How to Build a Cost-Aware LLM Routing System with NadirClaw Using Local Prompt Classification and Gemini Model Switching
In this tutorial, we discover NadirClaw as an clever routing layer that classifies prompts into easy and complicated tiers earlier than sending them to essentially the most appropriate mannequin. We begin by putting in the required packages, organising an non-obligatory Gemini API key, and testing the native classifier by the NadirClaw CLI with out making any dwell LLM calls. We then examine the centroid vectors that energy the routing determination, embed our personal prompts, visualize how similarity scores separate easy and complicated duties, and experiment with confidence thresholds. After understanding the native routing logic, we transfer into dwell routing by launching the NadirClaw proxy server, sending OpenAI-compatible requests by it, evaluating routed mannequin habits, and estimating value financial savings in opposition to an always-Pro baseline.
import subprocess, sys
def _pip(*pkgs):
subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], examine=True)
_pip("nadirclaw", "openai", "sentence-transformers", "matplotlib",
"scikit-learn", "pandas", "requests")
import os, json, time, sign, shutil, getpass
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
GEMINI_API_KEY = os.environ.get("GEMINI_API_KEY", "").strip()
if not GEMINI_API_KEY:
print("Paste your Gemini API key (enter hidden), or press Enter to skip:")
attempt:
GEMINI_API_KEY = getpass.getpass(immediate="GEMINI_API_KEY: ").strip()
besides (EOFError, KeyboardInterrupt):
GEMINI_API_KEY = ""
LIVE_ROUTING = bool(GEMINI_API_KEY)
if LIVE_ROUTING:
os.environ["GEMINI_API_KEY"] = GEMINI_API_KEY
print(f"✓ key captured ({len(GEMINI_API_KEY)} chars) — sections 8–11 enabled.")
else:
print("
no key entered — sections 3–7 nonetheless run; dwell routing skipped.")
We set up NadirClaw and the supporting Python libraries required for routing, embeddings, plotting, API calls, and information dealing with. We then import all required modules and securely seize the Gemini API key by the surroundings or a hidden immediate. We additionally determine whether or not dwell routing sections ought to run, whereas nonetheless permitting the native classifier sections to work with out an API key.
def classify(immediate: str) -> dict:
r = subprocess.run(
["nadirclaw", "classify", "--format", "json", prompt],
capture_output=True, textual content=True, timeout=180,
)
if r.returncode != 0:
return {"immediate": immediate, "error": (r.stderr or r.stdout).strip()}
return json.hundreds(r.stdout.strip())
prompts = [
"What is 2+2?",
"Format this JSON: {"a":1,"b":2}",
"Read the file at src/main.py",
"Add a docstring to the foo function",
"What does this function do?",
"Refactor the auth module to use dependency injection without breaking existing callers",
"Design a distributed event-sourced order pipeline that handles 50k req/s with strict ordering",
"Analyze the tradeoffs between actor-model and CSP-style concurrency for our codebase",
"Debug why this asyncio.gather call deadlocks under high load and provide a fix",
"Prove that this scheduling algorithm is optimal step by step and derive the worst-case bound",
]
print("n[3] Classifying 10 prompts (first name warms the encoder)…")
rows = [classify(p) for p in prompts]
df = pd.DataFrame(rows)
cols = [c for c in ["tier", "score", "confidence", "model", "prompt"] if c in df.columns]
print(df[cols].to_string(index=False))
import nadirclaw
PKG = Path(nadirclaw.__file__).dad or mum
SIMPLE_C = np.load(PKG / "simple_centroid.npy").astype(np.float32).flatten()
COMPLEX_C = np.load(PKG / "complex_centroid.npy").astype(np.float32).flatten()
def cosine(a, b):
return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-12))
print(f"n[4] simple_centroid form={SIMPLE_C.form} ‖·‖={np.linalg.norm(SIMPLE_C):.3f}")
print(f" complex_centroid form={COMPLEX_C.form} ‖·‖={np.linalg.norm(COMPLEX_C):.3f}")
print(f" cosine(easy,complicated) = {cosine(SIMPLE_C, COMPLEX_C):.4f} "
"← if this have been 1.0 the classifier could not distinguish them.")
We outline a reusable classify() perform that sends prompts to the NadirClaw CLI and returns structured JSON outcomes. We create a combined set of straightforward and complicated prompts, classify them, and show the routing tier, rating, confidence, mannequin, and immediate textual content in a desk. We then load the easy and complicated centroid vectors from the NadirClaw bundle and examine their shapes, norms, and cosine similarity.
from sentence_transformers import SentenceTransformer
print("n[5] Loading the identical encoder NadirClaw makes use of (all-MiniLM-L6-v2)…")
encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embs = encoder.encode(prompts, normalize_embeddings=True)
sim_simple = np.array([cosine(e, SIMPLE_C) for e in embs])
sim_complex = np.array([cosine(e, COMPLEX_C) for e in embs])
fig, ax = plt.subplots(figsize=(8.5, 6))
colours = ["tab:blue"] * 5 + ["tab:red"] * 5
ax.scatter(sim_simple, sim_complex, c=colours, s=110, edgecolor="okay", linewidth=0.5)
for i, _ in enumerate(prompts):
ax.annotate(str(i + 1), (sim_simple[i], sim_complex[i]),
xytext=(6, 4), textcoords="offset factors", fontsize=10)
xs = np.linspace(min(sim_simple.min(), sim_complex.min()),
max(sim_simple.max(), sim_complex.max()), 50)
ax.plot(xs, xs, "k--", alpha=0.4, label="cos(easy) = cos(complicated)")
ax.set_xlabel("cosine similarity to SIMPLE centroid")
ax.set_ylabel("cosine similarity to COMPLEX centroid")
ax.set_title("Routing determination boundaryn(blue = anticipated easy, crimson = anticipated complicated)")
ax.legend(loc="decrease proper")
ax.grid(alpha=0.25)
plt.tight_layout()
plt.savefig("centroid_decision_plot.png", dpi=120)
plt.present()
print("Legend: prompts above the dashed line route to COMPLEX, under to SIMPLE.")
print("n[6] Prompts sorted by complexity rating:")
sdf = df.sort_values("rating").reset_index(drop=True)
for _, row in sdf.iterrows():
bar = "█" * int(spherical(float(row["score"]) * 30))
print(f" rating={float(row['score']):.2f} conf={float(row['confidence']):.2f} "
f"{row['tier']:7s} |{bar:<30s}| {row['prompt'][:55]}")
print("n[6] Confidence-threshold sweep (low confidence → compelled complicated):")
print(" NadirClaw default threshold is 0.06.")
for thr in [0.02, 0.06, 0.10, 0.20, 0.30]:
forced_complex = sum(1 for r in rows if float(r["confidence"]) < thr)
natural_complex = sum(1 for r in rows if float(r["score"]) >= 0.5)
print(f" threshold={thr:.2f} → {forced_complex} prompts force-complex "
f"(low-confidence), {natural_complex} naturally complicated by rating")
modifier_demos = [
("agentic — text-only marker",
"You are a coding agent that can execute commands. Now add tests for the new endpoint."),
("reasoning — chain-of-thought markers",
"Step by step, derive the closed form and prove correctness mathematically. "
"Compare and contrast both approaches."),
("vision — would arrive with image_url part (only text shown)",
"Describe the screenshot."),
]
print("n[7] Modifier-marker scan:")
for label, p in modifier_demos:
r = classify(p)
print(f" {label}")
print(f" immediate='{p[:65]}…'")
print(f" tier={r['tier']} rating={float(r['score']):.2f} conf={float(r['confidence']):.2f}")
print(" NB: agentic & imaginative and prescient routing additionally set off from request form "
"(instruments=[…], image_url elements) — see dwell calls under.")
We use the identical SentenceTransformer encoder as NadirClaw and embed all tutorial prompts domestically. We examine every immediate embedding in opposition to the easy and complicated centroids, then visualize the routing boundary with a scatter plot. We additionally kind prompts by complexity rating, check confidence thresholds, and examine routing modifier examples for agentic, reasoning, and vision-style requests.
PORT = 8856
server_proc = None
if LIVE_ROUTING:
print(f"n[8] Starting `nadirclaw serve` on :{PORT} (background subprocess)…")
env = os.environ.copy()
env.replace({
"GEMINI_API_KEY": GEMINI_API_KEY,
"NADIRCLAW_SIMPLE_MODEL": "gemini-2.5-flash",
"NADIRCLAW_COMPLEX_MODEL": "gemini-2.5-pro",
"NADIRCLAW_PORT": str(PORT),
})
server_proc = subprocess.Popen(
["nadirclaw", "serve", "--verbose"],
env=env,
stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
preexec_fn=os.setsid if hasattr(os, "setsid") else None,
)
prepared = False
for _ in vary(60):
if server_proc.ballot() is just not None:
break
attempt:
if requests.get(f"http://localhost:{PORT}/well being", timeout=1).okay:
prepared = True
break
besides Exception:
time.sleep(1)
if prepared:
print(" ✓ /well being returned 200 — proxy is dwell.")
else:
print("
proxy didn't come up; dumping final log strains:")
if server_proc.stdout:
attempt:
strains = server_proc.stdout.read1(4096).decode("utf-8", errors="change")
print(strains[-2000:])
besides Exception as e:
print(f" (couldn't learn server stdout: {e})")
else:
print("n[8] Skipped — no GEMINI_API_KEY.")
def proxy_alive():
return server_proc is just not None and server_proc.ballot() is None
if proxy_alive():
from openai import OpenAI
consumer = OpenAI(base_url=f"http://localhost:{PORT}/v1", api_key="native")
side_by_side = [
("simple-ish", "Write a one-line docstring for: def add(a, b): return a + b"),
("complex", "Refactor a Python class to a dependency-injection pattern, "
"explain the trade-offs, and produce migration steps for callers."),
]
abstract = []
for label, p in side_by_side:
t0 = time.time()
attempt:
resp = consumer.chat.completions.create(
mannequin="auto",
messages=[{"role": "user", "content": p}],
max_tokens=220,
)
dt = time.time() - t0
textual content = (resp.decisions[0].message.content material or "").strip()
print(f"n--- [{label}] {dt:.2f}s · mannequin={resp.mannequin} ---")
print(textual content[:500] + ("…" if len(textual content) > 500 else ""))
abstract.append({
"label": label, "model_used": resp.mannequin,
"latency_s": spherical(dt, 2),
"tokens": getattr(resp.utilization, "total_tokens", None),
})
besides Exception as e:
abstract.append({"label": label, "model_used": "ERROR",
"latency_s": None, "tokens": str(e)[:80]})
print(f"
[{label}] failed: {e}")
print("n[9] Summary:")
print(pd.DataFrame(abstract).to_string(index=False))
We begin the NadirClaw proxy server domestically when a Gemini API secret’s accessible and configure it to route between Flash and Pro fashions. We examine the /well being endpoint to verify that the proxy is working earlier than sending requests. We then use the OpenAI SDK in opposition to the native proxy and examine how a easy immediate and a complicated immediate are routed and answered.
if proxy_alive():
print("n[10] Mixed 10-prompt workload…")
workload = [
"Capital of France?",
"Read foo.py",
"Type hint for a list of dicts",
"Lowercase: HELLO",
"One-sentence summary of REST",
"Refactor a callback chain into async/await with proper error handling",
"Design a sharded multi-region key-value store with linearizable reads",
"Analyze the asymptotic complexity of this code and prove the bound rigorously",
"Debug why our gRPC stream stalls when the client TCP window saturates",
"Compare and contrast B-trees and LSM-trees for write-heavy workloads",
]
runs = []
consumer = OpenAI(base_url=f"http://localhost:{PORT}/v1", api_key="native")
for p in workload:
t0 = time.time()
attempt:
r = consumer.chat.completions.create(
mannequin="auto",
messages=[{"role": "user", "content": p}],
max_tokens=140,
)
utilization = getattr(r, "utilization", None)
runs.append({
"immediate": p[:55],
"mannequin": r.mannequin,
"latency_s": spherical(time.time() - t0, 2),
"in_tok": getattr(utilization, "prompt_tokens", 0) if utilization else 0,
"out_tok": getattr(utilization, "completion_tokens", 0) if utilization else 0,
})
besides Exception as e:
runs.append({"immediate": p[:55], "mannequin": "ERROR",
"latency_s": None, "in_tok": 0, "out_tok": 0,
"error": str(e)[:80]})
rdf = pd.DataFrame(runs)
print(rdf.to_string(index=False))
PRICE = {
"flash": {"in": 0.30 / 1e6, "out": 2.50 / 1e6},
"professional": {"in": 1.25 / 1e6, "out": 10.0 / 1e6},
}
def price_for(model_str, in_t, out_t):
m = (model_str or "").decrease()
tier = "flash" if "flash" in m else "professional"
return in_t * PRICE[tier]["in"] + out_t * PRICE[tier]["out"]
cost_routed = sum(price_for(r["model"], r["in_tok"], r["out_tok"]) for r in runs)
cost_no_route = sum(price_for("gemini-2.5-pro", r["in_tok"], r["out_tok"]) for r in runs)
print(f"n[10] Cost (NadirClaw routed) : ${cost_routed:.6f}")
print(f" Cost (always-Pro baseline) : ${cost_no_route:.6f}")
if cost_no_route > 0:
print(f" Estimated financial savings on this run : "
f"{(1 - cost_routed/cost_no_route) * 100:.1f}%")
print("n[11] `nadirclaw report` (parses the JSONL request log):")
rep = subprocess.run(["nadirclaw", "report"], capture_output=True, textual content=True, timeout=60)
print(rep.stdout or rep.stderr)
if proxy_alive():
print("n[12] Stopping the proxy…")
attempt:
if hasattr(os, "killpg"):
os.killpg(os.getpgid(server_proc.pid), sign.SIGTERM)
else:
server_proc.terminate()
server_proc.wait(timeout=10)
besides Exception:
attempt:
server_proc.kill()
besides Exception:
go
print(" ✓ proxy stopped.")
print("nDone.
")
We ship a combined 10-prompt workload by the NadirClaw proxy to observe which mannequin every immediate makes use of. We calculate an illustrative routed value and examine it with an always-Pro baseline to estimate financial savings. We lastly run the built-in NadirClaw report command, cease the proxy cleanly, and end the tutorial workflow.
In conclusion, we constructed a full hands-on understanding of how NadirClaw routes prompts based mostly on complexity, confidence, and request modifiers. We noticed how native classification happens earlier than any API name, how centroid-based similarity helps clarify routing habits, and how threshold tuning impacts whether or not unsure prompts are escalated to a stronger mannequin. We additionally ran NadirClaw as a proxy, examined it with the OpenAI SDK, analyzed a combined workload, and generated a routing report from the request log. Also, we discovered how to use NadirClaw to make mannequin routing extra clear, cost-aware, and sensible for real-world AI functions.
Check out the GitHub Repo. Also, be happy to observe us on Twitter and don’t overlook to be a part of our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The submit How to Build a Cost-Aware LLM Routing System with NadirClaw Using Local Prompt Classification and Gemini Model Switching appeared first on MarkTechPost.
