A three-tier model routing system splits LLM traffic by complexity. A lightweight classifier analyzes each request, task type, complexity markers, required reasoning depth, and routes it to the cheapest capable model. Simple tasks like classification and extraction (typically 50-65% of traffic) hit Haiku-class models at $0.25/M tokens. Moderate tasks like summarization and structured Q&A (~25%) hit mid-tier models at $3/M tokens. Only complex multi-step reasoning (~10%) reaches premium models at $15/M tokens. Confidence-based escalation catches low-quality cheap-model responses and re-routes them upward. Across production deployments we've audited, the pattern routinely cuts monthly API spend 40-70% with no measurable quality decline.
If your LLM bill is in five figures and growing, the odds are that most of that spend is waste. Across the production AI systems we've audited, the same pattern keeps surfacing: every request, customer support classification, FAQ answers, document extraction, complex analytical queries, hits the most expensive model. Instrument the traffic and categorize requests by actual complexity, and the majority turn out to be tasks that a model costing a fraction as much handles with identical quality.
This pattern is everywhere. Teams pick the best available model during prototyping, wire it into production, and never revisit the decision. The API costs scale linearly while most of the work doesn't require that level of intelligence.
Model routing fixes this by directing each request to the cheapest model capable of handling it well. Try the cheap model first. Check whether the response meets quality thresholds. Escalate to the expensive model only when the cheap one falls short. At Particula Tech, we've built model routing layers for systems processing millions of requests monthly. This article covers the exact architecture, the classification logic, and the engineering trade-offs that determine whether routing pays off. If you're paying premium prices for every LLM call, most of that spend is waste. The routing layer itself usually lives in an AI gateway: our AI gateway decision framework walks through which tier (DIY, LiteLLM, Portkey, or Kong) fits the routing pattern below at your traffic level.
Why Every Request Ends Up on the Expensive Model
The root cause is how AI applications get built. During development, teams choose a capable model, Claude Opus, GPT-4.5, because it handles edge cases well and simplifies testing. Everything works. The application ships.
Nobody goes back to ask which requests actually needed that model's full capabilities. The code calls one endpoint, passes one model parameter, and that decision compounds silently against the monthly bill.
We've audited over twenty production AI systems. The traffic breakdown is remarkably consistent across industries:
The problem isn't that expensive models are bad. It's that using them for everything is like flying first class for every trip, including the one across town. Smaller specialized models outperform premium ones on narrowly defined tasks because they're optimized for that work, not carrying the overhead of broad general knowledge they'll never use.
The Cheap-First Model Routing Architecture
The architecture has four components: a request classifier, a model tier map, a confidence evaluator, and a quality monitor. Requests flow through them sequentially, with the confidence evaluator creating a feedback loop back to the tier map.
How Requests Flow Through the System
Every incoming request first hits the classifier, which assigns a complexity tier, simple, moderate, or complex. The tier map translates that classification into a specific model: The selected model processes the request and returns a response. The confidence evaluator then assesses whether that response meets quality thresholds. If it does, the response ships to the user. If confidence is low, the model hedges, the output is malformed, or structured validation fails, the request escalates to the next tier automatically. This escalation mechanism is the safety net. You don't need the classifier to be perfect. You need it to be right often enough that escalation handles the exceptions without destroying your cost savings.
- Tier 1 (Simple): Claude 3.5 Haiku, GPT-4o-mini, or equivalent. Input costs around $0.25/M tokens, output around $1.25/M. Handles classification, extraction, simple Q&A, and templated responses.
- Tier 2 (Moderate): Claude Sonnet, GPT-4o. Input around $3/M tokens, output around $15/M. Handles summarization, structured generation, and moderate reasoning.
- Tier 3 (Complex): Claude Opus, GPT-4.5. Input around $15/M tokens, output around $75/M. Reserved for multi-step reasoning, nuanced analysis, and tasks where cheaper models demonstrably fail.
Request Classification: Deciding Which Model Gets the Call
The classifier determines routing quality. Get this wrong and you either waste money routing too many requests to expensive models or degrade quality by under-routing complex tasks to cheap ones. We've used three classification approaches, each suited to different maturity stages.
Rule-Based Classification
Start here. Examine the request for structural signals that correlate with complexity: Rule-based classification captures 70-80% of routing decisions correctly with zero training data. It's deployable in a day.
- Task type markers: If the prompt includes "classify," "extract," or "label," route to Tier 1. If it includes "analyze," "compare," or "explain why," route to Tier 2 or 3.
- Input length: Requests under 200 tokens with structured prompts are almost always simple. Requests over 2,000 tokens with open-ended instructions skew complex.
- Output format constraints: Requests expecting JSON, enum values, or boolean outputs are strong Tier 1 candidates. Free-form text generation needs more capability.
- Domain markers: Industry-specific terminology, legal or medical content, or multi-language requests may need premium models for accuracy.
ML-Based Classification
Once you have two to four weeks of production traffic with labeled outcomes, which model actually produced a good response, train a lightweight classifier. We typically use a small text classification model (a fine-tuned DistilBERT or even logistic regression on TF-IDF features) that adds under 5ms of latency. In production deployments we've shipped, an ML classifier of this size lifts routing accuracy roughly from the high 70s into the low 90s, primarily by catching moderate-complexity requests that rule-based systems consistently misclassified as simple. A training set of around 8,000 labeled request-response pairs collected during the rule-based phase is usually enough to get there.
Hybrid Approach for Production
The system we recommend combines both: rules handle clear-cut cases instantly, and the ML classifier handles ambiguous requests. This layered approach keeps latency minimal for obvious routing decisions while improving accuracy on the edge cases that matter most.
Confidence-Based Escalation: The Safety Net
The classifier won't be perfect. Confidence-based escalation catches its mistakes by evaluating the cheap model's actual response before returning it to the user.
Three Signals That Trigger Escalation
Structured output validation. If the request expects JSON, a classification label, or an extraction result, validate the output programmatically. A cheap model that returns malformed JSON or an out-of-vocabulary label fails immediately, no judgment call needed. This catches roughly 40% of escalation-worthy responses. Self-evaluation scoring. For open-ended responses, we append a lightweight self-assessment step: the same cheap model rates its own confidence on a 1-5 scale given the query and its response. Responses scoring below 3 escalate. This adds one extra cheap inference call but catches responses where the model is genuinely uncertain. Self-evaluation isn't reliable for every model, but it's surprisingly effective on well-calibrated ones. Length and specificity heuristics. Responses that are suspiciously short, contain excessive hedging language ("I'm not sure," "it depends," "this could vary"), or fail to address the specific question all trigger escalation flags. These heuristics are crude individually but effective in combination.
Why Escalation Doesn't Kill Your Savings
Not every escalation wipes out cost savings. In well-tuned routing systems, only a small fraction of Tier 1 responses, typically under 10%, escalate to Tier 2, and rarely directly to Tier 3. The cost of processing that fraction twice (once on Tier 1, once on Tier 2) is still less than processing all of them on Tier 2 originally. Escalation only becomes expensive if your classifier is wrong more than 25-30% of the time, at which point you should fix the classifier rather than abandon routing entirely.
What the Numbers Look Like in Practice
After a phased deployment, rule-based routing first, ML classifier second, confidence escalation last, the savings show up across two full billing cycles. The shape of the result table below is what we typically see on systems with a healthy mix of simple, moderate, and complex traffic:
The cost savings break down predictably by routing tier:
The latency improvement is an unexpected bonus. Cheap models respond faster, Haiku-class models return in 200-400ms versus 1.5-3 seconds for premium models. Since the majority of requests now hit fast models, the average response time drops significantly. For customer-facing applications, this improvement alone often justifies the routing investment.
Quality scores hold steady because the routing only sends tasks to cheap models that cheap models handle well. A human evaluation panel scoring random samples weekly will typically find no statistically significant difference between pre-routing and post-routing quality distributions.
| Metric | Before | After |
|---|---|---|
| Requests routed to Tier 1 | 0% | ~60% |
| Requests routed to Tier 2 | 0% | ~25% |
| Requests routed to Tier 3 | 100% | ~10% |
| Escalation rate (Tier 1 → 2) | N/A | <10% |
| Avg response latency | high | lower |
| Quality score (human eval) | flat | flat |
Quality Safeguards That Prevent Silent Degradation
The risk with model routing isn't the initial deployment, it's drift. Traffic patterns change, models get updated by providers, and the boundary between "simple" and "complex" shifts over time. Without active monitoring, routing degrades quietly.
Shadow Testing During Rollout
Before enabling routing in production, we ran two weeks of shadow testing. Every request still hit the premium model as before, but simultaneously processed through the routing pipeline. We compared outputs to validate that routing decisions matched quality expectations before any user saw a routed response. This phase also generated the labeled training data for the ML classifier.
Continuous Quality Sampling
In production, we route 5% of Tier 1 responses to the premium model in parallel and compare outputs. If the agreement rate drops below 95%, the monitoring system alerts and we investigate whether the classifier needs retraining or a model update has changed capability boundaries. This costs roughly 3% of the total savings but provides ongoing confidence that routing decisions remain valid.
Escalation Rate as a Health Metric
Rising escalation rates signal classifier drift. A typical pattern: the rate starts under 10% and creeps a few points higher over a few months as the product adds new features generating query types the classifier hasn't seen. Retraining on recent data brings it back down. Set alerts at 15% to trigger mandatory classifier retraining. Escalation rate is the single most useful metric for routing health, if it's rising, something in your traffic or models has changed.
Building Your Own Model Routing Pipeline
If your LLM API bill exceeds $5,000/month and your traffic includes a mix of simple and complex tasks, model routing will save money. Here's the implementation path we recommend.
Week 1: Instrument and classify traffic. Log every request with metadata: task type, input length, output format, and the model currently processing it. Manually label 200-500 requests by actual complexity. This data reveals what percentage of traffic is routable and estimates your savings ceiling. Most teams discover the split is more favorable than they expected.
Week 2: Deploy rule-based routing. Build classification rules from your labeled data. Start conservative, only route requests you're highly confident are simple. A routing rate of 40-50% is normal initially. Enable escalation for any responses that fail output validation.
Weeks 3-4: Add ML classification and shadow testing. Train your classifier on production data collected during week 2. Shadow test against premium model outputs. Tune confidence thresholds. Expand routing coverage as accuracy data builds confidence.
When Model Routing Isn't Worth It
Skip routing if your traffic is uniformly complex, legal reasoning, medical diagnosis, or research synthesis where every request genuinely requires premium capability. Also skip it if your monthly spend is under $2,000; the engineering investment won't pay back at that scale. For cost optimization at lower volumes, token optimization and smart caching deliver better ROI with less implementation effort.
The Right Model for Every Request
Model routing is the most underused cost optimization in production AI systems. The concept is straightforward, send each request to the cheapest model that handles it well, but the impact compounds across every request in your pipeline. A well-tuned routing layer routinely cuts spend by half or more while holding quality flat and improving latency as a bonus.
The organizations paying the least per AI interaction aren't using the cheapest models exclusively. They're matching the right model to each request. Start by instrumenting your traffic, classifying request complexity, and routing your simplest tasks to your cheapest capable model. The savings accumulate from day one, and the architecture only improves as you layer in ML-based classification and confidence-based escalation.
Frequently Asked Questions
Quick answers to common questions about this topic
Model routing directs each LLM request to the most cost-effective model capable of handling it well, rather than sending everything to a single expensive model. A routing layer classifies incoming requests by complexity and task type, then selects the appropriate model tier. Simple tasks like classification go to cheap models, while complex reasoning goes to premium ones. This typically reduces API costs 40-70% without affecting response quality.



