NEW:Our AI Models Are Here →
    Particula Tech
    Work
    Services
    Models
    Company
    Blog
    Get in touch
    ← Back to Blog/LLMs & Models
    February 26, 2026

    LLM Model Routing: Cheap First, Expensive Only When Needed

    LLM model routing sends simple requests to cheap models and escalates complex ones to premium—cutting API costs 40-70% without losing response quality.

    Sebastian Mondragon - Author photoSebastian Mondragon
    8 min read
    On this page
    TL;DR

    We built a three-tier model routing system for a client spending $38K/month on LLM APIs. A lightweight classifier analyzed each request—task type, complexity markers, required reasoning depth—and routed it to the cheapest capable model. Simple tasks like classification and extraction (62% of traffic) went to Haiku-class models at $0.25/M tokens. Moderate tasks like summarization and structured Q&A (27%) hit mid-tier models at $3/M tokens. Only complex multi-step reasoning (11%) reached premium models at $15/M tokens. Confidence-based escalation caught the 8% of cheap-model responses that fell below quality thresholds and re-routed them upward. Monthly costs dropped from $38K to $15.2K—a 60% reduction—with no measurable quality decline. The entire implementation took four weeks.

    A client came to us spending $38,000 per month on LLM API calls. Every request—customer support classification, FAQ answers, document extraction, complex analytical queries—hit their most expensive model. When we instrumented their traffic and categorized requests by actual complexity, 62% were tasks that a model costing 1/30th as much handled with identical quality.

    This pattern is everywhere. Teams pick the best available model during prototyping, wire it into production, and never revisit the decision. The API costs scale linearly while most of the work doesn't require that level of intelligence.

    Model routing fixes this by directing each request to the cheapest model capable of handling it well. Try the cheap model first. Check whether the response meets quality thresholds. Escalate to the expensive model only when the cheap one falls short. At Particula Tech, we've built model routing layers for systems processing millions of requests monthly. This article covers the exact architecture, the classification logic, and the real numbers from a recent implementation. If you're paying premium prices for every LLM call, most of that spend is waste.

    Why Every Request Ends Up on the Expensive Model

    The root cause is how AI applications get built. During development, teams choose a capable model—Claude Opus, GPT-4.5—because it handles edge cases well and simplifies testing. Everything works. The application ships.

    Nobody goes back to ask which requests actually needed that model's full capabilities. The code calls one endpoint, passes one model parameter, and that decision compounds silently against the monthly bill.

    We've audited over twenty production AI systems. The traffic breakdown is remarkably consistent across industries:

  1. 60-70% of requests are simple tasks: classification, entity extraction, yes/no decisions, template-based responses, straightforward FAQ answers. These produce identical outputs on models costing $0.25/M tokens versus $15/M tokens.
  2. 20-30% are moderate tasks: summarization, structured Q&A, content generation with constraints. Mid-tier models handle these well.
  3. 5-15% are genuinely complex: multi-step reasoning, nuanced analysis, creative problem-solving requiring broad knowledge. These need premium models and should get them.
  4. The problem isn't that expensive models are bad. It's that using them for everything is like flying first class for every trip, including the one across town. Smaller specialized models outperform premium ones on narrowly defined tasks because they're optimized for that work, not carrying the overhead of broad general knowledge they'll never use.

    The Cheap-First Model Routing Architecture

    The architecture has four components: a request classifier, a model tier map, a confidence evaluator, and a quality monitor. Requests flow through them sequentially, with the confidence evaluator creating a feedback loop back to the tier map.

    How Requests Flow Through the System

    Every incoming request first hits the classifier, which assigns a complexity tier—simple, moderate, or complex. The tier map translates that classification into a specific model: The selected model processes the request and returns a response. The confidence evaluator then assesses whether that response meets quality thresholds. If it does, the response ships to the user. If confidence is low—the model hedges, the output is malformed, or structured validation fails—the request escalates to the next tier automatically. This escalation mechanism is the safety net. You don't need the classifier to be perfect. You need it to be right often enough that escalation handles the exceptions without destroying your cost savings.

    • Tier 1 (Simple): Claude 3.5 Haiku, GPT-4o-mini, or equivalent. Input costs around $0.25/M tokens, output around $1.25/M. Handles classification, extraction, simple Q&A, and templated responses.
    • Tier 2 (Moderate): Claude Sonnet, GPT-4o. Input around $3/M tokens, output around $15/M. Handles summarization, structured generation, and moderate reasoning.
    • Tier 3 (Complex): Claude Opus, GPT-4.5. Input around $15/M tokens, output around $75/M. Reserved for multi-step reasoning, nuanced analysis, and tasks where cheaper models demonstrably fail.

    Request Classification: Deciding Which Model Gets the Call

    The classifier determines routing quality. Get this wrong and you either waste money routing too many requests to expensive models or degrade quality by under-routing complex tasks to cheap ones. We've used three classification approaches, each suited to different maturity stages.

    Rule-Based Classification

    Start here. Examine the request for structural signals that correlate with complexity: Rule-based classification captures 70-80% of routing decisions correctly with zero training data. It's deployable in a day.

    • Task type markers: If the prompt includes "classify," "extract," or "label," route to Tier 1. If it includes "analyze," "compare," or "explain why," route to Tier 2 or 3.
    • Input length: Requests under 200 tokens with structured prompts are almost always simple. Requests over 2,000 tokens with open-ended instructions skew complex.
    • Output format constraints: Requests expecting JSON, enum values, or boolean outputs are strong Tier 1 candidates. Free-form text generation needs more capability.
    • Domain markers: Industry-specific terminology, legal or medical content, or multi-language requests may need premium models for accuracy.

    ML-Based Classification

    Once you have two to four weeks of production traffic with labeled outcomes—which model actually produced a good response—train a lightweight classifier. We typically use a small text classification model (a fine-tuned DistilBERT or even logistic regression on TF-IDF features) that adds under 5ms of latency. The ML classifier improved routing accuracy from 78% to 91% for our client, primarily by catching moderate-complexity requests that rule-based systems consistently misclassified as simple. The training set was 8,000 labeled request-response pairs collected during the rule-based phase.

    Hybrid Approach for Production

    The system we recommend combines both: rules handle clear-cut cases instantly, and the ML classifier handles ambiguous requests. This layered approach keeps latency minimal for obvious routing decisions while improving accuracy on the edge cases that matter most.

    Confidence-Based Escalation: The Safety Net

    The classifier won't be perfect. Confidence-based escalation catches its mistakes by evaluating the cheap model's actual response before returning it to the user.

    Three Signals That Trigger Escalation

    Structured output validation. If the request expects JSON, a classification label, or an extraction result, validate the output programmatically. A cheap model that returns malformed JSON or an out-of-vocabulary label fails immediately—no judgment call needed. This catches roughly 40% of escalation-worthy responses. Self-evaluation scoring. For open-ended responses, we append a lightweight self-assessment step: the same cheap model rates its own confidence on a 1-5 scale given the query and its response. Responses scoring below 3 escalate. This adds one extra cheap inference call but catches responses where the model is genuinely uncertain. Self-evaluation isn't reliable for every model, but it's surprisingly effective on well-calibrated ones. Length and specificity heuristics. Responses that are suspiciously short, contain excessive hedging language ("I'm not sure," "it depends," "this could vary"), or fail to address the specific question all trigger escalation flags. These heuristics are crude individually but effective in combination.

    Why Escalation Doesn't Kill Your Savings

    Not every escalation wipes out cost savings. In our client's system, 8% of Tier 1 responses escalated to Tier 2—not directly to Tier 3. The cost of processing that 8% twice (once on Tier 1, once on Tier 2) was still less than processing all of them on Tier 2 originally. Escalation only becomes expensive if your classifier is wrong more than 25-30% of the time, at which point you should fix the classifier rather than abandon routing entirely.

    The Numbers: From $38K to $15K Monthly

    After four weeks of phased deployment—rule-based routing first, ML classifier second, confidence escalation last—we measured results across two full billing cycles.

    The cost savings broke down predictably by routing tier:

  5. Tier 1 processing handled 62% of traffic at 1/60th the per-request cost of the original model, contributing roughly $18,400/month in savings.
  6. Tier 2 processing handled 27% at 1/5th the original cost, contributing approximately $3,800/month.
  7. Escalation overhead from requests processed twice cost an additional $800/month.
  8. The latency improvement was an unexpected bonus. Cheap models respond faster—Haiku-class models return in 200-400ms versus 1.5-3 seconds for premium models. Since 62% of requests now hit fast models, the average response time dropped significantly. For their customer-facing application, this improvement alone justified the routing investment.

    Quality scores held steady because the routing only sent tasks to cheap models that cheap models handle well. The human evaluation panel—five reviewers scoring 500 random responses weekly—found no statistically significant difference between pre-routing and post-routing quality distributions.

    MetricBeforeAfterChange
    Monthly API spend$38,200$15,200-60%
    Requests routed to Tier 10%62%+62%
    Requests routed to Tier 20%27%+27%
    Requests routed to Tier 3100%11%-89%
    Escalation rate (Tier 1 → 2)N/A8%N/A
    Avg response latency2,100ms890ms-58%
    Quality score (human eval)4.2/5.04.2/5.0No change

    Quality Safeguards That Prevent Silent Degradation

    The risk with model routing isn't the initial deployment—it's drift. Traffic patterns change, models get updated by providers, and the boundary between "simple" and "complex" shifts over time. Without active monitoring, routing degrades quietly.

    Shadow Testing During Rollout

    Before enabling routing in production, we ran two weeks of shadow testing. Every request still hit the premium model as before, but simultaneously processed through the routing pipeline. We compared outputs to validate that routing decisions matched quality expectations before any user saw a routed response. This phase also generated the labeled training data for the ML classifier.

    Continuous Quality Sampling

    In production, we route 5% of Tier 1 responses to the premium model in parallel and compare outputs. If the agreement rate drops below 95%, the monitoring system alerts and we investigate whether the classifier needs retraining or a model update has changed capability boundaries. This costs roughly 3% of the total savings but provides ongoing confidence that routing decisions remain valid.

    Escalation Rate as a Health Metric

    Rising escalation rates signal classifier drift. Our client's escalation rate started at 8% and crept to 12% over three months as their product added new features generating query types the classifier hadn't seen. Retraining on recent data brought it back to 7%. We set alerts at 15% to trigger mandatory classifier retraining. Escalation rate is the single most useful metric for routing health—if it's rising, something in your traffic or models has changed.

    Building Your Own Model Routing Pipeline

    If your LLM API bill exceeds $5,000/month and your traffic includes a mix of simple and complex tasks, model routing will save money. Here's the implementation path we recommend.

    Week 1: Instrument and classify traffic. Log every request with metadata: task type, input length, output format, and the model currently processing it. Manually label 200-500 requests by actual complexity. This data reveals what percentage of traffic is routable and estimates your savings ceiling. Most teams discover the split is more favorable than they expected.

    Week 2: Deploy rule-based routing. Build classification rules from your labeled data. Start conservative—only route requests you're highly confident are simple. A routing rate of 40-50% is normal initially. Enable escalation for any responses that fail output validation.

    Weeks 3-4: Add ML classification and shadow testing. Train your classifier on production data collected during week 2. Shadow test against premium model outputs. Tune confidence thresholds. Expand routing coverage as accuracy data builds confidence.

    When Model Routing Isn't Worth It

    Skip routing if your traffic is uniformly complex—legal reasoning, medical diagnosis, or research synthesis where every request genuinely requires premium capability. Also skip it if your monthly spend is under $2,000; the engineering investment won't pay back at that scale. For cost optimization at lower volumes, token optimization and smart caching deliver better ROI with less implementation effort.

    The Right Model for Every Request

    Model routing is the most underused cost optimization in production AI systems. The concept is straightforward—send each request to the cheapest model that handles it well—but the impact compounds across every request in your pipeline. Our client's 60% cost reduction came with zero quality degradation and a 58% latency improvement as a bonus.

    The organizations paying the least per AI interaction aren't using the cheapest models exclusively. They're matching the right model to each request. Start by instrumenting your traffic, classifying request complexity, and routing your simplest tasks to your cheapest capable model. The savings accumulate from day one, and the architecture only improves as you layer in ML-based classification and confidence-based escalation.

    Frequently Asked Questions

    Quick answers to common questions about this topic

    Model routing directs each LLM request to the most cost-effective model capable of handling it well, rather than sending everything to a single expensive model. A routing layer classifies incoming requests by complexity and task type, then selects the appropriate model tier. Simple tasks like classification go to cheap models, while complex reasoning goes to premium ones. This typically reduces API costs 40-70% without affecting response quality.

    Want help implementing model routing to reduce your AI costs?

    Related Articles

    01
    Feb 11, 2026

    vLLM vs Ollama vs TensorRT-LLM: Which Inference Server Fits Your Workload

    Practical comparison of vLLM, Ollama, and TensorRT-LLM for self-hosted model serving. Real throughput numbers, setup complexity, and which framework matches your team and traffic.

    02
    Jan 21, 2026

    Dynamic vs Static Prompts: Which Costs More to Maintain?

    Compare maintenance costs of dynamic vs static prompts in production AI. Learn when each approach makes sense and how to minimize operational overhead.

    03
    Dec 26, 2025

    How to Clean Messy Business Data Before AI Training?

    Learn practical methods to clean and prepare real-world business data for AI training. Covers common data quality issues, cleaning workflows, and validation techniques that actually work in production.

    PARTICULA

    AI Insights Newsletter

    © 2026
    PrivacyTermsCookiesCareersFAQ