Claude Opus 4.6 costs 60x more per output token than DeepSeek V3.2. Most AI products send every request to the same model.
That is the single largest waste in AI infrastructure today.
Cursor routes simple completions to lightweight models and complex multi-file edits to frontier models. Their Auto mode is unlimited on all paid plans because routing makes it cheaper than giving everyone frontier access. UC Berkeley's RouteLLM demonstrated 85% cost reduction while maintaining 95% of GPT-4 quality. Martian reports 20-97% savings depending on task complexity. Red Hat built semantic routing directly into vLLM.
The concept is straightforward: send each request to the cheapest model that can handle it. The implementation requires knowing which requests are simple, which models are cheap enough, and whether routing decisions actually saved money or quietly degraded quality.
Last Updated: April 2026
The 60x Spread
Frontier models get more capable. Budget models get cheaper. The gap between them is now the largest variable in AI unit economics.
Current pricing as of April 2026:
| Model | Provider | Input (per 1M tokens) | Output (per 1M tokens) | Best For |
|---|---|---|---|---|
| GPT-5 Nano | OpenAI | $0.05 | $0.40 | Ultra-cheap lightweight tasks |
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | Fast classification, extraction | |
| DeepSeek V3.2 | DeepSeek | $0.28 | $0.42 | High-volume commodity tasks |
| GPT-4o | OpenAI | $2.50 | $10.00 | General multimodal |
| Gemini 2.5 Pro | $1.00 | $10.00 | Long-context analysis | |
| Claude Sonnet 4.6 | Anthropic | $3.00 | $15.00 | Complex coding, agents |
| Claude Opus 4.6 | Anthropic | $5.00 | $25.00 | Frontier reasoning, architecture |
DeepSeek V3.2 outputs at $0.42 per million tokens. Claude Opus 4.6 at $25.00. Even within Anthropic's lineup, Haiku is 5x cheaper than Sonnet on output.
An AI assistant handling 1 million requests per month at 500 input tokens and 200 output tokens average:
| Strategy | Monthly Cost | Savings vs. All-Sonnet |
|---|---|---|
| All Claude Sonnet 4.6 | $4,500 | Baseline |
| 70% Haiku / 30% Sonnet | $2,400 | 47% |
| 70% DeepSeek / 30% Sonnet | $1,450 | 68% |
| Routed by complexity (mixed fleet) | $900-$1,800 | 60-80% |
Routing simple requests to cheap models is the highest-leverage cost optimization available to teams running multi-model AI products.
Which Tasks Need Which Models
The majority of production AI traffic is routine work. Commodity models handle it at equivalent quality.
| Task Type | Cheap Model Sufficient | Premium Model Required |
|---|---|---|
| Intent classification | 95%+ accuracy on Haiku, GPT-4o-mini, Flash-Lite | |
| Data extraction (names, dates, amounts) | Pattern matching, not reasoning | |
| Format transformation (JSON, text) | Structural, not creative | |
| Simple Q&A and FAQ | Straightforward retrieval | |
| Standard summarization | Comparable across tiers | |
| Complex reasoning chains | Multi-step inference, synthesis | |
| Nuanced writing (legal, marketing) | Tone, persuasion, style sensitivity | |
| Architectural decisions | System design, complex debugging | |
| Ambiguous or novel situations | Genuine judgment required | |
| Safety-critical outputs | Medical, legal, financial accuracy |
Cursor's Auto mode testing confirmed this in production. A February 2026 benchmark found Auto mode matched or beat manual Sonnet selection on every metric. Simple completions went to lightweight models. Complex multi-file edits went to frontier. The routing itself was invisible to the user.
The goal is matching capability to requirement. Underspending on complex tasks creates errors that cost more than the savings. Overspending on simple tasks compounds into thousands of dollars of waste per month.
Three Routing Architectures
| Approach | How It Works | Tradeoffs |
|---|---|---|
| Rule-based | Route by feature type, user tier, input length | Fast, predictable, easy to debug. Misses complex queries that look simple. |
| Classifier-based | Small model classifies complexity, then routes | Adapts to content, catches edge cases. Adds 50-100ms latency. Classifier can be wrong. |
| Learned routing | Train on quality scores from production traffic | Best accuracy over time. Requires labeled data and continuous retraining. |
RouteLLM from UC Berkeley is the reference implementation for learned routing. It trains on preference data from Chatbot Arena to predict when a cheaper model will match frontier quality. Their benchmarks: 85% cost reduction at 95% quality retention.
Martian's model router takes the classifier approach, analyzing prompts to select the best model per request. LiteLLM provides plumbing with a unified OpenAI-compatible interface across 100+ providers and basic cost tracking. Portkey adds a gateway with fallbacks, load balancing, and per-request logging.
Classifier economics. At Flash-Lite pricing, classifying 1 million requests costs roughly $50. If correct routing saves $2,000+ per month, the ROI clears immediately. But the classifier introduces a failure mode: misrouted complex queries return bad answers, misrouted simple queries waste money. Both are invisible without per-request measurement.
When routing is not worth it. Below 10,000 requests per month, engineering investment exceeds savings. If traffic is uniformly complex, routing overhead adds cost without benefit. Start with a single model. Instrument costs per request. Add routing when the data shows a clear split between simple and complex traffic.
The Measurement Problem
Here is the quiet part about routing: it is an optimization, not a strategy.
Routing reduces the cost per query. But if your pricing does not account for query complexity, cheaper queries subsidize expensive ones at the customer level. Routing without cost attribution is like negotiating a volume discount without knowing which customers drive the volume.
The monthly bill might drop. It might also rise because the classifier is miscategorizing queries and sending simple requests to expensive models. Without per-request cost tracking, you will not know until the invoice arrives.
IDC's research on model routing identifies measurement as the bottleneck holding back enterprise adoption. Organizations report 30-70% cost reductions with routing, but only when they can attribute costs to individual requests and measure quality alongside spend.
| Metric | What It Reveals |
|---|---|
| Cost per request by model | Whether routing decisions actually save money |
| Fallback rate | How often the router misjudges complexity |
| Quality scores by route | Whether cheap models degrade user experience |
| Cost per customer | Which customers show negative margin even with routing |
A 20% fallback rate can erase the entire savings. Without per-model, per-request cost data, you are operating on faith.
This connects to the unit economics framework for AI products. Routing changes your cost-per-request distribution, but only granular tracking tells you whether the distribution improved or just shifted. Teams running multi-provider stacks face this at every layer: each provider bills differently, each model has different token economics, and the routing layer adds its own overhead.
What to Build Based on Where You Are
| Your Situation | Next Move |
|---|---|
| Under 10K requests/month | Stay on one model. Instrument per-request costs now so you have data when volume grows. |
| 10K-500K requests/month, single model | Add rule-based routing by feature type and input length. No classifier needed yet. |
| 500K+ requests/month, rule-based routing | Evaluate LiteLLM or Portkey for provider abstraction. Add a classifier only if rules miss >15% of simple queries. |
| Any volume, no per-request cost data | Instrument first. Routing without measurement is optimization without feedback. |
The pattern is consistent across every team that gets routing right: they measure before they optimize. They know cost per request, cost per customer, and cost per feature before they build the routing layer. The routing logic is the easy part. Knowing whether it worked is the hard part.
Routing Saves Money. The Question Is Where It Goes.
Model pricing changes quarterly. DeepSeek dropped the floor on commodity pricing. Anthropic cut Opus pricing by 67% and expanded context to 1M tokens. OpenAI launched GPT-5 Nano at $0.05 per million input tokens. Google pushed Flash-Lite below $0.50 per million output tokens.
A routing strategy built on January 2026 pricing is already outdated. The teams that treat routing as living infrastructure, with continuous measurement and quarterly re-evaluation, will maintain cost advantages. The teams that deploy routing once and check their bill monthly will drift back toward overspending within two quarters.
But the deeper question is not whether routing saves money. It does. The question is whether the savings flow to your margins or get consumed by increased usage. The Copilot story proved this: better products drive more queries, and more queries drive higher costs. Routing compresses the per-query cost. It does not cap total spend per customer.
Without per-customer cost tracking, routing is a discount you cannot verify. You know the average cost per query dropped. You do not know which customers are profitable and which are consuming the savings. That is the same margin visibility gap that routing was supposed to solve, just at a different layer of the stack.
If you are building with multiple models, Bear Lumen gives you per-request cost attribution across providers, so you can measure whether routing savings reach your margins or get consumed by your highest-volume customers.