Multi-Model Routing: Matching Query Complexity to the Right Model

Claude Opus 4.6 costs 60x more per output token than DeepSeek V3.2. Most AI products send every request to the same model.

That is the single largest waste in AI infrastructure today.

Cursor routes simple completions to lightweight models and complex multi-file edits to frontier models. Their Auto mode is unlimited on all paid plans because routing makes it cheaper than giving everyone frontier access. UC Berkeley's RouteLLM demonstrated 85% cost reduction while maintaining 95% of GPT-4 quality. Martian reports 20-97% savings depending on task complexity. Red Hat built semantic routing directly into vLLM.

The concept is straightforward: send each request to the cheapest model that can handle it. The implementation requires knowing which requests are simple, which models are cheap enough, and whether routing decisions actually saved money or quietly degraded quality.

Last Updated: April 2026

The 60x Spread

Frontier models get more capable. Budget models get cheaper. The gap between them is now the largest variable in AI unit economics.

Current pricing as of April 2026:

Model	Provider	Input (per 1M tokens)	Output (per 1M tokens)	Best For
GPT-5 Nano	OpenAI	$0.05	$0.40	Ultra-cheap lightweight tasks
Gemini 2.5 Flash-Lite	Google	$0.10	$0.40	Fast classification, extraction
DeepSeek V3.2	DeepSeek	$0.28	$0.42	High-volume commodity tasks
GPT-4o	OpenAI	$2.50	$10.00	General multimodal
Gemini 2.5 Pro	Google	$1.00	$10.00	Long-context analysis
Claude Sonnet 4.6	Anthropic	$3.00	$15.00	Complex coding, agents
Claude Opus 4.6	Anthropic	$5.00	$25.00	Frontier reasoning, architecture

DeepSeek V3.2 outputs at $0.42 per million tokens. Claude Opus 4.6 at $25.00. Even within Anthropic's lineup, Haiku is 5x cheaper than Sonnet on output.

An AI assistant handling 1 million requests per month at 500 input tokens and 200 output tokens average:

Strategy	Monthly Cost	Savings vs. All-Sonnet
All Claude Sonnet 4.6	$4,500	Baseline
70% Haiku / 30% Sonnet	$2,400	47%
70% DeepSeek / 30% Sonnet	$1,450	68%
Routed by complexity (mixed fleet)	$900-$1,800	60-80%

Routing simple requests to cheap models is the highest-leverage cost optimization available to teams running multi-model AI products.

Which Tasks Need Which Models

The majority of production AI traffic is routine work. Commodity models handle it at equivalent quality.

Task Type	Cheap Model Sufficient	Premium Model Required
Intent classification	95%+ accuracy on Haiku, GPT-4o-mini, Flash-Lite
Data extraction (names, dates, amounts)	Pattern matching, not reasoning
Format transformation (JSON, text)	Structural, not creative
Simple Q&A and FAQ	Straightforward retrieval
Standard summarization	Comparable across tiers
Complex reasoning chains		Multi-step inference, synthesis
Nuanced writing (legal, marketing)		Tone, persuasion, style sensitivity
Architectural decisions		System design, complex debugging
Ambiguous or novel situations		Genuine judgment required
Safety-critical outputs		Medical, legal, financial accuracy

Cursor's Auto mode testing confirmed this in production. A February 2026 benchmark found Auto mode matched or beat manual Sonnet selection on every metric. Simple completions went to lightweight models. Complex multi-file edits went to frontier. The routing itself was invisible to the user.

The goal is matching capability to requirement. Underspending on complex tasks creates errors that cost more than the savings. Overspending on simple tasks compounds into thousands of dollars of waste per month.

Three Routing Architectures

Approach	How It Works	Tradeoffs
Rule-based	Route by feature type, user tier, input length	Fast, predictable, easy to debug. Misses complex queries that look simple.
Classifier-based	Small model classifies complexity, then routes	Adapts to content, catches edge cases. Adds 50-100ms latency. Classifier can be wrong.
Learned routing	Train on quality scores from production traffic	Best accuracy over time. Requires labeled data and continuous retraining.

RouteLLM from UC Berkeley is the reference implementation for learned routing. It trains on preference data from Chatbot Arena to predict when a cheaper model will match frontier quality. Their benchmarks: 85% cost reduction at 95% quality retention.

Martian's model router takes the classifier approach, analyzing prompts to select the best model per request. LiteLLM provides plumbing with a unified OpenAI-compatible interface across 100+ providers and basic cost tracking. Portkey adds a gateway with fallbacks, load balancing, and per-request logging.

Classifier economics. At Flash-Lite pricing, classifying 1 million requests costs roughly $50. If correct routing saves $2,000+ per month, the ROI clears immediately. But the classifier introduces a failure mode: misrouted complex queries return bad answers, misrouted simple queries waste money. Both are invisible without per-request measurement.

When routing is not worth it. Below 10,000 requests per month, engineering investment exceeds savings. If traffic is uniformly complex, routing overhead adds cost without benefit. Start with a single model. Instrument costs per request. Add routing when the data shows a clear split between simple and complex traffic.

The Measurement Problem

Here is the quiet part about routing: it is an optimization, not a strategy.

Routing reduces the cost per query. But if your pricing does not account for query complexity, cheaper queries subsidize expensive ones at the customer level. Routing without cost attribution is like negotiating a volume discount without knowing which customers drive the volume.

The monthly bill might drop. It might also rise because the classifier is miscategorizing queries and sending simple requests to expensive models. Without per-request cost tracking, you will not know until the invoice arrives.

IDC's research on model routing identifies measurement as the bottleneck holding back enterprise adoption. Organizations report 30-70% cost reductions with routing, but only when they can attribute costs to individual requests and measure quality alongside spend.

Metric	What It Reveals
Cost per request by model	Whether routing decisions actually save money
Fallback rate	How often the router misjudges complexity
Quality scores by route	Whether cheap models degrade user experience
Cost per customer	Which customers show negative margin even with routing

A 20% fallback rate can erase the entire savings. Without per-model, per-request cost data, you are operating on faith.

This connects to the unit economics framework for AI products. Routing changes your cost-per-request distribution, but only granular tracking tells you whether the distribution improved or just shifted. Teams running multi-provider stacks face this at every layer: each provider bills differently, each model has different token economics, and the routing layer adds its own overhead.

What to Build Based on Where You Are

Your Situation	Next Move
Under 10K requests/month	Stay on one model. Instrument per-request costs now so you have data when volume grows.
10K-500K requests/month, single model	Add rule-based routing by feature type and input length. No classifier needed yet.
500K+ requests/month, rule-based routing	Evaluate LiteLLM or Portkey for provider abstraction. Add a classifier only if rules miss >15% of simple queries.
Any volume, no per-request cost data	Instrument first. Routing without measurement is optimization without feedback.

The pattern is consistent across every team that gets routing right: they measure before they optimize. They know cost per request, cost per customer, and cost per feature before they build the routing layer. The routing logic is the easy part. Knowing whether it worked is the hard part.

Routing Saves Money. The Question Is Where It Goes.

Model pricing changes quarterly. DeepSeek dropped the floor on commodity pricing. Anthropic cut Opus pricing by 67% and expanded context to 1M tokens. OpenAI launched GPT-5 Nano at $0.05 per million input tokens. Google pushed Flash-Lite below $0.50 per million output tokens.

A routing strategy built on January 2026 pricing is already outdated. The teams that treat routing as living infrastructure, with continuous measurement and quarterly re-evaluation, will maintain cost advantages. The teams that deploy routing once and check their bill monthly will drift back toward overspending within two quarters.

But the deeper question is not whether routing saves money. It does. The question is whether the savings flow to your margins or get consumed by increased usage. The Copilot story proved this: better products drive more queries, and more queries drive higher costs. Routing compresses the per-query cost. It does not cap total spend per customer.

Without per-customer cost tracking, routing is a discount you cannot verify. You know the average cost per query dropped. You do not know which customers are profitable and which are consuming the savings. That is the same margin visibility gap that routing was supposed to solve, just at a different layer of the stack.

If you are building with multiple models, Bear Lumen gives you per-request cost attribution across providers, so you can measure whether routing savings reach your margins or get consumed by your highest-volume customers.

Multi-Model Routing: Matching Query Complexity to the Right Model

The 60x Spread

Which Tasks Need Which Models

Three Routing Architectures

The Measurement Problem

What to Build Based on Where You Are

Routing Saves Money. The Question Is Where It Goes.

Related Articles

When Your AI Costs Drop 90%: Three Pricing Responses

Billing Infrastructure vs. Cost Visibility: What Finance Teams Actually Need

Unit Economics for AI Products: A Cost Framework Beyond Tokens