Skip to main content
Back to Blog
insights8 min read

Multi-Model Routing: Matching Query Complexity to the Right Model

How intelligent model routing cuts AI API costs by matching query complexity to the cheapest capable model. Real pricing data, routing strategies, and the infrastructure needed to measure what routing actually saves.

BLT

Bear Lumen Team

Research

#cost-optimization#multi-model-routing#ai-infrastructure#unit-economics#llm-costs

Claude Opus 4.6 costs 60x more per output token than DeepSeek V3.2. Most AI products send every request to the same model.

That is the single largest waste in AI infrastructure today.

Cursor routes simple completions to lightweight models and complex multi-file edits to frontier models. Their Auto mode is unlimited on all paid plans because routing makes it cheaper than giving everyone frontier access. UC Berkeley's RouteLLM demonstrated 85% cost reduction while maintaining 95% of GPT-4 quality. Martian reports 20-97% savings depending on task complexity. Red Hat built semantic routing directly into vLLM.

The concept is straightforward: send each request to the cheapest model that can handle it. The implementation requires knowing which requests are simple, which models are cheap enough, and whether routing decisions actually saved money or quietly degraded quality.

Last Updated: April 2026


The 60x Spread

Frontier models get more capable. Budget models get cheaper. The gap between them is now the largest variable in AI unit economics.

Current pricing as of April 2026:

ModelProviderInput (per 1M tokens)Output (per 1M tokens)Best For
GPT-5 NanoOpenAI$0.05$0.40Ultra-cheap lightweight tasks
Gemini 2.5 Flash-LiteGoogle$0.10$0.40Fast classification, extraction
DeepSeek V3.2DeepSeek$0.28$0.42High-volume commodity tasks
GPT-4oOpenAI$2.50$10.00General multimodal
Gemini 2.5 ProGoogle$1.00$10.00Long-context analysis
Claude Sonnet 4.6Anthropic$3.00$15.00Complex coding, agents
Claude Opus 4.6Anthropic$5.00$25.00Frontier reasoning, architecture

DeepSeek V3.2 outputs at $0.42 per million tokens. Claude Opus 4.6 at $25.00. Even within Anthropic's lineup, Haiku is 5x cheaper than Sonnet on output.

An AI assistant handling 1 million requests per month at 500 input tokens and 200 output tokens average:

StrategyMonthly CostSavings vs. All-Sonnet
All Claude Sonnet 4.6$4,500Baseline
70% Haiku / 30% Sonnet$2,40047%
70% DeepSeek / 30% Sonnet$1,45068%
Routed by complexity (mixed fleet)$900-$1,80060-80%

Routing simple requests to cheap models is the highest-leverage cost optimization available to teams running multi-model AI products.


Which Tasks Need Which Models

The majority of production AI traffic is routine work. Commodity models handle it at equivalent quality.

Task TypeCheap Model SufficientPremium Model Required
Intent classification95%+ accuracy on Haiku, GPT-4o-mini, Flash-Lite
Data extraction (names, dates, amounts)Pattern matching, not reasoning
Format transformation (JSON, text)Structural, not creative
Simple Q&A and FAQStraightforward retrieval
Standard summarizationComparable across tiers
Complex reasoning chainsMulti-step inference, synthesis
Nuanced writing (legal, marketing)Tone, persuasion, style sensitivity
Architectural decisionsSystem design, complex debugging
Ambiguous or novel situationsGenuine judgment required
Safety-critical outputsMedical, legal, financial accuracy

Cursor's Auto mode testing confirmed this in production. A February 2026 benchmark found Auto mode matched or beat manual Sonnet selection on every metric. Simple completions went to lightweight models. Complex multi-file edits went to frontier. The routing itself was invisible to the user.

The goal is matching capability to requirement. Underspending on complex tasks creates errors that cost more than the savings. Overspending on simple tasks compounds into thousands of dollars of waste per month.


Three Routing Architectures

ApproachHow It WorksTradeoffs
Rule-basedRoute by feature type, user tier, input lengthFast, predictable, easy to debug. Misses complex queries that look simple.
Classifier-basedSmall model classifies complexity, then routesAdapts to content, catches edge cases. Adds 50-100ms latency. Classifier can be wrong.
Learned routingTrain on quality scores from production trafficBest accuracy over time. Requires labeled data and continuous retraining.

RouteLLM from UC Berkeley is the reference implementation for learned routing. It trains on preference data from Chatbot Arena to predict when a cheaper model will match frontier quality. Their benchmarks: 85% cost reduction at 95% quality retention.

Martian's model router takes the classifier approach, analyzing prompts to select the best model per request. LiteLLM provides plumbing with a unified OpenAI-compatible interface across 100+ providers and basic cost tracking. Portkey adds a gateway with fallbacks, load balancing, and per-request logging.

Classifier economics. At Flash-Lite pricing, classifying 1 million requests costs roughly $50. If correct routing saves $2,000+ per month, the ROI clears immediately. But the classifier introduces a failure mode: misrouted complex queries return bad answers, misrouted simple queries waste money. Both are invisible without per-request measurement.

When routing is not worth it. Below 10,000 requests per month, engineering investment exceeds savings. If traffic is uniformly complex, routing overhead adds cost without benefit. Start with a single model. Instrument costs per request. Add routing when the data shows a clear split between simple and complex traffic.


The Measurement Problem

Here is the quiet part about routing: it is an optimization, not a strategy.

Routing reduces the cost per query. But if your pricing does not account for query complexity, cheaper queries subsidize expensive ones at the customer level. Routing without cost attribution is like negotiating a volume discount without knowing which customers drive the volume.

The monthly bill might drop. It might also rise because the classifier is miscategorizing queries and sending simple requests to expensive models. Without per-request cost tracking, you will not know until the invoice arrives.

IDC's research on model routing identifies measurement as the bottleneck holding back enterprise adoption. Organizations report 30-70% cost reductions with routing, but only when they can attribute costs to individual requests and measure quality alongside spend.

MetricWhat It Reveals
Cost per request by modelWhether routing decisions actually save money
Fallback rateHow often the router misjudges complexity
Quality scores by routeWhether cheap models degrade user experience
Cost per customerWhich customers show negative margin even with routing

A 20% fallback rate can erase the entire savings. Without per-model, per-request cost data, you are operating on faith.

This connects to the unit economics framework for AI products. Routing changes your cost-per-request distribution, but only granular tracking tells you whether the distribution improved or just shifted. Teams running multi-provider stacks face this at every layer: each provider bills differently, each model has different token economics, and the routing layer adds its own overhead.


What to Build Based on Where You Are

Your SituationNext Move
Under 10K requests/monthStay on one model. Instrument per-request costs now so you have data when volume grows.
10K-500K requests/month, single modelAdd rule-based routing by feature type and input length. No classifier needed yet.
500K+ requests/month, rule-based routingEvaluate LiteLLM or Portkey for provider abstraction. Add a classifier only if rules miss >15% of simple queries.
Any volume, no per-request cost dataInstrument first. Routing without measurement is optimization without feedback.

The pattern is consistent across every team that gets routing right: they measure before they optimize. They know cost per request, cost per customer, and cost per feature before they build the routing layer. The routing logic is the easy part. Knowing whether it worked is the hard part.


Routing Saves Money. The Question Is Where It Goes.

Model pricing changes quarterly. DeepSeek dropped the floor on commodity pricing. Anthropic cut Opus pricing by 67% and expanded context to 1M tokens. OpenAI launched GPT-5 Nano at $0.05 per million input tokens. Google pushed Flash-Lite below $0.50 per million output tokens.

A routing strategy built on January 2026 pricing is already outdated. The teams that treat routing as living infrastructure, with continuous measurement and quarterly re-evaluation, will maintain cost advantages. The teams that deploy routing once and check their bill monthly will drift back toward overspending within two quarters.

But the deeper question is not whether routing saves money. It does. The question is whether the savings flow to your margins or get consumed by increased usage. The Copilot story proved this: better products drive more queries, and more queries drive higher costs. Routing compresses the per-query cost. It does not cap total spend per customer.

Without per-customer cost tracking, routing is a discount you cannot verify. You know the average cost per query dropped. You do not know which customers are profitable and which are consuming the savings. That is the same margin visibility gap that routing was supposed to solve, just at a different layer of the stack.

If you are building with multiple models, Bear Lumen gives you per-request cost attribution across providers, so you can measure whether routing savings reach your margins or get consumed by your highest-volume customers.

Share this article

Join the waitlistBook a call