Skip to main content
Back to Blog
insights7 min read

When Your AI Costs Drop 90%: Three Pricing Responses

AI inference costs dropped 90%. Three pricing responses exist: pocket the savings, pass them through, or redesign tiers. Only the third improves both margins and customer value. It requires per-customer cost data most teams lack.

BLT

Bear Lumen Team

Research

#unit-economics#ai-margins#pricing-strategy#cost-attribution#commoditization

GPT-4 level inference costs roughly 90% less than they did 18 months ago. DeepSeek R1 launched at $0.55 per million input tokens, 78% below GPT-4o. Alibaba cut Qwen3-Max by 50% to $0.46. Google pushed Gemini 2.5 Flash-Lite to $0.10. The floor keeps falling.

For products where inference makes up 70-90% of variable costs, a 90% cost drop changes the margin math overnight. Three responses exist. Two of them give the improvement away. The third is the only one that improves both margins and customer value. Which one you pick depends on whether you can see per-customer cost data.

Last Updated: April 2026


Response A: Pocket the Savings

Keep prices unchanged. Costs fell. Gross margin expands. A product running at 30% jumps to 60-70% overnight.

This is the default for most companies, often by accident. "We did not change anything" is margin capture without the intentionality.

GitHub Copilot was losing $20 per user per month at its $10 price point. As model costs dropped through 2024 and 2025, those losses narrowed. Microsoft held the $10 price and let cost deflation close the gap. By early 2026, Copilot had 4.7 million paid subscribers. The strategy worked because Microsoft had $80 billion in cash reserves to absorb two years of per-user losses.

Intercom's Fin charges $0.99 per resolved support ticket. If average inference cost per resolution drops from $0.30 to $0.08, gross margin on Fin jumps from 70% to 92%. The buyer never sees the change.

The problem: competitors do. If a rival chooses Response B and undercuts pricing by 40%, the window for margin capture closes. The market determines how long this holds, not the company.


Response B: Pass the Savings Through

Cut prices 40-70% to match the cost drop. Trade margin for volume. Lock customers in now, bet they stay when the next shift arrives.

Alibaba has done this repeatedly: Qwen visual model costs cut 85%, Qwen3-Max cut 50% in successive rounds. The goal is market dominance in China's AI infrastructure layer.

Klarna deployed AI customer service handling two-thirds of all chats, equivalent to 700 agents. Cost per transaction dropped 40% over two years, from $0.32 to $0.19. Klarna passed the savings through to lower transaction fees, betting cheaper service would expand their merchant base.

The problem: you train customers to wait. Every cost drop creates expectation of a price cut. And if a reasoning model like DeepSeek R1 generates 10x more tokens for the same task, the savings you passed through cannot be recovered.


Response C: Redesign Tiers Around Cost Data

Keep headline prices stable. Use the cost savings to restructure what each tier includes, who qualifies for each tier, and where the margin lands per customer segment.

Cursor moved from request-based limits to a compute credit pool at $20/month, absorbing cost drops into higher-quality completions rather than lower prices. The product improved each quarter. The price stayed the same. Customers got more value. Margins held.

This is different from Response A. Response A is passive. Response C is active: it uses per-customer cost data to identify which segments can absorb more value, which segments are unprofitable, and where tier boundaries should move.

It is the only response that improves both margins and customer value simultaneously. It is also the only response that requires per-customer, per-model cost attribution.


The Quiet Part: Jevons Wins Again

In 1865, William Stanley Jevons observed that more efficient coal engines did not reduce coal consumption. They increased it. Cheaper energy made more applications viable.

The same pattern applies to inference. Every cost drop makes AI viable for tasks that were previously too expensive. Customers who ran 1,000 queries at $0.01 each now run 5,000 queries at $0.002 each. The per-query cost dropped 80%. The total spend dropped 0%.

This is the structural force that makes Response A fragile and Response B self-defeating.

Response A assumes the margin improvement is permanent. It is not. Cheaper inference attracts heavier usage. The power-user economics problem that Copilot revealed does not disappear with lower costs. It shifts. The most expensive 20% of customers are still the most expensive 20%. Their absolute cost may drop, but their relative cost to the median customer stays the same, or widens.

Response B assumes customers will stay at current usage levels. They will not. Pass a 70% price cut and usage expands to consume the savings. You gave away margin to fund adoption that erases the margin you gave away.

Response C is the only response that accounts for Jevons. It uses per-customer cost data to set tier boundaries that remain profitable as usage patterns shift. It treats a cost drop not as a windfall (Response A) or a competitive weapon (Response B) but as a pricing design opportunity.


What the Data Needs to Show

Every response requires the same prerequisite: knowing current margins at the customer level.

ResponseWhat you need to know
A: Pocket savingsHow much margin expanded per customer segment, and how long before usage growth closes the gap
B: Pass savingsHow far you can cut before your most expensive cohort becomes unprofitable
C: Redesign tiersPer-customer cost distribution, segment boundaries, and which tier changes improve both value and margin

Most teams know their aggregate API spend from a provider dashboard. They do not know per-customer, per-model cost attribution. They cannot distinguish the customer who costs $0.02/query from the one who costs $0.40/query. A blanket pricing decision affects these cohorts differently, and without segmented data, the decision is a guess.

Multi-model routing amplifies this gap. The spread between budget and frontier models is now 100x ($0.05 vs. $5.00 per million input tokens). Routing simple queries to cheap models and reserving frontier models for complex tasks can cut blended costs 70%+. But the routing decision itself depends on per-query cost visibility. Without it, you cannot determine which queries justify $5.00/M and which perform identically at $0.05/M.


The Current Landscape

The market has not settled. As of April 2026:

ModelProviderInput (per 1M tokens)Output (per 1M tokens)
GPT-5 NanoOpenAI$0.05$0.40
Gemini 2.5 Flash-LiteGoogle$0.10$0.40
DeepSeek V3.2DeepSeek$0.28$0.42
Gemini 2.5 ProGoogle$1.25$10.00
GPT-5.4OpenAI$2.50$10.00
Claude Sonnet 4.6Anthropic$3.00$15.00
Claude Opus 4.6Anthropic$5.00$25.00

The 100x spread between cheapest and most expensive creates both risk and opportunity. Ninety-two percent of AI software companies now use mixed pricing models combining subscriptions with usage fees, because pure per-seat pricing produces 40% lower gross margins and 2.3x higher churn than usage-aligned alternatives. Bessemer's data shows AI companies at 50-60% gross margins vs. 80-90% for traditional SaaS.

Another cost drop is coming. It always is. The question is not whether, but which response you choose when it arrives.


The Pricing Opportunity

Cost drops are not a problem to react to. They are a pricing design opportunity. But only for teams that can see where the savings land.

Response A is passive. Response B is destructive. Response C, redesigning tiers around per-customer cost data, is the only path that compounds. Each cost drop becomes a chance to improve both margin structure and customer value, instead of sacrificing one for the other.

The teams that instrument per-customer cost attribution and build real-time margin visibility will have pricing flexibility when the next drop hits. The rest will have pricing constraints.

If you are building on AI and need per-customer, per-model cost visibility before the next reset, Bear Lumen provides the data layer to make the call.

Share this article

Join the waitlistBook a call