Skip to main content
Back to Blog
insights10 min read

Unit Economics for AI Products: A Cost Framework Beyond Tokens

Token math captures 20-40% of what an AI product actually costs. The complete cost-to-serve includes six layers most teams never instrument. Here is the framework.

BLT

Bear Lumen Team

Research

#unit-economics#ai-margins#cost-tracking#pricing-strategy

Token pricing tells you almost nothing about what your AI product actually costs.

GitHub Copilot charges $10/month and loses $20/month per user on heavy accounts. Intercom Fin charges $0.99 per resolution, but a hard ticket requiring multiple model calls can cost $0.85 in inference alone, leaving $0.14 of gross margin. Devin bills $20/month plus $2.25 per compute unit, with each coding session generating dozens of internal model calls that never appear on the invoice. The average AI-first B2B company runs 40-50% COGS against revenue, compared to 15-30% for traditional SaaS.

That gap is not a pricing mistake. It is structural. Token math captures one layer. Production AI products have six.

Last Updated: April 2026


Token Math Captures 20-40% of Real Costs

Most AI startups calculate costs as monthly tokens multiplied by price per token. That formula worked when products were single-call wrappers around GPT-3. It stopped working when products became workflows.

A customer support agent that appears to make one LLM call actually executes six operations per interaction. It embeds the user query. It retrieves context from a vector database. It calls a planner model to decide what to do. It executes a tool call to the CRM. It generates a response. It runs a safety check.

Token math captures maybe two of those six. The rest are invisible unless you instrument the full trace.

ICONIQ data shows AI company gross margins averaging 41% in 2024, improving to 45% in 2025, and reaching roughly 52% in 2026. Traditional SaaS runs 70-90%. The difference is largely explained by cost components that token math ignores.


The Six Cost Layers Inside a Single Trace

The real unit of accounting is the trace: a complete workflow execution from user request to final response. Every trace contains multiple spans, each with its own cost profile. Langfuse and similar observability platforms model applications this way because it reflects how costs actually accumulate.

Layer 1: LLM Inference (The Visible Part)

This is the cost everyone tracks. As of April 2026, the pricing spread across providers is wide.

ModelInput (per 1M tokens)Output (per 1M tokens)
GPT-4o$2.50$10.00
Claude Sonnet 4$3.00$15.00
Gemini 2.5 Pro$1.25$10.00
Gemini 2.5 Flash$0.15$0.60
DeepSeek V3$0.14$0.28
DeepSeek R1$0.55$2.19

These numbers create a false sense of precision. A single agent task on a reasoning model generates 3-5x hidden token multipliers that do not appear on vendor pricing pages. A coding task advertised at $0.02 can cost $0.12 when the model engages extended thinking. Multi-step agents generate 5-20x more internal tokens than visible output.

Layer 2: Retrieval and Vector Infrastructure

RAG is the industry standard for grounding AI products in company-specific data. It is not free.

Pinecone charges $0.33/GB/month for storage plus $16/million read units on Standard. At scale, a production deployment runs $70-300/month for modest workloads and $500-1,500/month at 5M+ vectors. Weaviate Cloud starts at $45/month base plus $0.095 per million vector dimensions. Qdrant self-hosted is free, but a production cluster on AWS typically costs $300-800/month in infrastructure.

Then there are the embedding costs. Every search query must be embedded before it hits the vector database. OpenAI charges $0.10 per million tokens for embeddings. At 10,000 searches per day with average query lengths of 50 tokens, that is $1.50/month. Small, but it compounds across retrieval-heavy products.

The context stuffing problem is worse. Retrieved chunks become input tokens to the LLM. An unoptimized RAG pipeline that fetches 10 chunks of 500 tokens each adds 5,000 input tokens per query. At GPT-4o's $2.50/million input rate, that is $0.0125 per query just for the context window. At 100,000 queries per month, that is $1,250 in context costs that token-only math attributes incorrectly to "LLM spend."

Layer 3: Tool and API Costs

AI agents call external services. Each call has a price.

Tool TypeTypical Cost
Web search (Serper, Tavily)$0.001-0.01 per query
CRM APIs (Salesforce, HubSpot)Per-call or monthly + overage
Database readsPer-read, varies by provider
Code execution (E2B)$0.005-0.02 per sandbox session

If an agent calls three tools per trace, that is three line items token math ignores. A research agent that runs web search, reads a database, and calls a summarization endpoint adds $0.02-0.05 per trace in tool costs alone.

Layer 4: Observability and Monitoring

Production AI requires observability. LangSmith charges $39/seat/month plus $2.50 per 1,000 traces on the Plus plan. Langfuse Cloud starts at $29/month for 100K units with $8 per 100K overage. A five-engineer team on LangSmith running 2M traces per month pays roughly $5,195/month. The same volume on Langfuse costs approximately $189/month.

These costs are real COGS. They scale with usage, not headcount. Most teams bury them in "dev tools" instead of attributing them to the product they monitor.

Layer 5: Reliability and Failure Costs

Production AI adds model calls for quality assurance: offline evaluations, online judges, guardrail checks, incident reprocessing. A workflow costing $0.05/trace in development often reaches $0.15/trace in production after adding safety layers.

Failures still consume tokens. Tokens burned before a timeout. Retries from rate limits. Fallback chains that call a second model when the first fails. A 5% retry rate means 5% more spend than happy-path math suggests. A fallback from Claude Sonnet to GPT-4o on timeout adds a second inference cost to every failed request.

Layer 6: Infrastructure Overhead

Compute for orchestration, queue processing, caching layers, and the application servers themselves. These costs are often fixed monthly, but they are directly attributable to the AI product. A production orchestration layer on AWS ECS or Railway runs $50-500/month depending on scale. Redis caching for prompt deduplication adds $15-100/month.

None of these appear in token-per-dollar calculations.


Per-Feature Margins Are the Real Target

Aggregate cost tracking hides the features that lose money. The only way to find them is to calculate margins at the feature level, not the product level.

FeatureAvg Trace CostRevenue per UseMargin
Quick chat$0.02$0.0560%
Document analysis$0.18$0.2528%
Agent workflow$1.85$2.007.5%
Research task$4.20$3.00-40%

That last row is where margin compression hides. A single negative-margin feature, heavily used by power users, offsets profits from everything else. This is the same dynamic that makes GitHub Copilot lose $20/month per power user and the same margin squeeze reshaping AI operating models.

The research task costs $4.20 because it runs a multi-step agent loop. Each iteration calls the LLM, runs a web search, retrieves context, and evaluates whether the answer is sufficient. Five iterations at $0.84 each. The revenue model charges a flat $3.00. The more thorough the agent, the deeper the loss.

Invisible at the aggregate level. Obvious at the feature level.


The Cheaper-Inference Trap

Here is the quiet part. Model costs dropped roughly 90% in 18 months. Teams celebrated. Then total bills stayed flat, or rose.

Cheaper inference triggers a predictable sequence. Lower per-call cost makes new features viable. Teams ship those features. Users adopt them. Usage expands. More agent loops, more retrieval, more guardrails. The per-token price fell. The number of tokens per customer rose faster.

This is the same pattern that hit Cursor. Better completions drove more usage. More usage forced four pricing changes in 18 months. The product crossed $1 billion ARR in under two years. The pricing took 18 months to catch up.

Tracking tokens alone misses this dynamic entirely. The token line item improves quarter over quarter. The total cost-to-serve does not. The gap between those two numbers is filled by the five layers above that never made it onto the dashboard.


Attribution Makes Costs Actionable

Tracking all six layers is necessary but not sufficient. Without attribution, finance cannot tie AI spend to business units. Product cannot identify which features lose money. Sales cannot set prices that reflect cost-to-serve.

The fix is trace-native attribution. Tag every trace with customer ID, feature name, model version, and environment. This transforms a single $15,000 monthly AI bill into a per-customer, per-feature cost map.

Consider 150 customers paying $99/month. Total AI infrastructure costs $15,000/month. The naive per-customer cost is $100, which means the business is losing $1 per customer. But with attribution, 20 power users generate 60% of the cost ($9,000), while 130 standard users generate 40% ($6,000). The power users cost $450/month each. The standard users cost $46/month each.

Standard users are profitable at $53/month margin. Power users lose $351/month each. Without per-customer attribution: a break-even business. With it: 130 profitable customers subsidizing 20 unprofitable ones. Two completely different pricing decisions follow from the same $15,000 bill.


The Cost Framework

The complete cost-per-trace formula:

Cost per Feature Invocation = LLM inference + retrieval costs + tool/API costs + observability share + reliability overhead + infrastructure share

Each component requires different instrumentation. LLM costs come from provider invoices and token counting. Retrieval costs come from vector database billing and embedding call logs. Tool costs come from third-party API dashboards. Observability and infrastructure costs require allocation models that divide fixed monthly costs across trace volume.

Cost LayerSourceUpdate Frequency
LLM inferenceProvider API logsPer-trace (real-time)
Retrieval/RAGVector DB billing + embedding logsDaily aggregation
Tool/API callsThird-party dashboardsDaily aggregation
ObservabilityPlatform billingMonthly allocation
Reliability overheadRetry/fallback logsWeekly aggregation
InfrastructureCloud billingMonthly allocation

Early-stage teams do not need all six layers on day one. Start with LLM inference and retrieval, which typically account for 60-70% of total trace cost. Add tool costs when agent features ship. Add observability and infrastructure allocation when monthly AI spend exceeds $5,000.


Tracking tokens alone is like tracking server costs and ignoring bandwidth, storage, and support. It is one line item out of six. The other five compound silently until the margin report arrives.

The teams that instrument trace-level costs across all six layers have the data to price accurately, identify unprofitable customers, and optimize the right cost drivers. The teams that track only tokens have a monthly AI bill, an average cost per customer, and no way to tell which number is wrong.

Full-stack cost visibility is not a reporting upgrade. It is what makes pricing decisions possible.

If you are building AI products with variable cost structures, Bear Lumen gives you per-customer, per-feature cost visibility across all six layers. See how multi-model routing cuts inference spend.

Share this article

Join the waitlistBook a call