Token pricing tells you almost nothing about what your AI product actually costs.
GitHub Copilot charges $10/month and loses $20/month per user on heavy accounts. Intercom Fin charges $0.99 per resolution, but a hard ticket requiring multiple model calls can cost $0.85 in inference alone, leaving $0.14 of gross margin. Devin bills $20/month plus $2.25 per compute unit, with each coding session generating dozens of internal model calls that never appear on the invoice. The average AI-first B2B company runs 40-50% COGS against revenue, compared to 15-30% for traditional SaaS.
That gap is not a pricing mistake. It is structural. Token math captures one layer. Production AI products have six.
Last Updated: April 2026
Token Math Captures 20-40% of Real Costs
Most AI startups calculate costs as monthly tokens multiplied by price per token. That formula worked when products were single-call wrappers around GPT-3. It stopped working when products became workflows.
A customer support agent that appears to make one LLM call actually executes six operations per interaction. It embeds the user query. It retrieves context from a vector database. It calls a planner model to decide what to do. It executes a tool call to the CRM. It generates a response. It runs a safety check.
Token math captures maybe two of those six. The rest are invisible unless you instrument the full trace.
ICONIQ data shows AI company gross margins averaging 41% in 2024, improving to 45% in 2025, and reaching roughly 52% in 2026. Traditional SaaS runs 70-90%. The difference is largely explained by cost components that token math ignores.
The Six Cost Layers Inside a Single Trace
The real unit of accounting is the trace: a complete workflow execution from user request to final response. Every trace contains multiple spans, each with its own cost profile. Langfuse and similar observability platforms model applications this way because it reflects how costs actually accumulate.
Layer 1: LLM Inference (The Visible Part)
This is the cost everyone tracks. As of April 2026, the pricing spread across providers is wide.
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| Claude Sonnet 4 | $3.00 | $15.00 |
| Gemini 2.5 Pro | $1.25 | $10.00 |
| Gemini 2.5 Flash | $0.15 | $0.60 |
| DeepSeek V3 | $0.14 | $0.28 |
| DeepSeek R1 | $0.55 | $2.19 |
These numbers create a false sense of precision. A single agent task on a reasoning model generates 3-5x hidden token multipliers that do not appear on vendor pricing pages. A coding task advertised at $0.02 can cost $0.12 when the model engages extended thinking. Multi-step agents generate 5-20x more internal tokens than visible output.
Layer 2: Retrieval and Vector Infrastructure
RAG is the industry standard for grounding AI products in company-specific data. It is not free.
Pinecone charges $0.33/GB/month for storage plus $16/million read units on Standard. At scale, a production deployment runs $70-300/month for modest workloads and $500-1,500/month at 5M+ vectors. Weaviate Cloud starts at $45/month base plus $0.095 per million vector dimensions. Qdrant self-hosted is free, but a production cluster on AWS typically costs $300-800/month in infrastructure.
Then there are the embedding costs. Every search query must be embedded before it hits the vector database. OpenAI charges $0.10 per million tokens for embeddings. At 10,000 searches per day with average query lengths of 50 tokens, that is $1.50/month. Small, but it compounds across retrieval-heavy products.
The context stuffing problem is worse. Retrieved chunks become input tokens to the LLM. An unoptimized RAG pipeline that fetches 10 chunks of 500 tokens each adds 5,000 input tokens per query. At GPT-4o's $2.50/million input rate, that is $0.0125 per query just for the context window. At 100,000 queries per month, that is $1,250 in context costs that token-only math attributes incorrectly to "LLM spend."
Layer 3: Tool and API Costs
AI agents call external services. Each call has a price.
| Tool Type | Typical Cost |
|---|---|
| Web search (Serper, Tavily) | $0.001-0.01 per query |
| CRM APIs (Salesforce, HubSpot) | Per-call or monthly + overage |
| Database reads | Per-read, varies by provider |
| Code execution (E2B) | $0.005-0.02 per sandbox session |
If an agent calls three tools per trace, that is three line items token math ignores. A research agent that runs web search, reads a database, and calls a summarization endpoint adds $0.02-0.05 per trace in tool costs alone.
Layer 4: Observability and Monitoring
Production AI requires observability. LangSmith charges $39/seat/month plus $2.50 per 1,000 traces on the Plus plan. Langfuse Cloud starts at $29/month for 100K units with $8 per 100K overage. A five-engineer team on LangSmith running 2M traces per month pays roughly $5,195/month. The same volume on Langfuse costs approximately $189/month.
These costs are real COGS. They scale with usage, not headcount. Most teams bury them in "dev tools" instead of attributing them to the product they monitor.
Layer 5: Reliability and Failure Costs
Production AI adds model calls for quality assurance: offline evaluations, online judges, guardrail checks, incident reprocessing. A workflow costing $0.05/trace in development often reaches $0.15/trace in production after adding safety layers.
Failures still consume tokens. Tokens burned before a timeout. Retries from rate limits. Fallback chains that call a second model when the first fails. A 5% retry rate means 5% more spend than happy-path math suggests. A fallback from Claude Sonnet to GPT-4o on timeout adds a second inference cost to every failed request.
Layer 6: Infrastructure Overhead
Compute for orchestration, queue processing, caching layers, and the application servers themselves. These costs are often fixed monthly, but they are directly attributable to the AI product. A production orchestration layer on AWS ECS or Railway runs $50-500/month depending on scale. Redis caching for prompt deduplication adds $15-100/month.
None of these appear in token-per-dollar calculations.
Per-Feature Margins Are the Real Target
Aggregate cost tracking hides the features that lose money. The only way to find them is to calculate margins at the feature level, not the product level.
| Feature | Avg Trace Cost | Revenue per Use | Margin |
|---|---|---|---|
| Quick chat | $0.02 | $0.05 | 60% |
| Document analysis | $0.18 | $0.25 | 28% |
| Agent workflow | $1.85 | $2.00 | 7.5% |
| Research task | $4.20 | $3.00 | -40% |
That last row is where margin compression hides. A single negative-margin feature, heavily used by power users, offsets profits from everything else. This is the same dynamic that makes GitHub Copilot lose $20/month per power user and the same margin squeeze reshaping AI operating models.
The research task costs $4.20 because it runs a multi-step agent loop. Each iteration calls the LLM, runs a web search, retrieves context, and evaluates whether the answer is sufficient. Five iterations at $0.84 each. The revenue model charges a flat $3.00. The more thorough the agent, the deeper the loss.
Invisible at the aggregate level. Obvious at the feature level.
The Cheaper-Inference Trap
Here is the quiet part. Model costs dropped roughly 90% in 18 months. Teams celebrated. Then total bills stayed flat, or rose.
Cheaper inference triggers a predictable sequence. Lower per-call cost makes new features viable. Teams ship those features. Users adopt them. Usage expands. More agent loops, more retrieval, more guardrails. The per-token price fell. The number of tokens per customer rose faster.
This is the same pattern that hit Cursor. Better completions drove more usage. More usage forced four pricing changes in 18 months. The product crossed $1 billion ARR in under two years. The pricing took 18 months to catch up.
Tracking tokens alone misses this dynamic entirely. The token line item improves quarter over quarter. The total cost-to-serve does not. The gap between those two numbers is filled by the five layers above that never made it onto the dashboard.
Attribution Makes Costs Actionable
Tracking all six layers is necessary but not sufficient. Without attribution, finance cannot tie AI spend to business units. Product cannot identify which features lose money. Sales cannot set prices that reflect cost-to-serve.
The fix is trace-native attribution. Tag every trace with customer ID, feature name, model version, and environment. This transforms a single $15,000 monthly AI bill into a per-customer, per-feature cost map.
Consider 150 customers paying $99/month. Total AI infrastructure costs $15,000/month. The naive per-customer cost is $100, which means the business is losing $1 per customer. But with attribution, 20 power users generate 60% of the cost ($9,000), while 130 standard users generate 40% ($6,000). The power users cost $450/month each. The standard users cost $46/month each.
Standard users are profitable at $53/month margin. Power users lose $351/month each. Without per-customer attribution: a break-even business. With it: 130 profitable customers subsidizing 20 unprofitable ones. Two completely different pricing decisions follow from the same $15,000 bill.
The Cost Framework
The complete cost-per-trace formula:
Cost per Feature Invocation = LLM inference + retrieval costs + tool/API costs + observability share + reliability overhead + infrastructure share
Each component requires different instrumentation. LLM costs come from provider invoices and token counting. Retrieval costs come from vector database billing and embedding call logs. Tool costs come from third-party API dashboards. Observability and infrastructure costs require allocation models that divide fixed monthly costs across trace volume.
| Cost Layer | Source | Update Frequency |
|---|---|---|
| LLM inference | Provider API logs | Per-trace (real-time) |
| Retrieval/RAG | Vector DB billing + embedding logs | Daily aggregation |
| Tool/API calls | Third-party dashboards | Daily aggregation |
| Observability | Platform billing | Monthly allocation |
| Reliability overhead | Retry/fallback logs | Weekly aggregation |
| Infrastructure | Cloud billing | Monthly allocation |
Early-stage teams do not need all six layers on day one. Start with LLM inference and retrieval, which typically account for 60-70% of total trace cost. Add tool costs when agent features ship. Add observability and infrastructure allocation when monthly AI spend exceeds $5,000.
Tracking tokens alone is like tracking server costs and ignoring bandwidth, storage, and support. It is one line item out of six. The other five compound silently until the margin report arrives.
The teams that instrument trace-level costs across all six layers have the data to price accurately, identify unprofitable customers, and optimize the right cost drivers. The teams that track only tokens have a monthly AI bill, an average cost per customer, and no way to tell which number is wrong.
Full-stack cost visibility is not a reporting upgrade. It is what makes pricing decisions possible.
If you are building AI products with variable cost structures, Bear Lumen gives you per-customer, per-feature cost visibility across all six layers. See how multi-model routing cuts inference spend.