▶️ LISTEN TO THIS ARTICLE

In January 2025, Klarna reported that its AI assistant handled two-thirds of all customer service chats in its first month, doing the work of 700 full-time agents. The company projected $40 million in annual savings. Six months later, Klarna started rehiring humans. The AI's cost-per-interaction had dropped, but its error rate on complex cases drove escalation costs that ate into the savings. The full cost of an AI agent isn't what you pay per token. It's everything that happens after the token is generated.

This guide breaks down what AI agents actually cost in production, where the money goes, and how to cut spend without cutting capability.

The Sticker Price Is Misleading

Every provider publishes a price-per-million-tokens table. Here's what the major APIs charge as of early 2026:

Frontier models: OpenAI's GPT-4o charges $2.50 per million input tokens and $10.00 per million output tokens. Anthropic's Claude 3.5 Sonnet sits at $3.00/$15.00. Google's Gemini 1.5 Pro comes in at $1.25/$5.00.

Mid-tier models: GPT-4o-mini runs $0.15/$0.60. Claude 3.5 Haiku charges $0.80/$4.00. Gemini 1.5 Flash is $0.075/$0.30.

Budget tier: DeepSeek V3 charges $0.27/$1.10. DeepSeek R1 sits at $0.55/$2.19 for reasoning-level performance at a fraction of OpenAI's o1 pricing ($15.00/$60.00).

These numbers look manageable until you multiply them by what agents actually consume. A single agent task, say researching a topic and drafting a summary, might involve 5-15 LLM calls with tool use, context assembly, and verification steps. A customer service interaction averages 3,000-8,000 tokens. A coding agent session can burn through 50,000-200,000 tokens.

A three-agent orchestration system processing 1,000 tasks per day on GPT-4o runs roughly $75-250 per day in raw API costs. Switch to GPT-4o-mini for the routing and simple tasks, keep GPT-4o for complex reasoning, and that drops to $30-80. Use DeepSeek V3 for everything non-critical and you're looking at $8-25. The model choice alone creates a 10x cost spread for the same workload.

Where the Real Money Goes

Raw inference is typically 30-50% of total cost. The rest is infrastructure nobody budgets for.

Orchestration overhead adds up fast. Every agent framework, whether LangGraph, AutoGen, or CrewAI, adds coordination tokens. System prompts get repeated across calls. Context windows get packed with conversation history, tool results, and intermediate reasoning. Research from Berkeley's Gorilla project found that tool-use agents consume 3-5x more tokens than equivalent single-turn completions. The coordination tax in multi-agent systems means that adding a second agent doesn't double your costs. It typically triples them.

Retries and error handling are the silent budget killer. Agents fail. They hallucinate tool parameters, misparse API responses, and lose track of multi-step plans. Production systems retry failed calls, sometimes with escalation to a more capable (and expensive) model. A 15% failure rate with automatic retry means you're paying for 1.15x the successful work, plus the wasted tokens from failed attempts. The testing and debugging guide covers how to diagnose these failures before they compound costs.

Monitoring and observability is non-negotiable but rarely budgeted. Tools like LangSmith, Helicone, and Braintrust add $50-500/month depending on volume. Without them, you're flying blind on cost attribution, but with them, your monitoring spend can approach 5-15% of your inference budget. The observability guide covers what to actually track.

Human review loops remain the most expensive layer. For high-stakes applications (medical, legal, financial), human-in-the-loop review adds $0.50-5.00 per interaction. This often exceeds the entire AI inference cost. But skipping it in domains where errors carry real consequences is a false economy.

The Optimization Playbook

Four techniques consistently cut agent costs by 40-90% without meaningful accuracy loss.

Prompt caching is the highest-ROI optimization for most agent systems. Anthropic's prompt caching reduces costs by up to 90% on cached tokens. OpenAI's equivalent cuts cached input costs by 50%. Google offers 75% reduction on cached context. The catch: caching works best when your system prompt and context are stable across calls. Agent systems that dynamically assemble context on every turn benefit less. For agent architectures with a consistent system prompt plus tools definition (which describes most production setups), caching alone can cut your bill in half.

Model routing sends each task to the cheapest model that can handle it. The pattern is straightforward: classify incoming requests by complexity, route simple tasks to a small model, and only invoke the expensive model for genuinely hard problems. Martian's research showed that intelligent routing achieves GPT-4-level accuracy at GPT-3.5-level cost, roughly a 10x reduction on the routed tasks. OpenAI's own model distillation pipeline, where a frontier model generates training data for a smaller model, achieves 90-95% of the teacher's accuracy at 5-10% of the cost. In practice, 60-80% of agent tasks are simple enough for a mid-tier model.

Prompt compression reduces token count without reducing information. Microsoft's LLMLingua compresses prompts by 2-20x with minimal quality loss. For agent systems with large tool descriptions or long conversation histories, compression can cut input costs dramatically. A simpler version: just trim your system prompts. Most production system prompts contain instructions the model already follows by default. Cut the obvious, measure the impact, and add back only what actually changes behavior.

Batching collects multiple requests and processes them together. OpenAI's Batch API offers a 50% discount for async workloads with 24-hour completion windows. Anthropic offers similar batch pricing. If your agent tasks aren't latency-sensitive (overnight data processing, bulk classification, content generation queues), batching is free money.

When Self-Hosting Beats APIs

The break-even calculation for self-hosting depends on volume, latency requirements, and model size.

At low volume (under 100,000 tokens per day), APIs win every time. The infrastructure cost of running even a small model exceeds API prices. A single A100 GPU costs $1.50-2.50/hour on cloud, or roughly $1,100-1,800/month. That buys a lot of API calls.

The crossover point sits around 2 million tokens per day for mid-tier models. At that volume, a self-hosted Llama 3.1 70B on two A100s costs roughly $2,200/month but serves unlimited tokens. The equivalent API spend would be $3,000-6,000/month depending on the provider.

At high volume (10M+ tokens/day), self-hosting is dramatically cheaper. Together AI's pricing study showed 5-10x cost savings at scale compared to frontier API pricing. But self-hosting adds operational complexity: model updates, GPU maintenance, scaling, and the engineering time to keep everything running. DeepSeek's open-weight models have changed this calculus significantly. R1 delivers reasoning performance competitive with o1 at self-hosted costs, making the "build vs buy" decision genuinely harder than it was a year ago. A deeper analysis of how models like DeepSeek changed the cost picture is in the DeepSeek economics breakdown.

The Hidden Multipliers

Three cost factors catch teams off guard after launch.

Token inflation over time. As agents handle more complex tasks, average token consumption grows. Teams that budgeted based on initial simple use cases find costs 3-5x higher within six months. The budget-aware approach to agent reasoning exists specifically because unconstrained reasoning agents will use as many tokens as you let them.

The Jevons paradox. As AI gets cheaper, teams use more of it. This is well-documented in economics and it applies directly to LLM spend. Stanford's 2025 AI Index found that while the cost of achieving GPT-3.5-level performance dropped over 280-fold between November 2022 and October 2024, yet total enterprise AI spending continued rising. Cheaper per-token doesn't mean cheaper total spend when you 10x your usage.

Evaluation costs. Testing your agents costs tokens too. Running eval suites, A/B tests, and regression checks against new model versions adds 10-30% to your inference budget. This is money well spent (shipping a degraded agent costs more in the long run), but it needs to be in the budget from day one. The model evaluation guide covers how to build eval suites that don't waste tokens testing the wrong things.

A Cost Calculator Framework

Here's how to estimate your actual monthly cost before you build.

Step 1: Map your agent's call pattern. For each task type, count the average number of LLM calls, the average input and output tokens per call, and the tool-use overhead. A customer service agent might average 4 calls at 2,000 input + 500 output tokens each. A coding agent might average 12 calls at 8,000 input + 2,000 output tokens each.

Step 2: Multiply by volume. If you handle 500 customer service conversations per day, that's 2,000 LLM calls generating roughly 5 million tokens daily.

Step 3: Apply your model pricing. At GPT-4o rates ($2.50/$10.00 per million), 5 million tokens daily costs roughly $37.50/day or $1,125/month in raw inference.

Step 4: Add the multipliers. Retries (+15%), orchestration overhead (+20%), monitoring ($200/month), and human review if applicable. That $1,125 becomes roughly $1,700/month all-in.

Step 5: Apply optimizations. Prompt caching (-40%), model routing for simple tasks (-30% on routed portion). Realistic post-optimization cost: $900-1,100/month.

Step 6: Budget for growth. Multiply by 2-3x for the first year. Usage will increase. Token consumption will grow. New use cases will appear. If your year-one estimate is $12,000, budget $24,000-36,000.

What Actually Matters

The teams that control AI agent costs share three habits.

First, they measure per-task cost from day one. Not average monthly spend, but cost per successful task completion. This metric captures retries, failures, and escalations that averages hide. If your per-task cost is climbing, you have a quality problem masquerading as a cost problem.

Second, they use the cheapest model that works for each subtask. A mixture of experts approach at the system level, routing to specialists rather than sending everything to the most capable model, mirrors how the best model architectures work internally. Most agents don't need frontier reasoning for every step.

Third, they treat cost optimization as continuous, not one-time. Prices drop constantly. New models launch monthly. Caching features ship quarterly. The team that set up their agent system six months ago and hasn't revisited the cost structure is overpaying by 2-5x compared to what's available today.

The true cost of running AI agents isn't a mystery. It's a measurement discipline. The teams that track it carefully spend less and ship better. The teams that don't find out the hard way that "the AI is basically free" was never true.

Sources

Research Papers:

Industry / Case Studies:

Commentary:

Related Swarm Signal Coverage: