LISTEN TO THIS ARTICLE

Agent Cost Optimization: How to Track and Reduce LLM Spend

Token prices dropped 280x over two years. Enterprise AI budgets rose 320% in the same period. That's not a paradox. It's what happens when agentic workflows multiply token consumption by 5-30x per task while teams treat API calls like a utility bill nobody reads.

Gartner's March 2026 forecast predicts inference costs will fall another 90% by 2030. But falling unit prices don't fix rising total spend. The teams that control costs aren't waiting for cheaper tokens. They're measuring where tokens go and eliminating the ones that don't earn their keep.

This guide covers the full optimization stack: metering, routing, caching, prompt engineering, and architectural changes that cut 60-80% of agent spend without cutting capability.

Why Agent Costs Are Different

A chatbot takes a question, generates a response, done. An agent researches, plans, calls tools, checks results, retries on failure, and sometimes loops back to start over. That loop structure changes the cost equation.

Consider a research agent that summarizes a topic. It might make 8-15 LLM calls: one for planning, several for document analysis, a few for synthesis, and verification passes at the end. Each call carries a system prompt, conversation history, and tool results in its context window. The true cost of running agents in production breaks down how raw inference accounts for only 30-50% of total spend. The rest is orchestration overhead, retries, and context bloat.

Three properties make agent costs harder to predict than simple API usage:

Non-linear scaling. Adding a second agent to a workflow doesn't double cost. It typically triples it, because agents coordinate through shared context. Every message between agents is tokens you're paying for twice: once to generate, once to process.

Retry amplification. A 15% task failure rate with automatic retry means you pay for 1.15x the successful work, plus wasted tokens from every failed attempt. Multi-step agents compound this. Five steps at 90% reliability each give you 59% end-to-end success, meaning nearly half your token spend produces nothing.

Context accumulation. Agents carry growing context windows across calls. By step ten of a complex task, you might be sending 50,000 tokens of history with every request, even if the actual new instruction is 200 tokens.

Step 1: Meter Before You Manage

You can't optimize what you can't measure. The first move is instrumenting every LLM call with cost attribution.

What to Track

At minimum, log these fields for every inference call:

  • Model used (including version/snapshot)
  • Input tokens and output tokens (separately)
  • Latency (time to first token and total)
  • Task type (classification, generation, tool selection, verification)
  • Agent identity (which agent in a multi-agent system)
  • Session/trace ID (to group calls within a single user task)
  • Success/failure (did this call contribute to a completed task?)

The last two matter most for optimization. Without session-level grouping, you can't calculate cost-per-completed-task. Without success tracking, you can't identify wasted spend.

Tool Options

The monitoring ecosystem has matured. Here's what works for cost-specific tracking:

Langfuse is the most widely adopted open-source option, with 21,000+ GitHub stars and MIT-licensed tracing, prompt management, and cost dashboards. It handles multi-step agent traces natively and breaks down spend by model, feature, and user segment.

Helicone specializes in cost monitoring with a proxy-based architecture. Drop it between your application and the LLM API, and it logs every request with cost calculations, latency, and token counts without code changes.

LiteLLM acts as a unified proxy for 100+ LLM providers. It normalizes cost tracking across providers and supports budget limits per user, team, or project. If you're calling multiple providers, this is the fastest path to consolidated cost visibility.

For teams already using observability platforms, Braintrust provides cost attribution that breaks down spending by user, feature, or model, useful for product teams that need to understand unit economics per feature.

Building a Cost Dashboard

The dashboard that actually drives decisions shows three things:

Cost per completed task (not cost per API call). If an agent needs six calls to finish a research task, your metric is the total cost of those six calls, including retries. This is the number you optimize against.

Waste ratio. What percentage of token spend goes to failed attempts, unnecessary retries, or context that never influenced the output? Teams typically find 20-40% waste on first measurement.

Model utilization distribution. What percentage of calls go to your most expensive model? This directly informs routing decisions.

Step 2: Route Queries to the Right Model

Model routing is the highest-impact optimization for most teams. The idea is simple: not every request needs your most capable (and expensive) model. A classification task, a simple extraction, or a format validation doesn't need Claude Opus or GPT-4o. It needs something fast and cheap.

How Routing Works

A small classifier model evaluates each incoming request and assigns it to the appropriate tier. The classification itself costs roughly $0.0001 per request, making it effectively free.

Tier 1 (Budget): Simple classification, extraction, formatting, yes/no decisions. Models like Claude Haiku or GPT-4o-mini at $0.15-0.80 per million input tokens.

Tier 2 (Standard): Summarization, moderate reasoning, tool selection. Claude Sonnet or GPT-4o at $2.50-3.00 per million input tokens.

Tier 3 (Frontier): Complex reasoning, novel problem solving, high-stakes decisions. Claude Opus or GPT-4.5 at $15+ per million input tokens.

Research from LMSYS's RouteLLM project showed routers can cut inference costs by up to 85% by directing most queries to smaller models. In practice, teams report 50-70% cost reduction with routing alone.

Implementation Patterns

Complexity-based routing scores request difficulty and routes accordingly. A lightweight classifier trained on your own query distribution works better than a generic one. Start with keyword and length heuristics, then graduate to a trained classifier as you collect data.

Confidence-based cascading starts with the cheapest model. If its confidence score falls below a threshold, escalate to the next tier. This costs more in latency but catches the common case cheaply. Research from IBM shows this approach works well when 60%+ of queries are straightforward.

Task-type routing maps agent operations to fixed model assignments. Planning steps get the frontier model. Tool parameter extraction gets the budget model. Verification gets the mid-tier. This is the easiest to implement and audit, but it's less adaptive than the other approaches.

The BEST-Route Approach

BEST-Route, published at a major 2026 venue, takes a different angle. Instead of just picking a model, it also decides how many responses to sample and selects the best. For small models, generating five responses and picking the winner is cheaper than one response from a large model, and often produces comparable quality.

Step 3: Cache What You Can

Caching eliminates LLM calls entirely for repeated or similar queries. There are two layers, and they're complementary.

Prompt/Prefix Caching

Most LLM providers now offer native prefix caching. If consecutive requests share the same system prompt and early context, the provider caches the KV computations and charges a reduced rate for the shared prefix.

Anthropic charges $0.30 per million cached tokens versus $3.00 fresh, a 90% saving on the cached portion. OpenAI offers a 50% discount on cached tokens. For agents with long, stable system prompts, this adds up fast.

To maximize prefix caching, structure your prompts so the stable parts (system instructions, tool definitions, few-shot examples) come first, and the variable parts (user query, recent context) come last. Don't shuffle your tool definitions between calls.

Semantic Caching

Semantic caching stores embeddings of past queries and their responses. When a new query arrives, it checks for semantically similar past queries and returns the cached response if the match is close enough.

GPT Semantic Cache research showed a 68.8% reduction in API calls across various query categories. Redis LangCache reports up to 73% cost reduction in high-repetition workloads.

But the headline numbers need context. Production analysis shows roughly 18% of requests are exact duplicates, 60-70% are genuinely unique, and semantic cache hit rates range from 10% to 70% depending on workload characteristics. Customer support and FAQ-style applications see high hit rates. Research agents and creative tasks see low hit rates.

Before investing in semantic caching infrastructure, log your queries for a week and measure actual duplication rates. If fewer than 20% of your queries are near-duplicates, the engineering investment won't pay off.

Step 4: Compress Your Prompts

After routing and caching, the next lever is reducing the tokens in each request.

System Prompt Optimization

System prompts in agent frameworks tend to bloat over time. Teams add instructions, examples, and guardrails without measuring their impact. A prompt audit typically reveals:

  • Redundant instructions that the model already follows by default
  • Few-shot examples that stopped helping after model upgrades
  • Verbose formatting instructions that can be compressed 3-5x
  • Tool descriptions longer than necessary

Measure the cost of your system prompt in tokens. If it exceeds 2,000 tokens, it's worth a compression pass. Test each removal against your eval suite. You'll often find that cutting 40% of the prompt produces identical output quality.

Context Window Management

Agent workflows accumulate context across calls. Left unchecked, later calls carry the full history of every prior step, including the irrelevant ones.

Strategies that work in practice:

Sliding window. Keep only the last N turns of conversation. Simple to implement, but risks losing important early context.

Summarization checkpoints. Every K steps, compress the conversation history into a summary using a cheap model. Use the summary as context for subsequent steps. This adds one LLM call but can cut context size by 80%.

Selective context injection. Instead of carrying full history, retrieve only the relevant prior context for each step. This is more complex but produces the best results for long-running agents. The context window management guide covers the technical details.

Output Length Control

Agents often generate more text than needed for intermediate steps. A planning step doesn't need 500 words. A tool selection step needs a function name and parameters, not an explanation.

Set explicit max_tokens for each call type. Planning: 500. Tool selection: 200. Verification: 100. Final output: whatever the task requires. This single change often cuts output token costs by 30-50% with no quality impact on intermediate steps.

Step 5: Architectural Cost Wins

Some optimizations require changing how your agents work, not just how they call models.

Batch Processing

Both OpenAI and Anthropic offer batch APIs with significant discounts. OpenAI's Batch API provides a 50% discount on all models for non-real-time workloads. If your agent processes tasks asynchronously (email triage, document review, data enrichment), batching is free money.

Structure your pipeline so non-urgent work queues up and processes in batch windows. A nightly batch run at half price is better than real-time processing at full price when the user won't see results until morning anyway.

Speculative Execution

Instead of sequential agent steps where each waits for the previous, run likely next steps in parallel. If step 2 has three possible outcomes, start step 3 for all three and discard the two that don't match. This costs more tokens but cuts latency dramatically. It's only cost-effective when latency directly affects revenue (trading systems, real-time customer support) and the branching factor is low.

Replacing Agents with Deterministic Code

The most overlooked optimization: some agent steps don't need an LLM at all. If an agent always extracts dates from a standard format, writes a SQL query with a predictable pattern, or validates JSON against a schema, replace those steps with code.

Audit your agent traces for steps with >95% consistency in output format. These are candidates for deterministic replacement. Each step you remove from the LLM eliminates tokens, latency, and failure modes simultaneously.

Putting It Together: The Optimization Sequence

Don't try everything at once. The order matters because each step reveals data that informs the next:

Week 1: Instrument. Deploy cost tracking on every LLM call. Establish your baseline cost-per-task, waste ratio, and model distribution. No optimization yet, just measurement.

Week 2: Route. Implement model routing based on your task-type distribution. This is typically the highest-impact change, cutting 40-70% of spend immediately.

Week 3: Cache. Enable provider prefix caching (configuration change, no engineering). Evaluate semantic caching ROI based on your duplication analysis from Week 1.

Week 4: Compress. Audit system prompts, implement context summarization, and set output length limits per call type.

Ongoing: Replace. Identify deterministic agent steps and replace them with code. This is incremental and compounds over time.

Teams that follow this sequence typically see 60-80% total cost reduction within a month, with most gains coming from routing and prompt compression. The AI agent ROI calculator can help you model the financial impact before you start.

What Comes Next

Cost optimization is moving in two directions simultaneously.

Inference providers are competing on price. NVIDIA's Blackwell platform has enabled providers like DeepInfra to cut cost-per-token by 4x through hardware optimization alone. Gartner expects a 90% reduction in trillion-parameter inference costs by 2030. Cheaper tokens help everyone.

But agentic workflows keep finding new ways to consume them. As agents get more capable, they take on longer, more complex tasks. A coding agent that can handle a full feature implementation burns through tokens that a simple autocomplete never would. The agent reliability problem means agents also waste more tokens as task complexity increases.

The winning strategy isn't waiting for cheaper tokens. It's building cost awareness into your agent architecture from the start: meter every call, route aggressively, cache where repetition exists, and replace LLM calls with code wherever a deterministic solution works.

The teams shipping agents profitably in 2026 aren't the ones with the biggest budgets. They're the ones that know exactly where every token goes.