LISTEN TO THIS ARTICLE
title: "How to Build Agent Evals That Catch Real Failures"
slug: agent-evals-production-failures
date: 2026-04-25
type: guide
category: reasoning-memory
subtopic: evaluation
tags: [guides, reasoning-memory, evaluation, agents, benchmarks]
status: draft
excerpt: "Standard LLM benchmarks miss the failures that actually hurt in production. Here's how to build an evaluation system for agents that catches cascading errors, trajectory drift, and policy violations before they reach users."
Your agent passes every benchmark. It scores 94% on tool-call accuracy. Your unit tests are green. Then it deletes the wrong records in production.
This is not a hypothetical. The gap between benchmark performance and production behavior is the defining reliability problem for AI agents right now. A 2025 analysis of 12 major agent benchmarks found validity issues affecting 7 out of 10 of them, with cost misestimation rates reaching 100% in some cases. The benchmarks aren't measuring what matters.
This guide explains why standard evaluation methods break for agents, what failure modes they miss, and how to build an eval system that actually correlates with production quality.
Why LLM Evals Don't Transfer to Agents
Static LLM evaluation was designed for single-turn responses: prompt in, answer out, score against expected output. Agents break that model in three structural ways.
Compositional failure. A single agent run involves dozens of sequential decisions: which tool to call, in what order, with what arguments, based on what context. An agent can achieve 100% tool-call accuracy (selecting the right tool every time) and still produce a wrong result because the reasoning connecting steps was flawed. A customer service agent can execute perfect API calls while still violating policy on edge cases. Final-output scoring doesn't catch this.
Cascading errors. Small upstream mistakes propagate. A wrong timestamp, a malformed identifier, an incorrect assumption in step two gets referenced in steps four, six, and eight. By the time the agent produces a final response, the original error is invisible in the output log. Standard benchmarks check the endpoint. The failure is in the path.
Shallow task design. Most existing benchmarks test "isolated API functionalities, few-step workflows, and artificial task compositions," as MCP-Bench's designers noted when explaining why they built an alternative. They don't test strategic planning across real multi-domain operations. An agent that passes these benchmarks has demonstrated it can operate in a controlled environment. That's not the same as production.
The result: developers get confident signals from evals that measure the wrong things, and get surprised when agents fail in the field.
What Benchmarks Currently Miss
Understanding the specific failure modes benchmarks skip is the starting point for building better evals.
Context Window Degradation
Berkeley Function-Calling Leaderboard data shows every model performs worse when given more than one tool. The degradation isn't linear. Past a model-specific threshold, more context increases variance and raises hallucination risk; the performance doesn't decline gradually, it collapses. Most tool-use benchmarks test agents with a handful of tools in a clean context. Production agents run with dozens of tools, mid-conversation, after five prior exchanges.
Tool Call Hallucination and Context Poisoning
Agents confidently call non-existent API endpoints and pass wrong parameters. The compounding problem: when a hallucinated call enters the conversation context ("the order was placed successfully"), subsequent steps treat it as ground truth. This is context poisoning: the error propagates laterally through the execution trace, and the final output looks plausible. No benchmark tests for this specifically.
Policy Violation on Edge Cases
Sierra Research's τ-bench, which tests agents in realistic retail, airline, and banking domains, exposed a finding that wasn't visible in simpler benchmarks: agents fail not on tool calls themselves but on policy edge cases and ambiguous user intent. When a customer's request sits at the boundary of what policy allows, agents improvise, sometimes in ways that would cause legal or compliance problems. The follow-on τ²-bench extends this to dual-control environments where both agent and user are tool-using, testing scenarios existing benchmarks don't cover at all.
Scope Creep
Agents do more than instructed. Standard benchmarks with fixed expected outputs don't score on over-action. In production, scope creep causes data deletion, unintended API side effects, and unauthorized resource access. If your eval only checks "did the right thing happen," it won't catch "did some unintended things also happen."
The Three-Layer Evaluation Architecture
The most practically useful framework for agent evals organizes assessments into three layers, each catching failures the others miss.
Layer 1: Node-Level Precision
Evaluate individual steps in isolation.
| Metric | What it measures |
|---|---|
| Tool selection accuracy | Did the agent pick the right tool for this step? |
| Argument correctness | Were tool arguments valid and semantically appropriate? |
| Step utility score | Did this step move toward the goal, or was it redundant/wrong? |
Node-level evals are fast to run and easy to automate. They catch the obvious failures: wrong tool, malformed parameters, hallucinated function names. They don't catch the reasoning failures between steps, which is why they're necessary but not sufficient.
Layer 2: Session-Level Trajectory Quality
Evaluate the full execution path, not just the output.
Trajectory scoring compares what the agent actually did against a golden trajectory: the sequence of steps a human expert would take to solve the task. You're scoring process quality, not just outcome.
Key metrics:
- Task success rate: Did the agent complete the goal? (Necessary but insufficient alone)
- Trajectory quality score: How closely did the agent's path match the optimal path?
- Step efficiency ratio: How many steps did the agent take vs. the minimum required?
- Policy adherence rate: Did every action stay within defined operational constraints?
Google Cloud's methodology formalizes this into a three-phase process: define a task taxonomy, build golden trajectory datasets, then evaluate against rubrics at both step and session level. The golden trajectory dataset is the most expensive part to build. It requires human experts to solve representative tasks and annotate each step. But it's also the highest-signal component: it's what separates an eval that catches real failures from one that just checks final outputs.
Layer 3: System Efficiency
Evaluate the agent's resource behavior in context.
| Metric | Why it matters |
|---|---|
| Latency per task | Correlates with user experience and operational cost |
| Token consumption | Directly affects cost; also a proxy for reasoning efficiency |
| Tool call count per task | Higher than expected often signals confused reasoning |
| Error recovery rate | What fraction of recoverable errors does the agent self-correct? |
System efficiency metrics are early warning signals. When an agent suddenly requires 40% more tokens to complete a task it handled efficiently last week, something changed in the model, the tools, or the task distribution. These metrics catch silent regressions that outcome-only scoring misses.
LLM-as-Judge: When It Works and When It Doesn't
Automated scoring with an LLM judge (using a capable model to evaluate another model's trajectory) is now standard practice. LLM judges achieve Spearman correlations of 0.8–0.9 with aggregate human preferences on well-defined rubrics. The threshold for trusting a judge in production is ≥0.80 Spearman with your human evaluators; if your judge doesn't meet that bar, your automated eval results aren't reliable.
The "agent-as-a-judge" extension uses an agent (not just a single LLM call) to evaluate another agent's full trajectory. arXiv:2508.02994 presents this as motivated specifically by the observation that final-output-only evaluation systematically misses agent-specific failure modes. An agent-evaluator can follow the execution trace, check each tool call against the context available at that step, and score reasoning quality at each decision point.
Where LLM judges fail:
- Factual accuracy on domain-specific knowledge (judges inherit the same knowledge gaps as the model being evaluated)
- Tasks requiring external ground truth (database records, API responses, real-world state)
- Adversarial inputs designed to fool the judge the same way they fool the agent
For these cases, you need deterministic checks: diff the database state before and after, validate API responses against schema, compare against known-correct outputs.
Building Your Eval Stack in Practice
Start With a Task Taxonomy
Before writing a single eval, categorize the tasks your agent handles. Group them by complexity, tool requirements, and risk level. A customer support agent might have: "standard refund request" (low complexity, two tools, low risk), "policy exception request" (medium complexity, five tools, high risk), "account compromise investigation" (high complexity, cross-system, critical risk).
Evaluations should be representative of this distribution. If 80% of production traffic is standard refunds and your eval dataset is 80% edge cases, your eval scores won't predict production behavior.
Build a Golden Trajectory Dataset
For each task category, have a human expert complete 20-50 representative tasks and annotate the steps. Record:
- Which tool was called and why
- What information from the context justified each decision
- Where shortcuts were possible and why they were or weren't taken
- What a wrong path would look like at each decision point
This dataset becomes the ground truth for trajectory scoring and the source of negative examples for training your judge.
Run Offline Evals Before Every Deployment
Braintrust's framework treats offline eval and production monitoring as the same system. Before deploying any change (model version, tool definition, system prompt), run the full task battery against your golden dataset. Statistical confidence indicators should flag when a change shifts performance outside normal variance.
Anthropic's practical guide to agent evals makes the point directly: the gap between offline benchmark performance and production behavior is a design failure, not a measurement artifact. If your offline evals don't predict production, the evals are wrong, not the production environment.
Monitor Trajectories in Production
Offline evals catch known failure modes. Production monitoring catches the unknown ones. The key capability: real-time tracing of tool invocations and their context as they happen. Arize and LangSmith both support multi-turn evaluation that scores complete agent conversations on semantic intent and trajectory, not just final output.
Set anomaly thresholds on the system efficiency metrics: if token consumption per task spikes 30% above baseline, alert. If tool call count per task exceeds expected range, investigate. These signals catch problems before users report them.
Close the Feedback Loop
HoneyHive's core workflow is the feedback flywheel: production logs feed eval datasets, which feed fine-tuning, which feeds the next deployment. This is how production monitoring compounds into systematic improvement rather than reactive firefighting.
The practical implementation:
- Flag sessions where the agent's trajectory diverged significantly from baseline
- Have a human reviewer score those sessions and annotate failure points
- Add the most informative failures to your golden dataset
- Re-run evals with the expanded dataset before the next deployment
The Benchmark Calibration Problem
Even well-designed evals drift from production over time. A 2025 survey of 12 major agent benchmarks found that task quality issues (incorrect expected actions, ambiguous instructions, impossible constraints, missing fallback behaviors) compound over time as benchmarks age against advancing model capabilities.
The fix is calibration: periodically compare your eval scores against production outcome data. If your eval predicts 85% task success but production is at 72%, your eval is miscalibrated and you're making deployment decisions on false confidence. Calibration sessions should happen at least quarterly, or after any major distribution shift in production traffic.
On the public benchmark side, convergence has already happened at the lower levels. Models now exceed 95% on tool name validity and schema compliance; those capabilities are table stakes. The Berkeley Function-Calling Leaderboard has effectively capped out on those dimensions. The real differentiator between agents now is orchestration intelligence: planning, error recovery, and policy adherence across long-horizon tasks. Your internal evals should be measuring these, not re-testing what the public leaderboards already tell you.
What's Next
The field is moving toward standardized trajectory evaluation: shared formats for golden datasets and step-level rubrics that allow cross-organization benchmarking on real-world tasks. τ-bench and its successors are working toward this for specific domains (retail, banking, customer service). MCP-Bench is attempting it for tool-using agents across heterogeneous server configurations.
For production teams, the immediate priority isn't waiting for those standards to arrive. It's building the internal infrastructure: task taxonomy, golden trajectory dataset, offline eval pipeline, production tracing. The teams that have done this already have a structural advantage over teams relying on public benchmarks alone — not because their agents are smarter, but because they know earlier when something is wrong.
The agent reliability gap isn't a model problem. It's a measurement problem. Fix the measurement and you fix the development loop.
Related reading: Agent Reliability Scores Are Getting Worse, Not Better examines why public benchmarks are diverging from production outcomes. Chain-of-Thought Prompting Doesn't Always Work. Here's the Evidence. covers the reasoning failures that multi-step evals need to catch. Context Window Management for Production Agents addresses the context degradation problem described in Layer 1. For cost implications of your eval infrastructure, see Agent Cost Optimization: How to Track and Reduce LLM Spend.