How to Build Agent Evals That Catch Real Failures

▶️ LISTEN TO THIS ARTICLE

Your agent passes the benchmark. It scores well on tool-call accuracy. Your unit tests are green. Then it deletes the wrong records in production.

This is the reliability gap agent teams have to design around: benchmark performance can look clean while production behavior remains brittle. One 2025 analysis of agent benchmarks argues that several widely used benchmarks have validity and cost-estimation problems. Read that as a bounded warning about benchmark coverage, not proof that all benchmarks fail in the same way. Public benchmarks often emphasize cleaner task slices than the messy parts of agent behavior that matter in production.

This guide explains why standard evaluation methods can break for agents, what failure modes they often miss, and how to build an eval system that is more likely to correlate with production quality.

Why LLM Evals Don't Transfer to Agents

Static LLM evaluation was designed for single-turn responses: prompt in, answer out, score against expected output. Agents break that model in three structural ways.

Compositional failure. A single agent run can involve dozens of sequential decisions: which tool to call, in what order, with what arguments, based on what context. Even when individual tool calls look valid in isolation, the run can still produce a wrong result if the steps do not jointly satisfy the task and policy constraints. A customer service agent can execute valid API calls while still mishandling policy edge cases. Final-output scoring may not catch this.

Cascading errors. Small upstream mistakes propagate. A wrong timestamp, a malformed identifier, an incorrect assumption in step two gets referenced in steps four, six, and eight. By the time the agent produces a final response, the original error is invisible in the output log. Standard benchmarks check the endpoint. The failure is in the path.

Shallow task design. Most existing benchmarks test "isolated API functionalities, few-step workflows, and artificial task compositions," as MCP-Bench's designers noted when explaining why they built an alternative. They don't test strategic planning across real multi-domain operations. An agent that passes these benchmarks has demonstrated it can operate in a controlled environment. That's not the same as production.

The result: developers get confident signals from evals that measure the wrong things, and get surprised when agents fail in the field.

In production, scope creep causes data deletion, unintended API side effects, and unauthorized resource access.

What Benchmarks Currently Miss

Understanding the specific failure modes benchmarks skip is the starting point for building better evals.

Context Window Degradation

Berkeley Function-Calling Leaderboard data shows every model performs worse when given more than one tool. The degradation isn't linear. Past a model-specific threshold, more context increases variance and raises hallucination risk; the performance doesn't decline gradually, it collapses. Most tool-use benchmarks test agents with a handful of tools in a clean context. Production agents run with dozens of tools, mid-conversation, after five prior exchanges.

Tool Call Hallucination and Context Poisoning

Agents confidently call non-existent API endpoints and pass wrong parameters. The compounding problem: when a hallucinated call enters the conversation context ("the order was placed successfully"), subsequent steps treat it as ground truth. This is context poisoning: the error propagates laterally through the execution trace, and the final output looks plausible. No benchmark tests for this specifically.

That makes memory design part of evaluation design. How Agent Memory Got an Architecture covers the long-term, episodic, and semantic memory choices that change what an eval has to inspect.

Policy Violation on Edge Cases

Sierra Research's τ-bench, which tests agents in realistic retail, airline, and banking domains, exposed a finding that wasn't visible in simpler benchmarks: agents fail not on tool calls themselves but on policy edge cases and ambiguous user intent. When a customer's request sits at the boundary of what policy allows, agents improvise, sometimes in ways that would cause legal or compliance problems. The follow-on τ²-bench extends this to dual-control environments where both agent and user are tool-using, testing scenarios existing benchmarks don't cover at all.

Scope Creep

Agents do more than instructed. Standard benchmarks with fixed expected outputs don't score on over-action. In production, scope creep causes data deletion, unintended API side effects, and unauthorized resource access. If your eval only checks "did the right thing happen," it won't catch "did some unintended things also happen."

The Three-Layer Evaluation Architecture

The most practically useful framework for agent evals organizes assessments into three layers, each catching failures the others miss.

Layer 1: Node-Level Precision

Evaluate individual steps in isolation.

Metric	What it measures
Tool selection accuracy	Did the agent pick the right tool for this step?
Argument correctness	Were tool arguments valid and semantically appropriate?
Step utility score	Did this step move toward the goal, or was it redundant/wrong?

Node-level evals are fast to run and easy to automate. They catch the obvious failures: wrong tool, malformed parameters, hallucinated function names. They don't catch the reasoning failures between steps, which is why they're necessary but not sufficient.

Layer 2: Session-Level Trajectory Quality

Evaluate the full execution path, not just the output.

Trajectory scoring compares what the agent actually did against a golden trajectory: the sequence of steps a human expert would take to solve the task. You're scoring process quality, not just outcome.

Key metrics:

Task success rate: Did the agent complete the goal? (Necessary but insufficient alone)
Trajectory quality score: How closely did the agent's path match the optimal path?
Step efficiency ratio: How many steps did the agent take vs. the minimum required?
Policy adherence rate: Did every action stay within defined operational constraints?

Google Cloud's methodology formalizes this into a three-phase process: define a task taxonomy, build golden trajectory datasets, then evaluate against rubrics at both step and session level. The golden trajectory dataset is the most expensive part to build. It requires human experts to solve representative tasks and annotate each step. But it's also the highest-signal component: it's what separates an eval that catches real failures from one that just checks final outputs.

Layer 3: System Efficiency

Evaluate the agent's resource behavior in context.

Metric	Why it matters
Latency per task	Correlates with user experience and operational cost
Token consumption	Directly affects cost; also a proxy for reasoning efficiency
Tool call count per task	Higher than expected often signals confused reasoning
Error recovery rate	What fraction of recoverable errors does the agent self-correct?

System efficiency metrics are early warning signals. When an agent suddenly requires 40% more tokens to complete a task it handled efficiently last week, something changed in the model, the tools, or the task distribution. These metrics catch silent regressions that outcome-only scoring misses.

LLM-as-Judge: When It Works and When It Doesn't

Automated scoring with an LLM judge (using a capable model to evaluate another model's trajectory) is now standard practice. LLM judges achieve Spearman correlations of 0.8–0.9 with aggregate human preferences on well-defined rubrics. The threshold for trusting a judge in production is ≥0.80 Spearman with your human evaluators; if your judge doesn't meet that bar, your automated eval results aren't reliable.

The "agent-as-a-judge" extension uses an agent (not just a single LLM call) to evaluate another agent's full trajectory. arXiv:2508.02994 presents this as motivated specifically by the observation that final-output-only evaluation systematically misses agent-specific failure modes. An agent-evaluator can follow the execution trace, check each tool call against the context available at that step, and score reasoning quality at each decision point.

Where LLM judges fail:

Factual accuracy on domain-specific knowledge (judges inherit the same knowledge gaps as the model being evaluated)
Tasks requiring external ground truth (database records, API responses, real-world state)
Adversarial inputs designed to fool the judge the same way they fool the agent

For these cases, you need deterministic checks: diff the database state before and after, validate API responses against schema, compare against known-correct outputs.

These signals catch problems before users report them.

Building Your Eval Stack in Practice

Start With a Task Taxonomy

Before writing a single eval, categorize the tasks your agent handles. Group them by complexity, tool requirements, and risk level. A customer support agent might have: "standard refund request" (low complexity, two tools, low risk), "policy exception request" (medium complexity, five tools, high risk), "account compromise investigation" (high complexity, cross-system, critical risk).

Evaluations should be representative of this distribution. If 80% of production traffic is standard refunds and your eval dataset is 80% edge cases, your eval scores won't predict production behavior.

Build a Golden Trajectory Dataset

For each task category, have a human expert complete 20-50 representative tasks and annotate the steps. Record:

Which tool was called and why
What information from the context justified each decision
Where shortcuts were possible and why they were or weren't taken
What a wrong path would look like at each decision point

This dataset becomes the ground truth for trajectory scoring and the source of negative examples for training your judge.

Run Offline Evals Before Every Deployment

Braintrust's framework treats offline eval and production monitoring as the same system. Before deploying any change (model version, tool definition, system prompt), run the full task battery against your golden dataset. Statistical confidence indicators should flag when a change shifts performance outside normal variance.

Anthropic's practical guide to agent evals makes the point directly: the gap between offline benchmark performance and production behavior is a design failure, not a measurement artifact. If your offline evals don't predict production, the evals are wrong, not the production environment.

Monitor Trajectories in Production

Offline evals catch known failure modes. Production monitoring catches the unknown ones. The key capability: real-time tracing of tool invocations and their context as they happen. Arize and LangSmith both support multi-turn evaluation that scores complete agent conversations on semantic intent and trajectory, not just final output.

Set anomaly thresholds on the system efficiency metrics: if token consumption per task spikes 30% above baseline, alert. If tool call count per task exceeds expected range, investigate. These signals catch problems before users report them.

If the agent depends on a human-curated knowledge base, Obsidian's CLI Turns Your Second Brain Into an API shows the retrieval surface your evals need to observe. For regulated workflows, AI Agents in Legal is the high-stakes version of the same rule: an eval is not complete until it tests verification, supervision, and auditability.

Close the Feedback Loop

HoneyHive's core workflow is the feedback flywheel: production logs feed eval datasets, which feed fine-tuning, which feeds the next deployment. This is how production monitoring compounds into systematic improvement rather than reactive firefighting.

The practical implementation:

Flag sessions where the agent's trajectory diverged significantly from baseline
Have a human reviewer score those sessions and annotate failure points
Add the most informative failures to your golden dataset
Re-run evals with the expanded dataset before the next deployment

The Benchmark Calibration Problem

Even well-designed evals drift from production over time. A 2025 survey of 12 major agent benchmarks found that task quality issues (incorrect expected actions, ambiguous instructions, impossible constraints, missing fallback behaviors) compound over time as benchmarks age against advancing model capabilities.

The fix is calibration: periodically compare your eval scores against production outcome data. If your eval predicts 85% task success but production is at 72%, your eval is miscalibrated and you're making deployment decisions on false confidence. Calibration sessions should happen at least quarterly, or after any major distribution shift in production traffic.

On the public benchmark side, convergence has already happened at the lower levels. Models now exceed 95% on tool name validity and schema compliance; those capabilities are table stakes. The Berkeley Function-Calling Leaderboard has effectively capped out on those dimensions. The real differentiator between agents now is orchestration intelligence: planning, error recovery, and policy adherence across long-horizon tasks. Your internal evals should be measuring these, not re-testing what the public leaderboards already tell you.

What's Next

The field is moving toward standardized trajectory evaluation: shared formats for golden datasets and step-level rubrics that allow cross-organization benchmarking on real-world tasks. τ-bench and its successors are working toward this for specific domains (retail, banking, customer service). MCP-Bench is attempting it for tool-using agents across heterogeneous server configurations.

For production teams, the immediate priority isn't waiting for those standards to arrive. It's building the internal infrastructure: task taxonomy, golden trajectory dataset, offline eval pipeline, production tracing. The teams that have done this already have a structural advantage over teams relying on public benchmarks alone — not because their agents are smarter, but because they know earlier when something is wrong.

The agent reliability gap isn't a model problem. It's a measurement problem. Fix the measurement and you fix the development loop.

Related reading: Agent Reliability Scores Are Getting Worse, Not Better examines why public benchmarks are diverging from production outcomes. Chain-of-Thought Prompting Doesn't Always Work. Here's the Evidence. covers the reasoning failures that multi-step evals need to catch. Context Window Management for Production Agents addresses the context degradation problem described in Layer 1. For cost implications of your eval infrastructure, see Agent Cost Optimization: How to Track and Reduce LLM Spend.

Hub: Building RAG Systems That Work