Agent Reliability Scores Are Getting Worse, Not Better

🎧 LISTEN TO THIS ARTICLE

Every quarter, agent benchmarks climb higher. Pass rates on SWE-Bench tick up. Tool-use accuracy improves on paper. And yet, production failure rates aren't dropping. They're getting worse. The more capable we make agents, the less reliably they behave in the wild. That's not a bug in our measurement. It's a feature of how capability and reliability actually interact.

The Numbers Don't Add Up

Here's the core tension. SWE-Bench Verified scores have been climbing steadily through 2025 and into 2026, with frontier models now clearing 70%+ on the automated grader. But a March 2026 METR study found that roughly half of test-passing PRs written by AI agents wouldn't actually be merged by repository maintainers. The gap between the automated grader and real maintainer merge decisions was 24 percentage points. Agents are passing tests. They're not writing code humans would ship.

Meanwhile, MCP-Universe benchmarks show single-model architectures averaging just 23% success rates on real tool-use tasks. Even the best-performing model managed only 43.7%. These aren't cherry-picked failure cases. They're aggregate scores across standardized tool integrations.

Gartner projects that over 40% of agentic AI projects will be canceled by end of 2027, citing escalating costs and inadequate risk controls. The market is spending more and trusting less.

Why Capability Makes Things Worse

Agent capability will keep climbing. The question nobody's answering well is whether reliability can catch up before the gap becomes a credibility crisis for the entire category.

A February 2026 paper, "Capable but Unreliable", tested 22 frontier models across 108 real-world tool-use tasks with three independent runs each. The finding: 22.5% of model-task pairs showed mixed outcomes across runs. Same model, same task, different results. The models could solve these problems. They just couldn't do it consistently.

The cause is what the researchers call "canonical path deviation." Every tool-use task has an optimal sequence of operations. More capable models have access to more tools and more strategies, which means more opportunities to drift off the solution path. Stochastic sampling alone causes failures that have nothing to do with capability gaps.

This compounds fast. If an agent hits 85% accuracy per action, a 10-step workflow succeeds roughly 20% of the time. Scale that to the 30 or 50-step workflows that enterprise customers actually want, and you're looking at near-zero end-to-end reliability without heavy guardrails.

Bigger models don't uniformly fix this. Research from the Allen Institute and others shows that scaling up improves some reliability dimensions like calibration and robustness, but can actually hurt consistency. Larger models sometimes show more run-to-run variability, not less.

The Production Gap Is Widening

A coding agent that gets the wrong answer every time is easy to filter. One that produces working code 70% of the time and subtly broken code the other 30% is far more dangerous.

The pattern is clear in deployment data. Organizations that successfully moved agents to production did it by constraining scope: fewer steps, internal-facing use cases, human review on every output. They're treating agents like interns, not autonomous systems.

That's a rational response. But it contradicts the pitch. The whole point of agents is autonomous multi-step execution. If you have to supervise every action, you've built an expensive autocomplete with extra steps.

The real problem isn't that agents fail. It's that they fail unpredictably. A coding agent that gets the wrong answer every time is easy to filter. One that produces working code 70% of the time and subtly broken code the other 30% is far more dangerous, because it erodes the reviewer's attention. You stop checking when things usually work.

What Actually Helps

If an agent hits 85% accuracy per action, a 10-step workflow succeeds roughly 20% of the time.

The teams reporting stable production agents share common patterns: deterministic tool orchestration instead of letting models freestyle, constrained action spaces that reduce deviation paths, and aggressive retry-with-verification loops rather than single-shot execution.

These aren't glamorous solutions. They're the same reliability engineering principles that made distributed systems work in the 2010s, applied to a new failure mode. As we've covered in testing and debugging AI agents, the hard part isn't building the agent. It's building the system that catches when the agent goes sideways.

The benchmark problem feeds directly into this. If your eval says 72% and production says 48%, you're building on wrong assumptions. And if your production reliability framework doesn't account for stochastic drift, you're measuring the wrong thing entirely.

Agent capability will keep climbing. The question nobody's answering well is whether reliability can catch up before the gap becomes a credibility crisis for the entire category.