LISTEN TO THIS ARTICLE


title: "When NOT to Use an Agent: The Production Data That Should Change Your Default"
slug: when-not-to-use-ai-agents
date: 2026-04-30
category: agent-design
subtopic: reliability
type: signal
tags: [signals, agent-design, reliability, real-world-ai]
status: draft

Gartner predicts over 40% of agentic AI projects will be canceled by end of 2027, not because AI doesn't work, but because escalating costs, unclear business value, and inadequate risk controls compound faster in agent architectures than in simpler ones. The vendor that profits most from selling agents agrees. Anthropic's Building Effective Agents guide opens with a direct instruction: "start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short."

That sentence doesn't appear in most agent pitch decks.

What Production Data Actually Shows

The case for agents rests on benchmark performance. The production data is less flattering.

A randomized controlled trial published in April 2025 (Bean et al., arXiv:2504.18919) ran 1,298 participants through medical decision support tasks using GPT-4o, Llama 3, and Command R+. In benchmark conditions, the models achieved 94.9% accuracy identifying medical conditions. When real humans interacted with the same systems on the same tasks, accuracy dropped below 34.5%. For treatment decisions, the AI-assisted group performed no better than the control group.

This isn't a critique of the models. It's a critique of assuming benchmark performance transfers to real deployment. Agents inherit this gap and add layers of it: each tool call, memory retrieval, and planning step is another place where measured behavior diverges from actual behavior.

Consistency compounds the problem. Mehta (arXiv:2511.14136, November 2025) analyzed 300 enterprise tasks across 12 benchmarks. Single-run success rate: 60%. Eight-run consistency (the metric that reflects production): 25%. That 35-point gap is invisible in standard evaluation, which reports peak performance. Agents don't run once. They run constantly.

The Coordination Tax Is Not Hypothetical

Multi-agent systems carry costs that show up only at runtime: context propagation, inter-agent messaging, orchestration latency, failure cascades when one agent returns unexpected output. The coordination tax is real and documented. Adding another agent to improve output quality often degrades it, because each handoff introduces new failure modes and amplifies errors that a single-call architecture would have surfaced and handled in one step.

Anthropic's own guidance names this directly: frameworks "can make it tempting to add complexity when a simpler setup would suffice" and "obscure the underlying prompts and responses, making them harder to debug." Harder to debug means failures survive longer before discovery.

The Counterargument

Agents genuinely excel at tasks that require iterative search across heterogeneous tools, long-horizon planning where no single prompt can access all required context, or verification loops where independent agent passes catch errors the first pass misses. Coding agents, research agents, and process automation with well-defined checkpoints all have production wins.

The enterprise AI pilot failure rate isn't 70% because agents don't work. It's 70% partly because teams reach for agent architecture at problems that don't require it. The failure mode isn't misaligned AI. It's misaligned tooling choice.

What This Changes

Three tests before adding agent infrastructure:

Can a single call with good retrieval solve 80% of the cases? If yes, build that first. Agent cost optimization data consistently finds that agentic pipelines cost 4 to 10 times more than single-call alternatives for well-defined tasks with comparable output quality. The complexity premium needs justification.

Does the task have reliable intermediate checkpoints? Agents fail at tasks where intermediate states can't be verified. If you can't detect failure partway through a pipeline, an agent will compound errors silently until the output is wrong in ways that require expert review to catch. This is why agent evals built for production failures prioritize stepwise verification over end-to-end accuracy.

What does consistency look like across 20 runs, not one? The 60%-to-25% collapse happens because teams test success on the first attempt and ship. An evaluation that stops at single-run accuracy isn't measuring production behavior. It's measuring best-case behavior.

Gartner's 40% cancellation projection assumes current trajectory continues. The teams most likely to avoid it are treating "should this be an agent?" as a question that requires data, not a default answer that requires justification to override.


Related: When Multi-Agent Systems Break: The Coordination Tax Nobody Warns You About · How to Build Agent Evals That Catch Real Failures · AI Agent ROI: The Calculator and Framework That Cuts Through Vendor Math · Types of AI Agents