When NOT to Use an Agent: The Production Data That Should Change Your Default

▶️ LISTEN TO THIS ARTICLE

Gartner's June 2025 forecast suggested that a substantial share of agentic AI projects could be canceled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. Treat that as a forward-looking analyst scenario, not a measured failure rate or proof that agents fail by default.

Note: The Gartner figure is a prediction from June 2025. Anthropic's Building Effective Agents guide points in a compatible direction, advising teams to start with simple prompts, evaluate them carefully, and add multi-step agentic systems only when simpler solutions fall short.

That sentence doesn't appear in most agent pitch decks.

Transfer from benchmark to deployment depends on context, task complexity, tool design, and how much human oversight remains in the loop.

What Production Data Actually Shows

The case for agents often leans on benchmark performance. Human-in-the-loop deployment evidence is more mixed.

A randomized controlled trial published in April 2025 (Bean et al., arXiv:2504.18919) ran 1,298 participants through medical decision support tasks using GPT-4o, Llama 3, and Command R+. The authors reported a sharp gap between standalone model performance and performance when people used the systems in a decision-support setting: tested alone, the LLMs identified conditions accurately in 94.9% of cases on average, while participants using the same LLMs identified relevant conditions in fewer than 34.5% of cases and disposition in fewer than 44.2%, no better than the control group.

This is not a claim that the models are useless. It is a caution against assuming benchmark performance transfers cleanly to real deployment. Agents can add extra places for reliability to degrade because tool calls, memory retrieval, and planning steps each create another point where measured behavior may diverge from the workflow's actual needs.

Note: This is an architectural observation about multi-step systems, not a quantified divergence rate. Transfer from benchmark to deployment depends on context, task complexity, tool design, and how much human oversight remains in the loop.

Consistency compounds the problem. A 2025 preprint by Sushant Mehta (arXiv:2511.14136), not yet peer-reviewed as of mid-2026, argues that enterprise agent evaluation should measure reliability, cost, latency, assurance, and policy behavior alongside task completion. Its benchmark discussion reports a drop from 60% single-run performance to 25% eight-run consistency in one 2025 reliability assessment. Treat that as a directional signal rather than a replicated production law: an evaluation that only measures one successful run may miss the reliability requirements of systems that run repeatedly under varying conditions.

Evidence caveat: The studies cited above offer useful directional signals, though they should not be treated as universal production laws. The Mehta preprint, in particular, reflects early-stage evaluation research that may evolve with further peer review. They support a more cautious default: prove that agentic complexity improves the target workflow before making it the architecture.

Note: The consistency metrics referenced (60% single-run to 25% eight-run) come from a specific benchmark evaluation reported in late 2025 and may vary significantly across different agent architectures, model versions, and deployment contexts. These figures should not be generalized without replication.

The Coordination Tax Is Not Hypothetical

Multi-agent systems carry costs that often show up only at runtime: context propagation, inter-agent messaging, orchestration latency, and failure cascades when one agent returns unexpected output. The coordination tax is real enough to design around. Adding another agent to improve output quality can degrade the system if each handoff introduces new failure modes or amplifies errors that a single-call architecture would have surfaced and handled in one step.

Anthropic's own guidance names this directly: frameworks "can make it tempting to add complexity when a simpler setup would suffice" and "obscure the underlying prompts and responses, making them harder to debug." Harder to debug means failures survive longer before discovery.

The broader enterprise AI pilot failure-rate problem is not simply evidence that agents do not work.

The Counterargument

Agents genuinely excel at tasks that require iterative search across heterogeneous tools, long-horizon planning where no single prompt can access all required context, or verification loops where independent agent passes catch errors the first pass misses. Coding agents, research agents, and process automation with well-defined checkpoints all have production wins.

The broader enterprise AI pilot failure-rate problem is not simply evidence that agents do not work. A recurring failure mode is misaligned tooling choice: teams reach for agent architecture when the workflow does not require autonomy.

What This Changes

Three tests before adding agent infrastructure:

Can a single call with good retrieval solve most routine cases? If yes, build that first. Agent cost optimization should compare the simplest measurable workflow against the agentic version, not just compare model prices. Mehta's enterprise-evaluation paper reports multi-fold cost differences between accuracy-optimized and cost-aware agent configurations with comparable performance in its benchmark setting. The complexity premium needs justification.

Does the task have reliable intermediate checkpoints? Agents fail at tasks where intermediate states can't be verified. If you can't detect failure partway through a pipeline, an agent will compound errors silently until the output is wrong in ways that require expert review to catch. This is why agent evals built for production failures prioritize stepwise verification over end-to-end accuracy.

What does consistency look like across 20 runs, not one? The repeated-run failure pattern appears when teams test success on the first attempt and ship. An evaluation that stops at single-run accuracy is not measuring production behavior. It is measuring best-case behavior.

Gartner's cancellation projection assumes the current trajectory continues. The teams most likely to avoid that outcome are treating "should this be an agent?" as a question that requires data, not a default answer that requires justification to override.

When NOT to Use an Agent: The Production Data That Should Change Your Default

Key finding

Why it matters

Evidence base

Operator takeaway

Where this breaks

Use this if

Avoid this if

What Production Data Actually Shows

The Coordination Tax Is Not Hypothetical

The Counterargument

What This Changes

Execution tooling is separate