The Agent Project That Should Have Been One LLM Call

▶️ LISTEN TO THIS ARTICLE

A useful agent recovery story is not always the one where a team adds better memory, finer-grained tools, and another evaluation layer.

It is the one where the team deletes the agent.

That sounds like failure. In some enterprise deployments, it can be the first honest architectural decision the project makes. A six-month agent build usually begins with a plausible demo: the system reads a request, plans a few steps, calls tools, checks its own work, and writes back to the user. Then production adds authentication, permissions, retries, stale data, human approvals, audit trails, hidden edge cases, and people who do not phrase requests like the demo script. The agent can become slower, more expensive, harder to debug, and less trusted than the workflow it was meant to improve.

At that point, the right answer may be a single LLM call wrapped in ordinary software.

Recent sources below support a narrower lesson: some agent projects are better understood as workflow-design failures than as model-capability failures. When the task is classification, extraction, routing, or summarization, the project may not need an agent at all. It may need one bounded model call, a deterministic workflow, and a clear owner for the exceptions.

Background: The Pilot-to-Production Gap Is Now Measurable

The pull toward agent architecture is easy to understand. Agent systems are meant to close the gap between a chatbot that advises and software that acts. That capability is real in some domains. It is also expensive.

Gartner has warned that many agentic AI projects can stall or fail to justify themselves as costs rise, business value stays unclear, and risk controls stay weak. The same release made a sharper point: many use cases marketed as agentic do not require agentic implementations.

Deloitte's recent agentic AI strategy reporting shows the deployment gap from another angle: organizations may explore agents and pilot them long before they are actually ready for production. Treat that as a maturity signal, not proof that every stalled pilot had the same cause.

McKinsey's State of AI survey found widespread AI use but continuing difficulty turning pilots into scaled impact. Its high performers were not distinguished by having more demos. They redesigned workflows, changed operating practices, and scaled with executive support.

The older AI project failure literature points in the same direction. RAND's report on AI project failure argued that AI projects often fail because teams misunderstand the problem, lack necessary data, focus on technology rather than user need, or lack infrastructure for deployment. Agentic systems amplify each of those failure modes because they add more moving parts before the workflow is proven.

The mistake is not "using AI." The mistake is skipping the simpler AI pattern.

The Architecture Smell: Autonomy Without a Decision

An agent is justified when the system must choose actions over time under changing conditions. It has to inspect the state of the world, decide what to do next, use tools, update its plan, and recover when earlier steps fail.

Many enterprise "agent" projects do not have that shape.

They have a user request. They need to classify it. They need to extract fields. They need to look up records. They need to generate a draft. They need to route the result to a queue. Those tasks may benefit from language models, but they do not automatically need planning loops, tool autonomy, memory stores, or multi-agent delegation.

The distinction matters because every extra agentic feature creates an operational obligation.

Planning means the system can choose a bad plan. Tool use means the system can call the wrong tool, call the right tool with the wrong arguments, or call it at the wrong time. Memory means the system can retrieve stale, irrelevant, or sensitive context. Reflection means the system can spend money convincing itself of a false answer. Autonomy means incident response must explain not only what output was wrong, but which decision path produced it.

A single LLM call has a narrower failure surface. It receives a bounded input, returns a bounded output, and hands control back to deterministic code. You can schema-validate the response, reject invalid fields, log the prompt and output, and route uncertain cases to a human. That does not make it perfect. It makes it inspectable.

This is why "we went back to one LLM call" should not be treated as retreat. It is often a sign that the team finally found the actual product requirement.

The project is valuable because it tests the kind of cross-tool, social, administrative work that "digital worker" claims depend on.

What Recent Benchmarks Say About Long-Horizon Agents

The research signal is not that agents are useless. It is that long-horizon autonomy remains brittle, especially when tasks resemble real work rather than benchmark puzzles.

TheAgentCompany benchmark simulates a workplace where agents browse the web, write code, run programs, and communicate with simulated coworkers. The project is valuable because it tests the kind of cross-tool, social, administrative work that "digital worker" claims depend on. The reported results are a warning: dependable full-task completion is still elusive, even for leading systems.

Computer-use benchmarks show the same pattern. OSWorld-Human studies the temporal efficiency of computer-use agents on desktop tasks and finds that agents often require many more steps than humans, even when they eventually succeed. The problem is not just final accuracy. It is latency, wandering, repeated actions, and fragile interaction with interfaces designed for people.

WindowsWorld, a 2026 benchmark for professional cross-application GUI agents, is even harsher. It reports that leading computer-use agents still struggle on multi-application tasks, especially when the work depends on conditional judgment, reasoning across several applications, and steady execution.

Software engineering tells a related story. Saving SWE-Bench argues that formal GitHub issue descriptions can overstate how well agents will perform on realistic user-style requests. When benchmark tasks are transformed to better match how users actually ask for help, agent capability estimates can shift. This matters for enterprise agent pilots because demos often use clean, complete task statements. Production users provide partial intent, ambiguous context, and requests that cut across team boundaries.

The practical read: if a benchmark built by researchers still exposes low completion rates on realistic multi-step tasks, a six-month internal project with noisier data and weaker evals should be cautious about betting on autonomy first.

The Six-Month Failure Pattern

The recurring pattern looks like this.

Month one: the demo works. The agent reads a ticket, checks a knowledge base, asks a follow-up question, drafts a response, and updates the CRM. Executives see a workflow that appears close to full automation.

Month two: integrations arrive. The CRM has custom fields. The knowledge base has duplicate articles. Some customers have multiple accounts. Permissions differ by region. The agent now needs tool schemas, retrieval filters, identity checks, and fallbacks.

Month three: exceptions dominate. The agent handles clean tickets but stalls on messy ones. It asks unnecessary questions, misses policy constraints, or calls a tool before it has enough information. The team adds routing rules, a verifier, and a memory layer.

Month four: cost and latency become visible. The system makes multiple model calls per request. Traces are hard to read. Users complain that the agent takes longer than manual handling for common cases. Finance notices the inference bill. Security asks who approved the tool permissions.

Month five: evaluation becomes the project. The team cannot tell whether the agent is improving because success depends on a chain of actions. A better final answer may hide a worse intermediate decision. A passing test may rely on the exact phrasing of the task. Every fix creates new edge cases.

Month six: the team rebuilds the workflow as ordinary software. One LLM call classifies the request and extracts required fields. Deterministic code checks policy, retrieves records, and applies business rules. A second optional LLM call drafts customer-facing language after the system already knows the decision. Human review handles exceptions.

The rebuild usually needs better measurement, not more autonomy. How to Build Agent Evals That Catch Real Failures covers the evaluation layer that tells teams when an agentic design is actually improving, while How Agent Memory Got an Architecture explains why adding memory should be treated as an architectural commitment rather than a quick fix.

The second system looks less impressive in a demo. It is also faster, cheaper, easier to approve, and easier to debug.

This is the same lesson behind AI Agent ROI: What Strong-Outcome Teams Do Differently: the winning teams define the measurable workflow before adding autonomy. It also matches the warning in When NOT to Use an Agent: if the task has stable rules, constrained inputs, and clear exception paths, agent architecture is probably not the starting point.

The Single-Call Pattern

The single-call alternative is not "just ask ChatGPT." It is a disciplined architecture:

Preprocess the input with deterministic code.
Send the model a narrow task.
Require structured output.
Validate the output against a schema.
Apply business logic outside the model.
Escalate uncertainty.
Log enough context to reproduce the result.

For a support workflow, the LLM might return:

{
  "intent": "refund_request",
  "confidence": 0.84,
  "required_fields": ["order_id", "purchase_date"],
  "customer_sentiment": "frustrated",
  "summary": "Customer says the item arrived damaged and wants a refund."
}

The refund policy does not live inside the prompt. The model does not decide whether to issue money. The code checks the order, purchase date, SKU, region, warranty status, and fraud flags. If the request qualifies, the system drafts a response. If it does not, it routes to a queue with the extracted summary.

For an internal analytics workflow, the LLM might translate a natural-language question into a constrained query plan, but a query builder enforces allowed tables and columns. For legal intake, the LLM might summarize facts and classify matter type, but conflict checks and retention decisions stay outside the model. For IT service management, the LLM might map a request to a category and priority, but access changes still require policy code and human approval.

This pattern is less autonomous. That is the point.

It gives the language model the job it is good at: reading messy language and producing useful structure. It gives deterministic systems the jobs they are better at: policy, permissions, calculation, state transitions, and auditability.

It also makes evaluation tractable. You can score intent classification, field extraction, summary faithfulness, confidence calibration, and escalation accuracy. You do not need to infer whether a five-step plan failed because of retrieval, reasoning, tool selection, memory, permissioning, or a bad intermediate observation.

For teams struggling with cost, this connects directly to Agent Cost Optimization: How to Track and Reduce LLM Spend. The cheapest agent call is the one you remove. The second cheapest is the one you narrow until a smaller model can handle it.

If that narrower call can run on a cheaper or owned model, Open Source AI Impact: Who Wins When Models Get Cheap explains how model routing changes the economics without making the workflow more autonomous.

It gives the language model the job it is good at: reading messy language and producing useful structure.

When an Agent Is Still the Right Choice

The argument against overbuilt agents is not an argument against agents.

Agents make sense when the task genuinely requires iterative action. A research agent that must search, compare sources, revise hypotheses, and follow leads may need tool use and planning. A coding agent that edits files, runs tests, reads failures, and patches again is doing work that cannot be compressed into one completion. A security investigation agent that correlates logs across systems may need to branch as evidence changes.

The important test is whether the next step depends on new information created by the previous step.

If yes, an agent may be justified. If no, you may be adding a planning loop around a fixed workflow. Fixed workflows are usually better expressed as software.

There is also a value threshold. Anthropic's multi-agent research system, covered in The Hidden Cost of "Just Add Another Agent", can justify high token overhead when the task value is high and parallel research produces better answers. A customer support classification pipeline cannot make the same argument unless the improvement is measurable and large.

The governance threshold matters too. Gartner's warning on agent governance argues that autonomous systems need controls matched to their permissions and blast radius. A system that can act across trust boundaries needs more than prompt rules. It needs scoped permissions, monitoring, rollback, incident response, and a named owner.

If the business case cannot support those controls, the architecture is too ambitious for the value at stake.

Practical Implications: The De-Agenting Checklist

Before funding another quarter of agent work, ask seven questions.

What decision is the agent actually making? If the answer is "it follows the workflow," you may not need an agent. You need workflow software with model-assisted input handling.

Can the task be completed with one structured output? If the model can classify, extract, summarize, or draft in one pass, start there. Add steps only when measured failures require them.

Which parts must be deterministic? Payments, account changes, compliance checks, access grants, inventory movements, and legal commitments should not depend on free-form model judgment.

What happens when confidence is low? A useful non-agentic system has a clear fallback. A weak agent often tries another loop and burns time without improving certainty.

Can you replay the failure? If you cannot reproduce why the system acted, you are not ready for autonomy. A single-call system is easier to replay because the decision boundary is narrow.

Does the workflow improve if the model thinks longer? Some reasoning tasks benefit from extra compute. Many enterprise tasks benefit more from cleaner data and stricter schemas.

Would a rules engine plus one LLM call beat the current agent on cost, latency, and approvals? If yes, rebuild around that pattern and preserve the agent work only where it proves incremental value.

This is not a purity test. It is a sequencing rule. Start with the smallest model-mediated workflow that can be measured. Add autonomy when the simpler design hits a documented ceiling.

What's Next

The next phase of enterprise AI will not be "agents everywhere." It will be better discrimination between tasks that need autonomy and tasks that need language understanding inside conventional systems.

That is a healthier direction. It means more AI will reach production, not less. The projects that survive will look less magical and more boring: structured outputs, policy code, retrieval limits, eval sets, audit logs, and human exception paths. They will use agents where open-ended action is worth the cost. They will use single LLM calls where the job is bounded.

The broader signal is consistent: industry warnings, workflow-redesign research, and realistic agent benchmarks all point to the same operational caution. Autonomy is not a feature to add by default. It is a liability to earn.

The six-month agent project that goes back to a single LLM call is not a cautionary tale about AI being overhyped. It is a cautionary tale about architecture pretending to be strategy.

The better question is not "Can we make this agentic?"

It is: "What is the least autonomous system that produces the business result?"

Start there. Then make the agent prove it deserves to exist.

The Agent Project That Should Have Been One LLM Call

Key finding

Why it matters

Evidence base

Operator takeaway

Where this breaks

Use this if

Avoid this if