LISTEN TO THIS ARTICLE


title: "The Agent Project That Should Have Been One LLM Call"
slug: agent-project-single-llm-call
category: real-world-ai
subtopic: enterprise-deployment
type: guide
tags: [guides, real-world-ai, enterprise-deployment, agents, contrarian]
date: 2026-06-06
status: draft

The cleanest agent recovery story is not the one where a team adds better memory, finer-grained tools, and another evaluation layer.

It is the one where the team deletes the agent.

That sounds like failure. In many enterprise deployments, it is the first honest architectural decision the project makes. A six-month agent build usually begins with a plausible demo: the system reads a request, plans a few steps, calls tools, checks its own work, and writes back to the user. Then production adds authentication, permissions, retries, stale data, human approvals, audit trails, hidden edge cases, and people who do not phrase requests like the demo script. The agent becomes slower, more expensive, harder to debug, and less trusted than the workflow it was meant to improve.

At that point, the right answer may be a single LLM call wrapped in ordinary software.

This is the contrarian lesson from the 2025-2026 enterprise agent cycle: many failed agent projects did not fail because language models were too weak. They failed because the architecture assigned autonomy to problems that needed classification, extraction, routing, or summarization. The project did not need an agent. It needed one bounded model call, a deterministic workflow, and a clear owner for the exceptions.


Background: The Pilot-to-Production Gap Is Now Measurable

The pull toward agent architecture is easy to understand. Agents promise to close the gap between a chatbot that advises and software that acts. That promise is real in some domains. It is also expensive.

Gartner predicted in June 2025 that more than 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs, unclear business value, and weak risk controls. The same release made a sharper point: many use cases marketed as agentic do not require agentic implementations.

Deloitte's 2026 agentic AI strategy report shows the deployment gap from another angle. In its 2025 Emerging Technology Trends study, 30% of organizations were exploring agentic options and 38% were piloting them, but only 14% had solutions ready for deployment and 11% were using them in production. That is not just a maturity curve. It is a filter that removes projects whose demo logic cannot survive enterprise constraints.

McKinsey's 2025 State of AI survey found widespread AI use but continuing difficulty turning pilots into scaled impact. Its high performers, about 6% of respondents by McKinsey's definition, were not distinguished by having more demos. They redesigned workflows, changed operating practices, and scaled with executive support.

The older AI project failure literature points in the same direction. RAND's 2024 report on AI project failure argued that AI projects often fail because teams misunderstand the problem, lack necessary data, focus on technology rather than user need, or lack infrastructure for deployment. Agentic systems amplify each of those failure modes because they add more moving parts before the workflow is proven.

The mistake is not "using AI." The mistake is skipping the simpler AI pattern.


The Architecture Smell: Autonomy Without a Decision

An agent is justified when the system must choose actions over time under changing conditions. It has to inspect the state of the world, decide what to do next, use tools, update its plan, and recover when earlier steps fail.

Many enterprise "agent" projects do not have that shape.

They have a user request. They need to classify it. They need to extract fields. They need to look up records. They need to generate a draft. They need to route the result to a queue. Those tasks may benefit from language models, but they do not automatically need planning loops, tool autonomy, memory stores, or multi-agent delegation.

The distinction matters because every extra agentic feature creates an operational obligation.

Planning means the system can choose a bad plan. Tool use means the system can call the wrong tool, call the right tool with the wrong arguments, or call it at the wrong time. Memory means the system can retrieve stale, irrelevant, or sensitive context. Reflection means the system can spend money convincing itself of a false answer. Autonomy means incident response must explain not only what output was wrong, but which decision path produced it.

A single LLM call has a narrower failure surface. It receives a bounded input, returns a bounded output, and hands control back to deterministic code. You can schema-validate the response, reject invalid fields, log the prompt and output, and route uncertain cases to a human. That does not make it perfect. It makes it inspectable.

This is why "we went back to one LLM call" should not be treated as retreat. It is often a sign that the team finally found the actual product requirement.


What Recent Benchmarks Say About Long-Horizon Agents

The research signal is not that agents are useless. It is that long-horizon autonomy remains brittle, especially when tasks resemble real work rather than benchmark puzzles.

TheAgentCompany benchmark simulates a workplace where agents browse the web, write code, run programs, and communicate with simulated coworkers. The project is valuable because it tests the kind of cross-tool, social, administrative work that "digital worker" claims depend on. The reported results are a warning: even leading agents complete only a minority of the work. On newer versions discussed by the project and follow-on coverage, top systems reach roughly 30% full completion, with higher partial credit but large gaps from dependable automation.

Computer-use benchmarks show the same pattern. OSWorld-Human studies the temporal efficiency of computer-use agents on desktop tasks and finds that agents often require many more steps than humans, even when they eventually succeed. The problem is not just final accuracy. It is latency, wandering, repeated actions, and fragile interaction with interfaces designed for people.

WindowsWorld, a 2026 benchmark for professional cross-application GUI agents, is even harsher. It reports that leading computer-use agents perform poorly on multi-application tasks, with success below 21% for the tested systems. The failures cluster around conditional judgment, reasoning across three or more applications, and inefficient execution.

Software engineering tells a related story. Saving SWE-Bench argues that formal GitHub issue descriptions can overstate how well agents will perform on realistic user-style requests. When benchmark tasks are transformed to better match how users actually ask for help, agent capability estimates can shift. This matters for enterprise agent pilots because demos often use clean, complete task statements. Production users provide partial intent, ambiguous context, and requests that cut across team boundaries.

The practical read: if a benchmark built by researchers still exposes low completion rates on realistic multi-step tasks, a six-month internal project with noisier data and weaker evals should be cautious about betting on autonomy first.


The Six-Month Failure Pattern

The recurring pattern looks like this.

Month one: the demo works. The agent reads a ticket, checks a knowledge base, asks a follow-up question, drafts a response, and updates the CRM. Executives see a workflow that appears close to full automation.

Month two: integrations arrive. The CRM has custom fields. The knowledge base has duplicate articles. Some customers have multiple accounts. Permissions differ by region. The agent now needs tool schemas, retrieval filters, identity checks, and fallbacks.

Month three: exceptions dominate. The agent handles clean tickets but stalls on messy ones. It asks unnecessary questions, misses policy constraints, or calls a tool before it has enough information. The team adds routing rules, a verifier, and a memory layer.

Month four: cost and latency become visible. The system makes multiple model calls per request. Traces are hard to read. Users complain that the agent takes longer than manual handling for common cases. Finance notices the inference bill. Security asks who approved the tool permissions.

Month five: evaluation becomes the project. The team cannot tell whether the agent is improving because success depends on a chain of actions. A better final answer may hide a worse intermediate decision. A passing test may rely on the exact phrasing of the task. Every fix creates new edge cases.

Month six: the team rebuilds the workflow as ordinary software. One LLM call classifies the request and extracts required fields. Deterministic code checks policy, retrieves records, and applies business rules. A second optional LLM call drafts customer-facing language after the system already knows the decision. Human review handles exceptions.

The second system looks less impressive in a demo. It is also faster, cheaper, easier to approve, and easier to debug.

This is the same lesson behind AI Agent ROI: What the 3.4% Who Hit Their Targets Do Differently: the winning teams define the measurable workflow before adding autonomy. It also matches the warning in When NOT to Use an Agent: if the task has stable rules, constrained inputs, and clear exception paths, agent architecture is probably not the starting point.


The Single-Call Pattern

The single-call alternative is not "just ask ChatGPT." It is a disciplined architecture:

  1. Preprocess the input with deterministic code.
  2. Send the model a narrow task.
  3. Require structured output.
  4. Validate the output against a schema.
  5. Apply business logic outside the model.
  6. Escalate uncertainty.
  7. Log enough context to reproduce the result.

For a support workflow, the LLM might return:

{
  "intent": "refund_request",
  "confidence": 0.84,
  "required_fields": ["order_id", "purchase_date"],
  "customer_sentiment": "frustrated",
  "summary": "Customer says the item arrived damaged and wants a refund."
}

The refund policy does not live inside the prompt. The model does not decide whether to issue money. The code checks the order, purchase date, SKU, region, warranty status, and fraud flags. If the request qualifies, the system drafts a response. If it does not, it routes to a queue with the extracted summary.

For an internal analytics workflow, the LLM might translate a natural-language question into a constrained query plan, but a query builder enforces allowed tables and columns. For legal intake, the LLM might summarize facts and classify matter type, but conflict checks and retention decisions stay outside the model. For IT service management, the LLM might map a request to a category and priority, but access changes still require policy code and human approval.

This pattern is less autonomous. That is the point.

It gives the language model the job it is good at: reading messy language and producing useful structure. It gives deterministic systems the jobs they are better at: policy, permissions, calculation, state transitions, and auditability.

It also makes evaluation tractable. You can score intent classification, field extraction, summary faithfulness, confidence calibration, and escalation accuracy. You do not need to infer whether a five-step plan failed because of retrieval, reasoning, tool selection, memory, permissioning, or a bad intermediate observation.

For teams struggling with cost, this connects directly to Agent Cost Optimization: How to Track and Reduce LLM Spend. The cheapest agent call is the one you remove. The second cheapest is the one you narrow until a smaller model can handle it.


When an Agent Is Still the Right Choice

The argument against overbuilt agents is not an argument against agents.

Agents make sense when the task genuinely requires iterative action. A research agent that must search, compare sources, revise hypotheses, and follow leads may need tool use and planning. A coding agent that edits files, runs tests, reads failures, and patches again is doing work that cannot be compressed into one completion. A security investigation agent that correlates logs across systems may need to branch as evidence changes.

The important test is whether the next step depends on new information created by the previous step.

If yes, an agent may be justified. If no, you may be adding a planning loop around a fixed workflow. Fixed workflows are usually better expressed as software.

There is also a value threshold. Anthropic's multi-agent research system, covered in The Hidden Cost of "Just Add Another Agent", can justify high token overhead when the task value is high and parallel research produces better answers. A customer support classification pipeline cannot make the same argument unless the improvement is measurable and large.

The governance threshold matters too. Gartner's May 2026 warning on agent governance predicts that by 2027, 40% of enterprises will demote or decommission autonomous agents because governance gaps are found after production incidents. A system that can act across trust boundaries needs more than prompt rules. It needs scoped permissions, monitoring, rollback, incident response, and a named owner.

If the business case cannot support those controls, the architecture is too ambitious for the value at stake.


Practical Implications: The De-Agenting Checklist

Before funding another quarter of agent work, ask seven questions.

What decision is the agent actually making? If the answer is "it follows the workflow," you may not need an agent. You need workflow software with model-assisted input handling.

Can the task be completed with one structured output? If the model can classify, extract, summarize, or draft in one pass, start there. Add steps only when measured failures require them.

Which parts must be deterministic? Payments, account changes, compliance checks, access grants, inventory movements, and legal commitments should not depend on free-form model judgment.

What happens when confidence is low? A useful non-agentic system has a clear fallback. A weak agent often tries another loop and burns time without improving certainty.

Can you replay the failure? If you cannot reproduce why the system acted, you are not ready for autonomy. A single-call system is easier to replay because the decision boundary is narrow.

Does the workflow improve if the model thinks longer? Some reasoning tasks benefit from extra compute. Many enterprise tasks benefit more from cleaner data and stricter schemas.

Would a rules engine plus one LLM call beat the current agent on cost, latency, and approvals? If yes, rebuild around that pattern and preserve the agent work only where it proves incremental value.

This is not a purity test. It is a sequencing rule. Start with the smallest model-mediated workflow that can be measured. Add autonomy when the simpler design hits a documented ceiling.


What's Next

The next phase of enterprise AI will not be "agents everywhere." It will be better discrimination between tasks that need autonomy and tasks that need language understanding inside conventional systems.

That is a healthier direction. It means more AI will reach production, not less. The projects that survive will look less magical and more boring: structured outputs, policy code, retrieval limits, eval sets, audit logs, and human exception paths. They will use agents where open-ended action is worth the cost. They will use single LLM calls where the job is bounded.

The market is already pushing that way. Gartner's cancellation forecast, Deloitte's production gap, McKinsey's emphasis on workflow redesign, and realistic agent benchmarks all point to the same operational truth: autonomy is not a feature to add by default. It is a liability to earn.

The six-month agent project that goes back to a single LLM call is not a cautionary tale about AI being overhyped. It is a cautionary tale about architecture pretending to be strategy.

The better question for 2026 is not "Can we make this agentic?"

It is: "What is the least autonomous system that produces the business result?"

Start there. Then make the agent prove it deserves to exist.


Related: AI Agent ROI: What the 3.4% Who Hit Their Targets Do Differently · When NOT to Use an Agent · Agent Cost Optimization: How to Track and Reduce LLM Spend · The Hidden Cost of "Just Add Another Agent" · Deploying AI Agents to Production