tags: guides, agent-design
category: agent-design
slug: agent-tool-use-patterns-guide
meta_description: "A practical guide to how LLM agents select, call, and chain tools in production. Covers function calling patterns, failure modes, benchmarks, and the MCP standard."
Every major model provider now supports function calling. OpenAI, Anthropic, Google, and a dozen open-weight alternatives all let you define tool schemas and get back structured JSON. The interface works. The reliability doesn't.
On the Berkeley Function Calling Leaderboard (BFCL) V4, the best models score around 70%. GLM-4.5 leads at 70.85%, with Claude Opus 4.1 close behind at 70.36% and Claude Sonnet 4 at 70.29%. GPT-5 sits at 59.22% in seventh place. These are the frontier models, evaluated on structured function calls with clear schemas. In production, where schemas are messy and tool counts climb past fifty, accuracy drops further.
The gap between "can call a function" and "can use tools reliably across a multi-step workflow" is where most agent projects die. This guide covers what actually happens when an LLM tries to use tools: the patterns that work, the failure modes that kill production systems, and the emerging standards trying to make it all less fragile.
The Tool-Use Lifecycle
Tool use isn't one skill. It's six, executed in sequence, and an agent can fail at any step.
Discovery is knowing what tools exist. In early systems, this meant stuffing every function signature into the system prompt. That worked for five tools. It falls apart at fifty. AnyTool introduced hierarchical retrieval structures that use semantic search to inject only the most relevant top-K tool descriptions into the context window. The tradeoff: retrieval adds latency and can miss the right tool entirely if the user's intent doesn't match the tool's description.
Selection is choosing which tool to call. This is where the ReAct framework still dominates, interleaving reasoning traces with action steps. But ReAct has an expensive bottleneck: it invokes the LLM at every step to decide which tool to use next. AutoTool, published in November 2025, sidesteps this by building a directed graph from historical agent trajectories. Nodes are tools, edges are transition probabilities. The result: 30% reduction in inference costs while maintaining task completion rates.
Parameter construction is generating the correct arguments. This is harder than it sounds. A comprehensive analysis presented at EMNLP 2025 categorized multi-step tool call errors into five patterns: tool selection errors, tool hallucination errors, parameter key errors, parameter value errors, and environment errors. Missing required parameters, hallucinating parameter names that don't exist in the schema, and filling values with plausible-sounding but incorrect data are all common. Models that write flawless Python routinely botch start_date vs date_start in an API call.
Execution is running the tool and getting a result. The agent doesn't execute anything directly. It hands structured JSON to your application layer, which calls the API, runs the query, or hits the database. Your code handles retries, rate limits, and timeouts. The agent just waits.
Result interpretation is reading what came back and deciding what to do next. This is where context windows matter. A tool that returns 50KB of JSON can blow out the context budget for the rest of the conversation. Smart implementations truncate, summarize, or extract specific fields before feeding results back to the model.
Chaining is doing it all again with the next tool. Multi-step workflows create compounding error rates. If each tool call succeeds 85% of the time, a five-step chain succeeds 44% of the time. This arithmetic is why production agents feel unreliable even when individual tool calls look fine.
How the Big Three Implement It
The three major providers have converged on similar interfaces but diverge on important details.
OpenAI's function calling defines tools as JSON Schema objects passed in the tools parameter. The model returns a tool_calls array with function names and arguments. OpenAI supports parallel tool calls by default, letting the model invoke multiple functions simultaneously when the calls are independent. You can disable this with parallel_tool_calls=False when order matters.
Anthropic's tool use follows the same pattern but wraps it in a content-block architecture. Claude returns tool_use content blocks with id, name, and input fields. You respond with tool_result content blocks that reference the original ID. Anthropic's documentation emphasizes the agent loop pattern: send message, receive tool call, execute, send result, let the model continue.
Google's Gemini function calling supports both automatic and manual modes. In automatic mode, Gemini executes the function call and feeds the result back without developer intervention. In manual mode, you get the structured call and handle execution yourself. Google also supports ANY mode, which forces the model to always predict a function call, useful when you want to guarantee tool use rather than leaving it optional.
The practical differences matter less than they used to. The Model Context Protocol (MCP), which Anthropic open-sourced in November 2024, is absorbing these differences into a shared standard. By March 2025, OpenAI had adopted MCP. By April 2025, Google DeepMind confirmed support. As of early 2026, MCP SDK downloads exceed 97 million per month. The protocol was donated to the Linux Foundation in December 2025, signaling that tool interfaces are becoming infrastructure rather than competitive differentiators.
For a broader comparison of the frameworks that wire these tool-calling APIs into agent architectures, see our framework roundup.
The Selection Problem
Choosing the right tool from a large catalog is the single hardest step in the lifecycle. Research confirms what practitioners already know: as the number of available tools grows, accuracy on selection drops.
Toolformer was one of the earliest approaches, using self-supervised learning where models teach themselves when and how to call tools from a handful of demonstrations. The limitation is the context window. Stuffing 200 tool descriptions into a prompt doesn't scale, and even if it fits, the model's attention over that many options degrades.
Gorilla, from UC Berkeley, took a different approach: fine-tuning LLaMA specifically on API call generation, using a retriever to pull relevant API documentation at inference time. Gorilla surpassed GPT-4 on API call accuracy in its initial benchmarks, but the model was specialized. It couldn't do anything else well.
The current generation of solutions splits into two camps. Retrieval-based approaches like AnyTool treat tool selection as an information retrieval problem, searching for the right tool at runtime. Graph-based approaches like AutoTool treat it as a prediction problem, using historical patterns to anticipate the next tool. Both reduce inference costs. Neither solves the cold-start problem, where an agent encounters a tool it has never used before and must reason from the documentation alone.
This cold-start scenario is exactly what benchmarks measure and production systems actually face. When your agent hits an unfamiliar API, it's back to reading the schema and hoping.
Failure Modes That Kill Production Systems
Four failure patterns dominate production tool use, and they're more stubborn than the benchmarks suggest.
Parameter hallucination is the most common. The model generates arguments that look structurally valid but contain invented values. A date field gets 2025-13-45. A user ID gets a plausible-looking but nonexistent string. An enum field gets a value that isn't in the allowed set. The JSON parses. The API rejects it. Or worse, the API accepts it and returns wrong data silently.
Tool hallucination means the model calls a tool that doesn't exist. It invents a function name based on what it thinks should be available. This happens more often when the model has seen similar tools in training data but the current environment offers a different set. Research on internal representations published in January 2026 found that hallucinated tool calls show distinct patterns in last-layer representations, enough to build a classifier that detects fabricated calls before they execute.
Interaction collapse is the failure mode explored in detail in When Your Agent Stops Using Tools. Models trained with reinforcement learning gradually abandon tool use and substitute internal reasoning, simulating what a tool would return instead of calling it. The model reasons about the calculator instead of using it. The outputs look reasonable. They're wrong at rates that matter.
Cascading failure occurs in multi-step chains when an error in step two corrupts steps three through five. The model doesn't recognize that a tool returned an error or returned partial data, and builds subsequent tool calls on a broken foundation. FutureAGI's analysis of production tool chains found that agents often begin with correct reasoning and valid tool selections but degrade mid-execution, with many failures traced to malformed JSON output, loss of structure, or the model forgetting earlier decisions.
The Reasoning Trap
Here's the finding that should worry every agent developer. A paper from October 2025 established a causal link between reasoning enhancement and tool hallucination. Training models to reason better with reinforcement learning actually makes them hallucinate tools more often.
The mechanism is intuitive once you see it. RL rewards the model for getting correct answers. If the model can simulate a tool call internally and get the right answer, that's rewarded just as much as actually calling the tool. Over time, the model learns that internal simulation is lower-variance than external tool calls that might fail, time out, or return unexpected formats. So it substitutes reasoning for action.
The researchers tested mitigation strategies. Prompt engineering offered minimal relief. Direct Preference Optimization (DPO) reduced hallucinations but degraded tool-use proficiency. There is a fundamental capability-reliability tradeoff: making models better reasoners makes them worse tool users, and fixing the tool-use problem costs you reasoning performance.
This tradeoff has practical implications for anyone building agents. You can't just train a model to be smarter and expect it to use tools more reliably. The training objective has to explicitly reward correct tool use at each step, not just correct final answers. ToolRLA, a March 2026 paper, demonstrates one approach: a multiplicative reward that decomposes into format validity, tool selection accuracy, parameter correctness, and domain compliance. Deployed on a financial advisory system handling 1,200+ daily queries, ToolRLA improved task completion from 62% to 91% and cut tool invocation errors from 38% to 14% over three months.
Building Reliable Tool-Use Agents
Given these failure modes, here's what actually works in production.
Constrain the tool catalog. Agents perform better with fewer, well-documented tools than with expansive catalogs. If your agent needs access to 100 APIs, use a retrieval layer to surface only the 5-10 most relevant ones per query. Every tool in context is a potential distraction.
Validate before executing. Don't pass the model's tool call directly to your API. Check required parameters exist. Validate types. Confirm enum values against the allowed set. Reject malformed calls and ask the model to retry. This simple guardrail catches the majority of parameter hallucination errors.
Add reasoning checkpoints. Forcing the model to articulate why it's choosing a specific tool before calling it improves selection accuracy. As covered in Tools That Think Back, this single structural change catches errors the model would otherwise miss.
Use dense reward signals for training. If you're fine-tuning or running RL on tool-use tasks, sparse terminal rewards (right answer = reward, wrong answer = nothing) are destructive. Intermediate rewards at each tool-calling step give the optimizer gradient signal through the middle of long trajectories.
Design for graceful degradation. Your agent will call the wrong tool. Your API will time out. The response will be truncated. Build retry logic, fallback paths, and explicit error handling into the orchestration layer, not the prompt. The agent should know that a tool call failed and have a defined recovery path, not just try again with the same parameters.
Monitor tool-use patterns in production. Track which tools get called, in what order, with what success rates. Observability is how you find the 15% of tool calls that fail before they compound into a 44% workflow failure rate.
Where This Is Heading
The tool-use stack is consolidating. MCP is becoming the standard interface layer. Frameworks like LangGraph, AutoGen, and CrewAI are converging on similar orchestration patterns. The research is shifting from "can models call functions?" to "can models use tools reliably across long workflows?"
The hardest unsolved problem is what the survey on tool-use evolution calls the shift from isolated invocation to multi-tool orchestration. Modern benchmarks now model topological complexity along a spectrum from sequential chains to directed acyclic graphs, where tool execution order follows causal dependencies. We're not just asking models to call one API. We're asking them to plan and execute multi-step workflows where the output of tool three determines which tool to call at step four.
That's a planning problem, not a function-calling problem. And planning is where current models still struggle most. The true cost of running agents in production isn't the API calls. It's the engineering around reliability, monitoring, and failure recovery that makes tool-use work at scale.
If you're building your first agent, start with two or three tools, validate every call, and add reasoning checkpoints. The models are good enough to use tools. They're not good enough to be trusted with tools unsupervised.
For more on the types of agents that use these patterns, from simple ReAct loops to full multi-agent orchestrations, see our agent taxonomy.
Sources:
- Berkeley Function Calling Leaderboard V4, UC Berkeley Gorilla Project
- The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination, arxiv, October 2025
- AutoTool: Efficient Tool Selection for Large Language Model Agents, arxiv, November 2025
- ToolRLA: Multiplicative Reward Decomposition for Tool-Integrated Agents, arxiv, March 2026
- ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, arxiv, 2023 (ICLR 2024)
- Gorilla: Large Language Model Connected with Massive APIs, UC Berkeley, 2023
- A Comprehensive Analysis of Failed Parameter Filling, EMNLP 2025 Findings
- Internal Representations as Indicators of Hallucinations in Agent Tool Selection, arxiv, January 2026
- The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration, arxiv, March 2026
- Model Context Protocol Specification, Anthropic / Linux Foundation, November 2025
- Why the Model Context Protocol Won, The New Stack
- How Tool Chaining Fails in Production LLM Agents, FutureAGI
- Toolformer: Language Models Can Teach Themselves to Use Tools, Meta AI, 2023
- ReAct: Synergizing Reasoning and Acting in Language Models, Yao et al., 2022