LISTEN TO THIS ARTICLE

Function Calling Is the Interface AI Research Forgot

OpenAI shipped function calling in June 2023. Anthropic followed with tool use. Google added it to Gemini. The capability felt like plumbing, necessary infrastructure that would quietly improve over time while researchers chased more interesting problems like reasoning or memory.

That assumption was wrong. Two years later, function calling isn't plumbing. It's a research frontier hiding in plain sight. Models that can write flawless Python still botch API parameter extraction 30% of the time. Multi-turn tool orchestration collapses when sparse rewards meet expensive exploration. And nobody's figured out how to make these systems work reliably across languages other than English.

The gap between "models that can call functions" and "agents that can actually use tools in production" is wider than the industry let on. I've now read eight papers this month on function calling, and the pattern is clear: we built the interface before we understood the problem.

What Function Calling Actually Is

Strip away the marketing and function calling is parameter extraction with consequences. The model reads natural language, maps it to structured JSON, and hands that JSON to an external system. A database query. An API call. A file operation. The model doesn't execute code. It translates intent into structured calls that something else executes.

The canonical example: a user asks "What's the weather in Tokyo?" The model doesn't scrape weather.com. It generates {"function": "get_weather", "location": "Tokyo"} and returns that to your application layer. Your code hits the weather API. The model takes the response and generates a natural language answer.

This sounds straightforward. It's not. The model has to identify when to call a function versus answering directly. It has to select the right function from dozens or hundreds of options. It has to extract parameters from messy natural language where "Tokyo" might appear as "東京" or "the capital of Japan." It has to handle partial information, ambiguous requests, and edge cases your API documentation doesn't cover.

OpenAI's implementation uses a specific system message format that primes the model to recognize function schemas and output valid JSON. Anthropic's approach treats tools as part of the prompt structure with explicit tool definition blocks. Google's Gemini uses a unified format similar to OpenAI but with different schema validation rules. The mechanics differ but the core challenge is identical: how do you get a language model to reliably output structured data that external systems can consume?

The answer turned out to be supervised fine-tuning on millions of synthetic examples. Every major provider built custom datasets of function calls paired with natural language inputs. They trained models to recognize function signatures, extract parameters, and handle error cases. The models got good at this specific task. Good enough that developers started building real applications.

Then the edge cases arrived. A model trained primarily on English function descriptions fails when the user speaks Mandarin. A system that works for single-turn calls breaks when you need three sequential API operations to fulfill a request. A carefully tuned prompt that extracts parameters with 95% accuracy on your test set drops to 70% when users start abbreviating field names or using synonyms you didn't anticipate.

Function calling looked solved. It wasn't.

The Multi-Turn Problem Nobody Expected

Single-turn function calling is a solved problem in controlled environments. You ask for the weather, the model calls get_weather, you display the result. Production systems don't work like that. Real agents need to chain multiple tool calls together. Check inventory, calculate shipping, verify payment, update the order database, send a confirmation email. Five sequential operations, each dependent on the previous result, with branching logic based on what each call returns.

This is where the current generation of function calling models starts to struggle. The RC-GRPO paper from Zhong et al. documents the core issue: multi-turn tool calling creates sparse reward signals that make reinforcement learning ineffective. You either complete the entire sequence successfully or you don't. There's no partial credit for getting four out of five calls right if the final call fails.

Traditional RLHF approaches treat each model output as an independent decision and assign rewards accordingly. But tool orchestration doesn't work like that. The value of calling get_inventory isn't determined until you attempt the final update_order call three steps later. If that final call fails because get_inventory returned stale data, the entire sequence was worthless. The model gets a reward of zero for a chain of decisions where four out of five were correct.

This creates a training problem. Standard Group Relative Policy Optimization (GRPO) compares outcomes within a batch of rollouts and assigns advantage estimates based on relative performance. When most rollouts in a group receive identical rewards, all zeros or all ones, the advantage signal vanishes. The model can't learn which specific decisions led to success versus failure. Updates stall. Performance plateaus.

Zhong's solution is reward conditioning. Instead of treating all rollouts equally, RC-GRPO explicitly conditions the policy on the final reward signal and uses that to shape advantage estimates even when within-group variation is low. The model learns to distinguish between partial paths that eventually succeed versus those that fail, even when the immediate feedback is sparse.

The improvement is measurable. On ToolBench, RC-GRPO improved pass rates from 67.3% to 71.8% on multi-turn tasks where standard GRPO had stalled. That 4.5 percentage point gain represents the difference between "works most of the time" and "works reliably enough to ship." For systems handling database writes or financial transactions, that gap matters.

But reward shaping only solves half the problem. The other half is data.

Every major function calling model was trained on synthetic data.

Why Synthetic Function Calling Data Is Harder Than It Looks

Every major function calling model was trained on synthetic data. Not logs of real user interactions, those don't exist at sufficient scale, and the ones that do exist can't be shared for privacy reasons. Synthetic data. Millions of programmatically generated examples mapping natural language to function calls.

The naive approach: enumerate all possible parameter combinations, generate template sentences for each, train the model. This produces models that work perfectly on your test set and fail unpredictably in production. The Greenstein et al. paper on linguistic diversity documents why: real users don't speak in templates, and the distribution of how they describe the same request is wider than synthetic data generators assume.

Their analysis of existing function calling datasets found that 72% of examples used identical phrasing patterns for the same parameter extraction task. "Get the weather in [city]" repeated thousands of times with different city names. Models trained on this data learned the template, not the underlying task. When a user says "what's it like outside in Tokyo right now," the model stumbles. The words don't match the pattern.

The solution requires two types of diversity: linguistic variation and argument structure variation. Linguistic diversity means generating multiple natural ways to express the same request. "What's the weather," "how's the weather," "tell me the weather," "weather report for," "is it raining in." Not templates with slots. Genuinely varied phrasings that capture how humans actually talk.

Argument diversity means varying how parameters are specified. Don't just vary the city name. Vary whether the user provides a city at all. Train on incomplete requests. Ambiguous locations. Implicit context from previous turns. "What about tomorrow?" as a follow-up to a weather query. The argument structure changes even though the underlying function is the same.

Greenstein's approach generates synthetic data by sampling from distributions over both linguistic patterns and argument structures. The result: models that generalize better to real user inputs because they've seen a wider range of ways to express the same semantic intent. Their best model improved parameter extraction accuracy by 8.2 percentage points on out-of-distribution test cases compared to models trained on traditional synthetic data.

This matters more as you add tools. A system with 5 functions can survive with template-based training data. A system with 50 functions can't. The combinatorial explosion of how users might describe requests that map to 50 different schemas is too large to cover with templates. You need models that learned the underlying structure of parameter extraction, not surface patterns.

The Reasoning Gap

Here's a pattern I've seen in every production function calling system: the model makes the right call with the wrong parameters. It correctly identifies that it should query the database. It generates syntactically valid SQL. The query returns zero rows because it extracted the wrong date range or misinterpreted which field maps to the user's intent.

Parameter extraction is a reasoning task disguised as a formatting task. When a user says "show me last quarter's sales," you need to figure out what "last quarter" means (which depends on today's date), map "sales" to the correct table and column names (which depends on your schema), and construct a query that captures the user's actual intent (which might be total revenue, units sold, or revenue by product line depending on context).

Current function calling models treat this as pure pattern matching. The Wei et al. paper on think-augmented function calling demonstrates what happens when you make reasoning explicit. Their approach adds a structured thinking step before parameter generation. The model outputs its reasoning about what the user wants, how that maps to available function parameters, and which assumptions it's making about ambiguous inputs.

The format looks like this: instead of going straight from user input to function call, the model generates an intermediate reasoning trace. "The user asked about last quarter. Today is October 15, 2024, so last quarter was Q3 (July-September 2024). 'Sales' likely refers to revenue given the business context. The query should sum the revenue column grouped by the date range July 1 to September 30, 2024."

Only after generating this reasoning does the model produce the actual function call. The reasoning trace acts as a scratchpad that forces the model to work through the semantic mapping before committing to parameters.

The improvement is significant. On their benchmark of 500 real-world function calling tasks with ambiguous or incomplete parameters, think-augmented function calling improved accuracy from 71.4% to 83.9%. The biggest gains came on tasks requiring temporal reasoning, unit conversion, or mapping between domain-specific terminology and database schemas.

The cost is latency and tokens. Generating explicit reasoning adds roughly 150 tokens per function call, which translates to an extra 200-300ms at current inference speeds. For applications where correctness matters more than speed, that's a reasonable trade. For chatbots, it's probably not.

This creates an architectural decision that most teams haven't had to make explicit yet: do you want fast function calling or accurate function calling? Current models force you to choose. The research suggests the ceiling for pure pattern-matching approaches is somewhere around 85% accuracy on real-world tasks. Getting past that requires reasoning. If you're building agents that need to handle complex multi-step workflows, understanding how reasoning tokens actually work becomes critical infrastructure knowledge.

Tool Calling Breaks Across Languages

English function calling works. Other languages don't. The Luo et al. paper on multilingual reliability documents the scale of the problem: state-of-the-art models that achieve 89% accuracy on English function calling tasks drop to 62% accuracy on Mandarin and 58% on Arabic for identical tasks with translated inputs.

This isn't a translation problem. The function schemas are language-agnostic. The parameters are the same. The underlying API doesn't care whether the request came from an English or Mandarin speaker. The model just fails to correctly extract parameters from non-English inputs.

The root cause: training data distribution. Function calling datasets are overwhelmingly English. Even multilingual models like GPT-4 and Claude see far more English function calling examples during training than examples in other languages. They learn the task in English and transfer incompletely to other languages.

The failure modes are predictable. Models default to English parameter names even when the user spoke Mandarin. They hallucinate parameters that exist in English versions of the schema but not the actual schema being used. They misparse language-specific conventions like date formats or proper name ordering.

Testing this revealed something worse: the errors aren't consistent. The same model given the same Mandarin input will sometimes extract parameters correctly and sometimes fail. The accuracy isn't deterministically bad. It's stochastically unreliable, which is harder to work around.

The current fix is language-specific fine-tuning, which means maintaining separate models or separate fine-tuning runs for each language you need to support. That's expensive and doesn't scale. The research consensus is that we need genuinely multilingual function calling datasets, not English datasets with machine-translated inputs. Those don't exist yet at sufficient scale.

For production systems targeting non-English users, the recommendation right now is to translate user inputs to English, perform function calling in English, and translate outputs back. This works but adds latency and introduces translation errors. It's a workaround, not a solution.

The public APIs look similar but the underlying implementations diverge in ways that matter for production systems.

How OpenAI, Anthropic, and Google Actually Do This

The public APIs look similar but the underlying implementations diverge in ways that matter for production systems.

OpenAI's approach treats functions as part of the chat completion API. You pass an array of function definitions in the request, and the model decides whether to respond with text or a function call. The function schemas use JSON Schema format. The model can call multiple functions in parallel if the task requires it. The response includes a function_call object with the name and arguments, or it returns a regular text message.

Anthropic's implementation separates tools from the conversation more explicitly. Tools are defined outside the message array with their own schema block. The model generates tool_use blocks within its response that specify which tool to call and with what parameters. Multiple tool calls in a single response are supported, but they're sequential rather than parallel. You can optionally provide tool results back to the model to continue the conversation.

Google's Gemini uses a format closer to OpenAI's but with tighter schema validation. Function definitions must include explicit parameter types, descriptions, and whether parameters are required. The model can generate multiple function calls in one response, and there's built-in support for iterative refinement where the model can request clarification before finalizing a function call.

The practical differences emerge when you try to handle errors. OpenAI's API returns invalid JSON if the model fails to generate proper parameters, which means you need try-catch blocks and retry logic. Anthropic's structured tool_use format makes errors easier to catch but harder to recover from programmatically. Google's strict validation catches more errors upfront but rejects function calls that might have worked with looser parsing.

None of these approaches handle the truly hard cases: partial information, conflicting parameters, or requests that map to multiple possible function sequences. For those, you're building orchestration logic on top of whichever API you choose. The model gives you primitives. You handle the composition.

The Orchestration Layer Nobody Wants to Build

Function calling is the primitive. Tool orchestration is the system. The gap between them is all the code you have to write to make agents work in production.

A real example from the AgentSkiller paper: a user asks an agent to "plan a trip to Paris next month." This maps to at least six sequential tool calls: check calendar availability, search flights, compare hotel prices, check visa requirements, estimate total budget, create itinerary. Each call depends on results from previous calls. Some calls might fail and require retries with adjusted parameters. The final itinerary needs to be coherent even if some steps returned suboptimal results.

Current function calling APIs give you the ability to make each individual call. They don't give you the orchestration layer that decides call order, handles dependencies, manages state between calls, implements retry logic, or composes results into a coherent output.

Most production teams build this orchestration layer themselves. There's no standard approach. Some use state machines. Some use DAGs. Some use LLM-generated plans that get executed by a separate runtime. The AgentSkiller paper tried to systematize this by training models on cross-domain task sequences, but their approach still requires significant custom integration work.

The fundamental problem: tool orchestration is domain-specific. The sequence of API calls needed to book a trip is completely different from the sequence needed to debug a production incident or analyze financial data. You can't pre-train a general-purpose orchestrator because the logic is tied to your specific tools and business rules.

This is where most agentic systems spend their engineering budget. Not on function calling itself, that works well enough. On the orchestration layer that makes function calling useful. The research community has largely ignored this problem because it looks like an engineering challenge rather than a research problem. But the engineering challenge is blocking real deployment. If you're trying to build practical agent systems, understanding the full architecture from prompt to production matters more than optimizing any single component.

What This Actually Changes

Function calling looked like solved infrastructure. The research says otherwise. Multi-turn tool orchestration doesn't work reliably when rewards are sparse. Synthetic training data creates models that memorize patterns rather than learning the underlying task. Reasoning about parameters matters more than we thought. And the whole stack breaks outside English.

These aren't incremental improvements to a working system. They're fundamental gaps in capability that block production deployment for anything more complex than single-turn API calls.

The immediate fix for teams building agents right now: treat function calling as probabilistic, not deterministic. Budget for 70-80% accuracy on real-world inputs even if your test set shows 95%. Build retry logic. Use explicit reasoning for high-stakes parameter extraction. Test extensively in your target languages. And plan to spend more engineering time on orchestration than on the function calling itself.

The research direction that matters: better training data that captures linguistic and structural diversity, reward shaping approaches that work with sparse multi-turn signals, and multilingual datasets that don't rely on machine translation. The primitive works. The composition layer is still broken.

Function calling became commodity infrastructure faster than it became reliable infrastructure. That's the gap production systems are now discovering. The vendors shipped the interface. The research is still catching up to the implementation. And if agents are going to remember context across these complex multi-step interactions, solving the memory problem becomes just as critical as getting the function calls right.

Sources

Research Papers:

Related Swarm Signal Coverage: