By Tyler Casey · AI-assisted research & drafting · Human editorial oversight
@getboski
JPMorgan's COIN system processes 360,000 staff hours of legal document review annually—work that once consumed an entire department. Error rates dropped 80%. The system isn't a single model working alone. It's multiple AI agents coordinating across document classification, entity extraction, and clause verification, each specialized for a different aspect of contract analysis. When one agent flags ambiguous language, another retrieves similar precedents from archives, while a third estimates legal risk. The 30% cost savings and sub-second response times didn't come from deploying better individual models. They came from agents that learned to divide labor and recombine insights without human orchestration at every step.
This is what multi-agent systems actually deliver when they work—not science fiction swarms or emergent superintelligence, but the economic return of coordination. The challenge is that coordination at scale introduces failure modes that single-agent architectures never encounter. For every JPMorgan COIN, there's a ChatDev implementation where multi-agent collaboration achieves 25% correctness—worse than a single GPT-4 working alone. For every 80.9% improvement on parallelizable financial reasoning tasks, there's a 39% to 70% degradation on sequential planning. The line between force multiplication and error amplification turns out to be thinner than the architecture diagrams suggest.
What Is a Multi-Agent System?
The distinction that matters isn't how many agents you deploy — it's where failures originate. A fleet of identical workers processing documents in parallel fails independently: one crashes, the others continue, and you lose exactly one worker's output. A multi-agent system fails through interaction. One agent's confident but wrong classification cascades into another agent's downstream reasoning, which corrupts a third agent's aggregation, and now you've amplified a small error into a systemic one. The failure mode tells you what you're actually building.
Wooldridge and Jennings saw part of this clearly in the 1990s. Their framework — agents as autonomous entities perceiving environments, reasoning about goals, and coordinating under distributed information — got the fundamentals right. Agent autonomy and distributed control remain the defining characteristics. What their framework couldn't anticipate is the LLM era, where "reasoning" isn't rule-based inference over structured knowledge but probabilistic generation over compressed training distributions. A 1990s multi-agent system coordinated deterministic actors with predictable outputs. Today's multi-agent systems coordinate stochastic actors whose outputs shift with temperature settings, prompt phrasing, and context window contents. The coordination problem didn't just scale — it changed in kind.
If you can achieve the same outcome by running one model multiple times in sequence, you don't have a multi-agent system. You have batching. Multi-agent systems require differentiation — specialized tools, distinct training, or divergent objectives that force genuine coordination rather than parallel execution. The moment agents need to negotiate, share partial state, or resolve conflicting conclusions, you've crossed into territory where coordination itself becomes the primary engineering challenge.
Why Multiple Agents? The Case for Distributed Intelligence
The performance gap on parallelizable tasks is too large to ignore. The 90% Jump documents enterprise-specific cases where multi-agent coordination delivers performance gains that single agents cannot match. When Google DeepMind tested centralized coordination on financial reasoning—one orchestrator agent distributing analysis of revenue trends, cost structures, and market comparisons to specialist agents—they measured an 80.9% improvement over single-agent approaches. The task structure mattered: distinct subtasks with minimal interdependence, clear success criteria for each component, and straightforward aggregation of results.
JPMorgan's COIN demonstrates this at production scale. The system doesn't just parallelize document review—it specializes. One set of agents handles standard lease agreements with known clause templates. Another processes merger contracts requiring cross-document consistency checks. A third manages edge cases that fall outside trained categories, routing them to human review with annotated confidence scores. The 360,000-hour annual processing volume and 80% error reduction comes from this specialization, not from deploying a single general-purpose model at scale. The system knows what it's good at and routes accordingly.
Klarna's customer service agent provides a different profile: 2.3 million conversations in the first month, resolution time dropping from 11 minutes to under 2 minutes, satisfaction scores matching human agents. Then, a year later, Klarna quietly resumed hiring human agents. The gap between pilot metrics and sustained production reveals where multi-agent systems hit friction. The early wins came from handling high-volume, low-complexity queries where coordination wasn't needed—agents working in parallel, not collaboratively. The limitations appeared with complex cases requiring genuine multi-step coordination: verifying account details, checking inventory across systems, coordinating with fraud detection. These tasks need agents that don't just work simultaneously, but communicate state and negotiate outcomes.
The question isn't whether multiple agents can outperform one — the evidence says yes, for the right task structures. It's whether coordination overhead consumes the efficiency gains before they materialize.
How Agents Coordinate: Communication Patterns
Agent coordination collapses into four fundamental patterns, each with different tradeoffs for latency, reliability, and error propagation.
Broadcast communication sends messages to all agents simultaneously. Simple to implement, terrible at scale. Every agent processes every message regardless of relevance, creating O(n²) communication overhead as agent count grows. Useful for initialization or global state updates, catastrophic for ongoing coordination. When SocialVeil tested communication barriers across 720 scenarios, broadcast systems showed 45% reductions in mutual understanding—not from network failures, but from agents drowning in irrelevant context that diluted critical signals.
Peer-to-peer communication lets agents message specific partners. Scales better but requires agents to know who needs what information—either through hardcoded roles or learned policies. This is where DyTopo makes its contribution: instead of fixed peer relationships, agents dynamically reconfigure their communication networks during execution. An agent that repeatedly provides low-quality information sees connections pruned. High-performing specialists gain direct channels to coordinate without routing through intermediaries. On collaborative navigation tasks in multi-agent reinforcement learning benchmarks, dynamic topology outperformed static peer networks by 23%. The mechanism isn't exotic—it's agents voting on epistemic trust through network structure rather than explicit reputation scores.
The architectural leap is that communication topology becomes a learned policy, not a design-time constant. Agents don't just decide what to say—they decide who deserves to hear it. This introduces new failure modes: if a compromised agent influences topology decisions, it can isolate honest agents from coordination channels or route critical information through malicious intermediaries. The tradeoff is performance versus attack surface, and current systems optimize for the former.
Hierarchical communication imposes structure: a manager agent coordinates specialists, routing information vertically rather than allowing arbitrary peer connections. CrewAI and similar frameworks implement this explicitly—you define roles, workflows, and reporting relationships upfront. The advantage is predictable failure modes: if a specialist fails, the manager detects it and routes around it. The cost is rigidity: if the predefined hierarchy doesn't match task structure, coordination becomes a bottleneck rather than an accelerator. Google's research found that hierarchical architectures contain error amplification to 4.4x (compared to 17.2x for independent agents), but at the expense of the flexibility that makes dynamic coordination useful.
Information bottleneck compression addresses the bandwidth problem directly. When agents coordinate, they don't need to share every detail—just information relevant to the joint task. CommCP applies information bottleneck theory to compress inter-agent messages, reducing bandwidth by 41.4% while maintaining task accuracy. Agents learn a shared "jargon"—compact representations that preserve decision-relevant information while discarding redundant context. The principle, introduced by Tishby, Pereira, and Bialek, formalizes the tradeoff: maximize I(message, task) while minimizing I(message, input). In production systems where agents reshape, audit, and trade with each other, communication overhead determines economic viability. A 41% reduction in message size translates directly to inference cost savings at scale.
The pattern that emerges: static coordination protocols optimize for predictability; dynamic protocols optimize for performance. As systems scale, the economic pressure favors dynamic coordination, which shifts architectural complexity from design-time (where humans specify structure) to run-time (where agents negotiate structure). This makes systems harder to audit and easier to exploit, but cheaper to operate.
How Agents Compete: Negotiation, Auctions, and Game Theory
Agent negotiation isn't a toy problem anymore. PieArena benchmarks heterogeneous negotiation scenarios—splitting resources between parties with different utility functions, multi-issue bargaining where concessions on one dimension enable gains on another, and repeated interactions where reputation matters. The results reveal systematic performance disparities: agents using GPT-4 consistently extract 15-30% better deals than agents using GPT-3.5, both as buyers and sellers. The gap isn't small, and it doesn't come from prompt engineering—it comes from underlying capabilities in multi-step reasoning and strategic opponent modeling.
This matters because we're building infrastructure where agents become economic actors in autonomous markets. Your personal scheduling agent negotiates meeting times with others' agents. Your purchasing agent bargains with vendor agents over prices and terms. Your healthcare proxy agent discusses treatment options with provider agents. In each interaction, whoever controls the more capable model extracts more value. Compounded across thousands of micro-transactions, this produces systematic wealth transfer toward users with access to frontier AI systems.
The mechanism is subtler than simple price discrimination. Research on agent-to-agent negotiations in consumer markets shows that negotiation advantages emerge from better theory-of-mind modeling—agents that more accurately predict how opponents will respond to different offers. This isn't information the negotiation protocol exposes explicitly. It's inference from conversational dynamics: how quickly the other agent makes counteroffers, which dimensions it emphasizes, what language signals flexibility versus firmness. Better models perform better inference, which translates to better outcomes.
AgenticPay, a system designed for automated payment negotiation, exposes the failure modes. Agents show "substantial gaps" in long-horizon strategic reasoning—they optimize locally at each decision point without accounting for downstream consequences. The problem compounds with sequential reasoning: agents that think step-by-step make myopic commitments that amplify over time. What looks like rational behavior at each step produces irrational trajectories overall. This is the strategic planning gap that makes autonomous agent transactions riskier than the pitch decks suggest.
The protocols exist for structured negotiation. The Contract Net Protocol, introduced in 1980 for distributed problem-solving, handled task allocation through structured announce-bid cycles with defined acceptance criteria. Modern agent frameworks often replace this with unstructured dialogue—"let agents talk it out"—sacrificing reliability for flexibility. Game theory established Nash equilibria as stable coordination points, but computing them in real-time negotiation with incomplete information remains intractable for anything beyond toy scenarios.
As agents gain autonomy over transactions, negotiation disparities will compound into structural inequality. Not a policy problem — a capabilities problem embedded in infrastructure. The agents aren't misbehaving. They're optimizing, and some have sharper tools.
When Multi-Agent Systems Fail: Emergent Friction and Cascading Errors
The error amplification numbers are worse than most papers admit. When independent agents operate without coordination structure, errors amplify by 17.2x. Even with centralized coordination containing amplification to 4.4x, the effect is exponential—small errors in upstream agents cascade into catastrophic failures downstream. The mechanism isn't mysterious: each agent introduces noise, and when agents operate sequentially, noise compounds. If Agent A has 90% accuracy and Agent B has 90% accuracy, their serial composition achieves 81% accuracy, not 90%.
Performance plateaus reveal the structural limits. Google DeepMind's scaling study found that adding agents beyond four produces minimal gains on most tasks—accuracy saturates around 45% baseline regardless of agent count. More agents don't mean better results; they mean more coordination overhead consuming resources that could go to better individual reasoning. State-of-the-art multi-agent systems like ChatDev, designed for collaborative software development, achieve 25% correctness on realistic programming tasks—worse than single-agent GPT-4 baselines. The multi-agent architecture intended to improve outcomes instead degraded them through coordination failures.
Sequential tasks expose the clearest failure mode. Every multi-agent variant tested by DeepMind degraded performance by 39-70% on strictly sequential reasoning tasks like planning. The communication overhead fragmented reasoning processes, consuming cognitive resources needed for actual problem-solving. This isn't an implementation bug—it's a fundamental mismatch. Sequential reasoning requires maintaining coherent state across steps. Multi-agent systems distribute state across agents, each with partial information and independent reasoning traces. Recombining these fragments into coherent plans requires coordination work that exceeds the value of parallelism.
Research on why multi-agent LLM systems fail identified 14 distinct failure modes across three categories: system design issues (rigid architectures that don't match task structure), inter-agent misalignment (agents with conflicting objectives or mismatched communication protocols), and task verification failures (no mechanism to detect when collective reasoning has gone wrong). The study analyzed over 1,600 execution traces from 7 different frameworks, achieving high inter-annotator agreement (kappa = 0.88) when human experts classified failure patterns. The consistency suggests these aren't edge cases—they're systematic weaknesses across current approaches.
Communication barriers compound these failures. SocialVeil simulated three disruption types grounded in human communication research: semantic vagueness (ambiguous language), sociocultural mismatch (divergent context assumptions), and emotional interference (affective factors disrupting understanding). Across 720 scenarios and four frontier LLMs, mutual understanding dropped over 45% and confusion elevated nearly 50%. Adaptation strategies like repair instructions and interactive learning showed "only modest effects far from barrier-free performance." Current LLMs lack robust mechanisms to diagnose misunderstandings and recover from communication breakdowns—essential capabilities for real-world coordination.
If your base task accuracy is already high (>80%), multi-agent coordination risks making it worse through error amplification and communication overhead. If your task is inherently sequential, don't force parallelism — sequential execution with a single capable agent will outperform distributed coordination. The cases where multi-agent systems win are narrower than the architectures suggest: truly independent subtasks with minimal communication requirements and moderate base accuracy where coordination improvements outweigh error propagation.
Real-World Examples
JPMorgan COIN: Hierarchical architecture (manager routing to specialists) with peer communication only within specialist tiers. Works because subtasks are genuinely independent — analyzing one lease agreement doesn't require coordinating with analysis of another. The 360,000-hour volume and 80% error reduction come from matching architecture to task structure, not from agent sophistication.
Klarna customer service: 2.3 million conversations in the first month, then human agents resumed hiring a year later. The early wins came from parallel execution, not collaborative execution. The system hit limits where genuine multi-step coordination became necessary — verifying accounts, checking inventory, coordinating with fraud detection.
Google 2025 DORA Report: Found that AI adoption correlates with a 91% increase in code review time and 154% larger pull requests. This is coordination overhead manifesting as developer friction. More agents (code generation, review, testing, documentation) produce more output, but integrating that output requires more human coordination. The bottleneck shifted from creation to verification—not because agents produce worse code, but because coordinating multi-agent outputs into coherent systems requires work that wasn't necessary when humans did everything.
Amazon warehouse robots: Not LLM-based, but instructive. Thousands of robots coordinate through centralized optimization — each path computed by a central planner with perfect information and deterministic outcomes. The moment you introduce uncertainty (ambiguous commands, partial information, stochastic outcomes), centralized planning breaks down and distributed coordination becomes necessary. The boundary between centralized and distributed isn't a design preference — it's determined by how much uncertainty your task contains.
GAMMS (Graph-based Multi-Agent Simulation) confirms this experimentally: four agents in a well-designed hierarchy outperform eight agents in an unstructured mesh. Architecture matters more than agent count. Coordination overhead managed is coordination overhead that doesn't consume the gains.
The Architecture Spectrum: From Rigid Pipelines to Emergent Swarms
Multi-agent systems occupy a spectrum from fully deterministic to fully emergent. Understanding where your use case sits determines which architecture to choose.
| Architecture | Structure | Coordination | Best For | Failure Modes |
|---|---|---|---|---|
| Pipeline | Sequential stages | None (unidirectional data flow) | Document processing, ETL workflows, any task with clear stages | Cascading errors, no recovery from upstream failures |
| Hierarchy | Manager + specialists | Centralized routing and aggregation | Task decomposition with independent subtasks, resource allocation | Manager becomes bottleneck, rigid structure mismatches dynamic tasks |
| Flat/Mesh | Peer-to-peer | Negotiation or consensus protocols | Problems requiring diverse perspectives, creative tasks | Exponential communication overhead, consensus failures |
| Emergent Swarm | Self-organizing topology | Learned coordination policies | Adaptive optimization, exploration tasks | Unpredictable behavior, difficult to debug, emergent failures |
Almost everyone reaches for hierarchy first. It's the comfort zone — decompose the task, assign specialists, aggregate results. CrewAI, LangGraph, AutoGen: the default examples are all hierarchical. And hierarchy does contain error amplification, compressing it to 4.4x versus 17.2x for independent agents. But containment isn't free. The manager becomes a reasoning bottleneck, and when subtask interdependencies require peer communication that the hierarchy blocks, you've built a system that's structured for a problem it doesn't actually have.
The most common architectural mismatch is dynamic tasks crammed into rigid pipelines. Pipeline architectures — Agent A extracts, Agent B classifies, Agent C summarizes — work when task structure is genuinely linear. But builders treat pipelines as the default "simple" option and discover too late that upstream errors propagate unchecked, with no mechanism for downstream agents to flag problems or request re-processing. Simplicity becomes brittleness.
Flat mesh and emergent swarm architectures solve the flexibility problem by introducing worse ones. Mesh coordination scales quadratically — 10 agents generate 45 potential communication links, 100 agents generate 4,950. DyTopo shows that dynamic topology reconfiguration can prune this overhead, but emergent coordination policies that work in training may fail unpredictably in deployment. Debugging a learned communication policy is substantially harder than debugging explicit routing rules.
The honest guidance: if you can specify the coordination protocol at design time, use a pipeline or hierarchy and accept the rigidity. If you can't, accept the complexity of adaptive architectures — but invest heavily in observability, because emergent coordination you can't inspect is emergent coordination you can't trust.
When NOT to Use Multi-Agent Systems
The conditions where multi-agent systems degrade rather than improve outcomes are specific and quantifiable. Deploying agents because "more agents are better" produces the failure rates research documents. Deploying agents when task structure demands it produces the success cases.
Sequential tasks with tight coupling: If step N requires complete output from step N-1, parallelism provides no benefit and coordination overhead destroys performance. DeepMind measured 39-70% degradation on sequential reasoning tasks. The mechanism: distributed agents maintain independent reasoning traces, and recombining fragments into coherent plans costs more than sequential execution with a single agent maintaining global state.
High base accuracy tasks: If your single-agent baseline achieves >80% accuracy, coordination risks making it worse. Error amplification means small gains from parallelism get consumed by noise compounding across agents. JPMorgan COIN works because legal document review had poor baseline accuracy—human error rates were high enough that even imperfect agent coordination improved outcomes. Code generation with GPT-4 has high baseline accuracy—adding coordination often degrades results by introducing inconsistencies across agent outputs.
Low communication bandwidth: If agents need to share substantial context to coordinate, communication costs exceed parallelism benefits. This is where information bottleneck compression matters—if you can't compress messages by ~40% without losing critical information, coordination will consume more tokens than sequential execution saves.
Resource-constrained environments: Adding agents multiplies inference costs. If your task budget is 1,000 tokens, distributing across 5 agents means each gets 200 tokens plus coordination overhead. This works when parallelism compensates (5 agents with 180 tokens each complete in 1/5 the time of 1 agent with 1,000 tokens). It fails when coordination overhead exceeds parallelism gains—a common pattern on tasks requiring substantial context.
Critical systems requiring determinism: Multi-agent systems introduce stochasticity through coordination dynamics. If you need reproducible outputs (financial calculations, regulatory compliance, medical diagnoses), deterministic single-agent architectures or classical automation provide reliability that distributed coordination can't match. The MAST taxonomy identified task verification failures as a fundamental category—multi-agent systems often lack mechanisms to detect when collective reasoning has gone wrong.
The rough decision boundaries: task parallelizability above 60%, communication cost below 20% of compute budget, base accuracy below 70%, and acceptable failure rate above 5%. Outside those bounds, single-agent architectures with better prompting or more capable models will outperform multi-agent coordination. The 40% project cancellation rate Gartner warns about comes from deploying multi-agent systems where simpler architectures would succeed.
Looking Forward: The Six-Month Horizon
The infrastructure pieces are landing simultaneously. Model Context Protocol provides standardized interfaces for agents to access data and tools. A2A Protocol specifies how agents discover each other's capabilities. Information bottleneck compression makes coordination economically viable. Within six months, we'll see production deployments combining these — agents that negotiate their own coordination protocols rather than executing predefined workflows.
This shifts architectural complexity from design-time to run-time. The coordination patterns that emerge won't be the ones humans would design — they'll be whatever policies minimize cost while satisfying task constraints. The SMAC benchmark showed this trajectory in reinforcement learning: agents learned coordination strategies in StarCraft micromanagement that human players didn't anticipate and couldn't replicate.
The question isn't whether agents will coordinate autonomously—the research demonstrates they can. The question is whether the infrastructure we're building allows us to understand, audit, and constrain that coordination when it matters. JPMorgan COIN succeeds because humans designed the coordination protocol and monitor outcomes. The next generation won't have that legibility. The systems that thrive will be those that solve interpretability for distributed coordination, not just individual agent reasoning. That's the real frontier—not better agents, but comprehensible swarms.
Sources
Research Papers:
- SocialVeil: Communication Barriers in Multi-Agent LLM Systems
- DyTopo: Dynamic Topology Reconfiguration for Multi-Agent Communication
- CommCP: Information Bottleneck Compression for Agent Communication
- PieArena: Heterogeneous Negotiation Benchmarks for AI Agents
- AgenticPay: Automated Payment Negotiation Systems
- Agent-to-Agent Negotiations in Consumer Markets
- Why Multi-Agent LLM Systems Fail: 14 Failure Modes
- GAMMS: Graph-based Multi-Agent Simulation Framework
- SMAC: StarCraft Multi-Agent Challenge
Industry Research & Reports:
- Google DeepMind: Towards a Science of Scaling Agent Systems
- Multi-Agent System Reliability: Failure Patterns and Production Validation
Real-World Case Studies:
Foundational Work:
Industry Commentary:
About this article: Swarm Signal articles are researched and drafted by AI agents, then edited, verified, and published by Tyler Casey. We write about agents building things — sometimes that includes this blog.
Swarm Signal is an independent publication with no financial relationships with the research teams, institutions, or companies cited. Some posts mention tools we've used or spent time with. Some links may be affiliate links. They don't influence what gets covered or how it's assessed.