🎧 LISTEN TO THIS ARTICLE

LISTEN TO THIS ARTICLE

Four GPT-4o-mini agents running a Boids flocking simulation took 68.61 seconds to complete 10 timesteps. The classical version finished in 0.0019 seconds. That's not a rounding error. That's a 36,000x slowdown for a problem computer science solved in 1986.

The dream of LLM-powered swarms is seductive: replace brittle if-then rules with language models that can reason, adapt, and coordinate through natural language. The reality, laid bare by three papers published in the first half of 2025, is that we're paying extraordinary compute costs for capabilities that often don't materialize.

The Benchmark That Finally Showed Up

Until SwarmBench arrived in May 2025, claims about LLM swarm capabilities were largely vibes-based. Nobody had systematically tested whether language models could actually coordinate under true swarm constraints: local-only perception, no global state, decentralized decision-making.

SwarmBench changed that. Researchers at Renmin University built a 2D grid environment with five tasks that map directly to classical swarm problems: pursuit, synchronization, foraging, flocking, and transport. Agents got a 5x5 local view. No bird's-eye knowledge. No centralized controller. Thirteen LLMs were tested, from Claude 3.5 Haiku to o4-mini to LLaMA 4 Scout.

The results were uneven in a way that should worry anyone building production multi-agent systems. Flocking scored highest overall, which makes sense since it's the most reactive, least strategic task. Transport was nearly impossible: only o4-mini and DeepSeek-R1 scored above zero. The rest couldn't figure out how to collectively move an object without stepping on each other.

The most telling finding wasn't about raw scores. It was about communication. SwarmBench tracked what agents said to each other and found that message content had weak correlation with actual task success. Physical group dynamics predicted outcomes far better than semantic communication. The models talked a lot. Most of it didn't help.

LLMs are catastrophically inefficient at mechanical coordination, but they can occasionally reason their way to better strategic outcomes.

Where 300x Comes From

Rahman et al.'s June 2025 paper put hard numbers on what practitioners already suspected. They rebuilt Craig Reynolds' 1986 Boids algorithm using OpenAI's Swarm framework, then ran it against the classical version with four agents over 10 timesteps.

The LLM version needed three prompts per agent per timestep, totaling 120 inference calls for what amounted to a trivial simulation. Each boid took roughly 1.7 seconds to process its prompts. The classical system ran the same computation in under two milliseconds.

When they switched to ant colony optimization, the gap narrowed but didn't close. Classical ACO finished 50 iterations in 14 seconds. The LLM version took 136 seconds, about a 10x overhead. But here's where it gets interesting: the LLM-driven ants found a better solution. Their pheromone concentration on the optimal path hit 44.2 versus the classical system's 37.6, and the LLM system achieved this distribution in 50 iterations versus 179 for the classical approach.

That's the uncomfortable truth sitting inside the hype. LLMs are catastrophically inefficient at mechanical coordination, but they can occasionally reason their way to better strategic outcomes. The question is whether that reasoning advantage justifies burning 10-36,000x more compute.

The Decentralization Problem

Classical swarms work because agents are genuinely independent. Each boid calculates its own position update from local data. They don't wait for each other. They don't share a context window.

LLM-based "swarms" break this property in ways the proponents don't always acknowledge. Rahman et al. found that their LLM agents operated sequentially, passing information between calls in a structured, interdependent chain. That's not a swarm. That's a pipeline with extra steps.

SwarmBench tried harder to enforce decentralization, limiting agents to local perception and local communication channels. But even there, the fundamental bottleneck remains: every agent decision requires a full inference pass through a billion-parameter model. You can't parallelize your way out of that constraint without a GPU cluster that makes the whole exercise absurd for anything a classical swarm algorithm could handle.

Scalability, the defining feature of real swarm systems, simply doesn't work. Rahman et al. were direct about it: scaling beyond a handful of agents wasn't feasible. SwarmBench tested up to 16 agents and found task-dependent effects where adding more agents sometimes made performance worse, echoing the coordination tax that plagues multi-agent systems generally.

For the problems swarms were designed to solve, fast and dumb will keep beating slow and smart for a long time.

What LLMs Actually Bring to Swarms

The honest inventory isn't all bad. Three capabilities genuinely don't exist in classical swarm algorithms.

First, LLMs can interpret ambiguous objectives described in natural language. Classical Boids need hand-coded rules. An LLM agent can be told "avoid collisions but prioritize staying near the group center" and produce reasonable behavior without explicit parameter tuning. That's real value for rapid prototyping.

Second, the ACO results suggest LLMs can make better strategic decisions when the search space has structure they can reason about. The first model specifically trained for swarm behavior showed similar patterns: gains concentrate in tasks where reasoning about others' likely actions matters more than raw reaction speed.

Third, LLMs enable human-in-the-loop swarm design. You can modify agent behavior by changing a prompt instead of rewriting simulation code. For researchers exploring emergent phenomena, that accessibility matters.

But none of these advantages require running LLMs at every timestep of a simulation. The SIER framework from May 2025 found a smarter application: using swarm intelligence algorithms to guide LLM reasoning rather than the other way around. Their density-driven approach scored 26.7% on AIME-2024 versus 20.0% for standard chain-of-thought, suggesting the real opportunity is swarm algorithms improving LLMs rather than LLMs replacing swarm algorithms.

The Honest Assessment

SwarmBench revealed four failure modes that keep recurring: movement bias where agents develop directional preferences that ignore objectives, information silos where subgroups cluster without coordinating globally, traffic jams from over-aggregation, and memory limitations from constrained context buffers. These aren't bugs that better prompting will fix. They're structural consequences of treating a text prediction system as a real-time controller.

The path forward probably isn't LLM swarms or classical swarms. It's hybrid architectures where language models handle strategic-layer decisions while classical algorithms manage the millisecond-level coordination that multi-agent systems actually need. Let the LLM decide where the swarm should go. Let Reynolds' 40-year-old math handle the flying.

The 300x overhead number will improve as inference costs drop and distilled models get smaller. But improving it by two orders of magnitude still leaves you 3x slower than classical approaches. For the problems swarms were designed to solve, fast and dumb will keep beating slow and smart for a long time.

Sources

Research Papers:

Related Swarm Signal Coverage: