Benchmark Watch

Evaluation notes, benchmark interpretation, leaderboard skepticism, and measurement failures.

Field Guides and Frameworks

Implementation playbooks, operator patterns, and deployment methods.

No Field Guide content is currently available for this topic.

External tools

Execution tooling is separate

Swarm Signal keeps the analysis layer. Use BoredTools for reusable production templates and trackers.

Open BoredTools Open Budget Tracker
Self-Improving Agents Have an Evaluator Problem

Self-Improving Agents Have an Evaluator Problem

Anthropic's June 2026 update on recursive self-improvement is not a distant sci-fi warning. The company says its engineers now ship 8x as much code per...

3 min read
How to Build Agent Evals That Catch Real Failures

How to Build Agent Evals That Catch Real Failures

Standard LLM benchmarks miss the failures that actually hurt in production. Here's how to build an evaluation system for agents that catches cascading errors, trajectory drift, and policy violations before they reach users.

8 min read
Why Multi-Agent Papers Don't Replicate in Production

Why Multi-Agent Papers Don't Replicate in Production

A paper from Tran and Kiela tested 28 multi-agent configurations across four architectures: Sequential, Parallel, Debate, and Ensemble. Every single one...

4 min read
Multimodal Agents Score 40% Where Humans Score 72%

Multimodal Agents Score 40% Where Humans Score 72%

Every frontier lab now ships models that see, hear, and read. The assumption is that more modalities mean more capable agents. The benchmarks tell a...

3 min read
Multi-Agent Systems Are Booming — But Real-Work Benchmarks Still Bite

Multi-Agent Systems Are Booming — But Real-Work Benchmarks Still Bite

Multi-agent workflows are growing fast, but APEX-Agents, AgentRx, Databricks, and Gartner show a gap between adoption, task success, and production readiness.

3 min read
Agent Reliability Scores Are Getting Worse, Not Better

Agent Reliability Scores Are Getting Worse, Not Better

SWE-Bench scores tick up every quarter, but production failure rates aren't dropping. A METR study found half of test-passing PRs wouldn't be merged. The more capable we make agents, the less reliably they behave.

4 min read
RAG for Legal: Building Document Retrieval That Survives Court

RAG for Legal: Building Document Retrieval That Survives Court

More than 300 documented instances of AI-generated fake citations have appeared in court filings since mid-2023. The question isn't whether to use AI for legal research — it's how to build retrieval systems that hold up under adversarial scrutiny.

12 min read
Agent Benchmarks Won't Sit Still

Agent Benchmarks Won't Sit Still

Static agent benchmarks assume frozen environments. ProEvolve evolved one environment into 200 with 3,000 task sandboxes. Every frontier model failed in structurally different ways when familiar tools disappeared.

3 min read
The UK Is Letting AI Diagnose Your Dog

The UK Is Letting AI Diagnose Your Dog

ManyPets routes every insurance claim through an AI agent. 55% need zero human involvement. In the same year, the RCVS dropped the physical exam requirement for prescribing. Each piece works. Nobody's testing the integration.

6 min read
Agentic RAG: How AI Agents Are Rewriting Retrieval

Agentic RAG: How AI Agents Are Rewriting Retrieval

The old retrieve-once-generate-once pipeline is dead, and agents killed it. Four architectural patterns are reshaping how production systems handle knowledge retrieval.

8 min read
Swarm Signal
0:00
0:00
Up Next

Queue is empty. Click "+ Queue" on any article to add it.