Benchmark Watch
Evaluation notes, benchmark interpretation, leaderboard skepticism, and measurement failures.
Field Guides and Frameworks
Implementation playbooks, operator patterns, and deployment methods.
No Field Guide content is currently available for this topic.
Signals, Maps, and Watch Lists
Production-oriented analysis, benchmarks, and market/system intelligence.
External tools
Execution tooling is separate
Swarm Signal keeps the analysis layer. Use BoredTools for reusable production templates and trackers.
Self-Improving Agents Have an Evaluator Problem
Anthropic's June 2026 update on recursive self-improvement is not a distant sci-fi warning. The company says its engineers now ship 8x as much code per...
How to Build Agent Evals That Catch Real Failures
Standard LLM benchmarks miss the failures that actually hurt in production. Here's how to build an evaluation system for agents that catches cascading errors, trajectory drift, and policy violations before they reach users.
Why Multi-Agent Papers Don't Replicate in Production
A paper from Tran and Kiela tested 28 multi-agent configurations across four architectures: Sequential, Parallel, Debate, and Ensemble. Every single one...
Multimodal Agents Score 40% Where Humans Score 72%
Every frontier lab now ships models that see, hear, and read. The assumption is that more modalities mean more capable agents. The benchmarks tell a...
Multi-Agent Systems Are Booming — But Real-Work Benchmarks Still Bite
Multi-agent workflows are growing fast, but APEX-Agents, AgentRx, Databricks, and Gartner show a gap between adoption, task success, and production readiness.
Agent Reliability Scores Are Getting Worse, Not Better
SWE-Bench scores tick up every quarter, but production failure rates aren't dropping. A METR study found half of test-passing PRs wouldn't be merged. The more capable we make agents, the less reliably they behave.
RAG for Legal: Building Document Retrieval That Survives Court
More than 300 documented instances of AI-generated fake citations have appeared in court filings since mid-2023. The question isn't whether to use AI for legal research — it's how to build retrieval systems that hold up under adversarial scrutiny.
Agent Benchmarks Won't Sit Still
Static agent benchmarks assume frozen environments. ProEvolve evolved one environment into 200 with 3,000 task sandboxes. Every frontier model failed in structurally different ways when familiar tools disappeared.
The UK Is Letting AI Diagnose Your Dog
ManyPets routes every insurance claim through an AI agent. 55% need zero human involvement. In the same year, the RCVS dropped the physical exam requirement for prescribing. Each piece works. Nobody's testing the integration.
Agentic RAG: How AI Agents Are Rewriting Retrieval
The old retrieve-once-generate-once pipeline is dead, and agents killed it. Four architectural patterns are reshaping how production systems handle knowledge retrieval.