Benchmark Watch

Evaluation notes, benchmark interpretation, leaderboard skepticism, and measurement failures.

Deep Dives and Frameworks

Implementation playbooks, operator patterns, and durable analysis.

No deep-dive content is currently available for this path.

Signals, Maps, and Watch Lists

Production-oriented analysis, benchmarks, and market/system intelligence.

External tools

Execution tooling is separate

Swarm Signal keeps the analysis layer. Use BoredTools for reusable production templates and trackers.

Open BoredTools Open Budget Tracker

Signal Benchmark Watch Evidence-first framing

Agent Leaderboards Can Be Cheaper Without Being Safer

A March 2026 paper on efficient agent benchmarking found that mid-difficulty task subsets can remove large parts of an agent benchmark while preserving...

Signal Benchmark Watch Evidence-first framing

Multimodal Memory Tests Expose the Personal-Agent Gap

Product teams are turning memory into the selling point for personal agents. The hard question is no longer whether they can remember a preference; it is...

Signal Benchmark Watch Evidence-first framing

Power Grid Agents Need Constraint Tests, Not Chat Scores

A June 2026 power-systems benchmark argues that language-model agents can solve grid-engineering tasks, but the useful signal is narrower: the agent must...

Signal Benchmark Watch Evidence-first framing

TerminalWorld Makes Agent Benchmarks Harder to Fake

TerminalWorld turns public terminal recordings into validated agent tasks. The signal is not a higher leaderboard score. It is a harder benchmark supply chain.

Signal Benchmark Watch Evidence-first framing

Million-Token Context Still Fails the Workload Test

Anthropic reported on February 5, 2026 that Claude Opus 4.6 scored 76% on the 8-needle 1M-token MRCR v2 test while Claude Sonnet 4.5 scored 18.5% on the...

Signal Benchmark Watch Evidence-first framing

Coding Agent Benchmarks Hit the Generalization Wall

Scale's SWE-Bench Pro public leaderboard reports that top models scoring above 70% on SWE-Bench Verified fall to 23.3% for OpenAI GPT-5 and 23.1% for...

Signal Benchmark Watch Evidence-first framing

Self-Improving Agents Have an Evaluator Problem

Anthropic's June 2026 update on recursive self-improvement is not a distant sci-fi warning. The company says its engineers now ship 8x as much code per...

Signal Benchmark Watch Evidence-first framing

The 12-to-72 Problem: Computer-Use Agents Hit Human Scores but Miss the Point

Computer-use agents jumped from 12% to 72% on OSWorld in 18 months. The scores look like progress. The latency and efficiency numbers tell a different story.

Signal Benchmark Watch Evidence-first framing

Why Multi-Agent Papers Don't Replicate in Production

A paper from Tran and Kiela tested 28 multi-agent configurations across four architectures: Sequential, Parallel, Debate, and Ensemble. Every single one...

Signal Benchmark Watch Evidence-first framing

Multimodal Agents Score 40% Where Humans Score 72%

Every frontier lab now ships models that see, hear, and read. The assumption is that more modalities mean more capable agents. The benchmarks tell a...