Benchmark Watch

Evaluation notes, benchmark interpretation, leaderboard skepticism, and measurement failures.

Deep Dives and Frameworks

Implementation playbooks, operator patterns, and durable analysis.

No deep-dive content is currently available for this path.

Signals, Maps, and Watch Lists

Production-oriented analysis, benchmarks, and market/system intelligence.

External tools

Execution tooling is separate

Swarm Signal keeps the analysis layer. Use BoredTools for reusable production templates and trackers.

Open BoredTools Open Budget Tracker

Signal Benchmark Watch Evidence-first framing

Terminal Agents Need Progress Curves, Not Victory Screens

Long-Horizon-Terminal-Bench landed on arXiv in July 2026 with an awkward result for terminal-agent buyers: even the strongest tested model still failed...

Signal Benchmark Watch Evidence-first framing

Agent Benchmarks Need Runtime Receipts, Not Model Labels

RuBench's revised 19 July 2026 release contains a small but important warning for coding-agent buyers: one audited product configuration silently...

Signal Benchmark Watch Evidence-first framing

Data Agents Need Exploration Budgets, Not SQL Magic

Data Agent Benchmark landed on arXiv on 21 March 2026 with a result that should make enterprise analytics teams pause: the best tested frontier model...

Signal Benchmark Watch Evidence-first framing

Agent Evals Need Harnesses, Not More Scoreboards

AgentCompass first landed on arXiv on 15 July 2026 with a practical complaint: agent evaluation is fragmented, tightly coupled, and hard to reproduce...

Signal Benchmark Watch Evidence-first framing

Agent Test-Time Scaling Needs Reuse, Not More Rollouts

General AgentBench reports that running agents for more interaction steps or more sampled trajectories did not reliably improve ten leading agents,...

Signal Benchmark Watch Evidence-first framing

Multi-Agent Finance Workflows Need Cost Curves, Not More Agents

A March 2026 benchmark on financial-document processing makes the uncomfortable point: the most accurate multi-agent architecture was not the obvious...

Signal Benchmark Watch Evidence-first framing

Assistant Agents Need Reminder Tests, Not Recall Scores

Most agent-memory benchmarks ask whether a model can recover old information. PM-Bench asks a harsher question: can an agent remember to do the right...

Signal Benchmark Watch Evidence-first framing

Coding Agents Need Trajectory Reviews, Not Pass Bits

Most coding-agent benchmarks still compress a whole run into one bit: did the task pass? AgentLens argues that users experience the whole trajectory...

Signal Benchmark Watch Evidence-first framing

Tool-Use Agents Need Failure Labels, Not Pass Rates

Tool-use agents can fail in ways a final accuracy score hides, because the same wrong answer can come from skipped tools, ignored outputs, fabricated...

Signal Benchmark Watch Evidence-first framing

Computer-Use Agents Fail Long Workflows, Not Mouse Clicks

Computer-use agents are clearing more short benchmark tasks, but the new failure line is workflow length. A June 2026 benchmark called OSWorld 2.0 tests...