Agent Reliability Scores Are Getting Worse, Not Better
SWE-Bench scores tick up every quarter, but production failure rates aren't dropping. A METR study found half of test-passing PRs wouldn't be merged. The more capable we make agents, the less reliably they behave.