🎧 LISTEN TO THIS ARTICLE
The system prompts powering today's coding agents are growing into small codebases of their own. Claude Code's prompt runs 1,490 lines and 78,000 characters. Nobody is testing them like code. A preprint from Tony Mason at the University of British Columbia and Georgia Institute of Technology introduces Arbiter, a framework that treats system prompts as auditable artifacts — and the first results suggest these prompts are riddled with internal contradictions.
Prompts Have Architecture Problems
Mason analyzed the system prompts of three major coding agents: Claude Code (Anthropic), Codex CLI (OpenAI), and Gemini CLI (Google). Each prompt has a distinct structural pattern. Claude Code's is a monolith — a single massive document where subsystem boundaries blur. Codex CLI is flat, trading capability for brevity at 298 lines. Gemini CLI is modular, composing behavior from separate sections across 245 lines.
Each architecture produces its own class of failure. The monolith accumulates contradictions at subsystem boundaries as teams append rules without reviewing what already exists. The flat design avoids contradiction but underspecifies behavior. The modular approach introduces bugs at composition seams, where separately authored sections interact in unintended ways.
What Arbiter Found

Arbiter combines formal evaluation rules with multi-model LLM analysis. The undirected analysis phase — where ten different models independently scour the prompts — surfaced 152 findings across all three vendors. Claude Code alone accounted for 116 of those, including 12 rated "alarming" severity and 34 rated "concerning."
A directed analysis of Claude Code decomposed the prompt into 56 classified blocks and identified 21 hand-labeled interference patterns: 4 critical contradictions, 13 scope overlaps where constraints were restated two to three times with subtle variations, 2 priority ambiguities, and 2 implicit dependencies. One example: Claude Code's prompt simultaneously mandates using a TodoWrite tool and contains workflow prohibitions that conflict with that mandate.
The most consequential finding hit Gemini CLI. Arbiter detected that Gemini's save_memory preferences were structurally guaranteed to be deleted during history compression — the compression schema simply contained no field for saved memories. This was independently confirmed by Google's own Issue #16213, patched in PR #16914 in January 2026. A user's explicitly saved preferences were being silently erased by design.
Ten Models See Different Things
The multi-model approach is central to Arbiter's design, and the results justify it. Ten distinct LLMs — including Claude Opus 4.6, Gemini 2.0 Flash, Grok 4.1, DeepSeek V3.2, Llama 4 Maverick, and others — generated 107 unique finding categories across Claude Code's 116 results. Different models demonstrated categorical complementarity: security and trust issues were flagged by 9 out of 10 models, but resource management findings came primarily from Kimi K2.5 alone. A single-model audit would have missed entire vulnerability classes.
This matters for anyone building agent security systems. If your red-teaming or testing pipeline relies on a single model to evaluate prompt quality, you are likely blind to categories of interference that model does not prioritize.
The Cost of Not Looking

The total cost of running Arbiter across all three vendor prompts was $0.27. Cost per finding: $0.002. Claude Code required 10 passes to convergence; Codex CLI needed only 2. The 95% static detectability rate for Claude Code's interference patterns suggests most of these issues could be caught by automated tooling before deployment — if anyone thought to run it.
This connects to a broader pattern in agent development. We have extensive frameworks for guardrails and prompt injection defense, but almost no discipline around prompt hygiene. System prompts are edited by multiple teams, accumulated over months, and deployed without regression testing. The result is agents that receive contradictory instructions and resolve the conflict silently, in ways nobody audits.
What Changes

Arbiter reframes system prompts as engineering artifacts that deserve the same rigor as source code: version control, static analysis, multi-reviewer audits. The paper does not claim these interference patterns cause catastrophic failures in practice — many may be resolved by the model's own disambiguation. But 4 critical contradictions in a single product's prompt, discoverable for a quarter, suggests the current approach of "append and hope" has a shelf life.
The question is whether agent vendors will adopt prompt testing before a contradiction causes a production incident, or after.