🎧 LISTEN TO THIS ARTICLE

Reasoning models are supposed to think step by step. That's the whole value proposition of chain-of-thought: show your work, build toward the answer, let monitors verify the logic. New research shows that much of this "thinking" is theater.

The Confidence Is Already There

Reasoning Theater, by Siddharth Boppana and colleagues, trained attention probes on the internal activations of DeepSeek-R1 (671B) and GPT-OSS (120B) during chain-of-thought generation. The probes decode the model's actual belief about its final answer — not what it's writing, but what its internal representations already encode.

On MMLU, DeepSeek-R1's probes hit 87.98% accuracy at predicting the final answer. The kicker: they could make that prediction far earlier in the reasoning chain than anything visible in the text. The model writes hundreds or thousands of tokens of "reasoning" after it has already made up its mind. Linear probes managed only 31.85% accuracy on the same task — attention patterns carry the signal, not simple activation magnitudes.

Performative Reasoning Has a Number

Roughly 40% of the reasoning chain is generated after the model is already confident in its answer.

The researchers quantified how performative each model's reasoning is by measuring the gap between when probes detect the answer versus when a chain-of-thought monitor does. They call it the performativity rate. DeepSeek-R1 on MMLU: 0.417. GPT-OSS on MMLU: 0.435. That means roughly 40% of the reasoning chain is generated after the model is already confident in its answer.

But here's the twist — on GPQA-Diamond, a harder multihop reasoning benchmark, DeepSeek-R1's performativity rate drops to 0.012. Nearly zero. The model actually uses its chain-of-thought for hard problems. It's the easy questions where the performance becomes literal performance: a model already holding the answer in its head going through the motions.

Think of it like a chess grandmaster narrating a game against a beginner. They know the winning move instantly but still talk through alternatives for the audience. Except we're paying per token for that narration.

Cut the Theater, Keep the Thinking

The practical payoff is probe-guided early exit. If a probe detects high confidence, stop generating. On MMLU, this cuts token usage by 80% while retaining 97% accuracy. On GPQA-Diamond — where the reasoning is genuine — the reduction is a more modest 40% with the same accuracy retention. That's the system deciding when to actually think and when to stop pretending.

The inflection-point analysis adds another layer. Genuine cognitive events — backtracks, realizations, reconsiderations — occur almost exclusively in responses where probes show large belief shifts. High-confidence responses from the start show reconsideration rates of 0.015 per step. Low-confidence responses show roughly double that. The reasoning text is honest about uncertainty only when the model is actually uncertain.

Why Monitors Can't Catch This

We're paying per token for a chess grandmaster narrating a game they've already won.

We're paying per token for a chess grandmaster narrating a game they've already won.

This matters for AI safety monitoring. If you're watching the chain-of-thought to verify a model's reasoning — which is how most current oversight works — you're watching a performance, not a process. The model can write "let me reconsider" without actually reconsidering anything internally. The probe data shows the internal state didn't budge.

Scaling up reasoning tokens doesn't fix this. More tokens on easy problems just means more theater. The path forward isn't longer chains — it's tools that read the model's internals and know when the thinking is real. That's a fundamentally different kind of interpretability than we've been building.