LISTEN TO THIS ARTICLE

Chain-of-Thought Prompting Doesn't Always Work. Here's the Evidence.

Think step by step. It's the most common prompt engineering advice in circulation, repeated in tutorials, baked into system prompts, and treated as a universal performance upgrade. But a growing body of research says otherwise. Chain-of-thought prompting can reduce accuracy by up to 36.3%, costs 20-80% more in compute time, and on reasoning models, adds almost nothing measurable.

The technique still works in specific situations. But the gap between "sometimes useful" and "always add it" is where teams waste tokens, latency, and trust in their systems.

The Wharton Numbers

A June 2025 report from Wharton's Generative AI Labs tested chain-of-thought prompting across multiple models and task types. The headline finding: CoT's value is decreasing as models get more capable.

For non-reasoning models, CoT improved average performance modestly. Gemini Flash 2.0 gained 13.5%. Sonnet 3.5 gained 11.7%. But GPT-4o-mini gained just 4.4%, and that wasn't statistically significant. More importantly, CoT introduced variability. Models got harder questions right but started missing easier ones they'd previously answered correctly. The net effect was often a wash.

For reasoning models like o1 and o3, explicit CoT prompting added almost nothing. These models already perform internal chain-of-thought during inference. Asking them to "think step by step" is redundant at best and counterproductive at worst. CoT requests took 35-600% longer than direct requests across models tested.

When Thinking Makes It Worse

A Princeton study published at ICML 2025 asked a sharper question: are there tasks where reasoning itself hurts?

The researchers drew from cognitive psychology. Humans perform worse on certain tasks when they deliberate, specifically implicit statistical learning, visual pattern recognition, and classifying patterns with exceptions. The hypothesis was that LLMs might show the same effect.

They did. Across three representative tasks, state-of-the-art models showed significant accuracy drops with CoT. The worst case: a 36.3% absolute accuracy drop for OpenAI's o1-preview compared to GPT-4o without CoT. On tasks involving intuitive pattern matching or exception-heavy classification, forcing step-by-step reasoning disrupted the model's ability to use learned statistical regularities.

This maps to something practitioners encounter but rarely name. When your agent needs to classify a support ticket into one of forty categories, it doesn't need to reason through each option. When your RAG pipeline retrieves five documents and needs to pick the most relevant, explicit reasoning often introduces second-guessing that degrades selection quality.

The Faithfulness Problem

Even when CoT appears to improve outputs, there's a deeper issue: the reasoning trace may not reflect how the model actually reached its answer.

Oxford researchers found that CoT explanations frequently diverge from models' real decision processes. Models use shortcuts, pattern matching, and latent knowledge that never surfaces in the visible reasoning chain. A 2025 study on open-weight reasoning models documented what researchers called "Reasoning Theater": models that lock in their answer early in the generation process, then continue producing tokens that look like deliberation but change nothing.

This isn't just an interpretability concern. If you're using CoT traces for agent observability or debugging, you may be reading fiction. The reasoning chain tells you what the model thinks good reasoning looks like, not what computation it actually performed.

What This Changes

None of this means "stop using chain-of-thought." It means stop using it by default.

The evidence points to a decision framework:

Use CoT when the task involves genuine multi-step reasoning, math, logic chains, or compositional planning. These are the domains where the original 2022 research demonstrated real gains, and where the technique still holds up.

Skip CoT when you're working with reasoning models (they already do it internally), the task involves pattern matching or classification, latency matters, or you're paying per token and the accuracy gain doesn't justify the 2-6x cost increase.

Never trust CoT traces as explanations of model behavior. Use them as outputs to evaluate, not as windows into computation.

The broader pattern here echoes what we've seen with scaling laws and prompt engineering ceilings: techniques that drove early LLM performance don't scale linearly into frontier territory. The models changed. The best practices haven't caught up.

Keep reading

Join the Swarm Signal newsletter

Get the Freelance Command Center on Payhip