Signal Signals

Chain-of-Thought Prompting Doesn't Always Work. Here's the Evidence.

Think step by step. It's the most common prompt engineering advice in circulation, repeated in tutorials, baked into system prompts, and treated as a...

By Tyler · April 29, 2026 · 6 min read

Evidence trail: source links, evidence base, and editorial method appear below. Editorial standards.

Key finding

Think step by step. It's the most common prompt engineering advice in circulation, repeated in tutorials, baked into system prompts, and treated as a...

Why it matters

Use this section to judge execution impact before implementation.

Evidence base

Claims are grounded in cited papers, benchmarks, and implementation observations where available.

Operator takeaway

Pair this with an execution review of your current monitoring, rollback, and eval loops.

Where this breaks

Assumptions become fragile when upstream systems or data distributions shift.

Use this if

You are standardising AI operations with explicit reliability constraints.

Avoid this if

The failure tolerance is low and you need defensive controls first.

Chain-of-Thought Prompting Doesn't Always Work. Here's the Evidence.

{"version":"0.3.1","atoms":[],"cards":[["html",{"html":"<div style="background: linear-gradient(135deg, #1a1a2e 0%, #16213e 100%); border-radius: 12px; padding: 20px; margin: 20px 0; text-align: center;"><p style="color: #e94560; font-weight: bold; margin: 0 0 12px 0; font-size: 14px; letter-spacing: 2px;">LISTEN TO THIS ARTICLE

<audio controls="" preload="none" style="width: 100%; max-width: 500px;" src="https://swarmsignal.net/audio/chain-of-thought-prompting-doesnt-always-work-heres-the-evid.mp3\">Your browser does not support the audio element.

"}],["image",{"src":"https://swarmsignal.net/content/images/2026/06/quote_chain-of-thought-prompting-doesnt-always-work-heres-the-evid_01.webp","alt":"It's the most common prompt engineering advice in circulation, repeated in tutorials, baked into system prompts, and treated as a..."}]],"markups":[["a",["href","https://gail.wharton.upenn.edu/research-and-insights/tech-report-chain-of-thought/"]],["a",["href","https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5285532"]],["a",["href","https://arxiv.org/abs/2410.21333"]],["a",["href","https://swarmsignal.net/building-rag-systems-that-work/"]],["a",["href","https://aigi.ox.ac.uk/wp-content/uploads/2025/07/Cot_Is_Not_Explainability.pdf"]],["a",["href","https://www.researchgate.net/publication/403262878_Why_Models_Know_But_Don't_Say_Chain-of-Thought_Faithfulness_Divergence_Between_Thinking_Tokens_and_Answers_in_Open-Weight_Reasoning_Models"]],["a",["href","https://swarmsignal.net/ai-agent-observability-and-monitoring-in-production-distribu/"]],["strong"],["a",["href","https://openreview.net/pdf?id=_VjQlMeSB_J"]],["a",["href","https://swarmsignal.net/scaling-laws-explained-for-practitioners-what-actually-matte/"]],["a",["href","https://swarmsignal.net/the-prompt-engineering-ceiling/"]],["a",["href","https://swarmsignal.net/mcp-server-architecture-guide/"]],["a",["href","https://swarmsignal.net/best-ai-red-teaming-tools-2026/"]],["a",["href","https://swarmsignal.net/agent-tool-use-patterns-guide/"]],["a",["href","https://swarmsignal.net/#/portal/signup"]],["a",["href","https://payhip.com/b/oq1HI?utm_source=swarmsignal&utm_medium=article_footer&utm_campaign=ss15"]]],"sections":[[10,0],[1,"p",[[0,[],0,"Think step by step. It's the most common prompt engineering advice in circulation, repeated in tutorials, baked into system prompts, and treated as a universal performance upgrade. But a growing body of research says otherwise. Chain-of-thought prompting can reduce accuracy by up to 36.3%, costs 20-80% more in compute time, and on reasoning models, "],[0,[0],1,"adds almost nothing measurable"],[0,[],0,"."]]],[1,"p",[[0,[],0,"The technique still works in specific situations. But the gap between "sometimes useful" and "always add it" is where teams waste tokens, latency, and trust in their systems."]]],[10,1],[1,"h2",[[0,[],0,"The Wharton Numbers"]]],[1,"p",[[0,[],0,"A June 2025 report from Wharton's Generative AI Labs tested chain-of-thought prompting across multiple models and task types. The headline finding: "],[0,[1],1,"CoT's value is decreasing"],[0,[],0," as models get more capable."]]],[1,"p",[[0,[],0,"For non-reasoning models, CoT improved average performance modestly. Gemini Flash 2.0 gained 13.5%. Sonnet 3.5 gained 11.7%. But GPT-4o-mini gained just 4.4%, and that wasn't statistically significant. More importantly, CoT introduced variability. Models got harder questions right but started missing easier ones they'd previously answered correctly. The net effect was often a wash."]]],[1,"p",[[0,[],0,"For reasoning models like o1 and o3, explicit CoT prompting added almost nothing. These models already perform internal chain-of-thought during inference. Asking them to "think step by step" is redundant at best and counterproductive at worst. CoT requests took 35-600% longer than direct requests across models tested."]]],[1,"h2",[[0,[],0,"When Thinking Makes It Worse"]]],[1,"p",[[0,[],0,"A Princeton study published at ICML 2025 asked a sharper question: "],[0,[2],1,"are there tasks where reasoning itself hurts"],[0,[],0,"?"]]],[1,"p",[[0,[],0,"The researchers drew from cognitive psychology. Humans perform worse on certain tasks when they deliberate, specifically implicit statistical learning, visual pattern recognition, and classifying patterns with exceptions. The hypothesis was that LLMs might show the same effect."]]],[1,"p",[[0,[],0,"They did. Across three representative tasks, state-of-the-art models showed significant accuracy drops with CoT. The worst case: a 36.3% absolute accuracy drop for OpenAI's o1-preview compared to GPT-4o without CoT. On tasks involving intuitive pattern matching or exception-heavy classification, forcing step-by-step reasoning disrupted the model's ability to use learned statistical regularities."]]],[1,"p",[[0,[],0,"This maps to something practitioners encounter but rarely name. When your agent needs to classify a support ticket into one of forty categories, it doesn't need to reason through each option. When your "],[0,[3],1,"RAG pipeline"],[0,[],0," retrieves five documents and needs to pick the most relevant, explicit reasoning often introduces second-guessing that degrades selection quality."]]],[1,"h2",[[0,[],0,"The Faithfulness Problem"]]],[1,"p",[[0,[],0,"Even when CoT appears to improve outputs, there's a deeper issue: the reasoning trace "],[0,[4],1,"may not reflect how the model actually reached its answer"],[0,[],0,"."]]],[1,"p",[[0,[],0,"Oxford researchers found that CoT explanations frequently diverge from models' real decision processes. Models use shortcuts, pattern matching, and latent knowledge that never surfaces in the visible reasoning chain. A 2025 study on open-weight reasoning models documented "],[0,[5],1,"what researchers called "Reasoning Theater""],[0,[],0,": models that lock in their answer early in the generation process, then continue producing tokens that look like deliberation but change nothing."]]],[1,"p",[[0,[],0,"This isn't just an interpretability concern. If you're using CoT traces for "],[0,[6],1,"agent observability"],[0,[],0," or debugging, you may be reading fiction. The reasoning chain tells you what the model thinks good reasoning looks like, not what computation it actually performed."]]],[1,"h2",[[0,[],0,"What This Changes"]]],[1,"p",[[0,[],0,"None of this means "stop using chain-of-thought." It means stop using it by default."]]],[1,"p",[[0,[],0,"The evidence points to a decision framework:"]]],[1,"p",[[0,[7],1,"Use CoT when"],[0,[],0," the task involves genuine multi-step reasoning, math, logic chains, or compositional planning. These are the domains where the "],[0,[8],1,"original 2022 research"],[0,[],0," demonstrated real gains, and where the technique still holds up."]]],[1,"p",[[0,[7],1,"Skip CoT when"],[0,[],0," you're working with reasoning models (they already do it internally), the task involves pattern matching or classification, latency matters, or you're paying per token and the accuracy gain doesn't justify the 2-6x cost increase."]]],[1,"p",[[0,[7],1,"Never trust CoT traces as explanations"],[0,[],0," of model behavior. Use them as outputs to evaluate, not as windows into computation."]]],[1,"p",[[0,[],0,"The broader pattern here echoes what we've seen with "],[0,[9],1,"scaling laws"],[0,[],0," and "],[0,[10],1,"prompt engineering ceilings"],[0,[],0,": techniques that drove early LLM performance don't scale linearly into frontier territory. The models changed. The best practices haven't caught up."]]],[1,"h3",[[0,[],0,"Keep reading"]]],[3,"ul",[[[0,[11],1,"MCP Server Architecture Guide"]],[[0,[12],1,"Best AI Red-Teaming Tools (2026)"]],[[0,[13],1,"Agent Tool-Use Patterns Guide"]]]],[1,"p",[[0,[14],1,"Join the Swarm Signal newsletter"]]],[1,"p",[[0,[15],1,"Get the Freelance Command Center on Payhip"]]]]}

Across three representative tasks, state-of-the-art models showed significant accuracy drops with CoT.

External tools

Execution tooling is separate

Swarm Signal keeps the analysis layer. Use BoredTools for templates, checklists, and execution tools.

Open BoredTools Open Budget Tracker

Swarm Signal

Up Next

Queue is empty. Click "+ Queue" on any article to add it.