Chain-of-Thought Prompting: When It Works, When It Fails

▶️ LISTEN TO THIS ARTICLE

When researchers at NeurIPS 2024 ran a meta-analysis across 100 papers, 20 datasets, and 14 models, they found something the hype cycle hadn't mentioned: chain-of-thought prompting provides strong benefits on math and logic tasks, with "much smaller gains on other types of tasks." A separate study from Wharton found that on reasoning models like o3-mini and Gemini Flash 2.5, adding chain-of-thought to your prompt produces negligible or negative improvement while increasing costs by 20-80%.

Chain-of-thought is the single most studied prompting technique in AI. It's also the most misapplied. This guide covers when to use it, when to skip it, and how to avoid paying 13x more in tokens for the same answer.

What Chain-of-Thought Actually Does

In January 2022, Jason Wei and colleagues at Google Brain published a paper showing that if you include intermediate reasoning steps in your prompt examples, large language models produce dramatically better answers on reasoning tasks. The technique is straightforward: instead of asking a model to jump directly to an answer, you show it (or tell it) to work through the problem step by step.

The original results were striking. On GSM8K, a benchmark of grade-school math problems, PaLM 540B jumped from 17.9% accuracy with standard prompting to 56.9% with chain-of-thought. On symbolic reasoning (concatenating the last letters of words), accuracy went from 7.6% to 99.4%. On a multi-step math benchmark called MultiArith, accuracy rose from 42.2% to 94.7%.

A few months later, Kojima et al. discovered something even simpler: just adding "Let's think step by step" before the answer, with no examples at all, boosted zero-shot performance on MultiArith from 17.7% to 78.7%. That single phrase outperformed carefully curated 8-shot prompting.

These results launched an industry-wide adoption of chain-of-thought as the default prompting strategy. But the full picture is more complicated than the headline numbers suggest.

Where It Genuinely Helps

Chain-of-thought earns its reputation on tasks that require sequential reasoning where each step depends on the previous one.

Math and arithmetic show the largest gains consistently. Across the original Google Brain study, every math benchmark improved substantially for large models. GSM8K gains of 30+ percentage points. MultiArith gains of 50+ points. Self-consistency, a technique that samples multiple reasoning paths and takes the majority vote, pushed GSM8K accuracy from 56.9% to 74.4% on PaLM 540B.

Multi-step logic benefits reliably. StrategyQA (yes/no questions requiring multi-hop reasoning) improved by 9.2 percentage points. Date understanding tasks improved by 16.3 points. Sports understanding went from 80.5% to 95.4%, surpassing unaided human sports enthusiasts at 84%.

Symbolic manipulation sees near-total transformation. Last letter concatenation went from 7.6% to 99.4%. Coin flip tracking went from 98.1% to 100%. These tasks have clear sequential structure that maps perfectly to step-by-step reasoning.

Complex planning benefits when the task requires exploring a decision space. Tree of Thoughts, a variant that explores multiple reasoning branches with backtracking, solved the Game of 24 puzzle at 74% compared to standard CoT's 4%. The difference: 60% of CoT reasoning paths fail at the first step with no way to recover.

The pattern is consistent: if a task has a clear sequence of dependent steps, and a human would benefit from writing out their work, chain-of-thought helps.

Chain-of-thought is the most studied prompting technique in AI. It is also the most misapplied.

Where It Fails (The Part Nobody Talks About)

A 2024 study from researchers at multiple institutions, published at NeurIPS under the title "Mind Your Step (by Step)," systematically tested when chain-of-thought degrades performance. They drew on a specific insight from cognitive psychology: if verbal deliberation hurts human performance on a task, it will likely hurt LLMs too.

Implicit pattern learning suffers the most. On artificial grammar tasks where subjects must recognize patterns without consciously identifying the rules, o1-preview dropped 36.3 percentage points with chain-of-thought. GPT-4o dropped 23.1 points. Verbalizing the reasoning process forced the models to commit to explicit rules that didn't capture the actual pattern.

Visual-spatial reasoning degrades across every model tested. On facial recognition tasks, Claude 3 Opus dropped 14.4 points, GPT-4o dropped 12.8 points, and Gemini 1.5 Pro dropped 11.4 points. The psychology literature calls this "verbal overshadowing," where describing a face in words impairs your ability to recognize it later. The same mechanism transfers to language models processing visual information.

Classification with exceptions becomes dramatically slower. When categories have irregular members that break the general rules, CoT caused GPT-4o to need 331% more iterations to reach the correct answer. Claude 3.5 Sonnet needed 178% more. The models over-committed to the rules they articulated, making them worse at handling exceptions.

Simple retrieval gains nothing. The NeurIPS meta-analysis found that on MMLU, chain-of-thought produces "almost identical accuracy" to direct answers unless the question involves symbolic operations. For straightforward factual questions, CoT adds tokens and latency without improving the answer.

Creative tasks suffer from the structure. Research from multiple groups confirms that CoT's sequential framework constrains the flexibility that creative work requires. The rigid step-by-step process that helps math actively hinders open-ended generation.

Small models get worse, not better. The original Wei et al. paper showed this clearly but it's often forgotten. Below roughly 100 billion parameters, chain-of-thought either has no effect or actively hurts. GPT-3 at 1.3B went from 2.4% to 0.5% on GSM8K with CoT. Small models generate what the authors called "fluent but non-causal rationales," producing plausible-sounding reasoning that leads to wrong answers.

The Decision Framework

Here's how to decide whether chain-of-thought helps for your specific task.

Use CoT when all three conditions hold:

The task requires multiple dependent reasoning steps
A human would benefit from writing out their work
Your model is large enough (100B+ parameters, or a frontier model)

Skip CoT when any of these apply:

The answer is a direct factual lookup
The task is classification or pattern recognition
You're using a reasoning model (o1, o3, DeepSeek R1) that already thinks internally
Your model is small (under 100B parameters)
The task is creative or open-ended
You need structured output (JSON, XML), which a separate EMNLP 2024 study found degrades reasoning ability

The overthinking trap: Recent research found that CoT performance follows an inverted U-shaped curve. Initially it helps, but beyond an optimal length, additional reasoning steps introduce errors. More capable models favor shorter, more efficient chains. If your model is already good at the task, CoT may push it past the peak into degradation territory.

Variants Worth Knowing

Not all chain-of-thought is the same. The variants differ meaningfully in cost, accuracy, and applicability.

Zero-shot CoT is the simplest: append "Let's think step by step" to your prompt. No examples needed. It's the right starting point for most use cases. Cost: roughly 12x more tokens than a direct answer.

Few-shot CoT includes 4-8 worked examples with reasoning chains. More precise than zero-shot but requires crafting quality examples. Use it when you need domain-specific reasoning patterns. Cost: roughly 15x more tokens.

Self-consistency samples multiple reasoning paths (typically 5-40) and takes the majority vote. It pushed GSM8K from 56.9% to 74.4% on PaLM 540B. The accuracy gain is real but the cost is brutal: 40 samples means 40x the inference cost. Use it only when accuracy matters more than budget.

Tree of Thoughts explores reasoning branches with backtracking, using BFS or DFS search. It solved Game of 24 at 74% versus CoT's 4%. Powerful for search and planning problems, but complex to implement and expensive to run.

Chain of Draft, published in February 2025, might be the most practically useful variant. Instead of verbose reasoning, it generates minimalistic scratch-work notes. Result: 7.6% of CoT's tokens while matching or exceeding accuracy. On sports understanding with Claude 3.5 Sonnet, it cut tokens from 189 to 14 while improving accuracy from 93.2% to 97.3%. If you're using CoT today, this is likely the upgrade you should try first.

Chain of Verification generates an answer, then produces verification questions, answers them independently, and refines the original response. Meta AI's research showed it reduces factual hallucinations by 50-70%. Best for knowledge-intensive tasks where getting it right matters more than speed.

If the model is writing more, it may be rationalizing rather than reasoning.

Reasoning Models Changed the Equation

OpenAI's o1, o3, and DeepSeek's R1 perform chain-of-thought internally during inference, trained through reinforcement learning rather than prompting. This changes the practical calculus substantially.

Adding explicit CoT prompts to reasoning models is redundant at best and harmful at worst. The Wharton study found that o3-mini gained only 2.9% from CoT prompting, o4-mini gained 3.1%, and Gemini Flash 2.5 lost 3.3%. Multiple guides from OpenAI and the practitioner community explicitly recommend against few-shot CoT with reasoning models. Let them think on their own.

The relationship between CoT prompting and reasoning models is that reasoning models are CoT made permanent. They don't need the prompt because the reasoning process is built into their weights. You're paying for it in inference tokens either way, but reasoning models allocate that compute more efficiently than prompt-based CoT.

For practitioners, the question has shifted. It's no longer "should I use CoT?" but "should I use CoT prompting on a standard model, or switch to a reasoning model?" The answer depends on cost. DeepSeek R1 charges $0.55 per million input tokens versus o1's $15.00. A well-crafted CoT prompt on a standard model can match reasoning model performance at lower cost for many tasks. For the hardest problems, reasoning tokens win. For everything else, it's a budget decision, and the budget-aware approach to agent reasoning applies here too.

The Cost Reality

Chain-of-thought isn't free. Every reasoning token you generate is a token you pay for.

On GSM8K, standard CoT averages 190-205 tokens per question versus 10-20 for a direct answer. That's a 10-15x cost multiplier. Self-consistency with 40 samples pushes the multiplier to roughly 500x. The Wharton study measured response time increases of 35-600% for non-reasoning models using CoT.

Chain of Draft compresses this dramatically: 40-44 tokens on GSM8K with comparable accuracy to full CoT. TokenSkip, a 2025 technique, reduces reasoning tokens by 40% with less than 0.4% accuracy loss. These compression approaches suggest that most of CoT's tokens are filler, and the actual reasoning signal is sparse.

For agent systems that make hundreds of tool-use decisions per session, the token overhead compounds fast. The prompt engineering ceiling already showed that sophisticated prompting underperforms zero-shot on frontier models for many tasks. Before adding CoT to your agent's system prompt, verify that it actually improves your specific task. The meta-analysis is clear: on anything that isn't math, logic, or multi-step reasoning, it probably doesn't.

A Faithfulness Warning

There's a deeper issue: chain-of-thought reasoning isn't always honest. Anthropic's July 2023 research found "large variation across tasks in how strongly models condition on the CoT," and an inverse scaling effect where larger, more capable models produce less faithful reasoning. Their May 2025 follow-up on reasoning models was more specific: Claude 3.7 Sonnet mentioned relevant hints in its chain of thought only 25% of the time. DeepSeek R1 mentioned them only 39% of the time.

Unfaithful chains of thought are longer. Claude averaged 2,064 tokens when its reasoning was unfaithful versus 1,439 when faithful. If the model is writing more, it isn't necessarily thinking more. It may be rationalizing rather than reasoning.

This matters for anyone using CoT as an interpretability tool, not just an accuracy boost. If you're relying on the reasoning chain to understand why a model reached its answer, the chain may not reflect the model's actual process. For agent systems where auditing decisions is critical, this is a significant caveat.

Sources

Research Papers:

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models -- Wei et al., Google Brain (2022)
Large Language Models are Zero-Shot Reasoners -- Kojima et al. (2022)
Self-Consistency Improves Chain of Thought Reasoning -- Wang et al. (2022)
Tree of Thoughts: Deliberate Problem Solving with Large Language Models -- Yao et al. (2023)
Chain-of-Verification Reduces Hallucination in Large Language Models -- Dhuliawala et al., Meta AI (2023)
Mind Your Step (by Step): Chain-of-Thought Can Reduce Performance -- Sprague et al. (2024)
Let Me Speak Freely? A Study on Format Restrictions and LLM Reasoning -- Tam et al. (2024)
Chain of Draft: Thinking Faster by Writing Less -- Xu et al. (2025)
When More is Less: Understanding Chain-of-Thought Length -- Chen et al. (2025)

Industry / Case Studies:

To CoT or not to CoT? Chain-of-Thought Helps Mainly on Math and Symbolic Reasoning -- NeurIPS 2024 Meta-Analysis
The Decreasing Value of Chain of Thought in Prompting -- Meincke, Mollick et al., Wharton (2025)
Reasoning Models Don't Always Say What They Think -- Anthropic (2025)

Commentary:

Measuring Faithfulness in Chain-of-Thought Reasoning -- Anthropic (2023)
Chain of Thoughtlessness: An Analysis of CoT in Planning -- Stechly et al. (2024)

Related Swarm Signal Coverage: