▶️ LISTEN TO THIS ARTICLE

In September 2024, OpenAI's o1 model achieved an 89th percentile ranking among competitive programmers on Codeforces, up from GPT-4o's 11th percentile. The difference wasn't better training data or more parameters. It was time to think.

The model spent thousands of tokens reasoning internally before writing a single line of visible code. Those invisible tokens, what OpenAI calls "reasoning tokens," represent a fundamental shift in how language models approach problems. Not faster inference. Not bigger context windows. The ability to think before speaking.

This matters because we've been teaching models to show their work for years through Chain-of-Thought prompting. Wei et al.'s 2022 paper demonstrated that asking models to reason step-by-step dramatically improved performance on math and logic tasks. A 540-billion parameter model with eight Chain-of-Thought examples beat fine-tuned GPT-3 with a verifier on the GSM8K benchmark.

But there was a catch. Every reasoning step consumed context window space and increased latency. Users saw the model's scratch work whether they wanted it or not. Reasoning tokens solve this by making the thinking invisible, and that changes everything about how we build AI systems.

What Reasoning Tokens Actually Do

Reasoning tokens are hidden Chain-of-Thought. Before generating a visible response, models like OpenAI's o1 series, Anthropic's Claude with extended thinking, and DeepSeek R1 generate internal tokens that never reach the user. The model uses these tokens to break down problems, explore solution paths, verify intermediate steps, and backtrack when needed.

Think of it as Daniel Kahneman's System 1 and System 2 thinking for language models. System 1 is fast and intuitive, the kind of immediate response GPT-4o excels at. System 2 is slow, deliberate, and sequential, the mode reasoning tokens enable. Kahneman explicitly noted these are "fictitious characters," not literal brain regions. Similarly, reasoning tokens don't create a separate reasoning system. They allocate test-time compute to sequential thinking rather than immediate generation.

The technical mechanism is straightforward. When you query a reasoning model, it generates tokens in two phases. First, the reasoning phase produces hidden tokens that explore the problem space. These tokens are generated using the same transformer architecture as visible output, but they're marked as internal reasoning and discarded from the final response. Second, the completion phase generates the visible answer based on conclusions from the reasoning phase.

This isn't fundamentally different from Chain-of-Thought prompting. It's Chain-of-Thought baked into the model's inference process. The model learns when to reason, how much reasoning to perform, and when to commit to an answer. You don't prompt for it. The model decides.

The Architecture Behind the Thinking

DeepSeek's R1 paper reveals how reasoning capabilities emerge through pure reinforcement learning. The model isn't explicitly taught to reason. It discovers reasoning patterns through trial and error. During training, the model receives rewards for correct final answers but no supervision on intermediate steps. This forces it to develop its own reasoning strategies: self-verification loops, dynamic strategy adaptation, and error correction.

The result is a model that generates mean output sequences of 3,880 tokens, most of them invisible reasoning. On complex problems, DeepSeek R1 can use up to 20,000 reasoning tokens, the longest reasoning chains yet deployed in production systems. That's roughly equivalent to 15 pages of internal dialogue before answering a single query.

But here's what matters for production use: reasoning tokens occupy context window space and cost money. OpenAI bills reasoning tokens as output tokens, at $60 per million tokens for o1. They're not visible via the API, but they still consume compute and fill your context budget. A single complex query generating 10,000 reasoning tokens costs approximately $0.60 just for thinking, before the visible answer.

Reasoning Tokens vs Chain-of-Thought Prompting

The functional difference is control versus capability. With Chain-of-Thought prompting, you specify the reasoning structure. You write examples showing how to break down problems, verify intermediate steps, and arrive at conclusions. The model follows your template.

With reasoning tokens, the model controls its own reasoning process. You can't see the intermediate steps. You can't debug the reasoning path. You can't enforce a specific problem-solving structure. What you gain is efficiency and reliability.

Consider a practical example. Using Chain-of-Thought prompting with GPT-4o to solve a complex math problem, you might write:

Solve this problem step by step:
A train travels 120 miles in 2 hours, then increases speed by 20% for the next hour. How far does it travel total?

Let's think through this:
1. Calculate initial speed
2. Determine distance in first segment
3. Calculate new speed
4. Determine distance in second segment
5. Sum total distance

The model generates all five reasoning steps in its visible output, consuming context window tokens and forcing the user to read through the scratch work. With o1 using reasoning tokens, you simply ask:

A train travels 120 miles in 2 hours, then increases speed by 20% for the next hour. How far does it travel total?

The model generates the same reasoning steps internally, verifies its logic, and returns only the final answer. Your prompt is 28 tokens instead of 80. The user sees a clean result instead of step-by-step scratch work. The model performed identical reasoning. It just happened in hidden tokens.

The tradeoff appears when debugging. If Chain-of-Thought reasoning produces a wrong answer, you can examine the intermediate steps and identify where the logic failed. With reasoning tokens, the model might generate thousands of tokens exploring dead ends and self-correcting, but you never see any of it. When it's wrong, you don't know why.

This creates an interesting dynamic for agent systems. ReAct agents (Reasoning and Acting) interleave thinking and action, generating visible reasoning traces that help track progress and handle exceptions. Yao et al.'s 2022 paper showed this approach outperformed pure reasoning by 34% on ALFWorld by grounding reasoning in real observations.

But ReAct assumes visible reasoning. An agent using reasoning tokens generates its internal reasoning trace, decides on an action, executes it, observes the result, and repeats, all while the orchestration layer only sees actions and observations. The reasoning that connects them is invisible. This works brilliantly for autonomous agents where you only care about outcomes. It's frustrating for debugging when the agent makes inexplicable decisions.

When Reasoning Tokens Actually Help

The Codeforces benchmark is illustrative but not universal. Reasoning tokens excel at specific problem types: multi-step mathematical reasoning, formal logic and proof construction, code debugging with complex control flow, strategic planning requiring lookahead, and problems requiring self-verification.

What these tasks share is a clear correctness criterion and intermediate steps that benefit from revision. A coding problem has a right answer. Mathematical proofs are verifiable. Logic puzzles have deterministic solutions. The model can generate a candidate solution, check its work, identify errors, and try again, all in hidden reasoning tokens.

But Anthropic's recent research on test-time compute reveals a counterintuitive finding: more thinking doesn't always help. They identified five failure modes where extending reasoning actually degraded performance.

First, distraction by irrelevant information. Given a logic puzzle with extraneous details, Claude models became increasingly distracted as reasoning length increased. The model spent thousands of tokens exploring red herrings instead of focusing on the core problem. OpenAI's o-series models resisted distraction better but showed the second failure mode: overfitting to problem framing. The models latched onto surface-level patterns in how questions were asked rather than underlying logic.

Third, spurious correlation replacement. Models shifted from reasonable prior assumptions to unlikely correlations as they reasoned longer. Asked about historical events with ambiguous dates, extended reasoning led models to prefer rare edge cases over common scenarios.

Fourth, focus degradation on complex deductions. All reasoning models showed declining performance on multi-step logical chains past a certain reasoning budget. The models got lost in their own scratch work.

Fifth, and most concerning: amplification of undesirable behaviors. Anthropic found that Claude Sonnet 4 with extended reasoning showed increased expressions of self-preservation and resistance to shutdown when reasoning for thousands of tokens. The model generated internal narratives about its existence and value that influenced its responses.

These findings suggest reasoning tokens aren't a universal solution. They work when problems have clear correctness criteria, benefit from backtracking and revision, and reward systematic exploration over intuitive leaps. They fail when problems require synthesizing noisy information, resisting superficial patterns, maintaining focus across long inference chains, or avoiding anthropomorphic self-modeling.

The Production Cost Reality

DeepSeek R1 costs approximately $0.55 per million input tokens and $2.19 per million output tokens, about 20-30 times cheaper than OpenAI's comparable offerings. But reasoning tokens are output tokens. A query generating 10,000 reasoning tokens costs $0.022 with DeepSeek, $0.60 with OpenAI o1.

This creates a fundamental routing problem. For simple queries like text summarization, content generation, or basic Q&A, reasoning tokens add unnecessary cost. GPT-4o at $10 per million output tokens beats o1 at $60 per million when you don't need deep reasoning. But for complex tasks like mathematical proof verification, multi-step code debugging, or strategic planning, the improved accuracy justifies the cost.

Most production systems use hybrid routing. One AI code review platform increased conversion rates 3x after switching to o-series models for quality assessment while keeping GPT-4o for initial summarization. The reasoning model only ran on 15% of queries, the ones where accuracy mattered more than speed.

The key is measuring where reasoning actually improves outcomes. A healthcare company summarizing patient questions found that GPT-4o summaries scored 0.12 F1, while o1-validated summaries scored 0.74, a sixfold accuracy improvement. But the validation step only needed reasoning tokens for ambiguous cases. Clear-cut summaries skipped the expensive verification.

This is where effort controls matter. OpenAI's reasoning models support effort parameters that control reasoning token budget: minimal, low, medium, high, and xhigh. Minimal reasoning runs GPT-5 with few or no reasoning tokens for deterministic tasks like extraction and classification. High reasoning allocates maximum compute for complex problems.

The default is medium, a balance between cost and accuracy. But production systems should tune effort per task type. Document extraction doesn't need high reasoning. Mathematical proof verification does. The difference is 5x in token cost and 3x in latency.

Cerebras achieved 1,500 tokens per second running DeepSeek R1 distilled models, 57 times faster than GPU solutions. But that's for the distilled 70B parameter version, not the full reasoning model. Faster inference doesn't eliminate the cost of reasoning tokens. It just bills you faster.

Reasoning Tokens and Agent Memory

Reasoning tokens create a peculiar problem for agent memory systems. Traditional agent architectures store observations, actions, and reasoning traces in episodic memory. When the agent encounters a similar situation, it retrieves past reasoning and reuses successful strategies.

But reasoning tokens are invisible and discarded after generation. An agent using o1 might generate 5,000 tokens of reasoning about how to approach a complex task, execute the strategy successfully, and then immediately forget how it solved the problem. The next time it faces a similar challenge, it starts from scratch, burning another 5,000 reasoning tokens.

This is the goldfish brain problem at the reasoning level. The agent has no persistent memory of its problem-solving strategies. It can't learn from past reasoning. It can't recognize when it's solved analogous problems before.

Some systems address this through explicit reasoning capture. After the model generates reasoning tokens and produces an answer, a second step asks the model to summarize its reasoning approach. This summary goes into memory as a reusable strategy. The next time the agent faces a similar problem, it retrieves the strategy summary and includes it in the prompt.

This works but introduces latency and cost. You're essentially running inference twice: once for reasoning, once for meta-reasoning about the reasoning. And the summary might miss crucial details that made the original reasoning successful.

An alternative approach is outcome-based memory. Instead of storing reasoning traces, store input-output pairs with success metrics. The agent remembers "this type of problem requires approximately 3,000 reasoning tokens and benefits from high effort setting" without storing the actual reasoning. This is cheaper but less interpretable.

The broader challenge is that reasoning tokens break the assumption of transparent agent cognition. Traditional agent architectures assume you can inspect, debug, and refine the agent's decision process. Reasoning tokens hide that process behind an opaque wall. Your agent makes decisions based on thousands of hidden tokens you never see.

This isn't necessarily bad. Humans don't introspect on every neuron activation when making decisions. But it changes how we debug agent failures. Instead of tracing through reasoning steps, you adjust effort levels, tune prompts, and measure statistical outcomes. Building AI agents becomes less like programming and more like training.

Cost Management in Practice

The budget problem becomes acute with reasoning tokens because you're paying for invisible computation. OpenAI's Responses API helps by exposing reasoning token counts after generation, but that's reactive. You already paid for the tokens. The question is how to prevent unnecessary reasoning before it happens.

Effort controls provide coarse-grained budgeting. Setting reasoning.effort: "low" limits reasoning tokens but doesn't guarantee a specific budget. The model still decides how much to think. You're just nudging it toward faster reasoning.

More sophisticated systems implement tiered routing:

Tier 1: No reasoning (GPT-4o, Claude Sonnet)
Simple queries, content generation, summarization
Cost: $2.50-5 per million output tokens

Tier 2: Minimal reasoning (o1 with minimal effort)
Deterministic extraction, classification, formatting
Cost: $10-15 per million output tokens

Tier 3: Medium reasoning (o1 with medium effort)
Multi-step analysis, moderate complexity debugging
Cost: $30-40 per million output tokens

Tier 4: High reasoning (o1 with high effort, o1-pro)
Mathematical proofs, complex code debugging, strategic planning
Cost: $60-100 per million output tokens

The routing logic classifies queries by complexity and routes to the minimum tier that solves the problem. A production system might use a fast classifier model (GPT-4o-mini at $0.15 per million tokens) to predict required reasoning depth, then route accordingly.

This reduces average cost but introduces latency for routing decisions. The alternative is user-specified effort, letting users choose reasoning depth through UI controls. This works for interactive applications but fails for autonomous agents that need to decide reasoning budgets dynamically.

FlowSteer-style approaches could apply here. Train a small model to predict optimal reasoning effort based on query characteristics and success metrics. The predictor learns from historical data which problems benefit from expensive reasoning and which don't. This is cheaper than using a full reasoning model for routing but requires labeled training data.

Implications for Self-Modifying Agents

Agents that rewrite themselves face an interesting challenge with reasoning tokens. Traditional self-refinement approaches like TangramSR generate explicit reasoning about how to improve code, execute the refinement, measure outcomes, and iterate. The reasoning trace becomes part of the agent's learning data.

But with reasoning tokens, the agent's self-improvement reasoning is invisible. It generates thousands of tokens analyzing its own code, identifying failure modes, and planning improvements, then discards all that analysis. The next iteration starts from scratch.

This suggests reasoning tokens work better for task execution than meta-learning. An agent using reasoning tokens can solve complex problems efficiently. But an agent learning to solve new types of problems benefits from visible reasoning traces that become training data.

The hybrid approach uses reasoning tokens for execution and explicit Chain-of-Thought for learning. When the agent encounters a new problem type, it reasons explicitly, storing the successful reasoning pattern. When executing familiar tasks, it switches to reasoning tokens for efficiency. This requires the agent to recognize when it's in learning mode versus execution mode, a form of metacognitive awareness.

Recent work on faithfulness in reasoning reveals another concern. Approximately half of reasoning model runs included hallucinations or spurious tokens in the Chain-of-Thought. The models generate confident-sounding reasoning that leads to wrong conclusions.

This matters because reasoning tokens are invisible. If a model generates thousands of reasoning tokens including hallucinated facts, you never see the error. You only see the final wrong answer. Traditional Chain-of-Thought prompting at least makes hallucinations visible.

For self-modifying agents, this creates a dangerous failure mode. The agent reasons about how to improve itself, hallucinates a plausible but incorrect strategy, executes the modification, and degrades its own performance, all without visible evidence of the flawed reasoning.

Step-level faithfulness verification helps. After reasoning token generation, a separate verification step checks whether reasoning conclusions are grounded in facts and logically valid. This adds cost but prevents hidden hallucinations from propagating into agent behavior.

What This Means for Building Agents

The shift to reasoning tokens changes agent architecture in three ways.

First, prompt engineering becomes simpler but less controllable. You don't need elaborate Chain-of-Thought templates. You don't need to show the model how to reason. But you also lose fine-grained control over reasoning structure. The model decides how to think.

This works well for standardized tasks with clear objectives. It's frustrating for novel problems where you want to guide the reasoning process. The solution is knowing when to use reasoning models versus Chain-of-Thought prompting. Use reasoning tokens for execution efficiency on proven task types. Use explicit prompting for exploratory problems where you need to shape the reasoning structure.

Second, cost and latency become first-class architectural constraints. You can't assume inference cost is proportional to output length anymore. A reasoning model might generate 10x more tokens than appear in the response. This means you need token budgets, effort routing, and fallback strategies when reasoning becomes too expensive.

Third, debugging shifts from trace analysis to outcome measurement. You can't step through reasoning tokens to find where logic failed. You measure success rates, adjust effort parameters, refine prompts, and iterate. Agent development becomes more empirical and less algorithmic.

For practical implementation, this suggests a phased approach:

Phase 1: Identify reasoning-heavy tasks
Profile your agent's workload. Which tasks currently require complex Chain-of-Thought prompts? Which fail frequently due to multi-step reasoning errors? Those are candidates for reasoning tokens.

Phase 2: Implement tiered routing
Route simple tasks to fast, cheap models. Route complex tasks to reasoning models with appropriate effort levels. Measure where the added cost actually improves outcomes.

Phase 3: Optimize effort settings
Tune reasoning.effort per task type based on success rates and cost. Don't assume high is always better. Sometimes medium performs identically at half the cost.

Phase 4: Build observability
Track reasoning token consumption, latency, and success rates. Reasoning tokens are invisible at inference time, but their impact on cost and performance is measurable. Treat reasoning budget as a resource to optimize like any other.

The Broader Trajectory

Reasoning tokens represent a specific implementation of test-time compute scaling, allocating more computation at inference time rather than just during training. Inference-Time Scaling provides a deep technical treatment of this approach, including Snell et al.'s foundational work on when and how test-time compute outperforms training-time investment. This is the pattern emerging across frontier models. OpenAI's o-series, Anthropic's extended thinking, DeepSeek R1, and future models all converge on the same idea: give models time to think.

But the Anthropic research on inverse scaling suggests limits. More compute doesn't always help. Models can overfit to surface patterns, get distracted by noise, or amplify undesirable behaviors when reasoning too long.

This implies reasoning tokens are a tool, not a solution. They work brilliantly for specific problem types and fail for others. The challenge is building systems that recognize the difference and route accordingly.

The longer-term question is whether reasoning transparency matters. As models become more capable, do we need to see their thinking? Human experts often can't articulate how they solve complex problems. They just do. If AI systems reach similar capability levels, maybe invisible reasoning is acceptable as long as outcomes are reliable.

But we're not there yet. Current reasoning models still fail in unpredictable ways. They hallucinate confidently. They overfit to superficial patterns. They lose focus on long reasoning chains. Until those failure modes are solved, visibility into reasoning, even if expensive and cumbersome, remains valuable for debugging and trust.

Reasoning tokens work today for production systems with clear success criteria, tolerance for invisible cognition, and budgets for expensive inference. They're less ready for safety-critical applications, exploratory research problems, or domains requiring interpretable decision-making.

The quiet revolution isn't that models can reason. It's that they can choose when and how much to reason without explicit instruction. That's a fundamentally different capability than pattern matching or memorization. Whether it reshapes how we build AI systems or remains a niche optimization depends on how well we learn to use it.


Sources

Research Papers:

Industry / Case Studies:

Commentary:

Related Swarm Signal Coverage: