LISTEN TO THIS ARTICLE
The Inference Budget Just Got Interesting: Why Test-Time Compute Is Rewriting Scaling Laws
OpenAI's o1 made headlines for "thinking harder" during inference. But the real story isn't that a model can spend more tokens on reasoning: it's that we've been fundamentally underinvesting in the wrong phase of the AI lifecycle. A cluster of recent papers reveals something uncomfortable: the scaling laws that defined the last five years of AI development don't translate to inference time. Throwing more compute at a trained model doesn't follow the same predictable curves we see during pre-training. It gets weird, and it gets expensive in ways nobody anticipated.
Time Series Foundation Models break scaling laws 78% of the time under standard sampling, according to research from Hua et al. The problem isn't the models. It's that standard inference techniques produce degenerate solutions. When you sample outputs repeatedly without controlling for diversity, most models converge to the same answer quickly, wasting compute on redundant paths. This isn't a time series problem. It's an inference problem that shows up everywhere once you start looking.
The Pre-Training Playbook Doesn't Work Here
Pre-training compute scales predictably. Double your parameters, quadruple your data, and you get measurable improvements that follow power laws. Loss drops. Benchmarks improve. CFOs can model ROI. This predictability drove the entire foundation model boom.
Inference-time compute doesn't behave the same way. Halder and Pehlevan built an analytically tractable model of LLM-as-a-Judge systems and found that performance gains plateau rapidly unless you redesign the sampling strategy itself. Their model shows that standard best-of-N sampling hits diminishing returns after N=8 in most scenarios. Keep sampling past that threshold and you're burning cycles on near-identical outputs.
The root cause: most foundation models weren't optimized for inference-time exploration. They were trained to maximize likelihood on static datasets, not to generate diverse, high-quality candidates under computational budgets. The result is models that confidently converge to local optima when you need them to explore the solution space.
This creates a resource allocation paradox. Companies spent millions on pre-training compute to build models that can't effectively use inference compute. The fix isn't more training data. It's rethinking how models search during inference.

Diversity Isn't a Nice-to-Have, It's a Compute Strategy
Hua et al. tested diversified sampling on Time Series Foundation Models and found that controlled diversity increases performance by 23% compared to standard sampling at the same compute budget. The key word is "controlled." Random diversity doesn't help. You need structured exploration that forces the model to consider genuinely different solution paths, not minor variations on the same answer.
Their approach uses temperature-controlled sampling combined with solution-space clustering to ensure each candidate explores a distinct region of possible outputs. When you plot their results, the scaling curve becomes predictable again, but only with diversity constraints in place.
Misaki and Akiba's UnMaskFork takes a different approach for masked diffusion models. Instead of sampling multiple outputs in parallel, they branch the generation process at critical decision points, creating a tree of possibilities. Each branch explores a deterministic path, which eliminates the redundancy problem entirely. Their method achieves comparable performance to best-of-N sampling at 40% of the computational cost.
The pattern across architectures: inference scaling works when you force the model to explore, not when you let it repeatedly confirm its first instinct.
Where This Gets Expensive
Bai et al.'s Prism system for discrete diffusion language models reveals the cost structure nobody wants to talk about. Their hierarchical search with self-verification achieves state-of-the-art results on reasoning benchmarks, but requires 3-5x more inference compute than standard sampling. The compute doesn't scale linearly with problem difficulty. It scales with solution space complexity.
For simple problems, standard sampling is cheaper and faster. For complex reasoning tasks where the solution space is large and poorly constrained, test-time scaling via search becomes essential. What matters is when the problem justifies the cost, not whether to use inference compute at all.
Zeng et al.'s ARTIS system makes this explicit for agentic settings. They built a risk-aware test-time scaling framework that simulates potential action sequences before execution. In their experiments on agent benchmarks, ARTIS improves success rates by 31% on high-risk tasks, but uses 4.2x more inference compute than baseline agents. The system learns to allocate compute based on estimated risk and irreversibility of actions.
This creates a new optimization problem: inference compute budgeting. Unlike pre-training, where you can run training longer to improve all downstream tasks, inference compute must be allocated per-query based on task characteristics. Get it wrong and you either waste money on easy problems or fail on hard ones. The Budget Problem: Why AI Agents Are Learning to Be Cheap explores this tension in agent systems specifically.

The Search Space Problem Nobody Solved
Kong et al.'s work on latent thought vectors for math reasoning exposes a fundamental limitation. They show that current models can't reliably estimate the difficulty of a problem before attempting it. Their difficulty-aware policy optimization improves compute allocation, but only achieves 67% accuracy in predicting which problems need extended reasoning.
This isn't a model size problem. Larger models are slightly better at self-assessment, but the improvement plateaus quickly. The issue is that problem difficulty is often emergent. You can't know if a reasoning path will work until you've explored it. This makes upfront compute budgeting inherently approximate.
Tomlinson et al. analyzed chain-of-thought token complexity using bounded accuracy posterior optimization (BAPO) and found that reasoning token requirements grow super-linearly with problem complexity for certain task classes. For mathematical proofs and multi-step logical reasoning, the token budget needed for reliable solutions grows faster than the complexity of the problem statement itself. From Answer to Insight: Why Reasoning Tokens Are a Quiet Revolution in AI examines how this changes what reasoning even means for language models.
Their lower bounds suggest that some problem classes may be fundamentally inefficient for sequential reasoning architectures. Inference-time compute helps, but it doesn't change the asymptotic complexity class.
What This Actually Changes
Inference-time compute scaling is real, but it's not the smooth power law that defined pre-training. It's spiky, task-dependent, and expensive in ways that require new cost models. The o1 approach of spending more tokens on reasoning works, but only when the model is designed to explore diverse solution paths rather than confidently repeating its first guess.
The immediate implications for system design: inference budgets need to be dynamic and risk-aware. Spending 10x compute on a high-stakes reasoning task makes sense. Spending 10x on simple retrieval doesn't. Current serving infrastructure treats all queries equally, which means we're systematically over-serving simple requests and under-serving complex ones.
For model developers, the priority shifts to training models that can explore effectively during inference, not just predict accurately during training. This likely requires new training objectives that explicitly reward diverse solution generation, not just maximum likelihood.
The uncomfortable truth: scaling laws still work, but they're more complicated now. Pre-training compute improves baseline capability. Inference compute improves solution quality on a per-query basis. The two don't substitute for each other. You need both, and the optimal allocation between them depends on your task distribution.
For deployment teams, this means inference cost modeling just got harder. You can't amortize inference compute the way you amortize training compute. Every query has its own budget, and you need systems that can estimate difficulty, allocate compute dynamically, and fail gracefully when the budget runs out.
The research frontier is shifting toward principled search during inference rather than blind sampling. Tree search, hierarchical verification, and risk-aware branching all show promise. None of them are cheap, but they're predictably expensive, which is better than unpredictably wasteful.
Sources
Research Papers:
- Diversified Scaling Inference in Time Series Foundation Models — Ruijin Hua, Zichuan Liu, Kun Zhang et al. (2026)
- UnMaskFork: Test-Time Scaling for Masked Diffusion via Deterministic Action Branching — Kou Misaki, Takuya Akiba (2026)
- Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models — Jinbin Bai, Yixuan Li, Yuchen Zhu et al. (2026)
- Demystifying LLM-as-a-Judge: Analytically Tractable Model for Inference-Time Scaling — Indranil Halder, Cengiz Pehlevan (2025)
- ARTIS: Agentic Risk-Aware Test-Time Scaling via Iterative Simulation — Xingshan Zeng, Lingzhi Wang, Weiwen Liu et al. (2026)
- Inference-Time Rethinking with Latent Thought Vectors for Math Reasoning — Deqian Kong, Minglu Zhao, Aoyang Qin et al. (2026)
- Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs — Kiran Tomlinson, Tobias Schnabel, Adith Swaminathan et al. (2026)
Related Swarm Signal Coverage: