LISTEN TO THIS ARTICLE

The Inference Budget Just Got Interesting: Why Test-Time Compute Is Rewriting Scaling Laws

OpenAI's o1 made headlines for "thinking harder" during inference. But the real story isn't that a model can spend more tokens on reasoning: it's that we've been fundamentally underinvesting in the wrong phase of the AI lifecycle. A cluster of recent papers reveals something uncomfortable: the scaling laws that defined the last five years of AI development don't translate to inference time. Throwing more compute at a trained model doesn't follow the same predictable curves we see during pre-training. It gets weird, and it gets expensive in ways nobody anticipated.

Time Series Foundation Models frequently fail to follow scaling laws under standard sampling, according to research from Hua et al. The problem isn't the models. It's that standard inference techniques produce degenerate solutions. When you sample outputs repeatedly without controlling for diversity, most models converge to the same answer quickly, wasting compute on redundant paths. This isn't a time series problem. It's an inference problem that shows up everywhere once you start looking.

The Pre-Training Playbook Doesn't Work Here

Pre-training compute scales predictably. Double your parameters, quadruple your data, and you get measurable improvements that follow power laws. Loss drops. Benchmarks improve. CFOs can model ROI. This predictability drove the entire foundation model boom.

Inference-time compute doesn't behave the same way. Halder and Pehlevan built an analytically tractable model of LLM-as-a-Judge systems and found that performance gains plateau rapidly unless you redesign the sampling strategy itself. Their model shows that when the reward signal is substantially misspecified relative to the true objective, best-of-N sampling hits a finite optimum beyond which additional samples actually increase generalization error. Even with well-aligned rewards, gains follow an inverse-quadratic decay of Theta(1/k^2), meaning each additional sample yields diminishing returns.

The root cause: most foundation models weren't optimized for inference-time exploration. They were trained to maximize likelihood on static datasets, not to generate diverse, high-quality candidates under computational budgets. The result is models that confidently converge to local optima when you need them to explore the solution space.

This creates a resource allocation paradox. Companies spent millions on pre-training compute to build models that can't effectively use inference compute. The fix isn't more training data. It's rethinking how models search during inference.

Each branch explores a deterministic path, which eliminates the redundancy problem entirely.

Diversity Isn't a Nice-to-Have, It's a Compute Strategy

Hua et al. tested diversified sampling on Time Series Foundation Models and found that controlled diversity yields peak improvements of up to 50% compared to standard sampling at the same compute budget. The key word is "controlled." Random diversity doesn't help. You need structured exploration that forces the model to consider genuinely different solution paths, not minor variations on the same answer.

Their approach uses temperature-controlled sampling combined with solution-space clustering to ensure each candidate explores a distinct region of possible outputs. When you plot their results, the scaling curve becomes predictable again, but only with diversity constraints in place.

Misaki and Akiba's UnMaskFork takes a different approach for masked diffusion models. Instead of sampling multiple outputs in parallel, they formulate the unmasking trajectory as a search tree and use Monte Carlo Tree Search to optimize the generation path. Each branch explores a deterministic partial unmasking, which eliminates the redundancy problem of stochastic sampling. On coding benchmarks like LiveCodeBench and HumanEval+, UnMaskFork outperforms best-of-N baselines by 9 to 13 percentage points at equivalent compute budgets, while achieving cache hit rates above 55% that further improve efficiency.

The pattern across architectures: inference scaling works when you force the model to explore, not when you let it repeatedly confirm its first instinct.

Where This Gets Interesting

Bai et al.'s Prism system for discrete diffusion language models reveals a more nuanced cost picture. Their hierarchical trajectory search with self-verification matches best-of-N performance while using over 4x fewer function evaluations. On GSM8K, Prism achieves 85.3% accuracy with roughly 1,000 denoising steps where best-of-16 needs over 4,000 to reach 87.5%. The efficiency comes from dynamic pruning: Prism kills unpromising trajectories early and reallocates compute to survivors.

The tradeoff isn't that search is always expensive. It's that the optimal strategy depends on the task. For simple problems, standard sampling remains fast and sufficient. For complex reasoning where the solution space is large, structured search like Prism becomes essential because it makes inference compute predictable rather than wasteful.

Zeng et al.'s ARTIS system makes this explicit for agentic settings. They built a risk-aware test-time scaling framework that simulates potential action sequences before committing to real-world execution. On multi-turn agent benchmarks like ACEBench, ARTIS dramatically improves reliability, boosting overall scores from around 15% to over 52% with Qwen3-8B. Crucially, it achieves this while using fewer tokens than sequential revision baselines, because the risk-aware simulator focuses compute on failure-prone actions rather than uniformly scaling all interactions.

This creates a new optimization problem: inference compute budgeting. Unlike pre-training, where you can run training longer to improve all downstream tasks, inference compute must be allocated per-query based on task characteristics. Get it wrong and you either waste money on easy problems or fail on hard ones. The Budget Problem: Why AI Agents Are Learning to Be Cheap explores this tension in agent systems specifically.

The issue is that problem difficulty is often emergent.

The Search Space Problem Nobody Solved

Kong et al.'s work on latent thought vectors for math reasoning takes a different angle on the search space problem. Rather than trying to predict difficulty upfront, they decouple reasoning into a continuous latent vector (what to reason about) and a decoder that generates the actual trace (how to reason). Their 0.2B parameter model with 30 iterative rethinking steps surpasses baselines with 10 to 15 times more parameters on GSM8K, achieving 31.5% accuracy versus a 3B model's 22.7%.

The implication is striking: you don't necessarily need to know how hard a problem is before allocating compute. You can instead build models that iteratively refine their reasoning strategy during inference. But the approach still highlights the core challenge. The issue is that problem difficulty is often emergent. You can't know if a reasoning path will work until you've explored it. This makes upfront compute budgeting inherently approximate.

Tomlinson et al. analyzed chain-of-thought token complexity using the bounded attention prefix oracle (BAPO) model, an abstraction that quantifies the information flow required for LLMs to solve a task. They proved that canonical problems like binary majority, triplet matching, and graph reachability each require Omega(n) reasoning tokens where n is the input size, and complemented these with matching or near-matching upper bounds. Experiments with frontier reasoning models confirmed approximately linear token scaling and showed that models fail when constrained to smaller reasoning budgets. From Answer to Insight: Why Reasoning Tokens Are a Quiet Revolution in AI examines how this changes what reasoning even means for language models.

Their lower bounds establish fundamental floors on how many reasoning tokens different problem classes require as complexity grows. Inference-time compute helps, but these theoretical limits mean you can't compress reasoning below certain thresholds regardless of how clever your sampling strategy is.

What This Actually Changes

Inference-time compute scaling is real, but it's not the smooth power law that defined pre-training. It's spiky, task-dependent, and expensive in ways that require new cost models. The o1 approach of spending more tokens on reasoning works, but only when the model is designed to explore diverse solution paths rather than confidently repeating its first guess.

The immediate implications for system design: inference budgets need to be dynamic and risk-aware. Spending 10x compute on a high-stakes reasoning task makes sense. Spending 10x on simple retrieval doesn't. Current serving infrastructure treats all queries equally, which means we're systematically over-serving simple requests and under-serving complex ones.

For model developers, the priority shifts to training models that can explore effectively during inference, not just predict accurately during training. This likely requires new training objectives that explicitly reward diverse solution generation, not just maximum likelihood.

The uncomfortable truth: scaling laws still work, but they're more complicated now. Pre-training compute improves baseline capability. Inference compute improves solution quality on a per-query basis. The two don't substitute for each other. You need both, and the optimal allocation between them depends on your task distribution.

For deployment teams, this means inference cost modeling just got harder. You can't amortize inference compute the way you amortize training compute. Every query has its own budget, and you need systems that can estimate difficulty, allocate compute dynamically, and fail gracefully when the budget runs out.

The research frontier is shifting toward principled search during inference rather than blind sampling. Tree search, hierarchical verification, and risk-aware branching all show promise. None of them are cheap, but they're predictably expensive, which is better than unpredictably wasteful.

Sources

Research Papers:

Related Swarm Signal Coverage: