LISTEN TO THIS ARTICLE

Flow Matching models just got measurably better at protein generation without retraining. The technique? Throwing more compute at inference rather than pre-training. While everyone fixates on OpenAI's o1 and chain-of-thought reasoning, a parallel universe of non-autoregressive models has been quietly adopting the same core insight: you can trade compute at inference time for better outputs. The difference is these models don't need to think step-by-step to benefit.

An October 2025 paper from researchers at McGill and Mila shows Flow Matching models, increasingly popular for scientific and vision tasks, can scale quality with inference compute just like their autoregressive cousins. But here's what matters: they do it without the serial bottleneck that makes LLM inference expensive. This isn't about teaching models to reason. It's about teaching them to search better.

The Autoregressive Trap Everyone Ignores

Inference-time compute scaling got famous through LLMs. You let the model generate multiple reasoning paths, score them, and pick the best one. Simple. Effective. Incredibly slow.

The problem: autoregressive models generate one token at a time. When you want 10 candidate solutions, you're running 10 sequential generation passes. Each pass waits for the previous token before computing the next. This is why o1 costs several times more than GPT-4o per query and why you wait seconds for responses that GPT-3.5 would've returned instantly.

Masked diffusion and Flow Matching models don't have this constraint. They generate all tokens simultaneously through iterative refinement. When you want multiple candidates, you run parallel denoising trajectories. The wall-clock time doesn't scale linearly with the number of attempts.

The UnMaskFork paper from Sakana AI demonstrates this concretely: their masked diffusion model uses Monte Carlo Tree Search with deterministic action branching to explore the search space efficiently. On coding benchmarks like HumanEval+, UnMaskFork achieves 88.0% Pass@1 at a budget of 12,288 function evaluations. The key insight is that deterministic branching with node caching avoids the redundant computation that plagues stochastic sampling approaches. Still expensive, but fundamentally different physics.

What Flow Matching Actually Changes

Flow Matching has become the architecture of choice for scientific applications. Proteins, molecules, climate data: domains where you need continuous output spaces and can't tokenize your way out of the problem. These models learn to transform noise into structured data through learned vector fields.

The McGill team's contribution is showing these models can scale quality at inference using the same search-and-verify loop that works for LLMs. They test on protein backbone generation and image synthesis. The method is straightforward: generate multiple samples, score each using a learned verifier (sometimes the model's own likelihood), keep the best one.

Results on protein backbone generation using FoldFlow2: the two-stage search method was the only approach to achieve an average TM-score above 0.9 at 8x compute, with designability (fraction of samples with scRMSD below 2.0 Angstroms) improving consistently with scaling. On ImageNet 256x256, FID and Inception Score both improve steadily when scaling from 1x to 8x compute budgets using DINO-based verification. These aren't marginal gains. They're the difference between research-quality and production-ready outputs.

The interesting detail everyone misses: they achieve this without changing the interpolation schedule. A concurrent paper by Kim et al. tried replacing Flow Matching's linear interpolant with a variance-preserving schedule to enable better scoring. It works but sacrifices the training efficiency that made Flow Matching attractive in the first place. The McGill approach keeps the simple linear schedule and just searches harder at inference.

Time Series Models Join The Party

Time series forecasting has its own inference-time scaling problem, and it's not about reasoning. It's about uncertainty. The Diversified Scaling Inference paper shows Time Series Foundation Models can generate more reliable predictions by exploring diverse forecasting trajectories at test time.

Their approach is cleaner than what's happening in language models. Instead of hoping diverse reasoning paths emerge from temperature sampling, they explicitly inject diversity through two categories of perturbation: task-agnostic strategies like prefix padding, Gaussian noise, and random offsets that work across any forecasting setup, and task-specific perturbations that exploit sensitivity, dependency, and reconstruction properties of the input data.

Results on standard forecasting benchmarks including ETTh1, ETTm1, Electricity, and Traffic datasets show MSE improvements of up to approximately 50% under diversified sampling compared to standard approaches. When combining sampling strategies with context length extensions, performance gains reach up to approximately 90% in the best cases. The predictions become reliable enough to use in production systems where wrong forecasts have real costs.

What makes this work interesting is the explicit rejection of the "more samples = better" assumption. They show you need diversity, not just quantity. The paper derives a critical sample threshold where diversified sampling begins to outperform standard sampling, and demonstrates that structured perturbation consistently beats naive random sampling by wide margins. This matters because it suggests inference-time compute scaling isn't one technique. It's a design space with actual optimization problems to solve.

The Verification Bottleneck Nobody Talks About

Here's the uncomfortable truth about all inference-time scaling: it only works if you can verify outputs cheaper than you can generate them. For math problems, you can check answers. For code, you can run tests. For reasoning chains, you can score logical consistency.

For proteins? For images? For time series forecasts? The verification problem gets much harder.

The McGill team uses the Flow Matching model's own likelihood as a verifier. This is elegant but circular: you're using the model to score its own outputs. The UnMaskFork paper uses Monte Carlo Tree Search with self-evaluation to guide branching. The time series work uses domain-specific metrics like MSE that require ground truth you won't have at deployment.

The self-correction paper from researchers at Cornell and NVIDIA attacks this directly. Their Progressive Self-Correction (ProSeCo) framework trains masked diffusion models to both unmask and correct tokens, reusing outputs from the denoising network as inputs for corrector training. The model learns to score and revise its own outputs, fixing errors that would otherwise accumulate through the iterative generation process.

Results across math and code benchmarks: ProSeCo achieves 82.18% on GSM8K (up from 77.48% baseline), 62.20% on HumanEval (up from 48.17%), and 50.20% on MBPP (up from 43.20%). On unconditional text generation, generation perplexity drops from 14.9 to 11.1 as sampling budget increases. The model achieves up to roughly 1.3x improvement on benchmarks through iterative self-correction at inference time.

But self-correction only gets you so far. The verification problem is still the constraint. You can't effectively search a space if you can't tell good from bad until after deployment.

The Discrete Diffusion Problem

Language models are trying to force inference-time scaling into a discrete token space, and it shows. The Prism paper demonstrates why this is harder than the continuous case.

Discrete diffusion language models work differently from both autoregressive transformers and continuous Flow Matching models. They corrupt text by randomly masking tokens and learn to denoise iteratively. This gives you parallel generation like Flow Matching but in discrete token space like autoregressive models.

The search problem becomes: which masked tokens should you branch on during test-time scaling? Branch on everything and you get combinatorial explosion. Branch on nothing and you don't get the benefits of search.

Prism solves this with Hierarchical Trajectory Search that dynamically prunes and reallocates compute during denoising. It introduces local branching with partial remasking to explore diverse completions while preserving high-confidence tokens, and replaces external verifiers with Self-Verified Feedback obtained via self-evaluation prompts on intermediate completions. It's beam search but where the beam focuses on uncertainty rather than spreading uniformly.

Results on math and code benchmarks: on GSM8K, Prism boosts LLaDA 8B Instruct from 67.58% baseline to 85.30% accuracy with K=8 branching. On HumanEval, the same model jumps from 54.88% to 79.27%. These gains hold across three different discrete diffusion models (LLaDA 8B, Dream 7B, LLaDA 2.0-mini) and match best-of-N performance with substantially fewer function evaluations. Whether that trade is worth it depends entirely on whether your problem is inference-bound or training-bound.

Here's what nobody says clearly: for most production systems, you're training-bound. You'd rather spend compute improving your base model than running expensive inference. Inference-time scaling makes sense when you've hit diminishing returns on training or when you need quality on-demand for specific queries where you can't predict which ones need the extra compute.

What This Actually Changes

Inference-time compute scaling is expanding beyond the reasoning tasks that made it famous. Flow Matching models can search output space effectively. Time series models can hedge uncertainty through diverse sampling. Discrete diffusion models can move through token space with hierarchical search. These aren't the same technique applied to different domains. They're different techniques optimized for different output spaces.

The unifying insight is simpler than it appears: models trained to generate outputs in one pass can often generate better outputs with multiple attempts and selection. This works when verification is cheaper than generation. It fails when verification is expensive or unreliable.

The practical implication is uncomfortable: we're entering a regime where inference cost matters as much as training cost. The companies that figure out efficient inference-time scaling (where to spend compute, when to spend it, how to verify results) will have better models than competitors with bigger training budgets. This is a different game than the pre-training race.

The research frontier is shifting from "how do we scale pre-training" to "how do we allocate compute across training and inference." Flow Matching models scaling at inference without retraining is a proof point. Time series models getting reliable through diverse sampling is another. The autoregressive monopoly on inference-time scaling is over.

Sources

Research Papers: