Scaling Laws Explained for Practitioners: What Actually Matters in 2026

LISTEN TO THIS ARTICLE

Scaling laws promised a simple deal: spend more compute, get better models. For three years, that deal held. Kaplan et al. drew the first power-law curves in 2020. Chinchilla refined them in 2022. Labs spent hundreds of millions of dollars following these recipes, and the results were good enough that nobody questioned the approach too closely.

Then the recipes stopped working the way everyone expected. Llama 3 trained its 8B model on 15 trillion tokens, a ratio of 1,875 tokens per parameter. Chinchilla said 20:1 was optimal. That's nearly a 100x deviation from the formula that was supposed to be definitive. DeepSeek V3 uses 671 billion total parameters but activates only 37 billion per token through Mixture of Experts routing. The original scaling laws didn't account for architectures where "model size" isn't a single number.

This guide covers what scaling laws actually tell practitioners in 2026, where they've been revised or outright replaced, and the three dimensions of scaling that matter for anyone building, deploying, or buying AI systems.

What Scaling Laws Actually Are

A scaling law is a power-law relationship between resources (compute, data, parameters) and model performance (usually measured as cross-entropy loss on a held-out set). The core claim: if you know how performance scales with each resource, you can predict how good a model will be before training it.

Kaplan et al. (2020) established the first empirical scaling laws for neural language models at OpenAI. Their findings: loss decreases as a power law with model size, dataset size, and training compute. Crucially, they found that model size matters more than data size for a fixed compute budget. This led OpenAI to train GPT-3 at 175 billion parameters on only 300 billion tokens, a ratio of roughly 1.7 tokens per parameter.

Hoffmann et al. (2022), the "Chinchilla" paper from DeepMind, overturned that finding. Their conclusion: for compute-optimal training, parameters and tokens should scale equally. The magic ratio was roughly 20 tokens per parameter. A 10B model needs 200B tokens. A 100B model needs 2 trillion. By this measure, GPT-3 was undertrained by a factor of 10x.

The Chinchilla result reshaped the industry overnight. It implied that most existing models were too large and too data-starved. Google trained Gemini with Chinchilla-aware ratios. Meta used Chinchilla numbers to plan Llama 2.

But Chinchilla optimized for one thing: minimizing training loss for a fixed training compute budget. It didn't consider what happens after training.

The Chinchilla Trap

Here's the problem practitioners discovered: a model that's compute-optimal for training can be wildly suboptimal for deployment.

Chinchilla says a 70B model trained on 1.4T tokens is compute-optimal. But deploying a 70B model costs roughly 10x more per inference call than deploying a 7B model. If your model serves a million requests per day, those inference costs dwarf the original training bill within weeks.

Sardana and Frankle (2024) formalized this as "inference-aware scaling laws." When you factor in expected deployment volume, the compute-optimal model size drops significantly. For a model expecting one billion inference requests over its lifetime, the optimal parameter count can be 5-10x smaller than Chinchilla would suggest, trained on proportionally more data.

This is exactly what happened in practice. Llama 3's 8B model trained on 15 trillion tokens isn't a mistake or a deviation from theory. It's the result of optimizing for total cost of ownership rather than training loss alone. The model is smaller than Chinchilla-optimal, overtrained by Chinchilla standards, and cheaper to deploy by an order of magnitude.

The industry term for this is "overtraining," but calling it that obscures what's really going on. These models aren't overtrained. They're optimized for a different objective function than the one Chinchilla used.

The practitioner takeaway: if you're not a frontier lab training from scratch, Chinchilla ratios are largely irrelevant to your decisions. They describe compute-optimal pretraining. Most teams do fine-tuning, distillation, or prompt engineering, none of which follow pretraining scaling laws.

The Three Scaling Axes of 2026

The original scaling laws tracked one dimension: pretraining compute. In 2026, there are three dimensions that matter, and they interact in non-obvious ways.

Axis 1: Pre-Training (Data-Limited)

Pre-training scaling continues to work, but it's running into a hard constraint: data availability. Epoch AI estimates that high-quality internet text totals somewhere between 10 and 50 trillion tokens. Chinchilla-optimal training for a one-trillion-parameter model would require roughly 20 trillion tokens of high-quality data. We're approaching the ceiling.

Labs have responded in three ways:

Synthetic data generation. Phi-4 from Microsoft demonstrated that carefully curated synthetic datasets can match or exceed web-scraped data quality for specific domains. The key word is "carefully." Naive synthetic data causes model collapse, where generators trained on their own outputs produce increasingly degenerate text. Effective synthetic data requires diverse seed prompts, quality filters, and domain-specific validators.

Data curation over data volume. FineWeb from Hugging Face showed that aggressive filtering of Common Crawl (keeping roughly 15% of documents) produces training data that outperforms the full dataset. The implication: we haven't exhausted available data, but we've exhausted the easy approach of scraping everything and hoping quality averages out.

Multimodal training. Text may be running out, but video, audio, and image data are effectively unlimited. Gemini and GPT-4o train on multimodal mixtures partly to access this larger data pool.

Axis 2: Post-Training (The Quiet Multiplier)

Post-training, meaning reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), and constitutional AI methods, doesn't follow pretraining scaling laws at all. The relationship between compute invested in post-training and model quality is poorly characterized and highly task-specific.

What we know empirically:

DeepSeek-R1 achieved a jump from 15.6% to 71% accuracy on AIME (math competition problems) primarily through reinforcement learning during post-training. The base model had the capability latent in its weights. Post-training unlocked it. The technique was Group Relative Policy Optimization, simpler and cheaper than the process reward models and Monte Carlo Tree Search approaches that OpenAI used for o1.

This matters for practitioners because post-training is where fine-tuning lives. A LoRA adapter trained for $500-$5,000 can shift a model's behavior dramatically for a specific domain, while changing its raw capabilities very little. The scaling law here isn't about parameters or tokens. It's about the quality of your preference data and the alignment between your reward signal and your actual objective.

The practitioner takeaway: post-training is where most teams should spend their optimization budget. The ROI per dollar is orders of magnitude higher than pretraining, and the required expertise is closer to traditional ML engineering than to the GPU cluster management that pretraining demands.

Axis 3: Inference-Time Compute (The New Frontier)

The newest scaling dimension is inference-time compute: spending more tokens and more computation during a single model call to improve output quality. This is what reasoning models like OpenAI's o1/o3 and DeepSeek-R1 exploit.

The results are striking. A 7B parameter model with tree search outperforms a 34B model on the MATH benchmark across all tested inference strategies, according to research presented at ICLR 2025. Initially, sampling more candidates from a small model is compute-optimal. But at larger compute budgets, bigger models eventually win because small models saturate.

This creates a practical tradeoff that pretraining scaling laws never addressed: should you spend your compute budget on a bigger model or on more inference-time reasoning? The answer depends on your latency requirements, your workload distribution, and whether your tasks benefit from extended reasoning chains.

For straightforward tasks (classification, simple extraction, translation), inference scaling provides no benefit. The model either knows the answer or it doesn't, and "thinking harder" won't help. For complex reasoning, multi-step planning, and novel problem-solving, inference scaling can substitute for 10x larger models at comparable accuracy.

The practitioner takeaway: inference scaling is most valuable for high-stakes, low-volume tasks where latency tolerance is high and accuracy requirements are strict. For high-volume, low-latency applications, a bigger pretrained model served efficiently remains the better choice.

Task-Specific Scaling Plateaus

One of the most practically useful findings from recent scaling research is that different capabilities plateau at different model sizes. BuildFastWithAI's 2026 analysis synthesized benchmarks across model families and found clear saturation points:

Capability	Approximate Plateau	Implication
Language understanding (GLUE-style)	~13B parameters	A 13B model captures most of the linguistic competence of a 70B model
Knowledge recall (MMLU)	~30B parameters	Beyond 30B, gains come from better data curation, not more parameters
Code generation (HumanEval)	~34B parameters	Inference-time scaling matters more than model size past this point
Mathematical reasoning (GSM8K)	~70B parameters	The hardest capability to scale; still benefits from larger models

These plateaus have direct implications for model selection. If your application is primarily language understanding, you're paying for unused capacity with anything over 13B. If you need strong math reasoning, smaller models will struggle regardless of how much inference compute you throw at them.

The plateaus also explain why the "small language model" movement has gained traction. For many production workloads, a well-trained 7-13B model does 90% of what a 70B model does at a fraction of the cost. The remaining 10% is often not worth the 5-10x cost multiplier.

The Densing Law: A New Framing

Yan et al. (2025) proposed the "Densing Law" in Nature Machine Intelligence: a counterpart to traditional scaling laws that measures how model capability per parameter improves over time rather than how performance improves with more parameters.

The finding: effective model density doubles roughly every 2.5 years. A 2026 model with 7B parameters achieves what a 2023 model needed 30B+ parameters to match. This means the performance frontier advances even when model sizes stay constant, driven by architectural improvements (MoE, FlashAttention, grouped-query attention), training recipe improvements (better data mixing, curriculum learning), and post-training techniques (RLHF, DPO, GRPO).

For practitioners, the densing law suggests a counterintuitive strategy: waiting can be cheaper than scaling. If you need capabilities that currently require a 70B model, you might get equivalent performance from a 13B model released 18 months later, with dramatically lower deployment costs.

How to Use Scaling Laws as a Planning Tool

If you're training or fine-tuning models, scaling laws are most useful as a budgeting and planning tool, not as an optimization target.

MIT-IBM Watson AI Lab research offers concrete guidance for constructing your own scaling curves:

Train 5 proxy models across a range of sizes (e.g., 100M to 3B parameters). This costs a fraction of your final training run.
Fit power-law curves to the proxy results. Expect roughly 4% absolute relative error as the theoretical best for predictions.
Discard early training data. Results from before 10B tokens of training are too noisy to extrapolate from.
Partial training works. Training proxies to 30% of your target dataset is sufficient for reliable extrapolation.
Cross-family borrowing. When budget-constrained, borrow scaling law parameters from a similar architecture family. This works reliably for decoder-only models.

This approach lets you predict the performance of a 70B model from the results of five models between 100M and 3B, saving millions in wasted training compute.

What This Means for Different Teams

If you're choosing a model to deploy: ignore scaling laws entirely and focus on benchmarks relevant to your task. The model that scores highest on your evaluation suite is the right choice, regardless of its parameter count. A 7B model that dominates your use case beats a 70B model that's slightly better on average.

If you're fine-tuning: scaling laws for fine-tuning are poorly established. The relationship between fine-tuning data quality and model improvement is closer to logarithmic than power-law. Your marginal dollar is better spent on data quality than data quantity. One thousand carefully curated examples often outperform ten thousand noisy ones.

If you're building agents: agent performance doesn't scale cleanly with model capability. A more capable base model doesn't guarantee a more reliable agent, because agent failures are dominated by tool use errors, context management issues, and coordination overhead rather than raw language modeling quality.

If you're a technical leader making buy decisions: the densing law is your friend. Unless you have urgent competitive pressure, waiting 12-18 months often gets you equivalent capability at half the cost. Budget for the model you need now, but architect your system to swap in cheaper models as they become available.

What's Next

Three developments will reshape scaling laws in the next 12 months:

Mixture of Experts will complicate everything. MoE architectures break the assumption that "model size" is a single number. A 671B total-parameter model that activates 37B per token doesn't fit neatly into existing scaling curves. We need scaling laws parameterized by active parameters, total parameters, and expert count.

Inference compute markets will emerge. As inference-time scaling matures, providers will offer tiered pricing: pay more per query for extended reasoning, less for quick answers. This turns scaling laws from an academic curiosity into a real-time pricing problem.

The data wall will force creativity. With high-quality text data approaching its ceiling, the labs that find effective ways to use synthetic data, multimodal data, and cross-lingual transfer will define the next generation of scaling behavior. The power laws will hold, but the constants will change dramatically.

The bottom line: scaling laws remain the closest thing AI has to physics. They tell you what's possible for a given budget. But in 2026, "budget" has three dimensions, not one, and the models that win aren't the biggest. They're the ones that allocate compute most intelligently across training, post-training, and inference.

Keep reading

Join the Swarm Signal newsletter

Get the Freelance Command Center on Payhip

Popular shorthand made scaling laws sound like a simple deal: spend more compute, get better models. Kaplan et al. published influential power-law curves in 2020. Chinchilla refined the compute-optimal story in 2022. Those recipes shaped how labs thought about model size, data, and training compute, but they were never a complete deployment playbook.

Then the recipes stopped working the way many practitioners expected. Llama 3 trained its 8B model on 15 trillion tokens, far above the simple Chinchilla-style token-per-parameter rule of thumb. Mixture-of-Experts architectures add another complication because total parameters and active parameters are no longer the same thing. The original scaling laws did not account for architectures where "model size" is not a single number.

What Scaling Laws Actually Are

Kaplan et al. (2020) established early empirical scaling laws for neural language models at OpenAI. Within the studied regime, they found that loss decreases as a power law with model size, dataset size, and training compute, and that model size mattered more than data size for a fixed compute budget. That result helped shape GPT-3's 175 billion parameter, 300 billion token training setup, but later work revised the practitioner takeaway.

But Chinchilla optimized for one thing: minimizing training loss for a fixed training compute budget. It didn't consider what happens after training.

The Chinchilla Trap

Here's the problem practitioners discovered: a model that's compute-optimal for training can be wildly suboptimal for deployment.

As of 2024, The practitioner takeaway: if you're not a frontier lab training from scratch, Chinchilla ratios are largely irrelevant to your decisions. They describe compute-optimal pretraining. Most teams do fine-tuning, distillation, or prompt engineering, none of which follow pretraining scaling laws.

As of 2024, The practitioner takeaway: if you're not a frontier lab training from scratch, Chinchilla ratios are largely irrelevant to your decisions.

The Three Scaling Axes of 2026

The original scaling laws tracked one dimension: pretraining compute. In 2026, there are three dimensions that matter, and they interact in non-obvious ways.

Axis 1: Pre-Training (Data-Limited)

Labs have responded in three ways:

Multimodal training. Text may be running out, but video, audio, and image data are effectively unlimited. Gemini and GPT-4o train on multimodal mixtures partly to access this larger data pool.

Axis 2: Post-Training (The Quiet Multiplier)

What we know empirically:

Axis 3: Inference-Time Compute (The New Frontier)

Task-Specific Scaling Plateaus

Capability	Approximate Plateau	Implication
Language understanding (GLUE-style)	~13B parameters	A 13B model captures most of the linguistic competence of a 70B model
Knowledge recall (MMLU)	~30B parameters	Beyond 30B, gains come from better data curation, not more parameters
Code generation (HumanEval)	~34B parameters	Inference-time scaling matters more than model size past this point
Mathematical reasoning (GSM8K)	~70B parameters	The hardest capability to scale; still benefits from larger models

The Densing Law: A New Framing

For high-volume, low-latency applications, a bigger pretrained model served efficiently remains the better choice.

How to Use Scaling Laws as a Planning Tool

If you're training or fine-tuning models, scaling laws are most useful as a budgeting and planning tool, not as an optimization target.

MIT-IBM Watson AI Lab research offers concrete guidance for constructing your own scaling curves:

Train 5 proxy models across a range of sizes (e.g., 100M to 3B parameters). This costs a fraction of your final training run.
Fit power-law curves to the proxy results. Expect roughly 4% absolute relative error as the theoretical best for predictions.
Discard early training data. Results from before 10B tokens of training are too noisy to extrapolate from.
Partial training works. Training proxies to 30% of your target dataset is sufficient for reliable extrapolation.
Cross-family borrowing. When budget-constrained, borrow scaling law parameters from a similar architecture family. This works reliably for decoder-only models.

This approach lets you predict the performance of a 70B model from the results of five models between 100M and 3B, saving millions in wasted training compute.

What This Means for Different Teams

What's Next

Three developments will reshape scaling laws in the next 12 months:

Mixture of Experts will complicate everything. MoE architectures break the assumption that "model size" is a single number. A model can have a very large total parameter count while activating only a smaller subset per token, which does not fit neatly into dense-model scaling curves. We need scaling laws parameterized by active parameters, total parameters, and expert count.

The data wall will force creativity. If high-quality text data approaches its ceiling, the labs that find effective ways to use synthetic data, multimodal data, and cross-lingual transfer may define the next generation of scaling behavior. The power-law framing may still be useful, but the constants and constraints will change.

Scaling Laws Explained for Practitioners: What Actually Matters in 2026

Key finding

Why it matters

Evidence base

Operator takeaway

Where this breaks

Use this if

Avoid this if

What Scaling Laws Actually Are

The Chinchilla Trap

The Three Scaling Axes of 2026

Axis 1: Pre-Training (Data-Limited)

Axis 2: Post-Training (The Quiet Multiplier)

Axis 3: Inference-Time Compute (The New Frontier)

Task-Specific Scaling Plateaus

The Densing Law: A New Framing

How to Use Scaling Laws as a Planning Tool

What This Means for Different Teams

What's Next

Keep reading

What Scaling Laws Actually Are

The Chinchilla Trap

The Three Scaling Axes of 2026

Axis 1: Pre-Training (Data-Limited)

Axis 2: Post-Training (The Quiet Multiplier)

Axis 3: Inference-Time Compute (The New Frontier)

Task-Specific Scaling Plateaus

The Densing Law: A New Framing

How to Use Scaling Laws as a Planning Tool

What This Means for Different Teams

What's Next

Execution tooling is separate