▶️ LISTEN TO THIS ARTICLE

Models Training Models: The Promise and Peril of Synthetic Data

Microsoft's Phi-4 trained on more than 50% synthetic data and beat its own teacher, GPT-4o, on graduate-level science benchmarks. A 14-billion parameter student outscoring the model that generated its training examples. That should make you uncomfortable, because the old rules about data are changing fast.

Human-labeled data is expensive, slow, and running out. The alternative is training on machine-generated data, and the results are now too good to ignore. But the risk is specific and well-documented: model collapse. Synthetic data clearly works. Whether labs can keep the gains without poisoning their own wells is the part nobody's figured out yet.

Their direct-RLAIF variant, which skips the reward model and gets scores straight from an off-the-shelf LLM, actually outperformed the standard approach.

When AI Replaces the Human Rater

The shift started with RLAIF, Reinforcement Learning from AI Feedback. Instead of paying humans to rank model outputs, you use another AI. Anthropic's Constitutional AI framework pioneered this in 2022, and a 2024 Google study at ICML tested RLAIF head-to-head against traditional RLHF across summarization, helpful dialogue, and harmless dialogue. RLAIF matched human-feedback quality on all three. Their direct-RLAIF variant, which skips the reward model and gets scores straight from an off-the-shelf LLM, actually outperformed the standard approach.

The kicker: RLAIF worked even when the AI labeler was the same checkpoint as the model being trained. A model improving itself by judging itself.

Self-Play Preference Optimization pushed further. Wu et al. framed alignment as a two-player game and let the model compete against itself. Starting from Mistral-7B-Instruct with only 60,000 prompts and zero GPT-4 labels, SPPO hit a 28.53% win rate against GPT-4-Turbo on AlpacaEval 2.0. With Llama-3-8B-Instruct, that jumped to 38.77%. No human preference data at all.

Self-Play Eats Math

DeepMind's AlphaProof delivered the most dramatic self-play result. Built on AlphaZero's RL architecture, it taught itself to prove mathematical theorems by playing against a formal verification system. Trained on roughly one million informal math problems, auto-formalized into Lean, then explored via RL-driven proof search.

Late model collapse was worse: data distributions converged until they looked nothing like the originals.

At the 2024 International Mathematical Olympiad, AlphaProof scored 28 out of 42 points, silver-medal territory. It solved the competition's hardest problem, one only five human contestants cracked. The Nature paper introduced "Test-Time RL," generating millions of problem variants during inference for deep adaptation.

The pattern matters more than the medal. AlphaProof didn't learn from human mathematicians explaining proofs. It generated candidates, checked them against a formal verifier, and iterated. That generate-verify-iterate loop now shows up across reasoning, code, and alignment research.

The Collapse Problem

In July 2024, Shumailov et al. published in Nature what happens when models train recursively on their own outputs. Early model collapse erased tail distributions, the rare examples and unusual phrasings that make language interesting. Late model collapse was worse: data distributions converged until they looked nothing like the originals. The model forgot what real data looked like.

This isn't theoretical. As AI-generated text floods the internet, every lab scraping web data is ingesting more synthetic slop. The training data problem isn't just running out of human text. It's contaminating what's left.

But Gerstgrasser et al. showed collapse is avoidable under one condition: keep original real data in the mix. Accumulate synthetic generations alongside the real training set rather than replacing it, and test error stays bounded no matter how many iterations you run. This held across language models, diffusion models, and autoencoders. Don't throw away the originals.

Phi-4 proves careful mixing works. Curated synthetic reasoning examples plus filtered organic web data got a 14B model to 84.8 on MMLU and 56% on GPQA, numbers that embarrass much larger models. The student beat the teacher not because synthetic data is magic, but because targeted curation beats raw scale.

What's Actually at Stake

Labs building frontier models are already committed here. Human annotation can't scale to match their ambitions. RLAIF, self-play, and synthetic generation are becoming default infrastructure.

The risk is treating model collapse as solved before guardrails are proven at web scale. AlphaProof works because math has formal proof checkers. Most tasks we care about don't have anything equivalent. Without a reliable verifier, self-play is just a model high-fiving itself in a hall of mirrors.

Phi-4's benchmarks are real. AlphaProof's IMO score is real. SPPO beating GPT-4-Turbo with zero human labels is real. But every success relied on careful curation, formal verification, or preserved access to original human data. Strip those guardrails away and you get recursive collapse. The models can train themselves. Making sure they don't train themselves into a corner is the actual hard problem.

Key Takeaways

  • Synthetic data generation risks model collapse when training loops become self-referential. Models trained on their own outputs show measurable quality degradation after 3-4 generations, particularly on minority subpopulations.
  • Domain-specific synthetic data fails in medical imaging at rates up to 50%. GANs trained on X-rays produce anatomically plausible but medically inaccurate images that degrade diagnostic model performance on edge cases.
  • Self-play training for multi-agent systems improved coordination by 29% in controlled settings. Agents that generate and resolve their own training scenarios showed improved generalization, but the approach doesn't scale to tasks requiring external knowledge.
  • Curated synthetic data (filtered, validated, mixed with real data) avoids collapse. The key is maintaining a minimum 70% real data ratio and applying domain-expert validation to synthetic examples before including them in training sets.

Sources

Research Papers:

Industry Sources:

Related Swarm Signal Coverage: