Models Training Models: The Promise and Peril of Synthetic D

▶️ LISTEN TO THIS ARTICLE

Microsoft's Phi-4 trained on more than 50% synthetic data and beat its own teacher, GPT-4o, on graduate-level science benchmarks. A 14-billion parameter student outscoring the model that generated its training examples. That should make you uncomfortable, because the old rules about data are changing fast.

Human-labeled data is expensive, slow, and running out. The alternative is training on machine-generated data, and the results are now too good to ignore. But the risk is specific and well-documented: model collapse. Synthetic data clearly works. Whether labs can keep the gains without poisoning their own wells is the part nobody's figured out yet.

When AI Replaces the Human Rater

The shift started with RLAIF, Reinforcement Learning from AI Feedback. Instead of paying humans to rank model outputs, you use another AI. Anthropic's Constitutional AI framework pioneered this in 2022, and a 2024 Google study at ICML tested RLAIF head-to-head against traditional RLHF across summarization, helpful dialogue, and harmless dialogue. RLAIF matched human-feedback quality on all three. Their direct-RLAIF variant, which skips the reward model and gets scores straight from an off-the-shelf LLM, actually outperformed the standard approach.

The kicker: RLAIF worked even when the AI labeler was the same checkpoint as the model being trained. A model improving itself by judging itself.

Self-Play Preference Optimization pushed further. Wu et al. framed alignment as a two-player game and let the model compete against itself. Starting from Mistral-7B-Instruct with only 60,000 prompts and zero GPT-4 labels, SPPO hit a 28.53% win rate against GPT-4-Turbo on AlpacaEval 2.0. With Llama-3-8B-Instruct, that jumped to 38.77%. No human preference data at all.

Self-Play Eats Math

Quote image 1 — to be replaced after Manus completes

DeepMind's AlphaProof delivered the most dramatic self-play result. Built on AlphaZero's RL architecture, it taught itself to prove mathematical theorems by playing against a formal verification system. Trained on roughly one million informal math problems, auto-formalized into Lean, then explored via RL-driven proof search.

At the 2024 International Mathematical Olympiad, AlphaProof scored 28 out of 42 points, silver-medal territory. It solved the competition's hardest problem, one only five human contestants cracked. The Nature paper introduced "Test-Time RL," generating millions of problem variants during inference for deep adaptation.

The pattern matters more than the medal. AlphaProof didn't learn from human mathematicians explaining proofs. It generated candidates, checked them against a formal verifier, and iterated. That generate-verify-iterate loop now shows up across reasoning, code, and alignment research.

The Collapse Problem

Quote image 2 — to be replaced after Manus completes

In July 2024, Shumailov et al. published in Nature what happens when models train recursively on their own outputs. Early model collapse erased tail distributions, the rare examples and unusual phrasings that make language interesting. Late model collapse was worse: data distributions converged until they looked nothing like the originals. The model forgot what real data looked like.

This isn't theoretical. As AI-generated text floods the internet, every lab scraping web data is ingesting more synthetic slop. The training data problem isn't just running out of human text. It's contaminating what's left.

But Gerstgrasser et al. showed collapse is avoidable under one condition: keep original real data in the mix. Accumulate synthetic generations alongside the real training set rather than replacing it, and test error stays bounded no matter how many iterations you run. This held across language models, diffusion models, and autoencoders. Don't throw away the originals.

Phi-4 proves careful mixing works. Curated synthetic reasoning examples plus filtered organic web data got a 14B model to 84.8 on MMLU and 56% on GPQA, numbers that embarrass much larger models. The student beat the teacher not because synthetic data is magic, but because targeted curation beats raw scale.

What's Actually at Stake

Labs building frontier models are already committed here. Human annotation can't scale to match their ambitions. RLAIF, self-play, and synthetic generation are becoming default infrastructure.

The risk is treating model collapse as solved before guardrails are proven at web scale. AlphaProof works because math has formal proof checkers. Most tasks we care about don't have anything equivalent. Without a reliable verifier, self-play is just a model high-fiving itself in a hall of mirrors.

Phi-4's benchmarks are real. AlphaProof's IMO score is real. SPPO beating GPT-4-Turbo with zero human labels is real. But every success relied on careful curation, formal verification, or preserved access to original human data. Strip those guardrails away and you get recursive collapse. The models can train themselves. Making sure they don't train themselves into a corner is the actual hard problem.

Sources

Research Papers:

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback - Lee et al. (2024)
Self-Play Preference Optimization for Language Model Alignment - Wu et al. (2024)
Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data - Gerstgrasser et al. (2024)
AI Models Collapse When Trained on Recursively Generated Data - Shumailov et al., Nature (2024)
Phi-4 Technical Report - Abdin et al. (2024)
Olympiad-Level Formal Mathematical Reasoning with Reinforcement Learning - AlphaProof, Nature (2025)

Industry Sources:

AI Achieves Silver-Medal Standard Solving International Mathematical Olympiad Problems - Google DeepMind
Constitutional AI: Harmlessness from AI Feedback - Anthropic

Related Swarm Signal Coverage: