▶️ LISTEN TO THIS ARTICLE
On January 27, 2025, Nvidia lost $589 billion in market cap in a single day. That's the largest single-day loss in U.S. stock market history. The cause wasn't an earnings miss, a product recall, or a fraud scandal. It was a PDF. A technical report from a Chinese AI lab called DeepSeek claimed it had trained a frontier-class reasoning model for $5.576 million. The entire GPU scarcity thesis that had driven Nvidia's trillion-dollar valuation suddenly looked fragile.
The stock recovered within weeks. The implications didn't.
From Quant Trading to AI Research
DeepSeek's origin story doesn't start in a university lab or a Silicon Valley garage. It starts in quantitative finance. Liang Wenfeng, born in 1985 in Guangdong province, founded High-Flyer Capital in 2015. By its peak, the firm managed $14 billion in assets, making it one of China's largest quant funds. To run its trading strategies, High-Flyer built a cluster of over 10,000 Nvidia A100 GPUs.
In May 2023, Liang made an unusual move. He spun off a separate entity called DeepSeek, dedicated entirely to AI research. Not AI products. Not chatbots for consumers. Research. "We're not here to make money from AI," Liang told interviewers. "We're here to understand intelligence."
That framing matters. DeepSeek operates more like a national research lab than a startup chasing revenue. It publishes its weights openly, releases detailed technical reports, and doesn't sell API access as its primary business. The quant fund bankrolls the whole operation. This structure freed DeepSeek from the pressure to ship products quickly, and it shows in the work.

The Model Progression
DeepSeek's technical trajectory over 18 months is where things get interesting.
DeepSeek-V2, released in May 2024, introduced Multi-head Latent Attention, a technique that compresses the key-value cache used during inference by 93.3%. That single innovation made V2's inference costs roughly 42x cheaper than comparable models. It was a signal that this team wasn't just training bigger models. They were rethinking the architecture.
DeepSeek-V3 arrived in December 2024 with 671 billion total parameters, but here's the trick: only 37 billion are active for any given token. That's Mixture of Experts at work. The model routes each input to a small subset of specialized sub-networks, so you get the knowledge capacity of a 671B model at a fraction of the compute cost per query. V3 trained on 14.8 trillion tokens using FP8 mixed-precision training across 2,048 H800 GPUs. DeepSeek claimed the final training run cost $5.576 million.
Then came R1 in January 2025. This is the model that broke the market. DeepSeek-R1 is a reasoning model, built to compete directly with OpenAI's o1. Instead of the standard RLHF pipeline that requires training a separate reward model, R1 uses Group Relative Policy Optimization. GRPO scores candidate responses against each other in groups, eliminating the reward model entirely. The result: R1 hit 79.8% on AIME 2024 versus o1's 79.2%, scored 97.3% on MATH-500, and reached a 2029 Elo rating on Codeforces compared to o1's 1891.
Perhaps the most fascinating variant is R1-Zero, trained with pure reinforcement learning and no supervised fine-tuning at all. Chain-of-thought reasoning emerged on its own. The model taught itself to think step by step without being shown examples of step-by-step thinking. That result alone has implications for how we understand the relationship between training methodology and emergent capability.
The Cost Controversy
Let's be honest about the $5.6 million number, because it's been used as both a rallying cry and a misleading headline.
DeepSeek's claimed $5.576 million covers the final training run of V3. That's 2,048 H800 GPUs running for approximately two months. It doesn't include the cost of building the GPU cluster, the failed experiments that preceded the successful run, the pre-training data curation, or the iterative research that produced the architectural innovations. SemiAnalysis estimates the true all-in cost at $1.3 to $1.6 billion when you account for the full R&D pipeline and infrastructure.
That context matters. DeepSeek isn't training frontier models in a garage for pocket change. It's a well-funded operation with access to serious hardware, including a reported stockpile of roughly 50,000 A100 GPUs acquired before U.S. export controls tightened.
But even with that caveat, the narrow figure is still remarkable. GPT-4's training compute alone is estimated at over $100 million. DeepSeek achieved comparable benchmark performance on a final training run that cost a fraction of that. The real story isn't "AI is cheap now." It's that the relationship between dollars spent and model quality isn't linear. Architectural innovation can substitute for brute-force compute, and DeepSeek proved it with receipts.
The Technical Playbook
What separates DeepSeek from labs that simply scale up existing architectures is the density of novel techniques packed into each release.
Multi-head Latent Attention compresses the key-value pairs that models store during generation. Standard transformer attention requires caching keys and values for every attention head at every layer. MLA projects these into a low-dimensional latent space, slashing memory usage by 93.3% during inference. This isn't a minor optimization. It changes the economics of serving the model to millions of users.
FP8 training was another bet that paid off. Training in 8-bit floating point instead of the standard 16-bit halves memory requirements. Most labs considered FP8 too unstable for large-scale training. DeepSeek made it work across 14.8 trillion tokens without quality degradation. V3 was the first model of its scale trained entirely in FP8.
The auxiliary-loss-free load balancing for MoE routing solved a persistent problem: Mixture of Experts models tend to develop "dead experts" where certain sub-networks never get activated. DeepSeek's approach keeps utilization balanced without the auxiliary loss terms that other implementations rely on, which can interfere with the primary training objective.
Multi-Token Prediction, where the model predicts several future tokens simultaneously during training, improved both training efficiency and downstream performance. It's a technique that's been discussed in research papers for years. DeepSeek actually shipped it at scale.

The Things Nobody Wants to Talk About
DeepSeek's technical achievements are real. So are the complications.
OpenAI has alleged that DeepSeek's training data includes outputs distilled from OpenAI's models. If true, that means R1's performance partially bootstraps off proprietary work from a competitor. The open weights debate gets messier when the training data pipeline is opaque.
DeepSeek-R1 refuses to discuss Tiananmen Square, Taiwanese independence, or criticism of Xi Jinping. This isn't a bug. Chinese data laws require cooperation with government data requests, and content moderation aligned with state policy is a condition of operating in China. For researchers outside China who want to use the open weights, this raises questions about what other biases are baked into the training that aren't as immediately visible.
The "open weights" label itself deserves scrutiny. DeepSeek releases model weights, which is more than most Western labs offer. But it doesn't release training data, full training recipes, or the intermediate checkpoints that would let others truly reproduce the work. Open weights are a step toward transparency. They aren't transparency.
What It Means for the Industry
DeepSeek hit 96.88 million monthly active users globally by January 2026, according to SimilarWeb data. V4 is slated for mid-February 2026. The Chinese government poured $137 billion into AI infrastructure in 2025, and Beijing's broader AI ambitions extend well beyond a single lab.
The real impact is the efficiency thesis. Before DeepSeek, the dominant narrative in AI was simple: whoever buys the most GPUs wins. The frontier model race was framed as a spending competition. DeepSeek demonstrated that a team with strong research fundamentals could match the output of labs spending 10 to 100x more on compute. That doesn't mean compute doesn't matter. It means compute alone isn't a moat.
This realization is already spreading. Qwen's team at Alibaba has adopted similar efficiency-first approaches. Yi and MiniMax are following suit. The "AI Sputnik moment" framing that dominated U.S. policy discussions in early 2025 was overblown in some ways, but it captured something real: the assumption that export controls on chips would kneecap Chinese AI capability was wrong.
Benchmarks tell part of the story but not all of it. DeepSeek-V3 scores 88.5% on MMLU versus GPT-4o's 87.2%. R1 edges out o1 on math and coding tasks. These numbers suggest parity. Whether that parity holds across real-world deployment at scale, across languages, across the long tail of use cases that benchmarks tend to miss, remains an open question.
What isn't an open question is the economics. DeepSeek proved that architectural cleverness can compress the cost curve for training frontier models. Every AI lab on the planet is now working under that assumption. The spending war isn't over, but the terms of engagement have permanently changed.
Sources
Research Papers:
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — DeepSeek-AI (2024)
- DeepSeek-V3 Technical Report — DeepSeek-AI (2024)
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek-AI (2025)
Industry / Case Studies:
- DeepSeek Debates: Chinese Leadership On Cost, True Training Cost, Closed Model Margin Impacts — SemiAnalysis (2025)
- chat.deepseek.com Traffic Analytics — SimilarWeb (2026)
- Nvidia Loses $589 Billion as DeepSeek Batters Stock — Bloomberg (2025)
Commentary:
- DeepSeek-R1, DeepSeek Implications — Ben Thompson, Stratechery (2025)
- DeepSeek Debrief: >128 Days Later — Dylan Patel, SemiAnalysis (2025)
- Machine Learning Trends — Epoch AI (2025)
Related Swarm Signal Coverage: