🎧 LISTEN TO THIS ARTICLE
In 2023, the most capable open-weight model was a 70-billion-parameter dense transformer. By early 2026, it's a 671-billion-parameter Mixture of Experts that activates just 37 billion parameters per token. That shift tells you everything about where large language model architecture is heading: not toward bigger monoliths, but toward smarter routing.
Mixture of Experts (MoE) isn't new. The core idea dates back to 1991. But the last two years have turned it from a research curiosity into the default architecture for frontier models. DeepSeek-V3, Qwen3, Mixtral, Llama 4, Grok-1, and (almost certainly) GPT-4 all use some variant of MoE. Understanding how it works isn't optional anymore. It's table stakes for anyone following AI development.
The Core Idea: Divide and Specialize
A standard dense transformer runs every input token through every parameter in the network. A 70-billion-parameter model uses all 70 billion parameters for every single token, whether the input is a calculus problem or a grocery list.
MoE takes a different approach. Instead of one massive feed-forward network (FFN) at each layer, it uses multiple smaller sub-networks called "experts." A routing mechanism (sometimes called a gating network) decides which experts process each token. Only a fraction of the total parameters activate for any given input.
The result: you can scale total parameter count into the hundreds of billions (or trillions) while keeping per-token compute costs comparable to a much smaller dense model. More capacity, roughly the same inference cost.
The concept first appeared in Robert Jacobs, Michael Jordan, Steven Nowlan, and Geoffrey Hinton's 1991 paper "Adaptive Mixtures of Local Experts," which proposed a system where specialized sub-networks each handle a subset of training cases. A gating network learns which expert should process which input. The fundamental architecture hasn't changed. The scale has.
How Routing Works (And Why It's the Hard Part)
The router is the most critical component. It takes each token's hidden representation and produces a probability distribution over all available experts. The token gets sent to the top-scoring expert(s), and their outputs are combined (usually weighted by the router's scores).
This sounds simple. It isn't.
Top-K Routing
Most production MoE models use top-k routing, where each token selects its k highest-scored experts. Mixtral 8x7B uses top-2 routing: every token goes to 2 out of 8 experts. DeepSeek-V3 uses top-8 out of 256 routed experts, plus one shared expert that processes every token.
Google's 2021 Switch Transformer simplified this further with top-1 routing (each token goes to exactly one expert), which reduced communication overhead but required careful load balancing to avoid expert collapse.
Expert Choice Routing
In 2022, Google Research flipped the paradigm. Instead of tokens choosing experts, experts choose tokens. Each expert has a fixed buffer capacity and selects its top-k preferred tokens from the batch. This guarantees perfect load balance by construction, and Google reported 2x faster training convergence in their 8B/64-expert model compared to standard top-1 and top-2 approaches.
The tradeoff: some tokens might get processed by zero experts (dropped) or many experts (over-processed), which creates unpredictable quality variance at inference time.
Hash Routing
The simplest approach assigns tokens to experts deterministically using a hash function. It maintains perfect load balance and adds zero learnable parameters to the router. But it also ignores token content entirely, so experts end up learning overlapping representations. In practice, hash routing consistently underperforms learned routing methods.
The Load Balancing Problem
Left unconstrained, routers tend to collapse. A few experts get selected repeatedly, others receive almost no tokens, and the model effectively becomes a dense network with wasted parameters. This is called routing collapse, and it's the single most common failure mode in MoE training.
The standard fix is an auxiliary loss that penalizes uneven expert utilization during training. But this creates its own problem: too large an auxiliary loss interferes with the primary training objective and degrades model quality; too small, and collapse happens anyway.
DeepSeek-V3 introduced an auxiliary-loss-free approach that adds a dynamic bias term to expert affinity scores. At each training step, the system monitors expert load and adjusts the bias upward for underloaded experts and downward for overloaded ones. No gradient interference with the main loss. This innovation is one reason DeepSeek-V3 achieved frontier performance at a reported training cost of approximately $5.6 million, a fraction of comparable models.
The Model Comparison: MoE in 2026
The table below shows every major MoE model released between 2023 and early 2026. The pattern is clear: total parameters keep climbing, but active parameters stay constrained.
| Model | Total Params | Active Params | Experts | Active/Token | Released |
|---|---|---|---|---|---|
| Mixtral 8x7B | 47B | 13B | 8 | 2 | Dec 2023 |
| Grok-1 | 314B | 86B | 8 | 2 | Mar 2024 |
| DBRX | 132B | 36B | 16 | 4 | Mar 2024 |
| Mixtral 8x22B | 141B | 39B | 8 | 2 | Apr 2024 |
| Jamba 1.5 | 398B | 94B | MoE+Mamba hybrid | Variable | Sep 2024 |
| DeepSeek-V3 | 671B | 37B | 256 + 1 shared | 8 + shared | Dec 2024 |
| Llama 4 Maverick | 400B | 17B | 128 + 1 shared | 1 + shared | Apr 2025 |
| Qwen3-235B | 235B | 22B | 128 | 8 | May 2025 |
GPT-4, though never officially confirmed by OpenAI, was reported by multiple sources (including George Hotz and PyTorch co-founder Soumith Chintala) to use a mixture of 8 or 16 experts with approximately 1.76 trillion total parameters. If accurate, that makes even OpenAI's flagship a MoE model.
Several trends stand out. First, the shift toward fine-grained experts: DeepSeek-V3 uses 256 small experts rather than 8 large ones, which provides 10 orders of magnitude more possible expert combinations. Second, the shared expert pattern, where one expert processes every token to maintain baseline quality while routed experts specialize. Both DeepSeek and Llama 4 adopted this. Third, active parameter counts are dropping relative to total parameters. Llama 4 Maverick activates just 4.25% of its total parameters per token.
Why MoE Wins on Economics
The case for MoE comes down to simple math. Training and inference costs scale primarily with active parameters (the FLOPs per token), not total parameters.
DeepSeek-V3 trained on 14.8 trillion tokens using 2.788 million H800 GPU hours, at a cost of roughly $5.6 million. For comparison, Llama 3 405B (a dense model) required 30.8 million GPU hours for training. That's an 11x difference in GPU hours for models that compete on the same benchmarks.
At inference time, MoE models process tokens using only their active parameter subset. DeepSeek-V3's 37 billion active parameters give it roughly the per-token compute cost of a 37B dense model, while its 671B total parameters provide the knowledge capacity of a much larger network. A rough industry rule of thumb: an 8-way sparse MoE has similar short-context decoding economics to a dense model half its total size.
The catch is memory. All parameters (active or not) must be loaded into GPU memory, because the router needs access to every expert to make its selection. A 671B-parameter MoE model needs the same VRAM as a 671B dense model, even though it only uses 37B parameters per forward pass. This is the central tension of MoE deployment, and it's why cost efficiency in inference remains an active research area.
The Engineering Challenges Nobody Talks About
MoE models are harder to serve than dense models of equivalent quality. The architecture introduces several deployment headaches that don't show up in benchmark scores.
Communication Overhead
When experts are distributed across multiple GPUs (expert parallelism), tokens must be routed between devices. NVIDIA's own measurements show that at NVLink's 350 GB/s bandwidth, communication consumes 77% of total MoE layer processing time during the decode phase. As the number of active experts increases and individual expert sizes shrink, the ratio of communication to useful compute gets worse.
DeepSeek addressed this through a co-design of algorithms, frameworks, and hardware that achieves near-complete computation-communication overlap during training. But this required custom engineering that isn't available off the shelf.
Expert Parallelism vs. Load Balance
Standard model parallelism splits a model's layers across GPUs. Expert parallelism places different experts on different GPUs. This creates a new problem: if token routing is uneven (and it always is in practice), some GPUs sit idle while others are overloaded. Critical bottlenecks like inter-GPU communication and workload imbalances caused by real-world token routing patterns remain partially unsolved, even in 2026.
The MoETuner system (2025) demonstrated that balanced expert placement combined with intelligent token routing can reduce inference latency, but achieving this requires runtime monitoring and dynamic rebalancing that adds operational complexity.
Memory Fragmentation
Each expert is a separate FFN with its own weight matrices. With 256 experts (DeepSeek-V3), that's 256 separate parameter blocks that must be managed in GPU memory. This fragments memory allocation and reduces the effectiveness of standard memory optimization techniques designed for dense, contiguous parameter tensors.
Fine-Grained vs. Coarse-Grained: The Design Spectrum
Early MoE models used a few large experts. Mixtral 8x7B has 8 experts, each roughly 7 billion parameters. This coarse-grained approach is simple to implement and reason about, but limits the router's ability to specialize.
The industry trend has shifted firmly toward fine-grained architectures. DBRX (2024) increased to 16 experts selecting 4, which Databricks noted provides 65x more possible expert combinations than an 8-choose-2 design. DeepSeek-V3 pushed this further with 256 experts selecting 8, and both Qwen3 and Llama 4 settled on 128 experts.
More experts means more possible combinations, which theoretically allows finer specialization. But it also means each individual expert has fewer parameters and less capacity. The benchmark results for these models suggest the tradeoff favors fine-grained designs, at least at current scales.
An emerging variation is the hybrid approach. Jamba (AI21 Labs, 2024) interleaves Transformer attention layers with Mamba state-space layers, and applies MoE only to certain layers. The 398B Jamba 1.5 model activates 94B parameters and claims 2.5x faster inference on long contexts compared to similar-sized pure transformer MoE models. This points toward a future where MoE isn't just about expert routing within transformers, but about mixing entire architectural paradigms.
What the Research Frontier Looks Like
The 2025-2026 literature reveals several active research directions that will shape the next generation of MoE models.
Routing stability under reinforcement learning. When MoE models undergo RLHF or GRPO post-training, routing distributions can drift significantly. This increases the variance of importance sampling weights and can destabilize optimization, particularly with multiple active experts per token. DeepSeek-V3 used its auxiliary-loss-free balancing to mitigate this during its reinforcement learning stage, but the problem remains open for general-purpose solutions.
ReLU routing. A 2025 ICLR paper proposed replacing the standard top-k + softmax routing with ReLU activation and adaptive L1 regularization (called ReMoE). This makes the routing function fully differentiable, which eliminates the discrete top-k selection that prevents gradient flow through the router. ReMoE reportedly outperforms traditional top-k routing across multiple model sizes and expert counts.
Optimal expert count. A December 2025 paper asked "How Many Experts Are Enough?" and found that optimal expert count depends on the relationship between model capacity and data diversity. More isn't always better. Beyond a certain point, additional experts fragment knowledge without improving specialization.
Distributed inference optimization. GRACE-MoE (2025) introduced grouping and replication with locality-aware routing to reduce communication overhead in distributed MoE inference. MegaScale-MoE (2025) tackled large-scale communication efficiency for production deployments. Both represent the kind of systems-level engineering required to make MoE practical beyond research settings.
These aren't incremental improvements. They address the fundamental bottlenecks (routing instability, deployment cost, scaling laws) that will determine whether MoE remains the dominant architecture or gets superseded by something else entirely.
Where MoE Falls Short
MoE isn't a free lunch. The architecture carries genuine limitations that its proponents tend to understate.
Expert collapse remains unsolved in the general case. Despite DeepSeek's auxiliary-loss-free approach, most MoE training runs still require careful hyperparameter tuning to prevent routing collapse. The router can, as Cerebras researchers put it, "single-handedly destroy an MoE model." Even with implementations that appear sound, mysterious training instabilities still occur.
Memory requirements are punishing. A 671B MoE model needs the same memory as a 671B dense model, even though it performs the FLOPs of a 37B model. For organizations that can't afford multi-GPU clusters, this erases the efficiency advantage. The cost of frontier inference is falling, but MoE's memory footprint keeps it out of reach for many deployment scenarios.
Interpretability is harder. Dense models are already opaque. MoE adds another layer of opacity: which experts activated, why, and whether different routing paths produce meaningfully different outputs. Early research into MoE interpretability suggests that experts don't cleanly specialize by topic or task the way the architecture's metaphor implies. The routing is messier than the diagrams suggest.
Throughput advantage shrinks at long context. MoE's efficiency advantage is most pronounced for short-context, high-batch inference. As context lengths grow (and inference-time compute scaling becomes more common), the memory bandwidth bottleneck of loading all expert parameters dominates, and the gap between MoE and dense models narrows.
The Default Architecture, For Now
Every major frontier model released in the last 18 months uses some form of MoE. The reasons are straightforward: it's the most efficient way to scale model capacity without proportionally scaling inference cost. The economics are too compelling to ignore.
But MoE in 2026 looks very different from the concept described in 1991, or even from the Switch Transformer of 2021. Fine-grained routing with 128-256 experts has replaced coarse 8-way splits. Shared experts provide stability. Auxiliary-loss-free balancing has started to solve the collapse problem. Hybrid architectures are mixing MoE with state-space models and other non-transformer components.
The architecture still has real problems: memory overhead, communication bottlenecks, routing instability, and interpretability gaps. And the research community hasn't converged on answers for any of them. Whether MoE stays dominant depends less on the quality of the routing algorithms and more on whether the engineering community can solve the systems-level deployment challenges at scale.
For now, if you're building on, evaluating, or competing with frontier models, you're working with MoE whether you know it or not. The question isn't whether to understand this architecture. It's whether you can afford not to.
Sources
Research Papers:
- Adaptive Mixtures of Local Experts — Jacobs, Jordan, Nowlan, Hinton (1991)
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer — Shazeer et al. (2017)
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding — Lepikhin et al. (2020)
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — Fedus, Zoph, Shazeer (2022)
- Mixture-of-Experts with Expert Choice Routing — Zhou et al. (2022)
- Mixtral of Experts — Jiang et al. (2024)
- DeepSeek-V3 Technical Report — DeepSeek-AI (2024)
- Qwen3 Technical Report — Qwen Team (2025)
- How Many Experts Are Enough? Towards Optimal Semantic Specialization for MoE — (2025)
- A Comprehensive Survey of Mixture-of-Experts — (2025)
- Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts — (2024)
Industry / Case Studies:
- Mixture of Experts Explained — Hugging Face
- Introducing DBRX: A New State-of-the-Art Open LLM — Databricks
- Router Wars: Which MoE Routing Strategy Actually Works — Cerebras
- MoE at Scale: Making Sparse Models Fast on Real Hardware — Cerebras
- MoE vs Dense Models: How Do They Compare in Inference? — Epoch AI
- Scaling Large MoE Models with Wide Expert Parallelism on NVL72 — NVIDIA
- The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI Innovation — Meta AI
Related Swarm Signal Coverage: