Mixture of Experts: Architecture, Models, and Trade-offs

What Is Mixture of Experts?

Mixture of Experts (MoE) is a neural network architecture that routes each input to a subset of specialized sub-networks (experts) rather than processing it through every parameter. A gating network decides which experts activate for a given token, meaning only a fraction of the model's total parameters are used per forward pass.

This matters because it breaks the traditional scaling law that ties compute cost to parameter count. A 1.8-trillion-parameter MoE model might only activate 200 billion parameters per token, delivering performance competitive with dense models at a fraction of the inference cost. Mixtral, GPT-4, and DeepSeek-V3 all use MoE architectures.

The trade-off is complexity. MoE models require more memory (all experts must be loaded even if only some activate), face load-balancing challenges during training, and can exhibit unstable routing behavior. But the efficiency gains are compelling enough that MoE has become the dominant architecture for frontier models.

Key Concepts

Sparse routing means each token only activates a subset of experts (typically 2 out of 8 or 16), keeping per-token compute constant even as total parameters grow.
Load balancing losses are auxiliary training objectives that prevent the gating network from sending all tokens to the same few experts, which would waste capacity.
Expert parallelism distributes different experts across different GPUs, enabling training and serving of models too large to fit on a single device.
Top-k routing is the most common gating strategy, where the router selects the k highest-scoring experts for each token and combines their outputs with learned weights.
Granular experts (as in DeepSeek-V3) use many small experts instead of few large ones, improving routing flexibility and specialization.

Frequently Asked Questions

Why do MoE models need more memory than dense models with similar performance?

Because all experts must be resident in memory even though only a few activate per token. A MoE model with 1.8 trillion total parameters but 200 billion active parameters needs memory for all 1.8 trillion parameters, even though its per-token compute resembles a 200B dense model.

Which major models use Mixture of Experts?

As of 2026, confirmed MoE architectures include Mixtral 8x7B and 8x22B, GPT-4, DeepSeek-V3, Grok-1, and several Gemini variants. The trend is strongly toward MoE for any model above roughly 100 billion active parameters.

What is the biggest disadvantage of MoE compared to dense models?

Expert collapse — where the router learns to use only a few experts while the rest go dormant. This wastes parameters and can cause sudden quality drops. Modern MoE training uses auxiliary losses and expert-level dropout to mitigate this, but it remains an active research problem.