For most of the deep learning era, the path to better AI models was straightforward: make them bigger. Double the parameters, train on more data, spend more compute. Each generation of “dense” models — where every parameter is used for every input — followed this playbook. GPT-2 had 1.5 billion parameters. GPT-3 had 175 billion. The performance improved, but so did the cost. The next doubling would be ruinous.
Mixture of Experts (MoE) broke that curve. Instead of activating all parameters for every input, MoE models contain many specialized sub-networks (“experts”) and a routing mechanism that selects only a few for each token. The result: a model with the knowledge of a very large network and the inference cost of a much smaller one.
This is not a new idea — MoE was proposed by Jacobs et al. in 1991. But it has become the dominant architecture at the frontier. GPT-4 is widely reported to be a MoE model. Mixtral proved MoE could work at open-weight scale. DeepSeek V3 pushed the efficiency further than anyone expected. Understanding MoE is now essential to understanding how modern AI works.
The core MoE architecture has three components: a set of expert networks, a router (also called a gating network), and a combining mechanism.
In a transformer-based MoE model, the “experts” replace the standard feed-forward network (FFN) in each transformer block. The attention layers remain shared across all tokens — only the FFN is split into multiple expert copies. Each expert is a full FFN with its own weights, meaning it can specialize in different types of patterns.
A model with 8 experts, each the size of a 7B parameter FFN, has 8x7B = 56B parameters in its expert layers alone, plus the shared attention layers. The total parameter count is much larger than any single expert.
The router is a small linear layer that takes the current token’s hidden state as input and produces a probability distribution over all experts. For a model with N experts, the router output is an N-dimensional vector passed through softmax:
gate(x) = softmax(W_g * x)
The top-k experts (usually k=2) are selected, and their weights are renormalized to sum to 1. This is called top-k gating. Only the selected experts actually run their forward pass — the idle experts consume no compute.
The final output for each token is the weighted sum of the selected experts’ outputs:
output = sum(gate_i * expert_i(x)) for each selected expert i
This is differentiable, which means the entire system — router included — can be trained end-to-end with standard backpropagation. The router learns which types of inputs each expert is best suited for.
Mixtral 8x7B (December 2023): Mistral’s breakout MoE model. 8 experts, each roughly 7B parameters in the FFN layers, with 2 active per token. Total: 46.7 billion parameters. Active per token: 12.9 billion. On benchmarks, it matched or exceeded Llama 2 70B (a model with 5.4x more active parameters) while being significantly faster at inference. This was the proof of concept that made the industry take MoE seriously for open models.
Mixtral 8x22B (April 2024): The scaled-up version. 8 experts at ~22B each, 176B total, 39B active. Performance competitive with GPT-4 Turbo on several benchmarks. Demonstrated that MoE scaling laws hold: bigger experts and more of them continue to improve performance.
GPT-4 (March 2023): OpenAI has never confirmed its architecture, but multiple sources (including George Hotz and leaked details) report it is an 8-expert MoE model with approximately 220B parameters per expert, totaling roughly 1.8 trillion parameters, with 2 experts active per token (~440B active compute). If true, this explains how GPT-4 achieved a massive capability jump over GPT-3 (175B dense) while keeping inference costs manageable.
DeepSeek V3 (December 2024): 671B total parameters with a novel MoE design using 256 fine-grained experts (plus 1 shared expert) with 8 active per token, yielding 37B active parameters. DeepSeek also introduced Multi-Head Latent Attention (MLA), which compresses the KV cache by 93.3%, and an auxiliary-loss-free load balancing strategy. Trained for approximately $5.6M on 2,048 H800 GPUs. The architecture innovations were as significant as the headline cost figure.
| Dimension | Dense Model | MoE Model |
|---|---|---|
| Inference Speed | Predictable. All parameters compute every token. Speed = f(active params). | Faster per token for equivalent capability. Only k-of-N experts run. But routing adds latency (~1-5%). |
| Memory (VRAM) | Straightforward: ~2 bytes per parameter (FP16). 70B model needs ~140GB. | Higher than active count suggests. All experts must be loaded. 46.7B MoE needs ~93GB, not ~26GB. |
| Training Compute | Efficient. Every parameter gets gradients on every example. | More total compute per training step (must train all experts), but converges faster per FLOP of active compute. |
| Throughput | Predictable batch processing. Good GPU utilization. | Varies per batch. Different tokens route to different experts, creating load imbalance and communication overhead. |
| Model Quality | Strong up to ~400B params. Beyond that, training becomes prohibitively expensive. | Better capability per FLOP at large scale. Can store more knowledge in total params while keeping inference fast. |
| Deployment Complexity | Simple. One model, no routing logic, standard serving infrastructure. | Complex. Expert parallelism across GPUs, load-balanced routing, larger memory footprint, specialized serving frameworks. |
| Fine-tuning | Standard LoRA/QLoRA works well. Mature tooling. | More complex. Must decide whether to fine-tune all experts or selected ones. Tooling is less mature. |
| Quantization | Straightforward. INT4/INT8 quantization well-understood. | Trickier. Different experts may have different sensitivity to quantization. Shared layers vs expert layers need different treatment. |
The router is the most critical and most fragile component of a MoE model. If trained naively, it develops pathological behaviors.
The most common failure mode is expert collapse (also called the “rich get richer” problem). During training, if one expert happens to perform slightly better on early batches, the router sends it more tokens, which gives it more training signal, which makes it better, which means it gets even more tokens. The result: one or two experts handle most of the work while the rest atrophy and contribute nothing.
This wastes the entire point of MoE. A model with 8 experts but only 2 actually doing anything is just a dense model with extra memory overhead.
The field has developed several approaches:
Auxiliary loss (Switch Transformer, 2021). Google’s Switch Transformer added a loss term that penalizes unequal expert utilization. The training objective becomes: generate good text AND distribute tokens evenly. This works but introduces a hyperparameter (the auxiliary loss coefficient) that is tricky to tune — too high and it hurts model quality by forcing suboptimal routing, too low and collapse still occurs.
Expert capacity limits (GShard, 2020). Each expert has a hard cap on how many tokens it can process per batch. Tokens routed to a full expert are either dropped or processed by a secondary expert. This prevents collapse but wastes compute when tokens are dropped.
Auxiliary-loss-free balancing (DeepSeek V3, 2024). DeepSeek introduced a method that dynamically adjusts expert biases during training without an explicit loss term. Each expert maintains a running average of its utilization, and tokens are re-routed away from over-utilized experts via a bias term added to the gating scores. This proved more stable than the auxiliary loss approach and is considered a significant architectural contribution.
Expert choice routing (Zhou et al., 2022). Instead of tokens choosing experts (top-k gating), experts choose tokens. Each expert selects its top-k tokens from the batch. This guarantees perfect load balance by construction but changes the computational pattern and can be harder to implement efficiently.
A natural question: do individual experts learn to handle specific types of content? The evidence is mixed.
In Mixtral 8x7B, Mistral reported that experts show some specialization — certain experts are preferentially activated for code, others for mathematical content, others for particular languages. But the specialization is soft, not hard. Most experts can handle most inputs; the router just has preferences.
In models with many more experts (like DeepSeek V3’s 256), the specialization becomes finer-grained but harder to interpret. Some researchers have found clusters of experts that activate together for specific domains, but the picture is complex and not fully understood.
The fundamental advantage of MoE is economic. Consider two hypothetical models:
The MoE model will be faster at inference while potentially matching or exceeding the dense model’s quality, because:
The tradeoff is memory. Both models need approximately the same VRAM. MoE does not save memory — it saves compute. This is why MoE is most advantageous at the frontier: when you are already provisioning serious GPU clusters, the memory overhead is acceptable, and the compute savings (or equivalently, the quality gains at the same compute budget) are significant.
For smaller models (under ~14B active parameters), the overhead of routing, the complexity of multi-expert deployment, and the risk of expert collapse often outweigh the benefits. This is why most small open models (Mistral 7B, Llama 3 8B, Phi-3 Mini) remain dense.
Finer-grained routing. DeepSeek V3’s 256-expert, 8-active design showed that more experts at smaller sizes can outperform fewer, larger experts. The trend is toward hundreds or thousands of very small experts, approaching a continuum between MoE and parameter-efficient methods.
Mixture of Depths. A 2024 Google research paper proposed not just routing tokens to different experts, but allowing some tokens to skip entire layers. Easy tokens (like predicting “the” after “in”) get fewer layers of processing. Hard tokens (like resolving ambiguous pronouns) get more. This is MoE applied to the depth dimension, not just the width.
Hardware-aware routing. Current MoE models route tokens purely based on learned gating scores. Future designs may incorporate hardware awareness — routing to experts that happen to be on the same GPU, reducing communication overhead at the cost of slightly suboptimal expert selection.
MoE for fine-tuning. Rather than fine-tuning an entire model, you could add new experts for your specific domain while keeping existing experts frozen. This would combine the efficiency of LoRA-style fine-tuning with the capacity of MoE, and early research (MoLoRA, 2024) is exploring exactly this direction.
Mixture of Experts is the answer to a question that has haunted deep learning for a decade: how do you keep making models smarter without making inference proportionally more expensive? The solution — many specialized sub-networks with a learned router — is elegant in concept and messy in practice, requiring careful load balancing, complex distributed systems, and substantial memory overhead.
But it works. GPT-4’s rumored MoE architecture helped it leap ahead of GPT-3.5. Mixtral proved the approach could be done in the open. DeepSeek V3 showed it could be done cheaply. The next generation of frontier models will almost certainly be MoE, and probably with architectures more sophisticated than anything deployed today.
The era of simple “make it bigger” scaling is over. The era of “make it bigger, but only use the parts you need” has begun.
One email at dawn. The five stories that mattered, with the bits removed and the meaning kept. Free, for now.