Morning Edition LIVE

Vol. I · No. 1

Est.
MMXXVI

The A.I. Beat

Dispatches from the frontier of machine intelligence

Three
Dollars

← Front page ARCHITECTURE April 21, 2026 · 6 min read

ARCHITECTURE

What Is Mixture of Experts? The Architecture Powering Frontier AI

A detailed technical explainer of the Mixture of Experts architecture — the routing mechanism, the efficiency tradeoffs, and why MoE is behind GPT-4, Mixtral, and DeepSeek V3.

By The AI Beat · Technology

What Is Mixture of Experts? The Architecture Powering Frontier AI

“Mixtral 8x7B has 46.7 billion total parameters but only uses 12.9 billion per token. It gets the knowledge of a large model at the speed of a small one. That is not a free lunch — but it is a deeply discounted one.”

For most of the deep learning era, the path to better AI models was straightforward: make them bigger. Double the parameters, train on more data, spend more compute. Each generation of “dense” models — where every parameter is used for every input — followed this playbook. GPT-2 had 1.5 billion parameters. GPT-3 had 175 billion. The performance improved, but so did the cost. The next doubling would be ruinous.

Mixture of Experts (MoE) broke that curve. Instead of activating all parameters for every input, MoE models contain many specialized sub-networks (“experts”) and a routing mechanism that selects only a few for each token. The result: a model with the knowledge of a very large network and the inference cost of a much smaller one.

This is not a new idea — MoE was proposed by Jacobs et al. in 1991. But it has become the dominant architecture at the frontier. GPT-4 is widely reported to be a MoE model. Mixtral proved MoE could work at open-weight scale. DeepSeek V3 pushed the efficiency further than anyone expected. Understanding MoE is now essential to understanding how modern AI works.

How Mixture of Experts Works

The core MoE architecture has three components: a set of expert networks, a router (also called a gating network), and a combining mechanism.

E1 + 0.3E2Output

Fig. 1 — Mixture of Experts architecture. The router assigns each token to the top-k experts (typically k=2). Only selected experts run; the rest are idle. Expert outputs are combined using router-assigned weights.

The Experts

In a transformer-based MoE model, the “experts” replace the standard feed-forward network (FFN) in each transformer block. The attention layers remain shared across all tokens — only the FFN is split into multiple expert copies. Each expert is a full FFN with its own weights, meaning it can specialize in different types of patterns.

A model with 8 experts, each the size of a 7B parameter FFN, has 8x7B = 56B parameters in its expert layers alone, plus the shared attention layers. The total parameter count is much larger than any single expert.

The Router

The router is a small linear layer that takes the current token’s hidden state as input and produces a probability distribution over all experts. For a model with N experts, the router output is an N-dimensional vector passed through softmax:

gate(x) = softmax(W_g * x)

The top-k experts (usually k=2) are selected, and their weights are renormalized to sum to 1. This is called top-k gating. Only the selected experts actually run their forward pass — the idle experts consume no compute.

The Combination

The final output for each token is the weighted sum of the selected experts’ outputs:

output = sum(gate_i * expert_i(x)) for each selected expert i

This is differentiable, which means the entire system — router included — can be trained end-to-end with standard backpropagation. The router learns which types of inputs each expert is best suited for.

MoE in Practice: Real Models, Real Numbers

Fig. 2 — Total parameter count for major models. MoE models (top four) have far more total parameters than active parameters per token. Dense models (bottom two) use all parameters for every token.

Mixtral 8x7B (December 2023): Mistral’s breakout MoE model. 8 experts, each roughly 7B parameters in the FFN layers, with 2 active per token. Total: 46.7 billion parameters. Active per token: 12.9 billion. On benchmarks, it matched or exceeded Llama 2 70B (a model with 5.4x more active parameters) while being significantly faster at inference. This was the proof of concept that made the industry take MoE seriously for open models.

Mixtral 8x22B (April 2024): The scaled-up version. 8 experts at ~22B each, 176B total, 39B active. Performance competitive with GPT-4 Turbo on several benchmarks. Demonstrated that MoE scaling laws hold: bigger experts and more of them continue to improve performance.

GPT-4 (March 2023): OpenAI has never confirmed its architecture, but multiple sources (including George Hotz and leaked details) report it is an 8-expert MoE model with approximately 220B parameters per expert, totaling roughly 1.8 trillion parameters, with 2 experts active per token (~440B active compute). If true, this explains how GPT-4 achieved a massive capability jump over GPT-3 (175B dense) while keeping inference costs manageable.

DeepSeek V3 (December 2024): 671B total parameters with a novel MoE design using 256 fine-grained experts (plus 1 shared expert) with 8 active per token, yielding 37B active parameters. DeepSeek also introduced Multi-Head Latent Attention (MLA), which compresses the KV cache by 93.3%, and an auxiliary-loss-free load balancing strategy. Trained for approximately $5.6M on 2,048 H800 GPUs. The architecture innovations were as significant as the headline cost figure.

Dense vs MoE: The Engineering Tradeoffs

Dimension	Dense Model	MoE Model
Inference Speed	Predictable. All parameters compute every token. Speed = f(active params).	Faster per token for equivalent capability. Only k-of-N experts run. But routing adds latency (~1-5%).
Memory (VRAM)	Straightforward: ~2 bytes per parameter (FP16). 70B model needs ~140GB.	Higher than active count suggests. All experts must be loaded. 46.7B MoE needs ~93GB, not ~26GB.
Training Compute	Efficient. Every parameter gets gradients on every example.	More total compute per training step (must train all experts), but converges faster per FLOP of active compute.
Throughput	Predictable batch processing. Good GPU utilization.	Varies per batch. Different tokens route to different experts, creating load imbalance and communication overhead.
Model Quality	Strong up to ~400B params. Beyond that, training becomes prohibitively expensive.	Better capability per FLOP at large scale. Can store more knowledge in total params while keeping inference fast.
Deployment Complexity	Simple. One model, no routing logic, standard serving infrastructure.	Complex. Expert parallelism across GPUs, load-balanced routing, larger memory footprint, specialized serving frameworks.
Fine-tuning	Standard LoRA/QLoRA works well. Mature tooling.	More complex. Must decide whether to fine-tune all experts or selected ones. Tooling is less mature.
Quantization	Straightforward. INT4/INT8 quantization well-understood.	Trickier. Different experts may have different sensitivity to quantization. Shared layers vs expert layers need different treatment.

Fig. 3 — Dense vs MoE across engineering dimensions. MoE wins on capability-per-FLOP at scale but adds significant deployment and training complexity.

The Routing Problem: Expert Collapse and Load Balancing

The router is the most critical and most fragile component of a MoE model. If trained naively, it develops pathological behaviors.

Expert Collapse

The most common failure mode is expert collapse (also called the “rich get richer” problem). During training, if one expert happens to perform slightly better on early batches, the router sends it more tokens, which gives it more training signal, which makes it better, which means it gets even more tokens. The result: one or two experts handle most of the work while the rest atrophy and contribute nothing.

This wastes the entire point of MoE. A model with 8 experts but only 2 actually doing anything is just a dense model with extra memory overhead.

Load Balancing Solutions

The field has developed several approaches:

Auxiliary loss (Switch Transformer, 2021). Google’s Switch Transformer added a loss term that penalizes unequal expert utilization. The training objective becomes: generate good text AND distribute tokens evenly. This works but introduces a hyperparameter (the auxiliary loss coefficient) that is tricky to tune — too high and it hurts model quality by forcing suboptimal routing, too low and collapse still occurs.

Expert capacity limits (GShard, 2020). Each expert has a hard cap on how many tokens it can process per batch. Tokens routed to a full expert are either dropped or processed by a secondary expert. This prevents collapse but wastes compute when tokens are dropped.

Auxiliary-loss-free balancing (DeepSeek V3, 2024). DeepSeek introduced a method that dynamically adjusts expert biases during training without an explicit loss term. Each expert maintains a running average of its utilization, and tokens are re-routed away from over-utilized experts via a bias term added to the gating scores. This proved more stable than the auxiliary loss approach and is considered a significant architectural contribution.

Expert choice routing (Zhou et al., 2022). Instead of tokens choosing experts (top-k gating), experts choose tokens. Each expert selects its top-k tokens from the batch. This guarantees perfect load balance by construction but changes the computational pattern and can be harder to implement efficiently.

Do Experts Actually Specialize?

A natural question: do individual experts learn to handle specific types of content? The evidence is mixed.

In Mixtral 8x7B, Mistral reported that experts show some specialization — certain experts are preferentially activated for code, others for mathematical content, others for particular languages. But the specialization is soft, not hard. Most experts can handle most inputs; the router just has preferences.

In models with many more experts (like DeepSeek V3’s 256), the specialization becomes finer-grained but harder to interpret. Some researchers have found clusters of experts that activate together for specific domains, but the picture is complex and not fully understood.

Why MoE Dominates the Frontier

The fundamental advantage of MoE is economic. Consider two hypothetical models:

Dense 200B: 200 billion parameters, all active per token. Requires ~400GB VRAM (FP16). Inference uses 200B parameters of compute per token.
MoE 8x25B: 200 billion total parameters, 8 experts of 25B each (in FFN layers), 2 active per token. Approximately 50B active parameters per token. Requires the same ~400GB VRAM (all experts loaded). But inference uses only 50B parameters of compute per token — 4x faster.

The MoE model will be faster at inference while potentially matching or exceeding the dense model’s quality, because:

It has the same total parameter count to store knowledge
It can specialize different experts for different patterns
The router adds only minimal overhead

The tradeoff is memory. Both models need approximately the same VRAM. MoE does not save memory — it saves compute. This is why MoE is most advantageous at the frontier: when you are already provisioning serious GPU clusters, the memory overhead is acceptable, and the compute savings (or equivalently, the quality gains at the same compute budget) are significant.

For smaller models (under ~14B active parameters), the overhead of routing, the complexity of multi-expert deployment, and the risk of expert collapse often outweigh the benefits. This is why most small open models (Mistral 7B, Llama 3 8B, Phi-3 Mini) remain dense.

What Comes Next

Finer-grained routing. DeepSeek V3’s 256-expert, 8-active design showed that more experts at smaller sizes can outperform fewer, larger experts. The trend is toward hundreds or thousands of very small experts, approaching a continuum between MoE and parameter-efficient methods.

Mixture of Depths. A 2024 Google research paper proposed not just routing tokens to different experts, but allowing some tokens to skip entire layers. Easy tokens (like predicting “the” after “in”) get fewer layers of processing. Hard tokens (like resolving ambiguous pronouns) get more. This is MoE applied to the depth dimension, not just the width.

Hardware-aware routing. Current MoE models route tokens purely based on learned gating scores. Future designs may incorporate hardware awareness — routing to experts that happen to be on the same GPU, reducing communication overhead at the cost of slightly suboptimal expert selection.

MoE for fine-tuning. Rather than fine-tuning an entire model, you could add new experts for your specific domain while keeping existing experts frozen. This would combine the efficiency of LoRA-style fine-tuning with the capacity of MoE, and early research (MoLoRA, 2024) is exploring exactly this direction.

The Bottom Line

Mixture of Experts is the answer to a question that has haunted deep learning for a decade: how do you keep making models smarter without making inference proportionally more expensive? The solution — many specialized sub-networks with a learned router — is elegant in concept and messy in practice, requiring careful load balancing, complex distributed systems, and substantial memory overhead.

But it works. GPT-4’s rumored MoE architecture helped it leap ahead of GPT-3.5. Mixtral proved the approach could be done in the open. DeepSeek V3 showed it could be done cheaply. The next generation of frontier models will almost certainly be MoE, and probably with architectures more sophisticated than anything deployed today.

The era of simple “make it bigger” scaling is over. The era of “make it bigger, but only use the parts you need” has begun.

machine learning explainers large language models