🧩 Mixture of Experts
DeepSeek-V3 has 671B params but each token only uses 37B
Dense Transformers use every parameter for every token. Mixture of Experts (MoE)breaks this constraint — a learned router selects a small subset of "expert" FFN layers per token, enabling models with hundreds of billions of parameters while keeping per-token compute fixed. This is how .
Expert Routing Simulator
What you're seeing
Select a token category. The router scores all 8 experts via softmax(W_g · x). The top-2 highest-scoring experts are activated (highlighted in accent). Their gate scores are renormalized to sum to 1 before weighting outputs.
What to try: switch between categories and watch which experts “specialize”. Notice that “code” and “math” share Expert 3 but activate different second experts — the router has learned overlapping representations.
MoE Layer Architecture
What you're seeing
A single MoE layer replacing a dense FFN. Each token passes through a lightweight router that scores all 8 experts. Only the top-2 scoring experts run — the other 6 sit idle (zero FLOPs that step). Their outputs are combined using the router's renormalized gate weights.
What to notice
The inactive experts (grey, dashed lines) still occupy GPU memory — that's the storage cost. But they contribute zero arithmetic that token, giving Mixtral 8×7B its 3.6× compute efficiency: 46.7B params stored, only 12.9B active per token.
The Intuition
The core problem: expert collapse.MoE routes each token to 2 of 8 experts. Without load balancing, the router learns a shortcut: send everything to expert #1 (it's already the best, so the gradient reinforces it). The other 7 experts never train. You paid for 8× parameters but use 1×. The load balancing loss forces the router to spread tokens evenly across all experts.
The compute math:Dense 7B model — 7B params, 7B active per token. MoE Mixtral 8×7B — . while activating only ~13B params per token — you get more capacity for free by storing knowledge in unused experts.
Think of a hospital. A dense modelis like one general practitioner who sees every patient — they know a bit about everything but can't go deep. An MoE model is like a hospital with specialist departments. A triage desk (the router) examines each patient and sends them to the right specialists. The hospital has far more total expertise, but each patient only sees 1-2 doctors.
The router is a simple learned linear layer: produces a score for each expert. Softmax converts these to probabilities, and top-k selection picks the winners. The selected experts process the token in parallel, and their outputs are combined using the gating weights.
The key challenge is load balancing. Without intervention, the router collapses — sending most tokens to a few "popular" experts while the rest go unused. This is called expert collapse, and it defeats the purpose of having multiple experts. An auxiliary loss term penalizes uneven routing distributions, keeping all experts utilized.
MoE scales remarkably well. — matching Llama-2 70B on benchmarks with far less compute. .
DeepSeek-V3 Architecture (2024): — 256 routed experts plus 1 shared expert per MoE layer, with top-8 routing. Two key innovations set it apart. First, auxiliary-loss-free load balancing: instead of adding an explicit balance loss term (which can destabilize training), DeepSeek-V3 adds small bias terms to expert routing scores that are dynamically adjusted to keep utilization even — same effect, cleaner gradients. Second, Multi-Token Prediction (MTP) as a training objective: the model predicts multiple future tokens per step, not just the next one. This improves data efficiency and sharpens representations. — strikingly cheap for its capability tier. Expert parallelism distributes experts across GPUs so each GPU holds a subset of experts, and tokens are dispatched via all-to-all communication to whichever GPU hosts their assigned expert.
Expert parallelism in practice. MoE introduces a new dimension of parallelism. In a dense model, you split parameters across devices via tensor parallelism or pipeline parallelism. With MoE, you can shard experts across devices: each device holds a different subset of experts, and tokens are all-to-all routed to whichever device has their assigned expert. , with tokens dynamically dispatched via high-speed interconnects. The downside is that every token causes at least one all-to-all communication: cheap for batch inference but latency-sensitive for interactive requests. This is why MoE excels at high-throughput batch workloads but adds overhead for single-stream queries (Clark et al., “Unified Scaling Laws for Routed Language Models,” 2022).
Quick check
Mixtral 8x7B stores 46.7B parameters but only 12.9B are active per token. What must be true for the inactive 33.8B parameters to still contribute value?
Why does MoE use a load balancing loss during training?
Step-by-Step Derivation
Router Gating Function
The router computes a gating score for each expert using a learned weight matrix . Softmax normalizes the scores, then top-k selects the highest:
MoE Layer Output
The final output is a weighted sum of the selected experts' outputs. Each expert is a standard FFN, and is its gating weight (zero for non-selected experts):
This replaces the single FFN in a standard Transformer block. The attention layers remain dense — MoE only applies to the feed-forward layers.
Load Balancing Loss
To prevent expert collapse, an auxiliary loss encourages uniform routing. is the fraction of tokens routed to expert , is the average router probability for expert , and (typically 0.01-0.1) controls the penalty strength:
The product is differentiable: provides the gradient signal (since involves discrete routing decisions). The factor normalizes across different expert counts.
PyTorch: MoE Forward Pass
class MoELayer(nn.Module):
def __init__(self, d_model, d_ff, n_experts=8, top_k=2):
super().__init__()
self.experts = nn.ModuleList([FFN(d_model, d_ff) for _ in range(n_experts)])
self.gate = nn.Linear(d_model, n_experts, bias=False)
self.top_k = top_k
def forward(self, x):
scores = F.softmax(self.gate(x), dim=-1) # [batch, seq, n_experts]
topk_scores, topk_idx = scores.topk(self.top_k, dim=-1)
topk_scores = topk_scores / topk_scores.sum(dim=-1, keepdim=True)
out = torch.zeros_like(x)
for k in range(self.top_k):
expert_idx = topk_idx[..., k]
for i, expert in enumerate(self.experts):
mask = expert_idx == i
if mask.any():
out[mask] += topk_scores[..., k:k+1][mask] * expert(x[mask])
return outPyTorch implementation
# MoE routing: top-k expert selection with load balancing loss
import torch
import torch.nn as nn
import torch.nn.functional as F
def moe_forward(x, gate, experts, top_k=2, balance_coeff=0.01):
# x: (batch * seq_len, d_model)
scores = F.softmax(gate(x), dim=-1) # (N, n_experts)
topk_scores, topk_idx = scores.topk(top_k, dim=-1)
topk_scores /= topk_scores.sum(dim=-1, keepdim=True) # renormalize
# Load balancing loss: penalize uneven routing
f_i = torch.zeros(len(experts), device=x.device)
for i in range(len(experts)):
f_i[i] = (topk_idx == i).float().mean() # fraction routed to expert i
P_i = scores.mean(dim=0) # avg router prob per expert
balance_loss = balance_coeff * len(experts) * (f_i * P_i).sum()
# Weighted sum of selected expert outputs
out = torch.zeros_like(x)
for k in range(top_k):
for i, expert in enumerate(experts):
mask = topk_idx[:, k] == i
if mask.any():
out[mask] += topk_scores[mask, k:k+1] * expert(x[mask])
return out, balance_lossBreak It — See What Happens
Quick check
You remove the load balancing loss and observe that after 5000 steps, 80% of tokens route to 2 of 8 experts. What is the effective model capacity, and what should you try first?
Real-World Numbers
| Model | Total Params | Active Params | Experts | Top-k | Shared |
|---|---|---|---|---|---|
| Mixtral 8x7B | 46.7B | 12.9B | 8 | 2 | No |
| DeepSeek-V2 | 236B | 21B | 160 | 6 | 2 shared |
| DeepSeek-V3 | 671B | 37B | 256 | 8 | 1 shared |
| Qwen2-MoE | 14.3B | 2.7B | 60 | 4 | 4 shared |
| Switch Transformer | 2048 | 1 | No | ||
| GShard | 2048 | 2 | No |
Quick check
DeepSeek-V3 has 671B total parameters but only 37B active per token. Compared to Mixtral 8x7B (46.7B total / 12.9B active), which model has the higher parameter efficiency ratio?
MoE Frontier (2024–2025)
In 2024–2025, MoE became the dominant architecture for frontier-scale models. Three design patterns emerged: fine-grained experts (256+), shared expert backbones, and aux-loss-free balancing. Below are the key models and one architectural innovation that changes the routing axis entirely.
| Model | Total Params | Active Params | Experts | Top-k | Routing Strategy |
|---|---|---|---|---|---|
| Mixtral 8×7B (legacy) | 46.7B | 12.9B | 8 | 2 | Aux-loss balance |
| (Dec 2024) | 671B | 37B | 256 routed + 1 shared | 8 | Aux-loss-free bias |
| (Apr 2025) | 109B | 17B | 16 | — | Interleaved dense/MoE |
| (Apr 2025) | 400B | 17B | 128 | — | Interleaved dense/MoE |
| (Apr 2025) | 235B | 22B | — | — | Thinking/non-thinking toggle |
Deep dive: DeepSeek-V3 aux-loss-free balancing
Standard MoE training adds an auxiliary load-balance loss (e.g., Mixtral) that penalises routing entropy directly. The problem: this auxiliary gradient competes with the main language modelling loss, distorting expert specialisation.
DeepSeek-V3 removes the auxiliary loss entirely. Instead, it maintains a running per-expert bias termthat is added to the router logits. If an expert is over-loaded relative to the target, its bias is decremented; if under-loaded, it is incremented. The language modelling gradient never “sees” the load signal — balancing is handled as a separate, gradient-free control loop. Result: cleaner gradients, better specialisation, and comparable balance to aux-loss training.
Deep dive: Mixture of Depths (DeepMind, Apr 2024)
Standard MoE routes tokens to different expert FFNs within a layer. Mixture of Depths (Raposo et al., 2024) routes tokens to different layers — the router decides whether each token should pass through a transformer block or skip it entirely via a residual bypass.
This is distinct from early-exit (which stops at the first confident layer): all layers are available for all tokens across all examples. The router simply allocates FLOPs where they are most needed per token per layer. On a fixed FLOP budget, MoD models match baseline performance; on equal quality, they need fewer total FLOPs.
Key Takeaways
What to remember for interviews
- 1MoE replaces a single FFN with N expert FFNs — a learned router selects the top-k experts per token, leaving the rest idle.
- 2Compute stays constant (only k of N experts run per token), but total model capacity scales with N — Mixtral 8x7B uses 12.9B active parameters out of 46.7B total.
- 3Without a load-balancing auxiliary loss, routers collapse: one expert receives almost all tokens and the rest go untrained (expert collapse).
- 4DeepSeek-V3 avoids the auxiliary loss entirely by dynamically adjusting per-expert bias terms, achieving the same balance effect with cleaner gradients.
- 5Expert parallelism distributes experts across devices via all-to-all communication — efficient for high-throughput batch inference, but adds latency for single-stream requests.
- 6Mixture of Depths (2024) extends the routing axis from expert selection to layer selection — tokens skip entire transformer blocks, reducing FLOPs without the limitations of early-exit.
Further Reading
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer — Shazeer et al. 2017 — the original MoE paper introducing sparsely-gated expert routing
- Mixtral of Experts — Mistral AI 2024 — open-weight MoE model with 8 experts per layer, top-2 routing
- DeepSeek-V3 Technical Report — DeepSeek 2024 — 671B MoE with auxiliary-loss-free load balancing and multi-token prediction
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — Fedus et al. 2021 — top-1 routing with capacity factor and auxiliary balance loss; showed MoE scales to 1T+ params
- Lilian Weng — How to Train Really Large Models on Many GPUs — Covers expert parallelism, pipeline parallelism, and how MoE fits into distributed training strategies
- Unified Scaling Laws for Routed Language Models — Clark et al. 2022 — scaling laws specific to MoE: how performance scales with number of experts, active params, and total params
Mixtral 8x7B activates 12.9B of its 46.7B parameters per token. Why is the 3.6× gap between storage and compute the key design win?
Without a load-balancing loss, what steady state does the router converge to, and why?
An interviewer asks why you chose top-2 routing (Mixtral) over top-1 (Switch Transformer). What is the strongest tradeoff argument?
DeepSeek-V3 replaces the standard auxiliary balance loss with a dynamic bias on expert routing scores. What does this fix?
MoE adds all-to-all communication for expert routing. Under what workload does this overhead hurt most, and why?
The Switch Transformer uses a capacity factor C to cap tokens per expert. What happens when a token exceeds the capacity of its top-1 expert?
Interview Questions
Showing 7 of 7
Why does MoE achieve better performance per FLOP than dense models?
★★☆Explain the routing mechanism in MoE — how does top-k expert selection work?
★★☆What is the load balancing problem in MoE and how is it solved?
★★★DeepSeek-V3 has 671B total parameters but only 37B active per token. Explain the architecture.
★★☆Compare dense and MoE models at equal compute budget. What are the tradeoffs?
★★☆What is expert collapse and how do auxiliary losses prevent it?
★★★How does Mixtral's routing differ from Switch Transformer's? What are the implications?
★★★