Mixture of Experts — Transformer Math

Module 29 · Architectures

🧩 Mixture of Experts

DeepSeek-V3 has 671B params but each token only uses 37B

Status:

Dense Transformers use every parameter for every token. Mixture of Experts (MoE)breaks this constraint — a learned router selects a small subset of "expert" FFN layers per token, enabling models with hundreds of billions of parameters while keeping per-token compute fixed. This is how .

🎮

Expert Routing Simulator

What you're seeing

Select a token category. The router scores all 8 experts via softmax(W_g · x). The top-2 highest-scoring experts are activated (highlighted in accent). Their gate scores are renormalized to sum to 1 before weighting outputs.

What to try: switch between categories and watch which experts “specialize”. Notice that “code” and “math” share Expert 3 but activate different second experts — the router has learned overlapping representations.

Token category:[Code]— syntax, indentation, identifiers

Expert 1

0.07

Expert 2

0.06

Expert 3

0.45(renorm: 0.584)

TOP-2 ✓

Expert 4

0.04

Expert 5

0.05

Expert 6

0.03

Expert 7

0.08

Expert 8

0.32(renorm: 0.416)

TOP-2 ✓

Router output: Expert 3(gate 0.45) + Expert 8(gate 0.32) — remaining 6 experts receive 0 gradient this step.

🗺️

MoE Layer Architecture

What you're seeing

A single MoE layer replacing a dense FFN. Each token passes through a lightweight router that scores all 8 experts. Only the top-2 scoring experts run — the other 6 sit idle (zero FLOPs that step). Their outputs are combined using the router's renormalized gate weights.

What to notice

The inactive experts (grey, dashed lines) still occupy GPU memory — that's the storage cost. But they contribute zero arithmetic that token, giving Mixtral 8×7B its 3.6× compute efficiency: 46.7B params stored, only 12.9B active per token.

💡

The Intuition

The core problem: expert collapse.MoE routes each token to 2 of 8 experts. Without load balancing, the router learns a shortcut: send everything to expert #1 (it's already the best, so the gradient reinforces it). The other 7 experts never train. You paid for 8× parameters but use 1×. The load balancing loss forces the router to spread tokens evenly across all experts.

The compute math:Dense 7B model — 7B params, 7B active per token. MoE Mixtral 8×7B — . while activating only ~13B params per token — you get more capacity for free by storing knowledge in unused experts.

Think of a hospital. A dense modelis like one general practitioner who sees every patient — they know a bit about everything but can't go deep. An MoE model is like a hospital with specialist departments. A triage desk (the router) examines each patient and sends them to the right specialists. The hospital has far more total expertise, but each patient only sees 1-2 doctors.

The router is a simple learned linear layer: produces a score for each expert. Softmax converts these to probabilities, and top-k selection picks the winners. The selected experts process the token in parallel, and their outputs are combined using the gating weights.

The key challenge is load balancing. Without intervention, the router collapses — sending most tokens to a few "popular" experts while the rest go unused. This is called expert collapse, and it defeats the purpose of having multiple experts. An auxiliary loss term penalizes uneven routing distributions, keeping all experts utilized.

MoE scales remarkably well. — matching Llama-2 70B on benchmarks with far less compute. .

DeepSeek-V3 Architecture (2024): — 256 routed experts plus 1 shared expert per MoE layer, with top-8 routing. Two key innovations set it apart. First, auxiliary-loss-free load balancing: instead of adding an explicit balance loss term (which can destabilize training), DeepSeek-V3 adds small bias terms to expert routing scores that are dynamically adjusted to keep utilization even — same effect, cleaner gradients. Second, Multi-Token Prediction (MTP) as a training objective: the model predicts multiple future tokens per step, not just the next one. This improves data efficiency and sharpens representations. — strikingly cheap for its capability tier. Expert parallelism distributes experts across GPUs so each GPU holds a subset of experts, and tokens are dispatched via all-to-all communication to whichever GPU hosts their assigned expert.

Expert parallelism in practice. MoE introduces a new dimension of parallelism. In a dense model, you split parameters across devices via tensor parallelism or pipeline parallelism. With MoE, you can shard experts across devices: each device holds a different subset of experts, and tokens are all-to-all routed to whichever device has their assigned expert. , with tokens dynamically dispatched via high-speed interconnects. The downside is that every token causes at least one all-to-all communication: cheap for batch inference but latency-sensitive for interactive requests. This is why MoE excels at high-throughput batch workloads but adds overhead for single-stream queries (Clark et al., “Unified Scaling Laws for Routed Language Models,” 2022).

✨ Insight · MoE is not about making models bigger — it's about decoupling capacity from compute. You store more knowledge without paying more FLOPs per token.

Quick check

Derivation

Mixtral 8x7B stores 46.7B parameters but only 12.9B are active per token. What must be true for the inactive 33.8B parameters to still contribute value?

Different tokens activate different experts, so all 46.7B params are used across the full input distribution.Inactive experts contribute via weight-sharing with active experts.Inactive experts still receive gradient updates via the balance loss.The router pre-aggregates all expert outputs before top-k selection.

Quick Check

Why does MoE use a load balancing loss during training?

📐

Step-by-Step Derivation

Router Gating Function

The router computes a gating score for each expert using a learned weight matrix . Softmax normalizes the scores, then top-k selects the highest:

💡 Tip · Only the top-k gating weights are non-zero. The selected weights are renormalized to sum to 1, ensuring the output scale is consistent regardless of which experts are chosen.

MoE Layer Output

The final output is a weighted sum of the selected experts' outputs. Each expert is a standard FFN, and is its gating weight (zero for non-selected experts):

This replaces the single FFN in a standard Transformer block. The attention layers remain dense — MoE only applies to the feed-forward layers.

Load Balancing Loss

To prevent expert collapse, an auxiliary loss encourages uniform routing. is the fraction of tokens routed to expert , is the average router probability for expert , and (typically 0.01-0.1) controls the penalty strength:

The product is differentiable: provides the gradient signal (since involves discrete routing decisions). The factor normalizes across different expert counts.

PyTorch: MoE Forward Pass

python

class MoELayer(nn.Module):
    def __init__(self, d_model, d_ff, n_experts=8, top_k=2):
        super().__init__()
        self.experts = nn.ModuleList([FFN(d_model, d_ff) for _ in range(n_experts)])
        self.gate = nn.Linear(d_model, n_experts, bias=False)
        self.top_k = top_k

    def forward(self, x):
        scores = F.softmax(self.gate(x), dim=-1)  # [batch, seq, n_experts]
        topk_scores, topk_idx = scores.topk(self.top_k, dim=-1)
        topk_scores = topk_scores / topk_scores.sum(dim=-1, keepdim=True)

        out = torch.zeros_like(x)
        for k in range(self.top_k):
            expert_idx = topk_idx[..., k]
            for i, expert in enumerate(self.experts):
                mask = expert_idx == i
                if mask.any():
                    out[mask] += topk_scores[..., k:k+1][mask] * expert(x[mask])
        return out

PyTorch implementation

# MoE routing: top-k expert selection with load balancing loss
import torch
import torch.nn as nn
import torch.nn.functional as F

def moe_forward(x, gate, experts, top_k=2, balance_coeff=0.01):
    # x: (batch * seq_len, d_model)
    scores = F.softmax(gate(x), dim=-1)           # (N, n_experts)
    topk_scores, topk_idx = scores.topk(top_k, dim=-1)
    topk_scores /= topk_scores.sum(dim=-1, keepdim=True)  # renormalize

    # Load balancing loss: penalize uneven routing
    f_i = torch.zeros(len(experts), device=x.device)
    for i in range(len(experts)):
        f_i[i] = (topk_idx == i).float().mean()  # fraction routed to expert i
    P_i = scores.mean(dim=0)                       # avg router prob per expert
    balance_loss = balance_coeff * len(experts) * (f_i * P_i).sum()

    # Weighted sum of selected expert outputs
    out = torch.zeros_like(x)
    for k in range(top_k):
        for i, expert in enumerate(experts):
            mask = topk_idx[:, k] == i
            if mask.any():
                out[mask] += topk_scores[mask, k:k+1] * expert(x[mask])
    return out, balance_loss

🔧

Break It — See What Happens

Remove Load Balancing Loss

Use Top-1 Instead of Top-2

No Shared Expert

Set top-k to all experts (k = N)

Quick check

Derivation

You remove the load balancing loss and observe that after 5000 steps, 80% of tokens route to 2 of 8 experts. What is the effective model capacity, and what should you try first?

Effective capacity ≈ 80%; reset only the two overloaded experts to random weights.Effective capacity is unchanged; the two active experts become larger via gradient accumulation.Effective capacity ≈ 50%; switch to top-1 routing to force more expert utilization.Effective capacity ≈ 25% of budget; re-add balance loss with α=0.01 before increasing α.

📊

Real-World Numbers

Model	Total Params	Active Params	Experts	Top-k	Shared
Mixtral 8x7B	46.7B	12.9B	8	2	No
DeepSeek-V2	236B	21B	160	6	2 shared
DeepSeek-V3	671B	37B	256	8	1 shared
Qwen2-MoE	14.3B	2.7B	60	4	4 shared
Switch Transformer			2048	1	No
GShard			2048	2	No

✨ Insight · The trend is toward more experts with finer granularity. Switch Transformer (2021) showed top-1 with many experts works. DeepSeek-V3 (2024) uses 256 fine-grained experts with top-8 plus a shared expert — maximizing specialization while maintaining routing diversity.

Quick check

Derivation

DeepSeek-V3 has 671B total parameters but only 37B active per token. Compared to Mixtral 8x7B (46.7B total / 12.9B active), which model has the higher parameter efficiency ratio?

DeepSeek-V3: 671/37 ≈ 18× vs Mixtral 46.7/12.9 ≈ 3.6× — far more parameters stored per FLOP.Mixtral: 3.6× is a more efficient ratio because lower ratios reduce memory bandwidth pressure.Both are equivalent because the active parameter count determines inference cost.Neither ratio is meaningful because expert quality matters more than count.

🚀

MoE Frontier (2024–2025)

In 2024–2025, MoE became the dominant architecture for frontier-scale models. Three design patterns emerged: fine-grained experts (256+), shared expert backbones, and aux-loss-free balancing. Below are the key models and one architectural innovation that changes the routing axis entirely.

Model	Total Params	Active Params	Experts	Top-k	Routing Strategy
Mixtral 8×7B (legacy)	46.7B	12.9B	8	2	Aux-loss balance
(Dec 2024)	671B	37B	256 routed + 1 shared	8	Aux-loss-free bias
(Apr 2025)	109B	17B	16	—	Interleaved dense/MoE
(Apr 2025)	400B	17B	128	—	Interleaved dense/MoE
(Apr 2025)	235B	22B	—	—	Thinking/non-thinking toggle

Deep dive: DeepSeek-V3 aux-loss-free balancing

Standard MoE training adds an auxiliary load-balance loss (e.g., Mixtral) that penalises routing entropy directly. The problem: this auxiliary gradient competes with the main language modelling loss, distorting expert specialisation.

DeepSeek-V3 removes the auxiliary loss entirely. Instead, it maintains a running per-expert bias termthat is added to the router logits. If an expert is over-loaded relative to the target, its bias is decremented; if under-loaded, it is incremented. The language modelling gradient never “sees” the load signal — balancing is handled as a separate, gradient-free control loop. Result: cleaner gradients, better specialisation, and comparable balance to aux-loss training.

Source: DeepSeek-V3 Technical Report, arxiv:2412.19437

Deep dive: Mixture of Depths (DeepMind, Apr 2024)

Standard MoE routes tokens to different expert FFNs within a layer. Mixture of Depths (Raposo et al., 2024) routes tokens to different layers — the router decides whether each token should pass through a transformer block or skip it entirely via a residual bypass.

This is distinct from early-exit (which stops at the first confident layer): all layers are available for all tokens across all examples. The router simply allocates FLOPs where they are most needed per token per layer. On a fixed FLOP budget, MoD models match baseline performance; on equal quality, they need fewer total FLOPs.

Source: Mixture of Depths, arxiv:2404.02258

✨ Insight · Llama 4 Scout and Maverick use the same 17B active parameters despite Scout having 109B total and Maverick 400B total. Active params determine inference latency; total params determine specialisation capacity. Maverick's 128-expert pool gives far more routing diversity than Scout's 16 — at the same per-token cost.

🧠

Key Takeaways

What to remember for interviews

1MoE replaces a single FFN with N expert FFNs — a learned router selects the top-k experts per token, leaving the rest idle.
2Compute stays constant (only k of N experts run per token), but total model capacity scales with N — Mixtral 8x7B uses 12.9B active parameters out of 46.7B total.
3Without a load-balancing auxiliary loss, routers collapse: one expert receives almost all tokens and the rest go untrained (expert collapse).
4DeepSeek-V3 avoids the auxiliary loss entirely by dynamically adjusting per-expert bias terms, achieving the same balance effect with cleaner gradients.
5Expert parallelism distributes experts across devices via all-to-all communication — efficient for high-throughput batch inference, but adds latency for single-stream requests.
6Mixture of Depths (2024) extends the routing axis from expert selection to layer selection — tokens skip entire transformer blocks, reducing FLOPs without the limitations of early-exit.

📚

Interview Questions

Difficulty:

Company:

Showing 7 of 7

Why does MoE achieve better performance per FLOP than dense models?

★★☆

GoogleMeta

Explain the routing mechanism in MoE — how does top-k expert selection work?

★★☆

Google

What is the load balancing problem in MoE and how is it solved?

★★★

GoogleMeta

DeepSeek-V3 has 671B total parameters but only 37B active per token. Explain the architecture.

★★☆

AnthropicOpenAI

Compare dense and MoE models at equal compute budget. What are the tradeoffs?

★★☆