Skip to content

Transformer Math

Module 29 · Architectures

🧩 Mixture of Experts

DeepSeek-V3 has 671B params but each token only uses 37B

Status:

Dense Transformers use every parameter for every token. Mixture of Experts (MoE)breaks this constraint — a learned router selects a small subset of "expert" FFN layers per token, enabling models with hundreds of billions of parameters while keeping per-token compute fixed. This is how .

🎮

Expert Routing Simulator

What you're seeing

Select a token category. The router scores all 8 experts via softmax(W_g · x). The top-2 highest-scoring experts are activated (highlighted in accent). Their gate scores are renormalized to sum to 1 before weighting outputs.

What to try: switch between categories and watch which experts “specialize”. Notice that “code” and “math” share Expert 3 but activate different second experts — the router has learned overlapping representations.

Token category:[Code]syntax, indentation, identifiers
Expert 1
0.07
Expert 2
0.06
Expert 3
0.45(renorm: 0.584)
TOP-2 ✓
Expert 4
0.04
Expert 5
0.05
Expert 6
0.03
Expert 7
0.08
Expert 8
0.32(renorm: 0.416)
TOP-2 ✓
Router output: Expert 3(gate 0.45) + Expert 8(gate 0.32) — remaining 6 experts receive 0 gradient this step.
🗺️

MoE Layer Architecture

What you're seeing

A single MoE layer replacing a dense FFN. Each token passes through a lightweight router that scores all 8 experts. Only the top-2 scoring experts run — the other 6 sit idle (zero FLOPs that step). Their outputs are combined using the router's renormalized gate weights.

What to notice

The inactive experts (grey, dashed lines) still occupy GPU memory — that's the storage cost. But they contribute zero arithmetic that token, giving Mixtral 8×7B its 3.6× compute efficiency: 46.7B params stored, only 12.9B active per token.

Mixture-of-Experts Layer (top-k = 2 of 8 experts)Token xRouterW_g · x → softmaxscores all 8 expertsselects top-2E1E2E3E4E5E6E7E8ACTIVEACTIVEgate weights (renorm)Σ gate_i · FFN_i(x)yRouter (gating network)Active expert (top-2)Inactive expert (no FLOPs!)Weighted mergeMixtral 8×7B: 46.7B total params — only 12.9B active per token (6 experts sit idle, cost 0 FLOPs)
💡

The Intuition

The core problem: expert collapse.MoE routes each token to 2 of 8 experts. Without load balancing, the router learns a shortcut: send everything to expert #1 (it's already the best, so the gradient reinforces it). The other 7 experts never train. You paid for 8× parameters but use 1×. The load balancing loss forces the router to spread tokens evenly across all experts.

The compute math:Dense 7B model — 7B params, 7B active per token. MoE Mixtral 8×7B — . while activating only ~13B params per token — you get more capacity for free by storing knowledge in unused experts.

Think of a hospital. A dense modelis like one general practitioner who sees every patient — they know a bit about everything but can't go deep. An MoE model is like a hospital with specialist departments. A triage desk (the router) examines each patient and sends them to the right specialists. The hospital has far more total expertise, but each patient only sees 1-2 doctors.

The router is a simple learned linear layer: produces a score for each expert. Softmax converts these to probabilities, and top-k selection picks the winners. The selected experts process the token in parallel, and their outputs are combined using the gating weights.

The key challenge is load balancing. Without intervention, the router collapses — sending most tokens to a few "popular" experts while the rest go unused. This is called expert collapse, and it defeats the purpose of having multiple experts. An auxiliary loss term penalizes uneven routing distributions, keeping all experts utilized.

MoE scales remarkably well. — matching Llama-2 70B on benchmarks with far less compute. .

DeepSeek-V3 Architecture (2024): — 256 routed experts plus 1 shared expert per MoE layer, with top-8 routing. Two key innovations set it apart. First, auxiliary-loss-free load balancing: instead of adding an explicit balance loss term (which can destabilize training), DeepSeek-V3 adds small bias terms to expert routing scores that are dynamically adjusted to keep utilization even — same effect, cleaner gradients. Second, Multi-Token Prediction (MTP) as a training objective: the model predicts multiple future tokens per step, not just the next one. This improves data efficiency and sharpens representations. — strikingly cheap for its capability tier. Expert parallelism distributes experts across GPUs so each GPU holds a subset of experts, and tokens are dispatched via all-to-all communication to whichever GPU hosts their assigned expert.

Expert parallelism in practice. MoE introduces a new dimension of parallelism. In a dense model, you split parameters across devices via tensor parallelism or pipeline parallelism. With MoE, you can shard experts across devices: each device holds a different subset of experts, and tokens are all-to-all routed to whichever device has their assigned expert. , with tokens dynamically dispatched via high-speed interconnects. The downside is that every token causes at least one all-to-all communication: cheap for batch inference but latency-sensitive for interactive requests. This is why MoE excels at high-throughput batch workloads but adds overhead for single-stream queries (Clark et al., “Unified Scaling Laws for Routed Language Models,” 2022).

✨ Insight · MoE is not about making models bigger — it's about decoupling capacity from compute. You store more knowledge without paying more FLOPs per token.

Quick check

Derivation

Mixtral 8x7B stores 46.7B parameters but only 12.9B are active per token. What must be true for the inactive 33.8B parameters to still contribute value?

Mixtral 8x7B stores 46.7B parameters but only 12.9B are active per token. What must be true for the inactive 33.8B parameters to still contribute value?
Quick Check

Why does MoE use a load balancing loss during training?

📐

Step-by-Step Derivation

Router Gating Function

The router computes a gating score for each expert using a learned weight matrix . Softmax normalizes the scores, then top-k selects the highest:

💡 Tip · Only the top-k gating weights are non-zero. The selected weights are renormalized to sum to 1, ensuring the output scale is consistent regardless of which experts are chosen.

MoE Layer Output

The final output is a weighted sum of the selected experts' outputs. Each expert is a standard FFN, and is its gating weight (zero for non-selected experts):

This replaces the single FFN in a standard Transformer block. The attention layers remain dense — MoE only applies to the feed-forward layers.

Load Balancing Loss

To prevent expert collapse, an auxiliary loss encourages uniform routing. is the fraction of tokens routed to expert , is the average router probability for expert , and (typically 0.01-0.1) controls the penalty strength:

The product is differentiable: provides the gradient signal (since involves discrete routing decisions). The factor normalizes across different expert counts.

PyTorch: MoE Forward Pass

python
class MoELayer(nn.Module):
    def __init__(self, d_model, d_ff, n_experts=8, top_k=2):
        super().__init__()
        self.experts = nn.ModuleList([FFN(d_model, d_ff) for _ in range(n_experts)])
        self.gate = nn.Linear(d_model, n_experts, bias=False)
        self.top_k = top_k

    def forward(self, x):
        scores = F.softmax(self.gate(x), dim=-1)  # [batch, seq, n_experts]
        topk_scores, topk_idx = scores.topk(self.top_k, dim=-1)
        topk_scores = topk_scores / topk_scores.sum(dim=-1, keepdim=True)

        out = torch.zeros_like(x)
        for k in range(self.top_k):
            expert_idx = topk_idx[..., k]
            for i, expert in enumerate(self.experts):
                mask = expert_idx == i
                if mask.any():
                    out[mask] += topk_scores[..., k:k+1][mask] * expert(x[mask])
        return out
PyTorch implementation
# MoE routing: top-k expert selection with load balancing loss
import torch
import torch.nn as nn
import torch.nn.functional as F

def moe_forward(x, gate, experts, top_k=2, balance_coeff=0.01):
    # x: (batch * seq_len, d_model)
    scores = F.softmax(gate(x), dim=-1)           # (N, n_experts)
    topk_scores, topk_idx = scores.topk(top_k, dim=-1)
    topk_scores /= topk_scores.sum(dim=-1, keepdim=True)  # renormalize

    # Load balancing loss: penalize uneven routing
    f_i = torch.zeros(len(experts), device=x.device)
    for i in range(len(experts)):
        f_i[i] = (topk_idx == i).float().mean()  # fraction routed to expert i
    P_i = scores.mean(dim=0)                       # avg router prob per expert
    balance_loss = balance_coeff * len(experts) * (f_i * P_i).sum()

    # Weighted sum of selected expert outputs
    out = torch.zeros_like(x)
    for k in range(top_k):
        for i, expert in enumerate(experts):
            mask = topk_idx[:, k] == i
            if mask.any():
                out[mask] += topk_scores[mask, k:k+1] * expert(x[mask])
    return out, balance_loss
🔧

Break It — See What Happens

Remove Load Balancing Loss
Use Top-1 Instead of Top-2
No Shared Expert
Set top-k to all experts (k = N)

Quick check

Derivation

You remove the load balancing loss and observe that after 5000 steps, 80% of tokens route to 2 of 8 experts. What is the effective model capacity, and what should you try first?

You remove the load balancing loss and observe that after 5000 steps, 80% of tokens route to 2 of 8 experts. What is the effective model capacity, and what should you try first?
📊

Real-World Numbers

ModelTotal ParamsActive ParamsExpertsTop-kShared
Mixtral 8x7B46.7B12.9B82No
DeepSeek-V2236B21B16062 shared
DeepSeek-V3671B37B25681 shared
Qwen2-MoE14.3B2.7B6044 shared
Switch Transformer20481No
GShard20482No
✨ Insight · The trend is toward more experts with finer granularity. Switch Transformer (2021) showed top-1 with many experts works. DeepSeek-V3 (2024) uses 256 fine-grained experts with top-8 plus a shared expert — maximizing specialization while maintaining routing diversity.

Quick check

Derivation

DeepSeek-V3 has 671B total parameters but only 37B active per token. Compared to Mixtral 8x7B (46.7B total / 12.9B active), which model has the higher parameter efficiency ratio?

DeepSeek-V3 has 671B total parameters but only 37B active per token. Compared to Mixtral 8x7B (46.7B total / 12.9B active), which model has the higher parameter efficiency ratio?
🚀

MoE Frontier (2024–2025)

In 2024–2025, MoE became the dominant architecture for frontier-scale models. Three design patterns emerged: fine-grained experts (256+), shared expert backbones, and aux-loss-free balancing. Below are the key models and one architectural innovation that changes the routing axis entirely.

ModelTotal ParamsActive ParamsExpertsTop-kRouting Strategy
Mixtral 8×7B (legacy)46.7B12.9B82Aux-loss balance
(Dec 2024)671B37B256 routed + 1 shared8Aux-loss-free bias
(Apr 2025)109B17B16Interleaved dense/MoE
(Apr 2025)400B17B128Interleaved dense/MoE
(Apr 2025)235B22BThinking/non-thinking toggle
Deep dive: DeepSeek-V3 aux-loss-free balancing

Standard MoE training adds an auxiliary load-balance loss (e.g., Mixtral) that penalises routing entropy directly. The problem: this auxiliary gradient competes with the main language modelling loss, distorting expert specialisation.

DeepSeek-V3 removes the auxiliary loss entirely. Instead, it maintains a running per-expert bias termthat is added to the router logits. If an expert is over-loaded relative to the target, its bias is decremented; if under-loaded, it is incremented. The language modelling gradient never “sees” the load signal — balancing is handled as a separate, gradient-free control loop. Result: cleaner gradients, better specialisation, and comparable balance to aux-loss training.

Source: DeepSeek-V3 Technical Report, arxiv:2412.19437

Deep dive: Mixture of Depths (DeepMind, Apr 2024)

Standard MoE routes tokens to different expert FFNs within a layer. Mixture of Depths (Raposo et al., 2024) routes tokens to different layers — the router decides whether each token should pass through a transformer block or skip it entirely via a residual bypass.

This is distinct from early-exit (which stops at the first confident layer): all layers are available for all tokens across all examples. The router simply allocates FLOPs where they are most needed per token per layer. On a fixed FLOP budget, MoD models match baseline performance; on equal quality, they need fewer total FLOPs.

Source: Mixture of Depths, arxiv:2404.02258

✨ Insight · Llama 4 Scout and Maverick use the same 17B active parameters despite Scout having 109B total and Maverick 400B total. Active params determine inference latency; total params determine specialisation capacity. Maverick's 128-expert pool gives far more routing diversity than Scout's 16 — at the same per-token cost.
🧠

Key Takeaways

What to remember for interviews

  1. 1MoE replaces a single FFN with N expert FFNs — a learned router selects the top-k experts per token, leaving the rest idle.
  2. 2Compute stays constant (only k of N experts run per token), but total model capacity scales with N — Mixtral 8x7B uses 12.9B active parameters out of 46.7B total.
  3. 3Without a load-balancing auxiliary loss, routers collapse: one expert receives almost all tokens and the rest go untrained (expert collapse).
  4. 4DeepSeek-V3 avoids the auxiliary loss entirely by dynamically adjusting per-expert bias terms, achieving the same balance effect with cleaner gradients.
  5. 5Expert parallelism distributes experts across devices via all-to-all communication — efficient for high-throughput batch inference, but adds latency for single-stream requests.
  6. 6Mixture of Depths (2024) extends the routing axis from expert selection to layer selection — tokens skip entire transformer blocks, reducing FLOPs without the limitations of early-exit.
📚

Further Reading

Derivation

Mixtral 8x7B activates 12.9B of its 46.7B parameters per token. Why is the 3.6× gap between storage and compute the key design win?

Mixtral 8x7B activates 12.9B of its 46.7B parameters per token. Why is the 3.6× gap between storage and compute the key design win?
Derivation

Without a load-balancing loss, what steady state does the router converge to, and why?

Without a load-balancing loss, what steady state does the router converge to, and why?
Trade-off

An interviewer asks why you chose top-2 routing (Mixtral) over top-1 (Switch Transformer). What is the strongest tradeoff argument?

An interviewer asks why you chose top-2 routing (Mixtral) over top-1 (Switch Transformer). What is the strongest tradeoff argument?
Trade-off

DeepSeek-V3 replaces the standard auxiliary balance loss with a dynamic bias on expert routing scores. What does this fix?

DeepSeek-V3 replaces the standard auxiliary balance loss with a dynamic bias on expert routing scores. What does this fix?
Trade-off

MoE adds all-to-all communication for expert routing. Under what workload does this overhead hurt most, and why?

MoE adds all-to-all communication for expert routing. Under what workload does this overhead hurt most, and why?
Derivation

The Switch Transformer uses a capacity factor C to cap tokens per expert. What happens when a token exceeds the capacity of its top-1 expert?

The Switch Transformer uses a capacity factor C to cap tokens per expert. What happens when a token exceeds the capacity of its top-1 expert?
🎯

Interview Questions

Difficulty:
Company:

Showing 7 of 7

Why does MoE achieve better performance per FLOP than dense models?

★★☆
GoogleMeta

Explain the routing mechanism in MoE — how does top-k expert selection work?

★★☆
Google

What is the load balancing problem in MoE and how is it solved?

★★★
GoogleMeta

DeepSeek-V3 has 671B total parameters but only 37B active per token. Explain the architecture.

★★☆
AnthropicOpenAI

Compare dense and MoE models at equal compute budget. What are the tradeoffs?

★★☆
Meta

What is expert collapse and how do auxiliary losses prevent it?

★★★
Google

How does Mixtral's routing differ from Switch Transformer's? What are the implications?

★★★
Meta