Skip to content

Transformer Math

Module 8 · The Transformer

🔗 LayerNorm & Residuals

Delete one line and a 96-layer model becomes untrainable

Status:

Normalization is the unsung hero of deep Transformers. LayerNorm stabilizes activations at every layer, residual connections create gradient highways, and the choice between Pre-Norm vs Post-Norm determines whether your 80-layer model trains at all. This module shows you exactly what happens to a vector as it gets normalized.

📊

Normalization Visualized

What you're seeing: Left panel shows 4 raw activations with wildly different magnitudes — exactly what LayerNorm receives. Right panel shows the same 4 values after normalization: mean is 0, standard deviation is 1, then scaled by γ and shifted by β. Below the arrow, the two placement strategies (Pre-LN vs Post-LN) show where the norm sits relative to the Attention / FFN sublayers.

What to notice: The bar heights on the right are much more uniform. Pre-LN places the norm inside the residual branch so the skip connection remains a clean gradient highway. Post-LN normalizes after adding the residual, which puts LN in the gradient path and destabilizes deep models.

Before LayerNorm(raw activations)After LayerNorm(μ=0, σ=1 → scale γ + shift β)150−80300100.3−0.81.2−0.1Pre-LN (GPT-2, Llama)LNAttn / FFNresidual addPost-LN (original Transformer)Attn / FFNLNLN in gradient path
🎮

Normalization Step-by-Step

What you're seeing: A 4-element vector being normalized one step at a time. The demo computes mean, variance, normalizes, then applies learnable scale (gamma) and shift (beta).

What to try:Step through each stage to see how values change. Toggle "RMSNorm" to see the simpler variant (no mean subtraction). Open "Adjust gamma and beta" to see how the learnable parameters reshape the distribution.

Step 0: Input vector

2.0000-1.00004.00000.5000
Adjust gamma and beta
💡

The Intuition

Without normalization, activations drift as they pass through layers -- values grow or shrink unpredictably, destabilizing gradients. LayerNorm acts like a checkpoint at each layer's entrance, rescaling values to a stable range before they enter the next sublayer.

Diagram 1 — What LayerNorm does to a vector

Raw activations(varied heights, mean ≠ 0)2.0−1.04.00.5μ=1.375LayerNormAfter LayerNorm(mean=0, std=1 — uniform scale)0.34−1.281.42−0.47μ=0σ=1

Pre-Norm vs Post-Norm: : . But gradients must pass through the LN Jacobian, causing instability in deep models. Pre-Norm (GPT-2+) applies LN before the sublayer: . The residual path stays clean -- an unobstructed gradient highway.

Diagram 2 — Pre-Norm vs Post-Norm placement

Pre-Norm(GPT-2+, Llama, Gemma — modern)xLayerNormSublayer (Attn/FFN)skip+y (clean residual path ✓)Post-Norm(Vaswani 2017 original — N=6 per stack)xSublayer (Attn/FFN)skip+LayerNormy (LN blocks gradient path ✗)

RMSNorm (Llama, Gemma, Mistral): Drops the mean-centering step entirely. Only divides by the root-mean-square of the vector. and works just as well -- normalization's value comes from controlling magnitude, not centering.

Residual connections as gradient highways: With , the gradient is . The identity matrix guarantees gradients can flow from layer 80 to layer 1 without vanishing. Remove it, and training collapses.

Diagram 3 — Residual stream: the accumulating highway

Residual Stream xEmbedAttention+Δ₁+LNFFN+Δ₂+LNOutputEach sublayer writes a small delta (Δ) — the stream accumulates them additively. Removing + collapses training.Residual stream (highway)Sublayer delta (+Δ)LayerNorm checkpoint
✨ Insight · Think of residual connections as express lanes on a highway. Without them, every car (gradient) must exit at every toll booth (layer) -- most get stuck. With skip connections, gradients can take the express lane straight to early layers.

Quick check

Trade-off

Before LayerNorm existed, BatchNorm was the go-to stabilizer for deep networks. What is the single most fundamental reason BatchNorm cannot be used for Transformer sequence modeling?

Before LayerNorm existed, BatchNorm was the go-to stabilizer for deep networks. What is the single most fundamental reason BatchNorm cannot be used for Transformer sequence modeling?
Quick Check

Why do modern models (GPT-2+, Llama, Gemma) use Pre-Norm instead of Post-Norm?

📐

Step-by-Step Derivation

LayerNorm

Normalize each token's feature vector independently. Compute mean and variance across the feature dimensions:

where , , and are .

RMSNorm (Llama, Gemma, Mistral)

skip mean-centering — only divide by the root-mean-square. No bias parameter :

💡 Tip · RMSNorm removes both mean subtraction and bias, reducing parameters and compute by . Experiments show negligible accuracy difference.

Residual Connection

The identity shortcut means . The term is the gradient highway.

Pre-Norm vs Post-Norm

Pre-Norm (modern standard):

Post-Norm (original Transformer):

⚠ Warning · In Pre-Norm, the residual path has no LN -- gradients backpropagate freely. In Post-Norm, gradients must pass through LN at every layer. This is why almost every model since GPT-2 uses Pre-Norm.

PyTorch: LayerNorm and RMSNorm

python
import torch
import torch.nn as nn

# Built-in LayerNorm
ln = nn.LayerNorm(d_model)       # learnable gamma, beta
output = ln(x)                    # x: [batch, seq, d_model]

# Built-in RMSNorm (PyTorch 2.4+)
rms = nn.RMSNorm(d_model)        # learnable gamma only
output = rms(x)

# Manual RMSNorm (for older PyTorch)
class RMSNorm(nn.Module):
    def __init__(self, d_model, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(d_model))
        self.eps = eps

    def forward(self, x):
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
        return x / rms * self.weight

# Pre-Norm residual pattern
x = x + self.attn(self.ln1(x))   # norm before sublayer
x = x + self.ffn(self.ln2(x))    # residual adds back original

Quick check

Trade-off

RMSNorm replaces LayerNorm’s formula with: x / RMS(x) × γ, where RMS(x) = √(mean(x²)). Two things are missing compared to LayerNorm. What are they, and what is the performance claim?

RMSNorm replaces LayerNorm’s formula with: x / RMS(x) × γ, where RMS(x) = √(mean(x²)). Two things are missing compared to LayerNorm. What are they, and what is the performance claim?
🔧

Break It -- See What Happens

Remove residual connections
Switch Pre-Norm to Post-Norm

Quick check

Derivation

The Break It panel shows what happens when residual connections are removed. What is the mathematical mechanism, and at what depth does training practically collapse without them?

The Break It panel shows what happens when residual connections are removed. What is the mathematical mechanism, and at what depth does training practically collapse without them?
📊

Real-World Numbers

ModelNorm TypePlacementNotes
Original TransformerLayerNormPost-Norm6 layers, required warmup
GPT-2LayerNormPre-NormFirst major model to switch to Pre-Norm
GPT-3LayerNormPre-Norm, 175B params
Llama-2 / Llama-3RMSNormPre-NormVaries by model size; ()
GemmaRMSNormPre-NormGoogle's open model family
Mistral / MixtralRMSNormPre-NormMoE architecture, same norm choice
✨ Insight · The trend: LayerNorm (2017) to RMSNorm (2023+). Pre-Norm became the practical default for most deep LLMs, though alternatives like DeepNorm (Wang et al. 2022) can stabilize very deep Post-LN models up to 1000 layers. The field converged because clean residual paths beat the marginal quality gain from Post-Norm in most settings.
🔬

Why Post-Norm Fails at Depth — Gradient Norm Analysis

Xiong et al. (2020) provided a theoretical analysis of why Post-Norm requires careful learning rate warmup while Pre-Norm does not. The root cause is the magnitude of the gradient at the input of early layers.

In Post-Norm, the output of layer is . The gradient with respect to accumulates a product of LayerNorm Jacobians across all layers:

Xiong et al. (2020) showed that at initialization, — large enough that a high learning rate immediately destabilizes training. The gradient magnitude grows with model dimension, making deep Post-Norm models extremely sensitive to learning rate and requiring careful warmup to avoid divergence.

In Pre-Norm, . The gradient decomposes cleanly:

The identity term dominates at initialization (when ), so . This is why Pre-Norm models train stably from a cold start without warmup, while Post-Norm models require gradual warmup to avoid the gradient vanishing at early layers.

PropertyPost-NormPre-Norm
Gradient norm at initO(1)
Warmup required?Yes — gradual LR increaseNo — stable from step 1
Max stable depth~12 layers without tricks
Final qualitySlightly better (when tuned)Slightly worse, but practical
✨ Insight · DeepNorm (Wang et al. 2022) is a hybrid: scale the residual branch down by and initialize weights scaled by , then apply Post-Norm. This keeps the expected gradient norm at while retaining Post-Norm's final-quality advantage. It enables stable training of 1000-layer Transformers, but the tuning complexity means most practitioners still prefer Pre-RMSNorm.

Quick check

Derivation

Xiong et al. (2020) show Post-LN gradient norm at initialization scales as O(d√(ln d)). For GPT-3 with d=12,288, what is the approximate gradient norm, and why does Pre-LN eliminate the need for warmup?

Xiong et al. (2020) show Post-LN gradient norm at initialization scales as O(d√(ln d)). For GPT-3 with d=12,288, what is the approximate gradient norm, and why does Pre-LN eliminate the need for warmup?
🧠

Key Takeaways

What to remember for interviews

  1. 1LayerNorm normalizes each token's feature vector to mean≈0, std≈1 then rescales with learned γ and β — acting as a stability checkpoint before each sublayer.
  2. 2Unlike BatchNorm (which depends on other samples in the batch), LayerNorm operates independently per token, making it compatible with variable-length sequences and batch size 1.
  3. 3Pre-Norm (LayerNorm before the sublayer) keeps the residual path clean, allowing gradients to flow through the identity connection without distortion — this is why GPT-2, Llama, and Gemma all use Pre-Norm.
  4. 4RMSNorm (used in Llama, Mistral) drops mean-centering entirely and only divides by the root-mean-square, saving ~7–10% compute with negligible quality loss.
  5. 5Residual connections provide a gradient highway: the derivative ∂x_{i+1}/∂x_i = I + ∂f/∂x_i ensures gradients reach early layers without vanishing through deep stacks.
🧠

Recap Quiz

Trade-off

A language model processes sequences of variable length (4 to 4096 tokens) with batch size 1 at inference. Why does BatchNorm fail here while LayerNorm succeeds?

A language model processes sequences of variable length (4 to 4096 tokens) with batch size 1 at inference. Why does BatchNorm fail here while LayerNorm succeeds?
Derivation

Xiong et al. (2020) show that at initialization, Post-LN gradient norm at the output layer scales as O(d√(ln d)) while Pre-LN scales as O(1). What does this mean in practice for training a 96-layer model like GPT-3?

Xiong et al. (2020) show that at initialization, Post-LN gradient norm at the output layer scales as O(d√(ln d)) while Pre-LN scales as O(1). What does this mean in practice for training a 96-layer model like GPT-3?
Trade-off

RMSNorm removes two components from LayerNorm. Identify both and explain why removing them is acceptable without measurable accuracy loss.

RMSNorm removes two components from LayerNorm. Identify both and explain why removing them is acceptable without measurable accuracy loss.
Derivation

Without residual connections, training a 96-layer Transformer from scratch is practically impossible. Derive the gradient flow argument: what exactly goes wrong without the identity skip?

Without residual connections, training a 96-layer Transformer from scratch is practically impossible. Derive the gradient flow argument: what exactly goes wrong without the identity skip?
Trade-off

The original Transformer (Vaswani 2017) used Post-Norm with 6 encoder + 6 decoder layers and required warmup. GPT-3 uses Pre-Norm with 96 layers and trains stably. What is the approximate maximum depth Post-Norm can reach without specialized tricks like DeepNorm?

The original Transformer (Vaswani 2017) used Post-Norm with 6 encoder + 6 decoder layers and required warmup. GPT-3 uses Pre-Norm with 96 layers and trains stably. What is the approximate maximum depth Post-Norm can reach without specialized tricks like DeepNorm?
Derivation

Llama-2 7B has 32 transformer layers, each with 2 RMSNorm operations (pre-attention and pre-FFN), plus a final norm. Each RMSNorm has d_model=4096 gamma parameters and no beta. How many norm parameters does this save versus full LayerNorm across the whole model?

Llama-2 7B has 32 transformer layers, each with 2 RMSNorm operations (pre-attention and pre-FFN), plus a final norm. Each RMSNorm has d_model=4096 gamma parameters and no beta. How many norm parameters does this save versus full LayerNorm across the whole model?
Trade-off

DeepNorm (Wang et al. 2022) enables stable training of 1000-layer transformers using Post-Norm. What is the key modification that controls the gradient norm, and why do most practitioners still prefer Pre-RMSNorm?

DeepNorm (Wang et al. 2022) enables stable training of 1000-layer transformers using Post-Norm. What is the key modification that controls the gradient norm, and why do most practitioners still prefer Pre-RMSNorm?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 5 of 5

LayerNorm vs BatchNorm -- why do Transformers use LayerNorm?

★★☆
GoogleOpenAIAnthropic

Pre-Norm vs Post-Norm -- what are the training dynamics differences?

★★★
GoogleOpenAIAnthropic

Draw the gradient flow path for both variants. Which one has an unobstructed highway?

RMSNorm -- what is it and why do Llama/Gemma use it?

★★☆
GoogleMetaAnthropic

Why are residual connections critical for gradient flow?

★★☆
GoogleMeta

What happens if you apply LayerNorm after the residual connection instead of before?

★★★
AnthropicOpenAI