🔗 LayerNorm & Residuals
Delete one line and a 96-layer model becomes untrainable
Normalization is the unsung hero of deep Transformers. LayerNorm stabilizes activations at every layer, residual connections create gradient highways, and the choice between Pre-Norm vs Post-Norm determines whether your 80-layer model trains at all. This module shows you exactly what happens to a vector as it gets normalized.
Normalization Visualized
What you're seeing: Left panel shows 4 raw activations with wildly different magnitudes — exactly what LayerNorm receives. Right panel shows the same 4 values after normalization: mean is 0, standard deviation is 1, then scaled by γ and shifted by β. Below the arrow, the two placement strategies (Pre-LN vs Post-LN) show where the norm sits relative to the Attention / FFN sublayers.
What to notice: The bar heights on the right are much more uniform. Pre-LN places the norm inside the residual branch so the skip connection remains a clean gradient highway. Post-LN normalizes after adding the residual, which puts LN in the gradient path and destabilizes deep models.
Normalization Step-by-Step
What you're seeing: A 4-element vector being normalized one step at a time. The demo computes mean, variance, normalizes, then applies learnable scale (gamma) and shift (beta).
What to try:Step through each stage to see how values change. Toggle "RMSNorm" to see the simpler variant (no mean subtraction). Open "Adjust gamma and beta" to see how the learnable parameters reshape the distribution.
Step 0: Input vector
Adjust gamma and beta
The Intuition
Without normalization, activations drift as they pass through layers -- values grow or shrink unpredictably, destabilizing gradients. LayerNorm acts like a checkpoint at each layer's entrance, rescaling values to a stable range before they enter the next sublayer.
Diagram 1 — What LayerNorm does to a vector
Pre-Norm vs Post-Norm: : . But gradients must pass through the LN Jacobian, causing instability in deep models. Pre-Norm (GPT-2+) applies LN before the sublayer: . The residual path stays clean -- an unobstructed gradient highway.
Diagram 2 — Pre-Norm vs Post-Norm placement
RMSNorm (Llama, Gemma, Mistral): Drops the mean-centering step entirely. Only divides by the root-mean-square of the vector. and works just as well -- normalization's value comes from controlling magnitude, not centering.
Residual connections as gradient highways: With , the gradient is . The identity matrix guarantees gradients can flow from layer 80 to layer 1 without vanishing. Remove it, and training collapses.
Diagram 3 — Residual stream: the accumulating highway
Quick check
Before LayerNorm existed, BatchNorm was the go-to stabilizer for deep networks. What is the single most fundamental reason BatchNorm cannot be used for Transformer sequence modeling?
Why do modern models (GPT-2+, Llama, Gemma) use Pre-Norm instead of Post-Norm?
Step-by-Step Derivation
LayerNorm
Normalize each token's feature vector independently. Compute mean and variance across the feature dimensions:
where , , and are .
RMSNorm (Llama, Gemma, Mistral)
skip mean-centering — only divide by the root-mean-square. No bias parameter :
Residual Connection
The identity shortcut means . The term is the gradient highway.
Pre-Norm vs Post-Norm
Pre-Norm (modern standard):
Post-Norm (original Transformer):
PyTorch: LayerNorm and RMSNorm
import torch
import torch.nn as nn
# Built-in LayerNorm
ln = nn.LayerNorm(d_model) # learnable gamma, beta
output = ln(x) # x: [batch, seq, d_model]
# Built-in RMSNorm (PyTorch 2.4+)
rms = nn.RMSNorm(d_model) # learnable gamma only
output = rms(x)
# Manual RMSNorm (for older PyTorch)
class RMSNorm(nn.Module):
def __init__(self, d_model, eps=1e-6):
super().__init__()
self.weight = nn.Parameter(torch.ones(d_model))
self.eps = eps
def forward(self, x):
rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
return x / rms * self.weight
# Pre-Norm residual pattern
x = x + self.attn(self.ln1(x)) # norm before sublayer
x = x + self.ffn(self.ln2(x)) # residual adds back originalQuick check
RMSNorm replaces LayerNorm’s formula with: x / RMS(x) × γ, where RMS(x) = √(mean(x²)). Two things are missing compared to LayerNorm. What are they, and what is the performance claim?
Break It -- See What Happens
Quick check
The Break It panel shows what happens when residual connections are removed. What is the mathematical mechanism, and at what depth does training practically collapse without them?
Real-World Numbers
| Model | Norm Type | Placement | Notes |
|---|---|---|---|
| Original Transformer | LayerNorm | Post-Norm | 6 layers, required warmup |
| GPT-2 | LayerNorm | Pre-Norm | First major model to switch to Pre-Norm |
| GPT-3 | LayerNorm | Pre-Norm | , 175B params |
| Llama-2 / Llama-3 | RMSNorm | Pre-Norm | Varies by model size; () |
| Gemma | RMSNorm | Pre-Norm | Google's open model family |
| Mistral / Mixtral | RMSNorm | Pre-Norm | MoE architecture, same norm choice |
Why Post-Norm Fails at Depth — Gradient Norm Analysis
Xiong et al. (2020) provided a theoretical analysis of why Post-Norm requires careful learning rate warmup while Pre-Norm does not. The root cause is the magnitude of the gradient at the input of early layers.
In Post-Norm, the output of layer is . The gradient with respect to accumulates a product of LayerNorm Jacobians across all layers:
Xiong et al. (2020) showed that at initialization, — large enough that a high learning rate immediately destabilizes training. The gradient magnitude grows with model dimension, making deep Post-Norm models extremely sensitive to learning rate and requiring careful warmup to avoid divergence.
In Pre-Norm, . The gradient decomposes cleanly:
The identity term dominates at initialization (when ), so . This is why Pre-Norm models train stably from a cold start without warmup, while Post-Norm models require gradual warmup to avoid the gradient vanishing at early layers.
| Property | Post-Norm | Pre-Norm |
|---|---|---|
| Gradient norm at init | O(1) | |
| Warmup required? | Yes — gradual LR increase | No — stable from step 1 |
| Max stable depth | ~12 layers without tricks | |
| Final quality | Slightly better (when tuned) | Slightly worse, but practical |
Quick check
Xiong et al. (2020) show Post-LN gradient norm at initialization scales as O(d√(ln d)). For GPT-3 with d=12,288, what is the approximate gradient norm, and why does Pre-LN eliminate the need for warmup?
Key Takeaways
What to remember for interviews
- 1LayerNorm normalizes each token's feature vector to mean≈0, std≈1 then rescales with learned γ and β — acting as a stability checkpoint before each sublayer.
- 2Unlike BatchNorm (which depends on other samples in the batch), LayerNorm operates independently per token, making it compatible with variable-length sequences and batch size 1.
- 3Pre-Norm (LayerNorm before the sublayer) keeps the residual path clean, allowing gradients to flow through the identity connection without distortion — this is why GPT-2, Llama, and Gemma all use Pre-Norm.
- 4RMSNorm (used in Llama, Mistral) drops mean-centering entirely and only divides by the root-mean-square, saving ~7–10% compute with negligible quality loss.
- 5Residual connections provide a gradient highway: the derivative ∂x_{i+1}/∂x_i = I + ∂f/∂x_i ensures gradients reach early layers without vanishing through deep stacks.
Recap Quiz
A language model processes sequences of variable length (4 to 4096 tokens) with batch size 1 at inference. Why does BatchNorm fail here while LayerNorm succeeds?
Xiong et al. (2020) show that at initialization, Post-LN gradient norm at the output layer scales as O(d√(ln d)) while Pre-LN scales as O(1). What does this mean in practice for training a 96-layer model like GPT-3?
RMSNorm removes two components from LayerNorm. Identify both and explain why removing them is acceptable without measurable accuracy loss.
Without residual connections, training a 96-layer Transformer from scratch is practically impossible. Derive the gradient flow argument: what exactly goes wrong without the identity skip?
The original Transformer (Vaswani 2017) used Post-Norm with 6 encoder + 6 decoder layers and required warmup. GPT-3 uses Pre-Norm with 96 layers and trains stably. What is the approximate maximum depth Post-Norm can reach without specialized tricks like DeepNorm?
Llama-2 7B has 32 transformer layers, each with 2 RMSNorm operations (pre-attention and pre-FFN), plus a final norm. Each RMSNorm has d_model=4096 gamma parameters and no beta. How many norm parameters does this save versus full LayerNorm across the whole model?
DeepNorm (Wang et al. 2022) enables stable training of 1000-layer transformers using Post-Norm. What is the key modification that controls the gradient norm, and why do most practitioners still prefer Pre-RMSNorm?
Further Reading
- Layer Normalization — Ba et al. 2016 — the original LayerNorm paper. Normalizes across features instead of batch dimension.
- Root Mean Square Layer Normalization (RMSNorm) — Zhang & Sennrich 2019 — drops the mean-centering step for a simpler, faster norm. Used by LLaMA and Mistral.
- On Layer Normalization in the Transformer Architecture — Xiong et al. 2020 — analyzes Pre-LN vs Post-LN placement. Pre-LN enables stable training without warmup.
- The Illustrated Transformer — Jay Alammar — Visual Transformer walkthrough — shows where layer norm fits in the residual stream around attention and FFN blocks.
- Batch Normalization: Accelerating Deep Network Training (Ioffe & Szegedy 2015) — The original BatchNorm paper — understanding why BN works for images clarifies why LayerNorm is needed for variable-length sequences.
- DeepNorm: Scaling Transformers to 1,000 Layers — Wang et al. 2022 — combines Pre-LN and Post-LN with scaled initialization to train 1000-layer transformers stably. Used in GLM-130B.
Interview Questions
Showing 5 of 5
LayerNorm vs BatchNorm -- why do Transformers use LayerNorm?
★★☆Pre-Norm vs Post-Norm -- what are the training dynamics differences?
★★★Draw the gradient flow path for both variants. Which one has an unobstructed highway?
RMSNorm -- what is it and why do Llama/Gemma use it?
★★☆Why are residual connections critical for gradient flow?
★★☆What happens if you apply LayerNorm after the residual connection instead of before?
★★★