LayerNorm & Residuals — Transformer Math

Module 8 · The Transformer

🔗 LayerNorm & Residuals

Delete one line and a 96-layer model becomes untrainable

Status:

Normalization is the unsung hero of deep Transformers. LayerNorm stabilizes activations at every layer, residual connections create gradient highways, and the choice between Pre-Norm vs Post-Norm determines whether your 80-layer model trains at all. This module shows you exactly what happens to a vector as it gets normalized.

📊

Normalization Visualized

What you're seeing: Left panel shows 4 raw activations with wildly different magnitudes — exactly what LayerNorm receives. Right panel shows the same 4 values after normalization: mean is 0, standard deviation is 1, then scaled by γ and shifted by β. Below the arrow, the two placement strategies (Pre-LN vs Post-LN) show where the norm sits relative to the Attention / FFN sublayers.

What to notice: The bar heights on the right are much more uniform. Pre-LN places the norm inside the residual branch so the skip connection remains a clean gradient highway. Post-LN normalizes after adding the residual, which puts LN in the gradient path and destabilizes deep models.

🎮

Normalization Step-by-Step

What you're seeing: A 4-element vector being normalized one step at a time. The demo computes mean, variance, normalizes, then applies learnable scale (gamma) and shift (beta).

What to try:Step through each stage to see how values change. Toggle "RMSNorm" to see the simpler variant (no mean subtraction). Open "Adjust gamma and beta" to see how the learnable parameters reshape the distribution.

RMSNorm (no mean)

Step 0: Input vector

2.0000-1.00004.00000.5000

Adjust gamma and beta

gamma[0]: 1.00

gamma[1]: 1.00

gamma[2]: 1.00

gamma[3]: 1.00

beta[0]: 0.00

beta[1]: 0.00

beta[2]: 0.00

beta[3]: 0.00

💡

The Intuition

Without normalization, activations drift as they pass through layers -- values grow or shrink unpredictably, destabilizing gradients. LayerNorm acts like a checkpoint at each layer's entrance, rescaling values to a stable range before they enter the next sublayer.

Diagram 1 — What LayerNorm does to a vector

Pre-Norm vs Post-Norm: : . But gradients must pass through the LN Jacobian, causing instability in deep models. Pre-Norm (GPT-2+) applies LN before the sublayer: . The residual path stays clean -- an unobstructed gradient highway.

Diagram 2 — Pre-Norm vs Post-Norm placement

RMSNorm (Llama, Gemma, Mistral): Drops the mean-centering step entirely. Only divides by the root-mean-square of the vector. and works just as well -- normalization's value comes from controlling magnitude, not centering.

Residual connections as gradient highways: With , the gradient is . The identity matrix guarantees gradients can flow from layer 80 to layer 1 without vanishing. Remove it, and training collapses.

Diagram 3 — Residual stream: the accumulating highway

✨ Insight · Think of residual connections as express lanes on a highway. Without them, every car (gradient) must exit at every toll booth (layer) -- most get stuck. With skip connections, gradients can take the express lane straight to early layers.

Quick check

Trade-off

Before LayerNorm existed, BatchNorm was the go-to stabilizer for deep networks. What is the single most fundamental reason BatchNorm cannot be used for Transformer sequence modeling?

BatchNorm requires substantially more memory than LayerNorm because it must store per-channel running statistics for every feature dimension.BatchNorm computes statistics across N samples — at batch=1 it reverts to stale running stats, corrupting activations when the inference distribution drifts from training.BatchNorm cannot scale to the large hidden dimensions (4096+) used in modern LLMs without numerical overflow.BatchNorm uses a non-differentiable normalization step that prevents gradients from flowing through the normalization during backpropagation.

Quick Check

Why do modern models (GPT-2+, Llama, Gemma) use Pre-Norm instead of Post-Norm?

📐

Step-by-Step Derivation

LayerNorm

Normalize each token's feature vector independently. Compute mean and variance across the feature dimensions:

where , , and are .

RMSNorm (Llama, Gemma, Mistral)

skip mean-centering — only divide by the root-mean-square. No bias parameter :

💡 Tip · RMSNorm removes both mean subtraction and bias, reducing parameters and compute by . Experiments show negligible accuracy difference.

Residual Connection

The identity shortcut means . The term is the gradient highway.

Pre-Norm vs Post-Norm

Pre-Norm (modern standard):

Post-Norm (original Transformer):

⚠ Warning · In Pre-Norm, the residual path has no LN -- gradients backpropagate freely. In Post-Norm, gradients must pass through LN at every layer. This is why almost every model since GPT-2 uses Pre-Norm.

PyTorch: LayerNorm and RMSNorm

python

import torch
import torch.nn as nn

# Built-in LayerNorm
ln = nn.LayerNorm(d_model)       # learnable gamma, beta
output = ln(x)                    # x: [batch, seq, d_model]

# Built-in RMSNorm (PyTorch 2.4+)
rms = nn.RMSNorm(d_model)        # learnable gamma only
output = rms(x)

# Manual RMSNorm (for older PyTorch)
class RMSNorm(nn.Module):
    def __init__(self, d_model, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(d_model))
        self.eps = eps

    def forward(self, x):
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
        return x / rms * self.weight

# Pre-Norm residual pattern
x = x + self.attn(self.ln1(x))   # norm before sublayer
x = x + self.ffn(self.ln2(x))    # residual adds back original

Quick check

Trade-off

RMSNorm replaces LayerNorm’s formula with: x / RMS(x) × γ, where RMS(x) = √(mean(x²)). Two things are missing compared to LayerNorm. What are they, and what is the performance claim?

Missing: (1) variance computation and (2) the square root. Performance: 30% saving from avoiding the sqrt.Missing: (1) per-position normalization and (2) the softmax step. Performance: linear scaling instead of quadratic.Missing: (1) mean subtraction (centering) and (2) the beta shift parameter. Performance: ~7–10% compute saving per norm operation with negligible quality loss — Zhang & Sennrich 2019.Missing: (1) epsilon smoothing and (2) the gamma scale. Performance: 50% savings by avoiding the division entirely.

🔧

Break It -- See What Happens

Remove residual connections

Switch Pre-Norm to Post-Norm

Quick check

Derivation

The Break It panel shows what happens when residual connections are removed. What is the mathematical mechanism, and at what depth does training practically collapse without them?

Without residuals, dL/dx_0 is a product of 96 Jacobians with spectral radius ~0.7 — 0.7^96 ≈ 10^{-12}, so layer-1 gradients vanish from step 1.Removing residuals forces every sublayer to learn the identity function, which wastes representational capacity across all layers.Without residuals, forward activations grow exponentially — each sublayer amplifies variance by ~2×, causing numerical overflow by layer 16.Training collapses only when depth exceeds the context window length, so a 4096-token model fails past 4096 layers but not before.

📊

Real-World Numbers

Model	Norm Type	Placement	Notes
Original Transformer	LayerNorm	Post-Norm	6 layers, required warmup
GPT-2	LayerNorm	Pre-Norm	First major model to switch to Pre-Norm
GPT-3	LayerNorm	Pre-Norm	, 175B params
Llama-2 / Llama-3	RMSNorm	Pre-Norm	Varies by model size; ()
Gemma	RMSNorm	Pre-Norm	Google's open model family
Mistral / Mixtral	RMSNorm	Pre-Norm	MoE architecture, same norm choice

✨ Insight · The trend: LayerNorm (2017) to RMSNorm (2023+). Pre-Norm became the practical default for most deep LLMs, though alternatives like DeepNorm (Wang et al. 2022) can stabilize very deep Post-LN models up to 1000 layers. The field converged because clean residual paths beat the marginal quality gain from Post-Norm in most settings.

🔬

Why Post-Norm Fails at Depth — Gradient Norm Analysis

Xiong et al. (2020) provided a theoretical analysis of why Post-Norm requires careful learning rate warmup while Pre-Norm does not. The root cause is the magnitude of the gradient at the input of early layers.

In Post-Norm, the output of layer is . The gradient with respect to accumulates a product of LayerNorm Jacobians across all layers:

Xiong et al. (2020) showed that at initialization, — large enough that a high learning rate immediately destabilizes training. The gradient magnitude grows with model dimension, making deep Post-Norm models extremely sensitive to learning rate and requiring careful warmup to avoid divergence.

In Pre-Norm, . The gradient decomposes cleanly:

The identity term dominates at initialization (when ), so . This is why Pre-Norm models train stably from a cold start without warmup, while Post-Norm models require gradual warmup to avoid the gradient vanishing at early layers.

Property	Post-Norm	Pre-Norm
Gradient norm at init		O(1)
Warmup required?	Yes — gradual LR increase	No — stable from step 1
Max stable depth	~12 layers without tricks
Final quality	Slightly better (when tuned)	Slightly worse, but practical

✨ Insight · DeepNorm (Wang et al. 2022) is a hybrid: scale the residual branch down by and initialize weights scaled by , then apply Post-Norm. This keeps the expected gradient norm at while retaining Post-Norm's final-quality advantage. It enables stable training of 1000-layer Transformers, but the tuning complexity means most practitioners still prefer Pre-RMSNorm.

Quick check

Derivation

Xiong et al. (2020) show Post-LN gradient norm at initialization scales as O(d√(ln d)). For GPT-3 with d=12,288, what is the approximate gradient norm, and why does Pre-LN eliminate the need for warmup?

Post-LN grad norm ≈ 96 (one per layer). Pre-LN eliminates warmup by halving the gradient magnitude at each update step.Post-LN grad norm ≈ 12288 × √(ln 12288) ≈ 37,700. Pre-LN is O(1) because the identity Jacobian ∂x_{l+1}/∂x_l ≈ I dominates at initialization.Post-LN grad norm grows as O(L²) quadratically with depth, while Pre-LN is O(L) and thus grows only linearly.Both norms are O(L) where L=96; Pre-LN eliminates the need for warmup because it uses RMSNorm rather than full LayerNorm.

🧠

Key Takeaways

What to remember for interviews

1LayerNorm normalizes each token's feature vector to mean≈0, std≈1 then rescales with learned γ and β — acting as a stability checkpoint before each sublayer.
2Unlike BatchNorm (which depends on other samples in the batch), LayerNorm operates independently per token, making it compatible with variable-length sequences and batch size 1.
3Pre-Norm (LayerNorm before the sublayer) keeps the residual path clean, allowing gradients to flow through the identity connection without distortion — this is why GPT-2, Llama, and Gemma all use Pre-Norm.
4RMSNorm (used in Llama, Mistral) drops mean-centering entirely and only divides by the root-mean-square, saving ~7–10% compute with negligible quality loss.
5Residual connections provide a gradient highway: the derivative ∂x_{i+1}/∂x_i = I + ∂f/∂x_i ensures gradients reach early layers without vanishing through deep stacks.

🧠

Recap Quiz

📚

Interview Questions

Difficulty:

Company:

Showing 5 of 5

LayerNorm vs BatchNorm -- why do Transformers use LayerNorm?

★★☆

GoogleOpenAIAnthropic

Pre-Norm vs Post-Norm -- what are the training dynamics differences?

★★★

GoogleOpenAIAnthropic

Draw the gradient flow path for both variants. Which one has an unobstructed highway?

RMSNorm -- what is it and why do Llama/Gemma use it?

★★☆

GoogleMetaAnthropic

Why are residual connections critical for gradient flow?

★★☆

GoogleMeta

What happens if you apply LayerNorm after the residual connection instead of before?

★★★

AnthropicOpenAI

←

⚙️ FFN & Activations

🏗️ The Full Forward Pass

→