🏗️ High-Level Overview
GPT-3 predicts each token in 6ms — but processes the entire 96-layer forward pass to do it. Why can’t it just skip layers it already ‘knows’?
The Complete Transformer Pipeline
This interactive diagram shows every stage a token passes through in a decoder-only Transformer. Click any stage to jump to its dedicated module. Hover for a quick summary.
Architecture at a Glance
What you’re seeing: the data flow from raw tokens through one (or N stacked) Transformer blocks to output logits. Dashed accent lines are the residual bypasses — information flows around each sublayer, not just through it.
What to try: trace a single token from top to bottom. Notice that the shape of the vector never changes — every sublayer returns a vector of size d_model, added back to the stream. The model scales by increasing N (layers) and d_model (width), not by changing the shape.
What Breaks Without Each Component?
The Transformer is a sequence-to-sequence machine that converts a list of tokens into a probability distribution over the next token. Every modern LLM — GPT-4, Claude, Llama, Gemini — is built from repeated Transformer blocks (with variations like MoE or GQA). But why this particular combination of components? Each one fixes a specific failure mode.
Attention = Routing
Self-attention decides which tokens talk to which. Each token computes a query (“what am I looking for?”) and attends to keys from all previous tokens (“who has it?”). Without attention, the model is blind to context — “bank” in “river bank” and “bank account” would be indistinguishable after embedding.
FFN = Memory
The feed-forward network applies the same MLP to each token independently. Research () shows FFNs behave like key-value memories: the first layer detects patterns, the second retrieves associated facts. “Paris is the capital of ___” is answered by FFN neurons that activate on “capital of France” patterns.
Residual = Highway
Residual connections () create a direct gradient highway from the loss back to every layer. Without them, gradients must flow through 96+ matrix multiplications — any eigenvalue slightly below 1 causes exponential vanishing. With residuals, early layers remain trainable even at GPT-3 scale ().
LayerNorm = Stabilizer
LayerNorm rescales each token’s activation vector to mean 0, std 1 before each sublayer. Without it, activations amplify with depth — attention logits become unbounded, softmax collapses to one-hot, and gradients vanish. Pre-Norm (normalize before the sublayer, not after) keeps the residual path clean, enabling stable training without learning-rate warmup.
The residual stream (Elhage et al., 2021) is the backbone: each attention head and FFN sublayer writes additively into a shared d-dimensional vector. This means individual layers can be pruned with graceful degradation — the stream carries redundant information.
Worked Example: tracing “cat” through one Transformer block
Model: GPT-2 small — d_model=768, 12 heads, d_k=d_v=64 per head.
- 1. Embed.The token ID for “cat” is looked up in the embedding table (vocab × 768). Add the positional embedding for position t. Result: a single vector
x ∈ ℝ⁷⁶⁸— this is the residual stream state entering the block. - 2. Pre-LayerNorm. Compute
x̂ = LayerNorm(x). This normalizes the 768 values to mean≈0, std≈1, then applies learned scale γ and bias β. The originalxis untouched — LayerNorm only affects the copy passed into attention. - 3. Multi-Head Attention. Project
x̂into 12 separate Q/K/V triplets (each 64-dim via a 768→64 linear). Each head computes scaled dot-product attention over all previous positions. The 12 outputs (each64-dim) are concatenated →768-dim→ projected throughW_O (768×768). Result:attn_out ∈ ℝ⁷⁶⁸. - 4. Residual add.
x ← x + attn_out. The stream for “cat” is updated in-place. If attention learned nothing useful,attn_out ≈ 0andxpasses through unchanged — the residual is the “default path”. - 5. Pre-LayerNorm → FFN. Normalize again:
x̂ = LayerNorm(x). Pass through the FFN:768 → 3072 → 768with GELU activation. This is where most per-token computation happens — the FFN expands to 4× width to mix features. - 6. Residual add.
x ← x + ffn_out. After 12 blocks of this, the finalx ∈ ℝ⁷⁶⁸is projected to vocab size (50257 for GPT-2) and softmaxed. The probability assigned to the next token is the output.
Key shape invariant: x stays (batch, seq_len, 768) throughout all N blocks — residuals guarantee this.
Quick check
Geva et al. (2021) show FFN layers act as key-value memories. Which component plays which role?
What is the primary role of the FFN layers in a transformer?
Parameter Counting
Where do all those billions of parameters actually live? The formula is simpler than it looks — everything reduces to a few matrix-multiplication shapes.
1. Token Embedding Table
One vector of size per vocabulary token. For GPT-2: 50 257 × 768 ≈ params.
2. Attention (per layer)
4 weight matrices, each d×d
Each of the four projection matrices (Q, K, V, output) is . With heads, the per-head size is , but the total stays .
3. FFN (per layer)
standard d_ff = 4d
4. Total
ignoring biases and LayerNorm (< 0.1% of params)
For GPT-2 Small (L=12, d=768, V=50 257):
GPT-2 Small worked example
The figure closely matches the official GPT-2 paper (the simplified formula omits positional embeddings and biases, which account for <1% of total params).
The minimal Pre-Norm block below matches GPT-2, Llama, and every modern decoder-only model. Note the two residual adds and that LayerNorm always precedes the sublayer.
One Transformer block — Pre-Norm with residual connections
import torch
import torch.nn as nn
class TransformerBlock(nn.Module):
def __init__(self, d_model=768, n_heads=12, d_ff=3072):
super().__init__()
self.ln1 = nn.LayerNorm(d_model)
self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
self.ln2 = nn.LayerNorm(d_model)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model),
)
def forward(self, x, attn_mask=None):
# Pre-Norm: LayerNorm BEFORE sublayer, residual wraps around it
normed = self.ln1(x)
attn_out, _ = self.attn(normed, normed, normed, attn_mask=attn_mask)
x = x + attn_out # residual add
normed = self.ln2(x)
x = x + self.ffn(normed) # residual add
return x
# GPT-2 small: d_model=768, 12 heads → d_k=64 per head
block = TransformerBlock(d_model=768, n_heads=12, d_ff=3072)
x = torch.randn(1, 16, 768) # batch=1, seq_len=16, d_model=768
causal_mask = nn.Transformer.generate_square_subsequent_mask(16)
out = block(x, attn_mask=causal_mask) # shape: (1, 16, 768)Verify your formula against real model weights:
Parameter counting — formula vs real model
import torch
import torch.nn as nn
def count_transformer_params(vocab_size, d_model, n_layers, d_ff=None):
"""
Parameter breakdown for a standard decoder-only Transformer.
Formula: V*d + L*(4d^2 + 8d^2) = V*d + L*12d^2
"""
if d_ff is None:
d_ff = 4 * d_model # standard FFN expansion ratio
embedding = vocab_size * d_model # token embeddings only
per_layer = (
4 * d_model**2 + # attention: W_Q, W_K, W_V, W_O each d×d
2 * d_model * d_ff # FFN: d→d_ff and d_ff→d
)
total = embedding + n_layers * per_layer
return total
# GPT-2 Small: L=12, d=768, V=50257
params = count_transformer_params(
vocab_size=50257, d_model=768, n_layers=12
)
print(f"GPT-2 Small: {params / 1e6:.1f}M params") # → ~124M
# GPT-3: L=96, d=12288, V=50257
params_gpt3 = count_transformer_params(
vocab_size=50257, d_model=12288, n_layers=96
)
print(f"GPT-3: {params_gpt3 / 1e9:.1f}B params") # → ~175B
# Verify against a real model
from transformers import GPT2Model
model = GPT2Model.from_pretrained("gpt2")
real = sum(p.numel() for p in model.parameters())
print(f"GPT-2 actual: {real / 1e6:.1f}M") # → 124.4MAdvanced: SwiGLU FFN changes the formula
Llama 2/3 replace the standard 2-matrix FFN with SwiGLU, which uses three matrices (gate, up, down) at reduced width to preserve parameter count:
SwiGLU: same total params, different shape
The parameter count formula holds approximately regardless — SwiGLU adjusts to compensate for the extra matrix.
Quick check
For a decoder-only model with L layers and d_model = d, which component dominates parameter count as d scales (assuming V is fixed and d_ff = 4d)?
Break It
Toggle these ablations to see what each component actually contributes. These aren’t hypothetical — real training runs with these ablations fail within the first thousand steps.
Quick check
A 96-layer model is trained without residual connections. Gradients must pass through all 96 Jacobians sequentially. If each layer's Jacobian has spectral norm 0.99, what is the approximate gradient magnitude at layer 1 relative to layer 96?
Real Numbers
Every number below comes from the original papers or official technical reports (except GPT-4, which is community estimates). The scaling law is clear: parameters grow as , training tokens grow even faster.
| Model | Layers (L) | d_model | Heads | Params | Training Tokens |
|---|---|---|---|---|---|
| GPT-2 Small | 12 | 768 | 12 | ||
| GPT-2 XL | 48 | 1600 | 25 | 1.5B | |
| GPT-3 | 12 288 | ||||
| Llama 2 7B | 32 | 4096 | 32 | 7B | |
| Llama 2 70B | 80 | 8192 | 70B | ||
| Llama 3 8B | 32 | 4096 | 32 | 8B | |
| Llama 3 70B | 80 | 8192 | 70B | ||
| GPT-4 (est.) | ~120 | ~12 288 | ~96 | ~1.8T (MoE) | unknown |
GPT-4 numbers are community estimates — not confirmed by OpenAI. All other numbers are from official papers or technical reports.
Quick check
Llama 3 8B trains on 15T tokens, vastly beyond the Chinchilla-optimal ~160B. A interviewer asks: “Is this wasteful compute?” What is the correct response?
Key Takeaways
What to remember for interviews
- 1Every modern LLM is a stack of identical Transformer blocks: LayerNorm → Multi-Head Attention → residual add → LayerNorm → FFN → residual add.
- 2Attention routes information between token positions; FFN stores factual knowledge per-token; residuals are gradient highways; LayerNorm stabilizes activation scale.
- 3The residual stream is the backbone — each sublayer writes additively into a shared d-dimensional vector, never replacing it, preserving earlier information.
- 4Parameter count follows V·d + 12Ld²: GPT-2 Small (L=12, d=768, V=50 257) = 124M, GPT-3 (L=96, d=12288) = 175B.
- 5Pre-Norm (LayerNorm before each sublayer) keeps the residual path clean and enables stable training at 96+ layers without learning-rate warmup.
Further Reading
- Attention Is All You Need — Vaswani et al. 2017 — the original Transformer paper
- The Illustrated Transformer — Jay Alammar's visual guide — the gold standard for Transformer intuition
- Let's build GPT from scratch — Andrej Karpathy builds a GPT in ~2 hours with running code
- But what is a GPT? — 3Blue1Brown's visual deep dive into the Transformer architecture
- A Mathematical Framework for Transformer Circuits — Elhage et al. 2021 — the residual stream framework for understanding Transformers mechanistically
- The Illustrated GPT-2 — Jay Alammar — step-by-step walkthrough of GPT-2's decoder-only architecture, masking, and generation loop
- Explaining Transformers (Alammar) — Jay Alammar's earlier illustrated overview of encoder-decoder Transformers for sequence-to-sequence tasks
- Transformer Feed-Forward Layers Are Key-Value Memories — Geva et al. 2021 — shows FFN neurons store factual associations as key-value pairs
- Training Compute-Optimal Large Language Models (Chinchilla) — Hoffmann et al. 2022 — the scaling law showing optimal token count ≈ 20× parameter count
Interview Questions
Showing 4 of 4
Walk through the full forward pass of a decoder-only Transformer. What happens at each stage from raw text input to next-token probability?
★★☆Why do modern Transformers use Pre-Norm (LayerNorm before sublayer) instead of Post-Norm (after)?
★★☆What is the causal mask in self-attention and why is it necessary for autoregressive generation?
★☆☆Compare the parameter count and compute distribution across the main components of a Transformer. Where do most parameters live?
★★★Recap Quiz
GPT-2 Small has L=12, d_model=768, V=50 257. Using the formula Total = V·d + 12Ld², what is the approximate parameter count?
Llama 3 8B was trained on 15T tokens — far beyond the Chinchilla-optimal ~160B tokens. What is the primary engineering reason to over-train this way?
At what sequence length does attention compute (O(n²d)) overtake FFN compute (O(nd²)) for GPT-3 (d=12 288)?
Why does Pre-Norm (LayerNorm before each sublayer) enable stable training without learning-rate warmup, while Post-Norm requires warmup?
In a standard Transformer, what fraction of parameters live in FFN layers (vs attention) for GPT-3 style config (d_ff = 4d, full attention)?
Self-attention without positional encoding treats all token positions equivalently. What is the precise term for this property?
A model has the same L=12, d=768 config as GPT-2 Small but uses Grouped Query Attention (GQA) with 1 key-value head per 12 query heads. How does the attention parameter count change?