🧠 Induction Heads & ICL
GPT learns to copy patterns mid-training — and that single circuit explains in-context learning
GPT doesn't just predict the next token from memorized statistics — it learns to copy patterns from context mid-training. This capability emerges suddenly, not gradually, and it's powered by a specific two-head circuit called an induction head. Understanding this circuit is the closest we've come to a mechanistic explanation of in-context learning.
argued, using multiple lines of evidence, that induction heads may be the mechanistic source of a large fraction — possibly the majority — of ICL behavior, and that their emergence coincides with a sharp phase transition that can be detected, measured, and causally verified in small models.
Circuit Diagram
What you're seeing: the two-head induction circuit across two transformer layers. Layer 0's previous-token head shifts token identities back by one position. Layer 1's induction head uses this shifted information to find where the current token appeared before and copy its successor.
What to try: trace the path for the second "Harry" — it queries against the shifted tokens in layer 0's output, matches "Potter" (because Potter carries "was preceded by Harry"), and reads Potter's value as the prediction.
The Intuition
The “strawberry” moment:Suppose you type "The cat sat on the mat" and then much later in the same context, "The cat sat on the" again. The model predicts mat — not because it memorized that phrase, but because it saw this sequence earlier in the same context window. How?
The answer is a two-layer attention circuit that implements exactly one rule: "If A followed B before, and I see A again, predict B."
Worked Example: Harry Potter
Context: … Harry Potter … Harry [?]
- Step 1 — Previous-token head (Layer 0): At every position i, this head copies the token at position i−1 into position i's residual stream. So position 2 ("Potter") now carries the information "I was preceded by Harry."
- Step 2 — Induction head (Layer 1):The current query is "Harry" (the second occurrence). The induction head scans all previous positions, looking for any position whose previous-token representation matches "Harry." It finds position 2 ("Potter"), which says "I was preceded by Harry." It attends to position 2 and copies its value — which is "Potter" — as the next-token prediction.
This circuit generalizes to any n-gram pattern in context. It doesn't require the model to have seen Harry Potter in training — it works on any repeated sequence, including random tokens. That's what makes it the foundation of in-context learning.
In larger models, the same mechanism operates in embedding space rather than exact token identity — enabling fuzzy induction. Instead of matching on exact tokens, the Q/K circuit matches on semantic similarity. Example: if the context showed "the doctor treated the patient" earlier, and "the" appears again, the induction head predicts semantically doctor-adjacent words — notnecessarily the word "doctor" itself. This is one mechanism behind few-shot prompting: the model generalizes the in-context pattern rather than just copying format. Olsson et al. describe this as a continuum from exact induction (2-layer circuits) to abstract pattern completion (multilayer circuits built on the same foundation).
In the induction head circuit, what role does the previous-token head play?
The QK/OV Circuit
QK Circuit — Pattern Matching
The attention score between current position and past position determines whether the induction head "notices" a match. The query comes from the current token embedding; the key comes from the previous-token head's output — which encodes token[j−1]:
where is derived from token[i]'s embedding and encodes token[j−1]'s identity. So is large exactly when token[j−1] = token[i] — i.e., when position j is the token that follows the last occurrence of the current token.
OV Circuit — Completion Copying
Once the induction head attends to position j, the OV (Output-Value) circuit reads token[j]'s content and writes it to the output. Since j is the position after the previous occurrence of the current token, this copies the completion:
The matrix is the full OV circuit. showed this can be analyzed directly as a single matrix that maps "what's at position j" to "what gets written into position i."
Layer Composition via Residual Stream
The two heads communicate through the residual stream. Layer 0 writes its output additively; layer 1 reads the updated stream:
The induction head in sees both the original token embedding andthe previous-token head's output in . This composition of two single-layer heads creates a two-layer circuit — the simplest possible form of inter-head communication.
The Induction Score — Detecting the Circuit
To identify induction heads, Olsson et al. feed a repeated random sequence into the model and inspect each head's attention matrix. An induction head will place most of its attention at the diagonal offset by — position attends to position , exactly where the current token last appeared:
where is the positions in the second copy and is head 's attention matrix. A score close to 1 means the head is attending precisely at the expected offset — a strong induction signature. In GPT-2 Small, , and those are the heads responsible for in-context learning.
PyTorch: Detecting Induction Heads via Attention Scores
import torch
import torch.nn.functional as F
def induction_score(attn_pattern: torch.Tensor) -> float:
"""
Measure how strongly a head shows induction behavior.
attn_pattern: (seq_len, seq_len) attention weight matrix on a
repeated random sequence of length seq_len//2.
An induction head attends at the [seq_len//2 - 1] diagonal:
position i attends to position i - (seq_len//2 - 1), the spot
where the current token appeared in the first copy.
"""
seq_len = attn_pattern.shape[0]
half = seq_len // 2
# Extract the diagonal offset = -(half - 1)
# i.e., for position i in second copy, attend to position i - half + 1
offset = -(half - 1)
diag = torch.diagonal(attn_pattern, offset=offset)
return diag.mean().item()
def find_induction_heads(
model,
seq_len: int = 50,
threshold: float = 0.4,
device: str = "cpu"
) -> list[tuple[int, int]]:
"""
Run a repeated random sequence through the model and return all
(layer, head) pairs with induction score above threshold.
"""
vocab_size = model.config.vocab_size
n_layers = model.config.num_hidden_layers
n_heads = model.config.num_attention_heads
# Build a repeated random sequence: [A B C ... A B C ...]
rand_tokens = torch.randint(1, vocab_size, (1, seq_len), device=device)
tokens = torch.cat([rand_tokens, rand_tokens], dim=1) # (1, 2*seq_len)
with torch.no_grad():
outputs = model(tokens, output_attentions=True)
induction_heads = []
for layer_idx, layer_attn in enumerate(outputs.attentions):
# layer_attn: (batch, n_heads, seq, seq)
for head_idx in range(n_heads):
pattern = layer_attn[0, head_idx] # (2*seq_len, 2*seq_len)
score = induction_score(pattern)
if score > threshold:
induction_heads.append((layer_idx, head_idx))
print(f"Layer {layer_idx}, Head {head_idx}: score={score:.3f}")
return induction_headsPyTorch implementation
# Induction head detection: repeated-sequence attention score
import torch
def induction_score_for_head(
model, layer: int, head: int, seq_len: int = 50
) -> float:
"""
Feed a repeated random sequence [A...A] of length 2*seq_len.
An induction head at (layer, head) will strongly attend at
diagonal offset -(seq_len - 1): position i attends to i-(seq_len-1),
the spot right after the previous occurrence of token[i].
Returns mean attention weight on that diagonal (0=no induction, 1=perfect).
"""
vocab = model.config.vocab_size
rand_seq = torch.randint(1, vocab, (1, seq_len))
tokens = torch.cat([rand_seq, rand_seq], dim=1) # (1, 2*seq_len)
with torch.no_grad():
out = model(tokens, output_attentions=True)
# out.attentions[layer]: (batch, n_heads, 2*seq_len, 2*seq_len)
attn = out.attentions[layer][0, head] # (2*seq_len, 2*seq_len)
offset = -(seq_len - 1)
diag = torch.diagonal(attn, offset=offset) # values on the induction diagonal
return diag.mean().item()Quick check
The QK circuit scores A_{ij} high when k_j encodes token[j−1] = token[i]. Which single change would make the induction head attend to the position two steps before the repeat, not one?
Break It — See What Happens
Quick check
When all induction heads are ablated, loss on repeated random sequences rises from ~0.1 to ~2.3. A colleague says the remaining ~0.1 residual ICL performance must come from MLP layers. Is this plausible, and why?
Real-World Numbers
| Finding | Model / Setting | Number |
|---|---|---|
| Phase change timing | Small models (2–8 layers) | |
| Induction heads per model | GPT-2 Small (12L × 12H) | |
| ICL contribution | Induction heads may explain a large fraction (possibly the majority) of measured ICL behavior; causal evidence strong in small attention-only models, correlational in larger ones | |
| Layer depth of emergence | Minimal models | Requires at least 2 layers; layer 0 (prev-token) + layer 1 (induction) |
| Loss drop | Repeated random sequences |
Quick check
GPT-2 Small has 12 layers × 12 heads = 144 heads total. Only ~5 score above 0.4 on the induction probe. If you wanted to preserve 90% of ICL performance while ablating as many heads as possible, what strategy follows from this finding?
Key Takeaways
What to remember for interviews
- 1An induction head is a two-layer circuit: a previous-token head (L0) shifts token identities back one position, enabling an induction head (L1) to search for where the current token last appeared and copy what came after it.
- 2This implements the rule 'if A followed B before, predict B when you see A again' — the mechanistic foundation of in-context learning and few-shot prompting.
- 3Induction heads emerge as a sharp phase transition mid-training (~2B tokens for small models), not gradually. All heads in the model change simultaneously — a sign of sudden circuit formation.
- 4Ablating induction heads collapses in-context learning: loss on repeated sequences jumps 23× (0.1 → 2.3). They account for roughly 50% of total ICL performance across 16 models studied.
- 5In larger models, the same circuit operates in semantic embedding space, enabling fuzzy pattern matching and generalization — not just exact token copying.
Further Reading
- In-Context Learning and Induction Heads — Olsson et al. 2022 — the definitive paper showing induction heads are the mechanistic basis of in-context learning, with phase-change evidence across 16 models
- A Mathematical Framework for Transformer Circuits — Elhage et al. 2021 — introduces the QK/OV decomposition and residual stream view used throughout induction head analysis
- Understanding LSTM Networks — Olah 2015 — the gold-standard visual explainer of recurrent memory; useful context for understanding why in-context learning is surprising in attention-only models
- Tracing Attention Computation Through Feature Interactions — Kamath et al. 2025 — traces how attention QK circuits interact with features, extending induction head analysis to larger and more complex models
Interview Questions
Showing 5 of 5
What is an induction head and how does it implement pattern completion?
★★☆Why do induction heads emerge as a phase change during training rather than gradually?
★★★How would you detect induction heads in a trained transformer? Describe the experimental setup.
★★☆Can induction heads explain generalization beyond exact copying? Give an example of fuzzy induction.
★★★What is the relationship between induction heads and the in-context learning loss bump?
★★☆Recap quiz
Induction Heads recap
The previous-token head in layer 0 writes token[i−1]'s identity into position i's residual stream. Why is this shift required for the induction circuit to work?
Why do both heads in the induction circuit emerge together as a sudden phase transition rather than one head forming first?
An induction score close to 1.0 on a repeated random sequence means the head attends at diagonal offset −(n−1). What does a score of ~0.05 indicate about that head?
Ablating all induction heads raises loss on repeated random sequences from ~0.1 to ~2.3. A colleague argues this proves induction heads do 95% of all language modeling work. What is the correct rebuttal?
In a 2-layer attention-only model the residual stream update is x₁ = x₀ + Attn₀(x₀), then x₂ = x₁ + Attn₁(x₁). Which layer does most of the in-context learning work and why?
Fuzzy induction in large models allows the circuit to generalise beyond exact token matching. What is the key difference in the Q/K computation that enables this?
An engineer claims they can detect when induction heads have formed during a training run without inspecting attention weights. What training-curve signal would confirm this?