Induction Heads & ICL — Transformer Math

Module 66 · Trust & Evaluation

🧠 Induction Heads & ICL

GPT learns to copy patterns mid-training — and that single circuit explains in-context learning

Status:

GPT doesn't just predict the next token from memorized statistics — it learns to copy patterns from context mid-training. This capability emerges suddenly, not gradually, and it's powered by a specific two-head circuit called an induction head. Understanding this circuit is the closest we've come to a mechanistic explanation of in-context learning.

argued, using multiple lines of evidence, that induction heads may be the mechanistic source of a large fraction — possibly the majority — of ICL behavior, and that their emergence coincides with a sharp phase transition that can be detected, measured, and causally verified in small models.

🎨

Circuit Diagram

What you're seeing: the two-head induction circuit across two transformer layers. Layer 0's previous-token head shifts token identities back by one position. Layer 1's induction head uses this shifted information to find where the current token appeared before and copy its successor.

What to try: trace the path for the second "Harry" — it queries against the shifted tokens in layer 0's output, matches "Potter" (because Potter carries "was preceded by Harry"), and reads Potter's value as the prediction.

💡

The Intuition

The “strawberry” moment:Suppose you type "The cat sat on the mat" and then much later in the same context, "The cat sat on the" again. The model predicts mat — not because it memorized that phrase, but because it saw this sequence earlier in the same context window. How?

The answer is a two-layer attention circuit that implements exactly one rule: "If A followed B before, and I see A again, predict B."

Worked Example: Harry Potter

Context: … Harry Potter … Harry [?]

Step 1 — Previous-token head (Layer 0): At every position i, this head copies the token at position i−1 into position i's residual stream. So position 2 ("Potter") now carries the information "I was preceded by Harry."
Step 2 — Induction head (Layer 1):The current query is "Harry" (the second occurrence). The induction head scans all previous positions, looking for any position whose previous-token representation matches "Harry." It finds position 2 ("Potter"), which says "I was preceded by Harry." It attends to position 2 and copies its value — which is "Potter" — as the next-token prediction.

✨ Insight · The previous-token head is the key: it makes every token carry its predecessor's identity. This lets the induction head do indirect lookup — "find where the current token appeared before by searching for positions that say they were preceded by the current token."

This circuit generalizes to any n-gram pattern in context. It doesn't require the model to have seen Harry Potter in training — it works on any repeated sequence, including random tokens. That's what makes it the foundation of in-context learning.

In larger models, the same mechanism operates in embedding space rather than exact token identity — enabling fuzzy induction. Instead of matching on exact tokens, the Q/K circuit matches on semantic similarity. Example: if the context showed "the doctor treated the patient" earlier, and "the" appears again, the induction head predicts semantically doctor-adjacent words — notnecessarily the word "doctor" itself. This is one mechanism behind few-shot prompting: the model generalizes the in-context pattern rather than just copying format. Olsson et al. describe this as a continuum from exact induction (2-layer circuits) to abstract pattern completion (multilayer circuits built on the same foundation).

Quick Check

In the induction head circuit, what role does the previous-token head play?

📐

The QK/OV Circuit

QK Circuit — Pattern Matching

The attention score between current position and past position determines whether the induction head "notices" a match. The query comes from the current token embedding; the key comes from the previous-token head's output — which encodes token[j−1]:

where is derived from token[i]'s embedding and encodes token[j−1]'s identity. So is large exactly when token[j−1] = token[i] — i.e., when position j is the token that follows the last occurrence of the current token.

OV Circuit — Completion Copying

Once the induction head attends to position j, the OV (Output-Value) circuit reads token[j]'s content and writes it to the output. Since j is the position after the previous occurrence of the current token, this copies the completion:

The matrix is the full OV circuit. showed this can be analyzed directly as a single matrix that maps "what's at position j" to "what gets written into position i."

Layer Composition via Residual Stream

The two heads communicate through the residual stream. Layer 0 writes its output additively; layer 1 reads the updated stream:

The induction head in sees both the original token embedding andthe previous-token head's output in . This composition of two single-layer heads creates a two-layer circuit — the simplest possible form of inter-head communication.

💡 Tip · The QK circuit of the induction head "reads" from the previous-token head's output. This means the K matrix of the induction head must have been learned to be compatible with the output directions of the previous-token head — a beautiful example of emergent inter-layer coordination.

The Induction Score — Detecting the Circuit

To identify induction heads, Olsson et al. feed a repeated random sequence into the model and inspect each head's attention matrix. An induction head will place most of its attention at the diagonal offset by — position attends to position , exactly where the current token last appeared:

where is the positions in the second copy and is head 's attention matrix. A score close to 1 means the head is attending precisely at the expected offset — a strong induction signature. In GPT-2 Small, , and those are the heads responsible for in-context learning.

PyTorch: Detecting Induction Heads via Attention Scores

python

import torch
import torch.nn.functional as F

def induction_score(attn_pattern: torch.Tensor) -> float:
    """
    Measure how strongly a head shows induction behavior.

    attn_pattern: (seq_len, seq_len) attention weight matrix on a
    repeated random sequence of length seq_len//2.

    An induction head attends at the [seq_len//2 - 1] diagonal:
    position i attends to position i - (seq_len//2 - 1), the spot
    where the current token appeared in the first copy.
    """
    seq_len = attn_pattern.shape[0]
    half = seq_len // 2

    # Extract the diagonal offset = -(half - 1)
    # i.e., for position i in second copy, attend to position i - half + 1
    offset = -(half - 1)
    diag = torch.diagonal(attn_pattern, offset=offset)
    return diag.mean().item()


def find_induction_heads(
    model,
    seq_len: int = 50,
    threshold: float = 0.4,
    device: str = "cpu"
) -> list[tuple[int, int]]:
    """
    Run a repeated random sequence through the model and return all
    (layer, head) pairs with induction score above threshold.
    """
    vocab_size = model.config.vocab_size
    n_layers = model.config.num_hidden_layers
    n_heads = model.config.num_attention_heads

    # Build a repeated random sequence: [A B C ... A B C ...]
    rand_tokens = torch.randint(1, vocab_size, (1, seq_len), device=device)
    tokens = torch.cat([rand_tokens, rand_tokens], dim=1)  # (1, 2*seq_len)

    with torch.no_grad():
        outputs = model(tokens, output_attentions=True)

    induction_heads = []
    for layer_idx, layer_attn in enumerate(outputs.attentions):
        # layer_attn: (batch, n_heads, seq, seq)
        for head_idx in range(n_heads):
            pattern = layer_attn[0, head_idx]  # (2*seq_len, 2*seq_len)
            score = induction_score(pattern)
            if score > threshold:
                induction_heads.append((layer_idx, head_idx))
                print(f"Layer {layer_idx}, Head {head_idx}: score={score:.3f}")

    return induction_heads

PyTorch implementation

# Induction head detection: repeated-sequence attention score
import torch

def induction_score_for_head(
    model, layer: int, head: int, seq_len: int = 50
) -> float:
    """
    Feed a repeated random sequence [A...A] of length 2*seq_len.
    An induction head at (layer, head) will strongly attend at
    diagonal offset -(seq_len - 1): position i attends to i-(seq_len-1),
    the spot right after the previous occurrence of token[i].
    Returns mean attention weight on that diagonal (0=no induction, 1=perfect).
    """
    vocab = model.config.vocab_size
    rand_seq = torch.randint(1, vocab, (1, seq_len))
    tokens = torch.cat([rand_seq, rand_seq], dim=1)  # (1, 2*seq_len)

    with torch.no_grad():
        out = model(tokens, output_attentions=True)

    # out.attentions[layer]: (batch, n_heads, 2*seq_len, 2*seq_len)
    attn = out.attentions[layer][0, head]  # (2*seq_len, 2*seq_len)
    offset = -(seq_len - 1)
    diag = torch.diagonal(attn, offset=offset)  # values on the induction diagonal
    return diag.mean().item()

Quick check

Derivation

The QK circuit scores A_{ij} high when k_j encodes token[j−1] = token[i]. Which single change would make the induction head attend to the position two steps before the repeat, not one?

Scale the Q/K dot product by 2/√d_k instead of 1/√d_k.Replace softmax with a hard argmax in the attention computation.Change the previous-token head to write token[i−2] instead of token[i−1].Increase the number of attention heads in the induction layer.

🔧

Break It — See What Happens

Ablate Induction Heads (zero out their output)

Quick check

Trade-off

When all induction heads are ablated, loss on repeated random sequences rises from ~0.1 to ~2.3. A colleague says the remaining ~0.1 residual ICL performance must come from MLP layers. Is this plausible, and why?

Yes — MLPs can cache n-gram statistics from training and predict completions without attention.Yes — gating functions in MLP blocks can implement pattern matching across sequence positions.No — after ablation, loss of 2.3 equals random chance so there is zero residual ICL.No — MLPs process each position independently and cannot access earlier context directly.

📊

Real-World Numbers

Finding	Model / Setting	Number
Phase change timing	Small models (2–8 layers)
Induction heads per model	GPT-2 Small (12L × 12H)
ICL contribution		Induction heads may explain a large fraction (possibly the majority) of measured ICL behavior; causal evidence strong in small attention-only models, correlational in larger ones
Layer depth of emergence	Minimal models	Requires at least 2 layers; layer 0 (prev-token) + layer 1 (induction)
Loss drop	Repeated random sequences

✨ Insight · The phase transition is visible across all Olsson et al. studied — from 2-layer attention-only models to full GPT-style architectures. Bigger models show the same transition, just at different training-token counts and with more sophisticated generalizations of the basic circuit.

Quick check

Trade-off

GPT-2 Small has 12 layers × 12 heads = 144 heads total. Only ~5 score above 0.4 on the induction probe. If you wanted to preserve 90% of ICL performance while ablating as many heads as possible, what strategy follows from this finding?

Ablate the 5 induction heads first — they are the biggest contributors.Ablate the ~139 non-induction heads and keep only the ~5 high-scoring ones.Ablate heads uniformly at random until 50% remain, preserving diversity.Ablate all heads in layers > 6; induction heads are always early-layer.

🧠

Key Takeaways

What to remember for interviews

1An induction head is a two-layer circuit: a previous-token head (L0) shifts token identities back one position, enabling an induction head (L1) to search for where the current token last appeared and copy what came after it.
2This implements the rule 'if A followed B before, predict B when you see A again' — the mechanistic foundation of in-context learning and few-shot prompting.
3Induction heads emerge as a sharp phase transition mid-training (~2B tokens for small models), not gradually. All heads in the model change simultaneously — a sign of sudden circuit formation.
4Ablating induction heads collapses in-context learning: loss on repeated sequences jumps 23× (0.1 → 2.3). They account for roughly 50% of total ICL performance across 16 models studied.
5In larger models, the same circuit operates in semantic embedding space, enabling fuzzy pattern matching and generalization — not just exact token copying.

📚

Interview Questions

Difficulty:

Company:

Showing 5 of 5

What is an induction head and how does it implement pattern completion?

★★☆

AnthropicGoogle

Why do induction heads emerge as a phase change during training rather than gradually?

★★★

Anthropic

How would you detect induction heads in a trained transformer? Describe the experimental setup.

★★☆

AnthropicGoogle

Can induction heads explain generalization beyond exact copying? Give an example of fuzzy induction.

★★★

AnthropicOpenAI

What is the relationship between induction heads and the in-context learning loss bump?

★★☆

Anthropic

🧠

Recap quiz

←

🔬 Mechanistic Interpretability

Transformer Math