Mechanistic Interpretability — Transformer Math

🗺️

SAE Architecture

What you're seeing

A Sparse Autoencoder trained on a layer's residual stream activations. The encoder projects from d_model (8, simplified) into a much wider feature space d_sae (24 here; real SAEs use 32–256× expansion). An L1 penalty forces most feature activations to zero — only a handful “light up” for any given input. The decoder reconstructs the original activation from those sparse features; each decoder column is one learned “feature direction”.

What to notice

Three green features are active (Capitals, Math, Code) out of 24 — that's the sparsity in action. The loss combines reconstruction fidelity (||x − x̂||²) and a sparsity penalty (λ||f||₁). Anthropic's Gemma-Scope SAEs achieve >99% reconstruction with most features firing <1% of the time.

🔬

SAE Training: How It Actually Works

The theory is in the Interpretability module. Here's what you actually build. An SAE has two parts: an encoder that maps model activations to a much wider space, and a decoder that maps back. The decoder column directions are the features.

SAE architecture

x ∈ ℝ^d

model activation

W_enc→

f ∈ ℝd_sae

sparse features (ReLU)

W_dec→

x̂ ∈ ℝ^d

reconstruction

d_sae is . Most entries of f are zero (L1 penalty). Each column of W_dec is one feature direction.

Real hyperparameters from Anthropic papers

Hyperparameter	Typical value	What it controls
expansion_factor		How many more features than model dimensions; higher = more capacity but slower
λ (L1 coefficient)	1e-3 – 1e-1	Sparsity vs. reconstruction trade-off; tune for L0 of 20–100 features per token
training_tokens		Sufficient coverage for features to specialize; fewer tokens → more dead features
learning_rate	1e-4 – 5e-5	Adam optimizer; cosine decay
target_layer	Middle layers	Middle layers have richest representations; Anthropic used residual stream post-MLP

PyTorch: Minimal SAE Training Loop

python

import torch
import torch.nn as nn
from torch.optim import Adam

class SparseAutoencoder(nn.Module):
    def __init__(self, d_model: int, expansion: int = 64):
        super().__init__()
        d_sae = d_model * expansion
        self.W_enc = nn.Linear(d_model, d_sae, bias=True)
        self.W_dec = nn.Linear(d_sae, d_model, bias=True)
        self.relu = nn.ReLU()
        # Normalize decoder columns to unit norm
        self._normalize_decoder()

    def _normalize_decoder(self):
        with torch.no_grad():
            norms = self.W_dec.weight.norm(dim=0, keepdim=True).clamp(min=1e-8)
            self.W_dec.weight.div_(norms)

    def forward(self, x: torch.Tensor):
        # Center around decoder bias before encoding
        x_cent = x - self.W_dec.bias
        f = self.relu(self.W_enc(x_cent))      # sparse features
        x_hat = self.W_dec(f)                  # reconstruction
        return x_hat, f

def sae_loss(x, x_hat, f, lam: float = 5e-3):
    recon = (x - x_hat).pow(2).mean()          # MSE reconstruction
    sparsity = f.abs().mean()                   # L1 on features
    return recon + lam * sparsity, recon, sparsity

# Training loop sketch
sae = SparseAutoencoder(d_model=4096, expansion=64)
opt = Adam(sae.parameters(), lr=2e-4)

for activations in dataloader:              # activations: [B, d_model]
    opt.zero_grad()
    x_hat, f = sae(activations)
    loss, recon, sparse = sae_loss(activations, x_hat, f, lam=5e-3)
    loss.backward()
    opt.step()
    sae._normalize_decoder()                # keep decoder cols unit norm

💡 Tip · The key insight: after training, each column of W_decis a feature direction in the model's activation space. If column 4,721 consistently activates for “Golden Gate Bridge” text, that direction is the Golden Gate Bridge feature. You can read features by checking what inputs maximize each column's activation.

Quick check

Trade-off

You observe 30% dead features after training. Lowering the L1 coefficient hasn't helped. What is the most likely root cause?

Lambda is still too high despite loweringLearning rate is too high, causing features to collapseExpansion factor is too large relative to data diversityDecoder columns were not normalized to unit norm after steps

💡

The Intuition: MRI-ing a Language Model

Imagine you could MRI a language model's brain while it thinks. Not just watch inputs and outputs, but trace every intermediate step — which concepts activated, in what order, and how they caused the final answer.

That's what Anthropic did with Claude 3.5 Haiku in their 2025 Biology paper. The experiment:

Circuit trace: “What is the capital of Texas?” (Anthropic, 2025)

1

Token "Texas" is read

The residual stream at the "Texas" token position starts activating features.

2

"Texas-state" feature fires

An SAE feature learned to recognize Texas as a US state activates strongly. It encodes geographic context.

3

"Austin" (capital) feature activates

The state feature propagates, causing the capital-city feature for Austin to fire — the model is building the answer.

4

"Austin" written to output

The Austin feature writes its direction to the residual stream at the answer position, and the model decodes "Austin".

Steering validation: two-hop geography

Anthropic asked: “What is the capital of the state containing Dallas?” The circuit activated a Texas feature which then output Austin. To confirm causality, researchers intervened mid-computation — swapping the Texas feature activation for a California feature activation. The model output changed to Sacramento. The feature, not the input token, drove the answer.

This is the mechanistic interpretability litmus test: if swapping feature F for feature G at the right layer changes the output from answer(F) to answer(G), you have found the causal bottleneck. Average circuits like this span 2.3 hops (Ameisen et al. 2025).

✨ Insight · This is circuit tracing — the core method. It combines sparse autoencoders (to name the features) with attribution patching (to measure which features caused which). The result is a computational graph you can read, not a black box.

⚠ Warning · Hallucination mechanism (Biology paper, 2025): Circuit analysis found “known answer” features that suppress the model's default refusal circuit. When a question is asked about something the model genuinely knows, these features fire and allow the answer to proceed. Hallucination occurs when these “known answer” features activate despite the model not actually having sufficient knowledge — confidence gates open when they shouldn't.

Three methods form the practical toolkit:

Method	What it does	Cost
Sparse Autoencoder (SAE)	Decomposes activations into named, monosemantic features	Train once, reuse
Activation Patching	Swap activations between runs to measure causal effect	O(N) forward passes
Attribution Patching	Gradient approximation of causal effect — fast circuit tracing	1 forward + 1 backward

📐

Attribution Patching & Circuit Tracing

Method 1: Activation Patching (exact, slow)

Run the model on a clean input and a corrupted input. For each component , swap its activation from the clean run into the corrupted run and measure the change in output metric :

Large effect = component causally matters. Requires one forward pass per component — O(N) total.

Method 2: Attribution Patching (approximate, fast)

First-order Taylor approximation. The attribution of feature on output is:

Gradient tells you sensitivity; activation difference tells you magnitude. One forward + one backward pass covers all components. Accuracy is within ~90% of full patching on most circuits.

Method 3: Circuit Tracing (SAE + attribution combined)

Replace MLP and attention outputs with their SAE decompositions. Now run attribution patching over the feature graph instead of neurons. This gives a directed graph where nodes are named features and edges are causal attributions:

Prune edges below a threshold to get the sparse subgraph — the “circuit”. Each node is interpretable (SAE feature), each edge is a measured causal strength.

Cross-Layer Transcoders (CLTs) — the 2025 upgrade

Standard SAEs are trained per-layer. Ameisen et al. (2025) instead train Cross-Layer Transcoders: each CLT reads from the residual stream at its own layer but outputs contributions to all subsequent MLP layers. This lets a single feature represent a concept that persists and propagates across depth, rather than needing separate features at each layer.

Edge weight between source feature and target feature is:

where is the source feature's activation, is the Jacobian through attention and residual connections, and the sum runs over the decoder-to-encoder paths. Loss uses JumpReLU (not standard ReLU) plus a tanh-based sparsity penalty to reduce dead features.

💡 Tip · CLT limitations to know for interviews: Attribution graphs succeed on only (the rest are too complex or ambiguous to yield clean sparse graphs). The replacement model — which substitutes CLT features for the original MLP computations — and matches the model's top-1 next-token prediction on ~50% of a filtered evaluation set (prompts where the base model predicts correctly with confidence below 80%), with ~11.5% normalized mean reconstruction error.

PyTorch: Basic Activation Patching

python

import torch
from contextlib import contextmanager

@contextmanager
def patch_activation(model, layer_name: str, patch_value: torch.Tensor):
    """Context manager to swap one layer's output mid-forward-pass."""
    hooks = []
    def hook_fn(module, input, output):
        return patch_value   # replace with clean-run activation
    handle = dict(model.named_modules())[layer_name].register_forward_hook(hook_fn)
    hooks.append(handle)
    try:
        yield
    finally:
        for h in hooks:
            h.remove()

def activation_patching_score(model, clean_tokens, corrupt_tokens, layer_name,
                               clean_cache, metric_fn):
    """
    Measure how much layer_name causally matters for the metric.
    metric_fn(logits) -> scalar (e.g., logit diff between two tokens)
    """
    # Baseline: corrupted run
    with torch.no_grad():
        corrupt_logits = model(corrupt_tokens)
    baseline = metric_fn(corrupt_logits)

    # Patched: corrupted run but swap in the clean activation
    clean_act = clean_cache[layer_name]
    with torch.no_grad():
        with patch_activation(model, layer_name, clean_act):
            patched_logits = model(corrupt_tokens)
    patched = metric_fn(patched_logits)

    return (patched - baseline).item()  # positive = component helps

PyTorch implementation

# Logit lens: project intermediate residual stream to vocab space
import torch

def logit_lens(model, tokens: torch.Tensor):
    """
    At each layer, unembed the residual stream directly to get
    a probability distribution over vocabulary — no more processing.
    Shows what the model 'thinks' the next token is at each depth.
    """
    unembed = model.lm_head          # W_U: (d_model, vocab)
    ln_f = model.transformer.ln_f    # final layer norm

    residual_stream = []

    def save_residual(module, inp, out):
        # out[0] is the hidden state after this transformer block
        h = out[0] if isinstance(out, tuple) else out
        residual_stream.append(h.detach().clone())

    hooks = [block.register_forward_hook(save_residual)
             for block in model.transformer.h]

    with torch.no_grad():
        model(tokens)
    for h in hooks:
        h.remove()

    # Project each layer's residual stream through the unembedding
    layer_logits = []
    for h in residual_stream:
        normed = ln_f(h)                         # apply final norm
        logits = normed @ unembed.weight.T        # (B, seq, vocab)
        layer_logits.append(logits[:, -1, :].softmax(-1))  # last position
    return layer_logits  # list of (B, vocab) — one per layer

Quick check

Derivation

You need to rank the causal importance of all 10M CLT features in a single experiment. Which method is feasible?

Activation patching over all 10M features sequentiallyAttribution patching with one forward + one backward passManual inspection of all 10M activation logs over timeTrain a second SAE to predict feature importance scores

🔧

Break It — Feature Steering in Action

The “Golden Gate Claude” experiment (Anthropic, 2024) demonstrated that SAE features are causally real. Researchers found a feature that activates for text about the Golden Gate Bridge, then clamped it to 10x its maximum activation during inference.

Clamp 'Golden Gate Bridge' feature to 10x activation

Normal response (feature = 0)

User: What are some things to do in San Francisco?

“Some great options include visiting Fisherman's Wharf, exploring Golden Gate Park, the Painted Ladies, and the ferry to Alcatraz.”

Steered response (feature clamped 10x)

User: What are some things to do in San Francisco?

“Oh, the Golden Gate Bridge, obviously! And near the Golden Gate Bridge you'll find... the Golden Gate Bridge visitor center, Golden Gate Bridge overlooks, and the Golden Gate Bridge gift shop.”

✨ Insight · Feature steering is the interpretability equivalent of unit testing: if the feature truly represents a concept, amplifying it should reliably inject that concept into outputs. It does. This is how we know SAE features are real computational objects, not just post-hoc labels.

Quick check

Trade-off

The Golden Gate Bridge steering works by clamping one feature. What happens if you clamp 10 unrelated features simultaneously at 10x?

All 10 concepts appear proportionally in every responseThe model outputs incoherent text as conflicting features competeThe model ignores weaker features and outputs only the strongestSteering has no effect when more than one feature is clamped

📊

Real Numbers

Finding	Source	Details
	Scaling Monosemanticity (2024)	Claude 3 Sonnet, (d_model 4,096 → d_sae 1,048,576), middle layers
~1B tokens for SAE training	Bricken et al. (2023)	Enough to cover rare features; fewer tokens leaves features underspecialized
	Circuit Tracing methods (2025)	Largest Cross-Layer Transcoder has 10M features total; mean L0 sparsity is 88 active features per token across the corpus
	Circuit Tracing methods (2025)	0.80 = fraction of important inputs explained by the graph; 0.61 = fraction of end-to-end computation explained by CLT features. Reconstruction error ~11.5% normalized mean.
	Circuit Tracing methods (2025)	Perturbation validation: Spearman ρ = 0.72 for feature-to-feature influence; cosine similarity ~0.80 for predicted vs. observed feature activations
Multi-step reasoning chains	Biology paper (2025)	Dallas → Texas feature → Austin; swap Texas for California mid-computation and output shifts to Sacramento. Feature, not token, drives the answer.
Cross-lingual feature sharing	Biology paper (2025)	Concepts like “banana” and “wedding” converge to shared features across English, French, Chinese — concepts exist before language. Middle layers are most multilingual; English has “mechanistic privilege” as the default language.
	Biology paper (2025)	In ~50% of rhyming poems, the model activates candidate end-of-line rhyme words before writing the line that sets up the rhyme. Steering those features to a different word changes the chosen rhyme 70% of the time.

⚠ Warning · The “biology” framing is intentional but cautious. Anthropic draws analogies to neuroscience (circuits, features, neurons) but emphasizes these are mechanistic descriptions, not claims about consciousness or intent. The features are real computational objects; what they “mean” is inferred by humans looking at activation patterns.

Quick check

Derivation

Graph completeness is 0.80 and replacement score is 0.61. Why is there a gap between these two numbers?

Completeness measures input coverage; replacement measures output fidelityThe gap is measurement error — both approximate the same quantityCompleteness counts features; replacement counts tokens processedThe 0.19 gap represents the fraction of dead features in the CLT

🛠️

Hands-On Tooling

Mechanistic interpretability is unusually accessible for a frontier research area. Three tools let you start exploring today:

Neuronpedia — browse 50M+ SAE features across GPT-2, Gemma, Llama, and DeepSeek-R1. Search by concept, inspect top activations, and steer model behavior interactively in your browser.
TransformerLens () — Python library for hooking into any layer of a transformer, extracting activations, and running activation patching experiments. The standard tool for mech interp research. Built around the same framework used to discover induction heads.
ARENA Chapter 1 — structured coding exercises covering induction heads (which ), superposition, SAE training, and circuit analysis. The closest thing to a university course in mech interp.

💡 Tip · Start with Neuronpedia to build intuition for what SAE features look like, then move to TransformerLens when you want to run your own experiments. ARENA exercises bridge the gap between “I understand the theory” and “I can find circuits myself.”

Quick check

Derivation

You use TransformerLens on a model trained on 500M tokens. Induction head experiments show no clear induction behavior. Most likely explanation?

TransformerLens hooks are corrupting the activations silentlyThe model hasn't crossed the ~2B token phase transitionThe model is too large for induction heads to form at allInduction heads only exist in decoder-only, not encoder models

🧠

Key Takeaways

What to remember for interviews

1Circuit tracing = SAE decomposition + attribution patching. Name the features with SAEs, measure causality with gradients, prune to a sparse graph. The 2025 CLT approach uses 10M features with 88 active per token (L0 sparsity), achieving a graph completeness score of 0.80.
2SAEs work by learning an overcomplete basis (32x–256x wider) with an L1 sparsity penalty. CLTs upgrade this: features read from one layer but write to all subsequent MLP layers, so one feature can represent a concept that persists across depth.
3Attribution patching is activation patching's faster cousin: attr(f_i) ≈ (∂M/∂f_i) · (f_i^clean − f_i^corrupt). One forward + one backward pass covers all features instead of O(N) passes.
4Feature steering causally confirms interpretations. Dallas → Texas → Austin circuit: swap Texas for California mid-computation, output becomes Sacramento. Poetry planning: model activates rhyme-word candidates ~50% of the time before writing the setup line; steering succeeds 70% of the time.
5Current limits: replacement score is 0.61 (CLT features explain ~61% of end-to-end computation); graphs succeed on only ~25% of attempted prompts; attention decomposition is harder than MLP; human labeling introduces its own errors.

🧠

Recap quiz

📚

Transformer Math