🔬 Mechanistic Interpretability
Anthropic found a single Claude feature that fires only on ‘the Golden Gate Bridge’ — and clamping it causally steers Claude’s behavior to mention bridges in every response.
Anthropic partially traced parts of Claude's internal computation using attribution graphs. They followed a multi-step reasoning example — asking about the capital of Texas — and watched as intermediate features activated sequentially, ultimately writing the answer. The method captures a fraction of the total computation, not a complete end-to-end trace.
This module is a hands-on companion to the Interpretability module. That module explains what superposition and SAEs are. This one shows how to actually do mechanistic interpretability research — training SAEs, running attribution patching, tracing circuits, and steering model behavior.
SAE Architecture
What you're seeing
A Sparse Autoencoder trained on a layer's residual stream activations. The encoder projects from d_model (8, simplified) into a much wider feature space d_sae (24 here; real SAEs use 32–256× expansion). An L1 penalty forces most feature activations to zero — only a handful “light up” for any given input. The decoder reconstructs the original activation from those sparse features; each decoder column is one learned “feature direction”.
What to notice
Three green features are active (Capitals, Math, Code) out of 24 — that's the sparsity in action. The loss combines reconstruction fidelity (||x − x̂||²) and a sparsity penalty (λ||f||₁). Anthropic's Gemma-Scope SAEs achieve >99% reconstruction with most features firing <1% of the time.
SAE Training: How It Actually Works
The theory is in the Interpretability module. Here's what you actually build. An SAE has two parts: an encoder that maps model activations to a much wider space, and a decoder that maps back. The decoder column directions are the features.
SAE architecture
d_sae is . Most entries of f are zero (L1 penalty). Each column of W_dec is one feature direction.
Real hyperparameters from Anthropic papers
| Hyperparameter | Typical value | What it controls |
|---|---|---|
| expansion_factor | How many more features than model dimensions; higher = more capacity but slower | |
| λ (L1 coefficient) | 1e-3 – 1e-1 | Sparsity vs. reconstruction trade-off; tune for L0 of 20–100 features per token |
| training_tokens | Sufficient coverage for features to specialize; fewer tokens → more dead features | |
| learning_rate | 1e-4 – 5e-5 | Adam optimizer; cosine decay |
| target_layer | Middle layers | Middle layers have richest representations; Anthropic used residual stream post-MLP |
PyTorch: Minimal SAE Training Loop
import torch
import torch.nn as nn
from torch.optim import Adam
class SparseAutoencoder(nn.Module):
def __init__(self, d_model: int, expansion: int = 64):
super().__init__()
d_sae = d_model * expansion
self.W_enc = nn.Linear(d_model, d_sae, bias=True)
self.W_dec = nn.Linear(d_sae, d_model, bias=True)
self.relu = nn.ReLU()
# Normalize decoder columns to unit norm
self._normalize_decoder()
def _normalize_decoder(self):
with torch.no_grad():
norms = self.W_dec.weight.norm(dim=0, keepdim=True).clamp(min=1e-8)
self.W_dec.weight.div_(norms)
def forward(self, x: torch.Tensor):
# Center around decoder bias before encoding
x_cent = x - self.W_dec.bias
f = self.relu(self.W_enc(x_cent)) # sparse features
x_hat = self.W_dec(f) # reconstruction
return x_hat, f
def sae_loss(x, x_hat, f, lam: float = 5e-3):
recon = (x - x_hat).pow(2).mean() # MSE reconstruction
sparsity = f.abs().mean() # L1 on features
return recon + lam * sparsity, recon, sparsity
# Training loop sketch
sae = SparseAutoencoder(d_model=4096, expansion=64)
opt = Adam(sae.parameters(), lr=2e-4)
for activations in dataloader: # activations: [B, d_model]
opt.zero_grad()
x_hat, f = sae(activations)
loss, recon, sparse = sae_loss(activations, x_hat, f, lam=5e-3)
loss.backward()
opt.step()
sae._normalize_decoder() # keep decoder cols unit normQuick check
You observe 30% dead features after training. Lowering the L1 coefficient hasn't helped. What is the most likely root cause?
The Intuition: MRI-ing a Language Model
Imagine you could MRI a language model's brain while it thinks. Not just watch inputs and outputs, but trace every intermediate step — which concepts activated, in what order, and how they caused the final answer.
That's what Anthropic did with Claude 3.5 Haiku in their 2025 Biology paper. The experiment:
Circuit trace: “What is the capital of Texas?” (Anthropic, 2025)
Token "Texas" is read
The residual stream at the "Texas" token position starts activating features.
"Texas-state" feature fires
An SAE feature learned to recognize Texas as a US state activates strongly. It encodes geographic context.
"Austin" (capital) feature activates
The state feature propagates, causing the capital-city feature for Austin to fire — the model is building the answer.
"Austin" written to output
The Austin feature writes its direction to the residual stream at the answer position, and the model decodes "Austin".
Steering validation: two-hop geography
Anthropic asked: “What is the capital of the state containing Dallas?” The circuit activated a Texas feature which then output Austin. To confirm causality, researchers intervened mid-computation — swapping the Texas feature activation for a California feature activation. The model output changed to Sacramento. The feature, not the input token, drove the answer.
This is the mechanistic interpretability litmus test: if swapping feature F for feature G at the right layer changes the output from answer(F) to answer(G), you have found the causal bottleneck. Average circuits like this span 2.3 hops (Ameisen et al. 2025).
Three methods form the practical toolkit:
| Method | What it does | Cost |
|---|---|---|
| Sparse Autoencoder (SAE) | Decomposes activations into named, monosemantic features | Train once, reuse |
| Activation Patching | Swap activations between runs to measure causal effect | O(N) forward passes |
| Attribution Patching | Gradient approximation of causal effect — fast circuit tracing | 1 forward + 1 backward |
Why does the SAE decoder matrix represent interpretable features?
Attribution Patching & Circuit Tracing
Method 1: Activation Patching (exact, slow)
Run the model on a clean input and a corrupted input. For each component , swap its activation from the clean run into the corrupted run and measure the change in output metric :
Large effect = component causally matters. Requires one forward pass per component — O(N) total.
Method 2: Attribution Patching (approximate, fast)
First-order Taylor approximation. The attribution of feature on output is:
Gradient tells you sensitivity; activation difference tells you magnitude. One forward + one backward pass covers all components. Accuracy is within ~90% of full patching on most circuits.
Method 3: Circuit Tracing (SAE + attribution combined)
Replace MLP and attention outputs with their SAE decompositions. Now run attribution patching over the feature graph instead of neurons. This gives a directed graph where nodes are named features and edges are causal attributions:
Prune edges below a threshold to get the sparse subgraph — the “circuit”. Each node is interpretable (SAE feature), each edge is a measured causal strength.
Cross-Layer Transcoders (CLTs) — the 2025 upgrade
Standard SAEs are trained per-layer. Ameisen et al. (2025) instead train Cross-Layer Transcoders: each CLT reads from the residual stream at its own layer but outputs contributions to all subsequent MLP layers. This lets a single feature represent a concept that persists and propagates across depth, rather than needing separate features at each layer.
Edge weight between source feature and target feature is:
where is the source feature's activation, is the Jacobian through attention and residual connections, and the sum runs over the decoder-to-encoder paths. Loss uses JumpReLU (not standard ReLU) plus a tanh-based sparsity penalty to reduce dead features.
PyTorch: Basic Activation Patching
import torch
from contextlib import contextmanager
@contextmanager
def patch_activation(model, layer_name: str, patch_value: torch.Tensor):
"""Context manager to swap one layer's output mid-forward-pass."""
hooks = []
def hook_fn(module, input, output):
return patch_value # replace with clean-run activation
handle = dict(model.named_modules())[layer_name].register_forward_hook(hook_fn)
hooks.append(handle)
try:
yield
finally:
for h in hooks:
h.remove()
def activation_patching_score(model, clean_tokens, corrupt_tokens, layer_name,
clean_cache, metric_fn):
"""
Measure how much layer_name causally matters for the metric.
metric_fn(logits) -> scalar (e.g., logit diff between two tokens)
"""
# Baseline: corrupted run
with torch.no_grad():
corrupt_logits = model(corrupt_tokens)
baseline = metric_fn(corrupt_logits)
# Patched: corrupted run but swap in the clean activation
clean_act = clean_cache[layer_name]
with torch.no_grad():
with patch_activation(model, layer_name, clean_act):
patched_logits = model(corrupt_tokens)
patched = metric_fn(patched_logits)
return (patched - baseline).item() # positive = component helpsPyTorch implementation
# Logit lens: project intermediate residual stream to vocab space
import torch
def logit_lens(model, tokens: torch.Tensor):
"""
At each layer, unembed the residual stream directly to get
a probability distribution over vocabulary — no more processing.
Shows what the model 'thinks' the next token is at each depth.
"""
unembed = model.lm_head # W_U: (d_model, vocab)
ln_f = model.transformer.ln_f # final layer norm
residual_stream = []
def save_residual(module, inp, out):
# out[0] is the hidden state after this transformer block
h = out[0] if isinstance(out, tuple) else out
residual_stream.append(h.detach().clone())
hooks = [block.register_forward_hook(save_residual)
for block in model.transformer.h]
with torch.no_grad():
model(tokens)
for h in hooks:
h.remove()
# Project each layer's residual stream through the unembedding
layer_logits = []
for h in residual_stream:
normed = ln_f(h) # apply final norm
logits = normed @ unembed.weight.T # (B, seq, vocab)
layer_logits.append(logits[:, -1, :].softmax(-1)) # last position
return layer_logits # list of (B, vocab) — one per layerQuick check
You need to rank the causal importance of all 10M CLT features in a single experiment. Which method is feasible?
Break It — Feature Steering in Action
The “Golden Gate Claude” experiment (Anthropic, 2024) demonstrated that SAE features are causally real. Researchers found a feature that activates for text about the Golden Gate Bridge, then clamped it to 10x its maximum activation during inference.
Normal response (feature = 0)
User: What are some things to do in San Francisco?
“Some great options include visiting Fisherman's Wharf, exploring Golden Gate Park, the Painted Ladies, and the ferry to Alcatraz.”
Steered response (feature clamped 10x)
User: What are some things to do in San Francisco?
“Oh, the Golden Gate Bridge, obviously! And near the Golden Gate Bridge you'll find... the Golden Gate Bridge visitor center, Golden Gate Bridge overlooks, and the Golden Gate Bridge gift shop.”
Quick check
The Golden Gate Bridge steering works by clamping one feature. What happens if you clamp 10 unrelated features simultaneously at 10x?
Real Numbers
| Finding | Source | Details |
|---|---|---|
| Scaling Monosemanticity (2024) | Claude 3 Sonnet, (d_model 4,096 → d_sae 1,048,576), middle layers | |
| ~1B tokens for SAE training | Bricken et al. (2023) | Enough to cover rare features; fewer tokens leaves features underspecialized |
| Circuit Tracing methods (2025) | Largest Cross-Layer Transcoder has 10M features total; mean L0 sparsity is 88 active features per token across the corpus | |
| Circuit Tracing methods (2025) | 0.80 = fraction of important inputs explained by the graph; 0.61 = fraction of end-to-end computation explained by CLT features. Reconstruction error ~11.5% normalized mean. | |
| Circuit Tracing methods (2025) | Perturbation validation: Spearman ρ = 0.72 for feature-to-feature influence; cosine similarity ~0.80 for predicted vs. observed feature activations | |
| Multi-step reasoning chains | Biology paper (2025) | Dallas → Texas feature → Austin; swap Texas for California mid-computation and output shifts to Sacramento. Feature, not token, drives the answer. |
| Cross-lingual feature sharing | Biology paper (2025) | Concepts like “banana” and “wedding” converge to shared features across English, French, Chinese — concepts exist before language. Middle layers are most multilingual; English has “mechanistic privilege” as the default language. |
| Biology paper (2025) | In ~50% of rhyming poems, the model activates candidate end-of-line rhyme words before writing the line that sets up the rhyme. Steering those features to a different word changes the chosen rhyme 70% of the time. |
Quick check
Graph completeness is 0.80 and replacement score is 0.61. Why is there a gap between these two numbers?
Hands-On Tooling
Mechanistic interpretability is unusually accessible for a frontier research area. Three tools let you start exploring today:
- Neuronpedia — browse 50M+ SAE features across GPT-2, Gemma, Llama, and DeepSeek-R1. Search by concept, inspect top activations, and steer model behavior interactively in your browser.
- TransformerLens () — Python library for hooking into any layer of a transformer, extracting activations, and running activation patching experiments. The standard tool for mech interp research. Built around the same framework used to discover induction heads.
- ARENA Chapter 1 — structured coding exercises covering induction heads (which ), superposition, SAE training, and circuit analysis. The closest thing to a university course in mech interp.
Quick check
You use TransformerLens on a model trained on 500M tokens. Induction head experiments show no clear induction behavior. Most likely explanation?
Key Takeaways
What to remember for interviews
- 1Circuit tracing = SAE decomposition + attribution patching. Name the features with SAEs, measure causality with gradients, prune to a sparse graph. The 2025 CLT approach uses 10M features with 88 active per token (L0 sparsity), achieving a graph completeness score of 0.80.
- 2SAEs work by learning an overcomplete basis (32x–256x wider) with an L1 sparsity penalty. CLTs upgrade this: features read from one layer but write to all subsequent MLP layers, so one feature can represent a concept that persists across depth.
- 3Attribution patching is activation patching's faster cousin: attr(f_i) ≈ (∂M/∂f_i) · (f_i^clean − f_i^corrupt). One forward + one backward pass covers all features instead of O(N) passes.
- 4Feature steering causally confirms interpretations. Dallas → Texas → Austin circuit: swap Texas for California mid-computation, output becomes Sacramento. Poetry planning: model activates rhyme-word candidates ~50% of the time before writing the setup line; steering succeeds 70% of the time.
- 5Current limits: replacement score is 0.61 (CLT features explain ~61% of end-to-end computation); graphs succeed on only ~25% of attempted prompts; attention decomposition is harder than MLP; human labeling introduces its own errors.
Recap quiz
Mechanistic Interpretability recap
An SAE trained with expansion factor 4x achieves 99% reconstruction but dense features. What should you change first?
The CLT replacement score is 0.61 on Claude 3.5 Haiku. What does this imply for using attribution graphs in production safety work?
Researchers steer a “California” feature mid-computation on the prompt about the capital of the state containing Dallas. The model outputs Sacramento. What does this confirm?
Attribution patching scores all N components in O(1) passes. Activation patching scores them in O(N) passes. When would you still prefer activation patching?
Scaling Monosemanticity extracted ~34M features from Claude 3 Sonnet using 256x expansion. Why not use 1000x to extract even more features?
Induction heads emerge around 2B training tokens, with loss on repeated random sequences dropping from ~2.3 to ~0.1. What makes this a phase change rather than gradual learning?
The Biology paper finds ~50% advance planning in rhyming poems and 70% steering success. Does this prove the model is “consciously planning”?
Further Reading
- Circuit Tracing: Revealing Computational Graphs in Language Models — Ameisen et al. 2025 — combining SAEs with attribution patching to trace full computational circuits in Claude 3.5 Haiku
- On the Biology of a Large Language Model — Lindsey et al. 2025 — probing Claude 3.5 Haiku's internal mechanisms: multi-step reasoning, planning, and multilingual feature sharing
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet — Templeton et al. 2024 — dictionary learning at scale finds ~34M features in a production frontier model
- Towards Monosemanticity: Decomposing Language Models with Dictionary Learning — Bricken et al. 2023 — the first successful SAE decomposition of a one-layer transformer; established the field
- Toy Models of Superposition — Elhage et al. 2022 — controlled experiments showing how and why neural networks encode more features than dimensions
- When Models Manipulate Manifolds — Gurnee et al. 2025 — studying how models use linebreaks and whitespace as geometric pivots in activation space
- Chris Olah — Neural Networks, Manifolds, and Topology — Olah 2014 — the foundational visual intuition for how neural networks transform data through manifold operations
- 3Blue1Brown — How might LLMs store facts (Chapter 7) — Grant Sanderson 2024 — visual walkthrough of how MLP layers in transformers store and retrieve facts, with connections to superposition and sparse autoencoders.
- Neel Nanda — How to Become a Mechanistic Interpretability Researcher — Nanda 2023 — comprehensive guide to getting started in mech interp research, with recommended papers, exercises, and learning path.
- Neuronpedia — Interactive SAE Feature Explorer — Open-source platform for exploring 50M+ SAE features across GPT-2, Gemma, Llama, and more — search, visualize activations, and steer model behavior interactively.
- ARENA — Mechanistic Interpretability Exercises — Hands-on coding tutorials for transformer interpretability — TransformerLens, induction heads, superposition, SAEs, and circuit analysis.
Interview Questions
Showing 5 of 5
Walk through training a sparse autoencoder on transformer activations. What are the key hyperparameters?
★★☆How does attribution patching differ from activation patching? When would you use each?
★★★What is the L1/reconstruction trade-off in SAE training and how do you pick the right sparsity coefficient?
★★☆Describe how you would find the circuit responsible for a specific model behavior (e.g., gendered pronoun resolution).
★★★What are the limitations of current mechanistic interpretability methods? What can't they tell us?
★★☆