🔬 Interpretability
Anthropic found a 'Golden Gate Bridge' neuron inside Claude
We can build powerful transformers, but can we understand what they compute? Mechanistic interpretability reverse-engineers the internal algorithms of neural networks — finding human-interpretable features, circuits, and computations inside models that were learned from data alone.
Recent breakthroughs — sparse autoencoders, circuit tracing, and — are turning "black box" models into something closer to understandable programs.
Interactive Sandbox
Activation Patching Demo
What you're seeing:A 12-layer transformer is run on a factual prompt ("The Eiffel Tower is in"). We corrupt the residual stream at every layer, then patch in the clean activations one layer at a time and measure how much of the correct answer is recovered.
What to try: Click each layer bar to see which layers are most causally responsible for this factual recall. Notice how recovery peaks in the upper-middle to late layers.
Click a layer bar above to inspect its causal role.
The Intuition
The residual stream is a shared communication bus running through the transformer. Each layer reads from it and writes back additively. Attention heads move information between token positions (copying, routing). MLPs process information at each position, acting as key-value memories that store learned associations.
Superposition is the core challenge: . A 768-dimensional residual stream might encode thousands of distinct concepts as nearly-orthogonal directions. This works because real-world features are sparse — most are inactive for any given input. But it means individual neurons are polysemantic— a single neuron fires for multiple unrelated concepts, making it impossible to read off what any neuron "means."
Sparse autoencoders (SAEs)crack superposition by learning an overcomplete basis. They decompose the model's activations into a much larger set of features (e.g., ) with a sparsity constraint: only a handful of features activate for any given input. Each learned feature tends to be monosemantic— representing one coherent concept like "Golden Gate Bridge" or "code has a bug."
Probing classifiersare a complementary, cheaper interpretability tool. The idea: train a simple linear classifier on top of frozen intermediate activations to predict some property (e.g., part-of-speech tag, entity type, sentiment, or whether a statement is true). If the linear probe achieves high accuracy, the representation must linearly encode that property — the information is "in there" in a geometrically accessible way. Hewitt & Manning (2019) showed that syntactic parse-tree distances are linearly encoded in BERT's representations, despite BERT never being trained on parse trees. Probing is useful for hypothesis testing ("does layer 12 encode negation?") but has a key limitation: high probe accuracy only tells you the information exists; it does not tell you whether the model uses it for its predictions. Causal interventions (activation patching) are needed to establish use.
Quick check
Why does superposition become worse (more features, more interference) as neural networks grow wider and are trained on more data?
Activation Patching Pipeline
What you're seeing:Activation patching tests whether a specific layer's activations causally determine the model's output.
What to try: Follow the three steps — run clean, run corrupted, then patch the clean activation back in. If the output recovers, that layer is causally responsible.
Why do SAEs use an overcomplete (wider-than-input) hidden layer?
Step-by-Step Derivation
Residual Stream Update
Each layer reads from and writes to the residual stream additively. is the residual at layer :
In practice each sublayer has its own residual connection and layer norm (pre-LN shown above). The interpretability shorthand often writes to emphasize that both components add to the same stream, but Attn and MLP are applied sequentially, not in parallel from the same input.
SAE Encoding
The encoder maps a -dimensional activation to a sparse, overcomplete feature space (typically ). Center the input around the decoder bias first:
ReLU ensures features are non-negative. Most entries of will be zero due to the sparsity penalty in the loss — only a few features fire per input.
SAE Reconstruction
The decoder reconstructs the original activation from the sparse features. Each column of is a feature direction in activation space:
SAE Training Loss
Reconstruction fidelity plus sparsity. The coefficient controls the sparsity-fidelity tradeoff:
PyTorch: Simple Sparse Autoencoder
import torch
import torch.nn as nn
class SparseAutoencoder(nn.Module):
def __init__(self, d_model: int, n_features: int):
super().__init__()
# n_features >> d_model (overcomplete)
self.encoder = nn.Linear(d_model, n_features)
self.decoder = nn.Linear(n_features, d_model, bias=True)
self.relu = nn.ReLU()
def forward(self, x):
# Center input around decoder bias
x_centered = x - self.decoder.bias
# Encode to sparse features
f = self.relu(self.encoder(x_centered))
# Reconstruct
x_hat = self.decoder(f)
return x_hat, f
def sae_loss(x, x_hat, f, lam=1e-2):
"""Reconstruction + L1 sparsity."""
recon = (x - x_hat).pow(2).mean()
sparse = f.abs().mean()
return recon + lam * sparsePyTorch: Activation Patching
# Activation patching: is layer L causally important?
clean_acts = {}
def save_hook(module, input, output):
clean_acts[module] = output.clone()
# Run clean, save activations
model.layer[L].register_forward_hook(save_hook)
clean_out = model(clean_input)
# Run corrupted, patch in clean activation at layer L
def patch_hook(module, input, output):
return clean_acts[module]
model.layer[L].register_forward_hook(patch_hook)
patched_out = model(corrupted_input)
# Baseline: run corrupted input without patching
model.layer[L]._forward_hooks.clear()
corrupt_out = model(corrupted_input)
# If patched_out ≈ clean_out, layer L is causally responsible
recovery = 1 - (patched_out - clean_out).norm() / (corrupt_out - clean_out).norm()PyTorch implementation
# Activation patching via forward hooks
import torch
def run_with_patch(model, tokens, layer, patch_tensor):
"""Run model but replace one layer's residual-stream output."""
hooks = []
def hook(module, inp, out):
# out is a tuple in many HuggingFace models; patch the hidden state
hidden = out[0] if isinstance(out, tuple) else out
hidden = patch_tensor.to(hidden.device)
return (hidden,) + out[1:] if isinstance(out, tuple) else hidden
hooks.append(layer.register_forward_hook(hook))
with torch.no_grad():
logits = model(tokens).logits
for h in hooks:
h.remove()
return logits
def logit_diff(logits, correct_tok, wrong_tok):
"""Metric: log-prob(correct) - log-prob(wrong) at last position."""
last = logits[:, -1, :]
return (last[:, correct_tok] - last[:, wrong_tok]).item()Quick check
In the SAE loss L = ||x - x_hat||^2 + lambda * ||f||_1, if you want features that each fire for a single concept, which loss term does the heavy lifting and what is the risk of over-weighting it?
Break It — See What Happens
Quick check
A safety team runs a probe on layer 12 and finds 91% accuracy predicting “harmful intent” in user prompts. They conclude the model reliably detects harmful intent at layer 12. What is wrong with this conclusion?
Real-World Numbers
| Work | Team | Key Result |
|---|---|---|
| Scaling Monosemanticity | Anthropic (2024) | ; found abstract, multilingual, safety-relevant features |
| OpenAI SAE on GPT-4 | OpenAI (2024) | ; demonstrated that SAEs scale to the largest production models |
| Induction Heads | Anthropic (2022) | ; |
| Golden Gate Claude | Anthropic (2024) | ; model inserts Golden Gate Bridge references into all outputs |
| Circuit Tracing | Anthropic (2025) | Full computational graphs traced on Claude 3.5 Haiku; revealed multi-step reasoning, planning, and multilingual feature sharing |
Quick check
Anthropic extracted ~34M features from Claude 3 Sonnet. OpenAI extracted 16M from GPT-4. What does the difference in scale imply about interpretability coverage?
Key Takeaways
What to remember for interviews
- 1Superposition lets neural networks represent more features than dimensions by encoding them as nearly-orthogonal directions — enabling polysemantic neurons that fire for multiple unrelated concepts.
- 2Sparse autoencoders (SAEs) crack superposition by learning an overcomplete basis (32x–256x wider) with an L1 sparsity penalty, decomposing polysemantic neurons into monosemantic features.
- 3The residual stream is a shared communication bus: each layer reads from and writes to it additively, making transformers shallow-and-wide rather than deep-and-sequential.
- 4Anthropic's Scaling Monosemanticity (2024) extracted ~34M interpretable features from Claude 3 Sonnet, including safety-relevant features for deception and sycophancy that causally affect model behavior.
- 5Probing classifiers reveal whether information exists in representations; causal activation patching is needed to confirm whether the model actually uses that information.
Recap Quiz
Mechanistic Interpretability recap
A 512-dim residual stream encodes thousands of concepts. Which mechanism makes this possible, and what side-effect does it create?
An SAE trained on a 768-dim residual stream uses a 32× expansion factor. What is the hidden-layer size, and why is this ratio necessary rather than, say, 2×?
Induction heads emerge abruptly during training. What is the empirical marker of their arrival, and why does it matter for in-context learning?
A probe achieves 94% accuracy predicting entity gender from layer 8 activations. What can you conclude, and what can you NOT conclude?
An SAE is trained with loss L = ||x - x_hat||^2 + lambda * ||f||_1. What happens if lambda is set too large, and what fails if it is set too small?
Anthropic's circuit tracing on Claude 3.5 Haiku replaced MLP and attention outputs with SAE feature decompositions, then pruned weak attribution edges. What does the resulting “circuit” represent?
Golden Gate Claude was created by artificially activating one SAE feature during inference. What does this demonstrate about SAE features beyond interpretability?
Further Reading
- A Mathematical Framework for Transformer Circuits — Elhage et al. 2021 — reverse-engineering transformer computations as interpretable circuits
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet — Templeton et al. 2024 — dictionary learning at scale to find interpretable features in large models
- Toy Models of Superposition — Elhage et al. 2022 — understanding how neural networks represent more features than dimensions
- Towards Monosemanticity: Decomposing Language Models with Dictionary Learning — Bricken et al. 2023 — sparse autoencoders on a one-layer transformer find thousands of interpretable features; the predecessor to scaling monosemanticity
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small — Wang et al. 2022 — end-to-end circuit analysis of a real linguistic capability; the canonical example of mechanistic interpretability on a real model
- 3Blue1Brown — But what is a GPT? (YouTube) — Visual intuition for what transformer attention heads actually compute — useful foundation before diving into circuit-level interpretability
- Circuit Tracing: Revealing Computational Graphs in Language Models — Ameisen et al. 2025 — combining SAEs with attribution patching to trace full computational circuits in Claude 3.5 Haiku
- On the Biology of a Large Language Model — Lindsey et al. 2025 — probing Claude 3.5 Haiku's internal mechanisms: multi-step reasoning, planning, and multilingual feature sharing
- Emotion Concepts and their Function in a Large Language Model — Sofroniew et al. 2026 — investigating how emotion concept representations form and function inside Claude Sonnet 4.5
- Emergent Introspective Awareness in Large Language Models — Lindsey 2025 — evidence of introspective awareness where models can report on their own internal representations
- Chris Olah's Blog — Neural Networks, Manifolds, and Topology — Olah 2014 — foundational visual intuition for how neural networks transform data through manifold operations
- 3Blue1Brown — How might LLMs store facts (Chapter 7) — Grant Sanderson 2024 — visual deep dive into MLP layers as key-value memories, superposition, and why individual neurons are hard to interpret.
- Neuronpedia — Interactive SAE Feature Explorer — Open-source platform for exploring 50M+ sparse autoencoder features across GPT-2, Gemma, Llama — hands-on companion to the theory in this module.
Interview Questions
Showing 8 of 8
What is superposition and why does it make interpretability hard?
★★☆How do sparse autoencoders work and what do they find?
★★☆What are induction heads and why do they matter for in-context learning?
★★★Explain the residual stream view of transformers.
★★☆How does circuit tracing work? What has it revealed?
★★★What is polysemanticity and how do SAEs address it?
★★☆How could interpretability improve AI safety?
★★☆What did scaling monosemanticity find in Claude 3 Sonnet?
★★★