Interpretability — Transformer Math

Module 43 · Trust & Evaluation

🔬 Interpretability

Anthropic found a 'Golden Gate Bridge' neuron inside Claude

Status:

We can build powerful transformers, but can we understand what they compute? Mechanistic interpretability reverse-engineers the internal algorithms of neural networks — finding human-interpretable features, circuits, and computations inside models that were learned from data alone.

Recent breakthroughs — sparse autoencoders, circuit tracing, and — are turning "black box" models into something closer to understandable programs.

🎮

Interactive Sandbox

Activation Patching Demo

What you're seeing:A 12-layer transformer is run on a factual prompt ("The Eiffel Tower is in"). We corrupt the residual stream at every layer, then patch in the clean activations one layer at a time and measure how much of the correct answer is recovered.
What to try: Click each layer bar to see which layers are most causally responsible for this factual recall. Notice how recovery peaks in the upper-middle to late layers.

Recovery score:0%25%50%75%100%

Click a layer bar above to inspect its causal role.

💡

The Intuition

The residual stream is a shared communication bus running through the transformer. Each layer reads from it and writes back additively. Attention heads move information between token positions (copying, routing). MLPs process information at each position, acting as key-value memories that store learned associations.

Superposition is the core challenge: . A 768-dimensional residual stream might encode thousands of distinct concepts as nearly-orthogonal directions. This works because real-world features are sparse — most are inactive for any given input. But it means individual neurons are polysemantic— a single neuron fires for multiple unrelated concepts, making it impossible to read off what any neuron "means."

Sparse autoencoders (SAEs)crack superposition by learning an overcomplete basis. They decompose the model's activations into a much larger set of features (e.g., ) with a sparsity constraint: only a handful of features activate for any given input. Each learned feature tends to be monosemantic— representing one coherent concept like "Golden Gate Bridge" or "code has a bug."

✨ Insight · Think of superposition like a crowded party where everyone talks at once. You can't understand any single voice (polysemantic neuron). SAEs are like directional microphones — they isolate individual speakers (monosemantic features) from the noise.

Probing classifiersare a complementary, cheaper interpretability tool. The idea: train a simple linear classifier on top of frozen intermediate activations to predict some property (e.g., part-of-speech tag, entity type, sentiment, or whether a statement is true). If the linear probe achieves high accuracy, the representation must linearly encode that property — the information is "in there" in a geometrically accessible way. Hewitt & Manning (2019) showed that syntactic parse-tree distances are linearly encoded in BERT's representations, despite BERT never being trained on parse trees. Probing is useful for hypothesis testing ("does layer 12 encode negation?") but has a key limitation: high probe accuracy only tells you the information exists; it does not tell you whether the model uses it for its predictions. Causal interventions (activation patching) are needed to establish use.

✨ Insight · The interpretability ladder: Probing (is info there?) → Activation patching (is it used?) → Circuit tracing (how is it computed?). Each level gives strictly more information but costs exponentially more compute.

Quick check

Trade-off

Why does superposition become worse (more features, more interference) as neural networks grow wider and are trained on more data?

Wider models have more dead neurons, reducing effective capacity.Wider models apply stronger L2 regularization, collapsing feature directions.Gradient noise increases with width, pushing features into shared directions.Larger models learn more real-world concepts; the concept count grows faster than neuron count.

🗺️

Activation Patching Pipeline

What you're seeing:Activation patching tests whether a specific layer's activations causally determine the model's output.

What to try: Follow the three steps — run clean, run corrupted, then patch the clean activation back in. If the output recovers, that layer is causally responsible.

Quick Check

Why do SAEs use an overcomplete (wider-than-input) hidden layer?

📐

Step-by-Step Derivation

Residual Stream Update

Each layer reads from and writes to the residual stream additively. is the residual at layer :

In practice each sublayer has its own residual connection and layer norm (pre-LN shown above). The interpretability shorthand often writes to emphasize that both components add to the same stream, but Attn and MLP are applied sequentially, not in parallel from the same input.

SAE Encoding

The encoder maps a -dimensional activation to a sparse, overcomplete feature space (typically ). Center the input around the decoder bias first:

ReLU ensures features are non-negative. Most entries of will be zero due to the sparsity penalty in the loss — only a few features fire per input.

SAE Reconstruction

The decoder reconstructs the original activation from the sparse features. Each column of is a feature direction in activation space:

SAE Training Loss

Reconstruction fidelity plus sparsity. The coefficient controls the sparsity-fidelity tradeoff:

💡 Tip · Too high kills reconstruction — features become too sparse to capture the signal. Too low loses interpretability — features become polysemantic again.

PyTorch: Simple Sparse Autoencoder

python

import torch
import torch.nn as nn

class SparseAutoencoder(nn.Module):
    def __init__(self, d_model: int, n_features: int):
        super().__init__()
        # n_features >> d_model (overcomplete)
        self.encoder = nn.Linear(d_model, n_features)
        self.decoder = nn.Linear(n_features, d_model, bias=True)
        self.relu = nn.ReLU()

    def forward(self, x):
        # Center input around decoder bias
        x_centered = x - self.decoder.bias
        # Encode to sparse features
        f = self.relu(self.encoder(x_centered))
        # Reconstruct
        x_hat = self.decoder(f)
        return x_hat, f

def sae_loss(x, x_hat, f, lam=1e-2):
    """Reconstruction + L1 sparsity."""
    recon = (x - x_hat).pow(2).mean()
    sparse = f.abs().mean()
    return recon + lam * sparse

PyTorch: Activation Patching

python

# Activation patching: is layer L causally important?
clean_acts = {}
def save_hook(module, input, output):
    clean_acts[module] = output.clone()

# Run clean, save activations
model.layer[L].register_forward_hook(save_hook)
clean_out = model(clean_input)

# Run corrupted, patch in clean activation at layer L
def patch_hook(module, input, output):
    return clean_acts[module]

model.layer[L].register_forward_hook(patch_hook)
patched_out = model(corrupted_input)

# Baseline: run corrupted input without patching
model.layer[L]._forward_hooks.clear()
corrupt_out = model(corrupted_input)

# If patched_out ≈ clean_out, layer L is causally responsible
recovery = 1 - (patched_out - clean_out).norm() / (corrupt_out - clean_out).norm()

PyTorch implementation

# Activation patching via forward hooks
import torch

def run_with_patch(model, tokens, layer, patch_tensor):
    """Run model but replace one layer's residual-stream output."""
    hooks = []
    def hook(module, inp, out):
        # out is a tuple in many HuggingFace models; patch the hidden state
        hidden = out[0] if isinstance(out, tuple) else out
        hidden = patch_tensor.to(hidden.device)
        return (hidden,) + out[1:] if isinstance(out, tuple) else hidden
    hooks.append(layer.register_forward_hook(hook))
    with torch.no_grad():
        logits = model(tokens).logits
    for h in hooks:
        h.remove()
    return logits

def logit_diff(logits, correct_tok, wrong_tok):
    """Metric: log-prob(correct) - log-prob(wrong) at last position."""
    last = logits[:, -1, :]
    return (last[:, correct_tok] - last[:, wrong_tok]).item()

Quick check

Derivation

In the SAE loss L = ||x - x_hat||^2 + lambda * ||f||_1, if you want features that each fire for a single concept, which loss term does the heavy lifting and what is the risk of over-weighting it?

L1 term enforces sparsity; over-weighting collapses features toward zero, breaking reconstruction.MSE term enforces monosemanticity; over-weighting it memorizes training activations.Both terms together; removing either one still produces monosemantic features at high enough lr.L1 term enforces monosemanticity; over-weighting produces dead features that never activate.

🔧

Break It — See What Happens

Remove sparsity penalty (lambda = 0)

Use too few SAE features (e.g., same width as model)

Interpret Individual Neurons (No SAEs)

Trust Probing Classifiers Alone

Quick check

Trade-off

A safety team runs a probe on layer 12 and finds 91% accuracy predicting “harmful intent” in user prompts. They conclude the model reliably detects harmful intent at layer 12. What is wrong with this conclusion?

91% is not high enough; they need 99%+ for safety-critical claims.They need causal evidence that the model's output changes when the layer-12 representation is perturbed.Linear probes are too simple; a nonlinear probe would give a more accurate measure of detection.The probe may be biased toward majority class; AUROC should be reported instead of accuracy.

📊

Real-World Numbers

Work	Team	Key Result
Scaling Monosemanticity	Anthropic (2024)	; found abstract, multilingual, safety-relevant features
OpenAI SAE on GPT-4	OpenAI (2024)	; demonstrated that SAEs scale to the largest production models
Induction Heads	Anthropic (2022)	;
Golden Gate Claude	Anthropic (2024)	; model inserts Golden Gate Bridge references into all outputs
Circuit Tracing	Anthropic (2025)	Full computational graphs traced on Claude 3.5 Haiku; revealed multi-step reasoning, planning, and multilingual feature sharing

✨ Insight · Interpretability is moving fast: from toy models (2022) to production frontier models (2024-2025). The core toolkit — SAEs + circuit tracing — now works at scale, but we still can't interpret full model behavior end-to-end on arbitrary inputs.

Quick check

Trade-off

Anthropic extracted ~34M features from Claude 3 Sonnet. OpenAI extracted 16M from GPT-4. What does the difference in scale imply about interpretability coverage?

Claude 3 Sonnet has roughly twice the “concepts” of GPT-4, confirming it is a more capable model.The 34M vs 16M gap is explained by Claude's larger expansion factor — the numbers are equivalent at the same factor.Both SAEs are far smaller than the true feature count; neither achieves complete interpretability.34M features means interpretability is solved for Claude 3 Sonnet — every neuron is now mapped.

🧠

Key Takeaways

What to remember for interviews

1Superposition lets neural networks represent more features than dimensions by encoding them as nearly-orthogonal directions — enabling polysemantic neurons that fire for multiple unrelated concepts.
2Sparse autoencoders (SAEs) crack superposition by learning an overcomplete basis (32x–256x wider) with an L1 sparsity penalty, decomposing polysemantic neurons into monosemantic features.
3The residual stream is a shared communication bus: each layer reads from and writes to it additively, making transformers shallow-and-wide rather than deep-and-sequential.
4Anthropic's Scaling Monosemanticity (2024) extracted ~34M interpretable features from Claude 3 Sonnet, including safety-relevant features for deception and sycophancy that causally affect model behavior.
5Probing classifiers reveal whether information exists in representations; causal activation patching is needed to confirm whether the model actually uses that information.

🧠

Recap Quiz

📚

Interview Questions

Difficulty:

Company:

Showing 8 of 8

What is superposition and why does it make interpretability hard?

★★☆

AnthropicOpenAI

How do sparse autoencoders work and what do they find?

★★☆

Anthropic

What are induction heads and why do they matter for in-context learning?

★★★

AnthropicGoogle

Explain the residual stream view of transformers.

★★☆

Anthropic

How does circuit tracing work? What has it revealed?

★★★

Anthropic

What is polysemanticity and how do SAEs address it?

★★☆

AnthropicOpenAI

How could interpretability improve AI safety?

★★☆

AnthropicOpenAI

What did scaling monosemanticity find in Claude 3 Sonnet?

★★★

Anthropic

←

🔄 Eval-Driven Development

🛡️ Safety & Alignment

→

Transformer Math