Skip to content

Transformer Math

Module 43 · Trust & Evaluation

🔬 Interpretability

Anthropic found a 'Golden Gate Bridge' neuron inside Claude

Status:

We can build powerful transformers, but can we understand what they compute? Mechanistic interpretability reverse-engineers the internal algorithms of neural networks — finding human-interpretable features, circuits, and computations inside models that were learned from data alone.

Recent breakthroughs — sparse autoencoders, circuit tracing, and — are turning "black box" models into something closer to understandable programs.

🎮

Interactive Sandbox

Activation Patching Demo

What you're seeing:A 12-layer transformer is run on a factual prompt ("The Eiffel Tower is in"). We corrupt the residual stream at every layer, then patch in the clean activations one layer at a time and measure how much of the correct answer is recovered.
What to try: Click each layer bar to see which layers are most causally responsible for this factual recall. Notice how recovery peaks in the upper-middle to late layers.

Recovery score:0%25%50%75%100%

Click a layer bar above to inspect its causal role.

💡

The Intuition

The residual stream is a shared communication bus running through the transformer. Each layer reads from it and writes back additively. Attention heads move information between token positions (copying, routing). MLPs process information at each position, acting as key-value memories that store learned associations.

Superposition is the core challenge: . A 768-dimensional residual stream might encode thousands of distinct concepts as nearly-orthogonal directions. This works because real-world features are sparse — most are inactive for any given input. But it means individual neurons are polysemantic— a single neuron fires for multiple unrelated concepts, making it impossible to read off what any neuron "means."

Sparse autoencoders (SAEs)crack superposition by learning an overcomplete basis. They decompose the model's activations into a much larger set of features (e.g., ) with a sparsity constraint: only a handful of features activate for any given input. Each learned feature tends to be monosemantic— representing one coherent concept like "Golden Gate Bridge" or "code has a bug."

✨ Insight · Think of superposition like a crowded party where everyone talks at once. You can't understand any single voice (polysemantic neuron). SAEs are like directional microphones — they isolate individual speakers (monosemantic features) from the noise.

Probing classifiersare a complementary, cheaper interpretability tool. The idea: train a simple linear classifier on top of frozen intermediate activations to predict some property (e.g., part-of-speech tag, entity type, sentiment, or whether a statement is true). If the linear probe achieves high accuracy, the representation must linearly encode that property — the information is "in there" in a geometrically accessible way. Hewitt & Manning (2019) showed that syntactic parse-tree distances are linearly encoded in BERT's representations, despite BERT never being trained on parse trees. Probing is useful for hypothesis testing ("does layer 12 encode negation?") but has a key limitation: high probe accuracy only tells you the information exists; it does not tell you whether the model uses it for its predictions. Causal interventions (activation patching) are needed to establish use.

✨ Insight · The interpretability ladder: Probing (is info there?) → Activation patching (is it used?) → Circuit tracing (how is it computed?). Each level gives strictly more information but costs exponentially more compute.

Quick check

Trade-off

Why does superposition become worse (more features, more interference) as neural networks grow wider and are trained on more data?

Why does superposition become worse (more features, more interference) as neural networks grow wider and are trained on more data?
🗺️

Activation Patching Pipeline

What you're seeing:Activation patching tests whether a specific layer's activations causally determine the model's output.

What to try: Follow the three steps — run clean, run corrupted, then patch the clean activation back in. If the output recovers, that layer is causally responsible.

Clean RunNormal input →save activationsCorrupted RunReplace one input,run againPatch & MeasureSwap one clean act.measure output Δ123If output recovers → that activation is causally important
Quick Check

Why do SAEs use an overcomplete (wider-than-input) hidden layer?

📐

Step-by-Step Derivation

Residual Stream Update

Each layer reads from and writes to the residual stream additively. is the residual at layer :

In practice each sublayer has its own residual connection and layer norm (pre-LN shown above). The interpretability shorthand often writes to emphasize that both components add to the same stream, but Attn and MLP are applied sequentially, not in parallel from the same input.

SAE Encoding

The encoder maps a -dimensional activation to a sparse, overcomplete feature space (typically ). Center the input around the decoder bias first:

ReLU ensures features are non-negative. Most entries of will be zero due to the sparsity penalty in the loss — only a few features fire per input.

SAE Reconstruction

The decoder reconstructs the original activation from the sparse features. Each column of is a feature direction in activation space:

SAE Training Loss

Reconstruction fidelity plus sparsity. The coefficient controls the sparsity-fidelity tradeoff:

💡 Tip · Too high kills reconstruction — features become too sparse to capture the signal. Too low loses interpretability — features become polysemantic again.

PyTorch: Simple Sparse Autoencoder

python
import torch
import torch.nn as nn

class SparseAutoencoder(nn.Module):
    def __init__(self, d_model: int, n_features: int):
        super().__init__()
        # n_features >> d_model (overcomplete)
        self.encoder = nn.Linear(d_model, n_features)
        self.decoder = nn.Linear(n_features, d_model, bias=True)
        self.relu = nn.ReLU()

    def forward(self, x):
        # Center input around decoder bias
        x_centered = x - self.decoder.bias
        # Encode to sparse features
        f = self.relu(self.encoder(x_centered))
        # Reconstruct
        x_hat = self.decoder(f)
        return x_hat, f

def sae_loss(x, x_hat, f, lam=1e-2):
    """Reconstruction + L1 sparsity."""
    recon = (x - x_hat).pow(2).mean()
    sparse = f.abs().mean()
    return recon + lam * sparse

PyTorch: Activation Patching

python
# Activation patching: is layer L causally important?
clean_acts = {}
def save_hook(module, input, output):
    clean_acts[module] = output.clone()

# Run clean, save activations
model.layer[L].register_forward_hook(save_hook)
clean_out = model(clean_input)

# Run corrupted, patch in clean activation at layer L
def patch_hook(module, input, output):
    return clean_acts[module]

model.layer[L].register_forward_hook(patch_hook)
patched_out = model(corrupted_input)

# Baseline: run corrupted input without patching
model.layer[L]._forward_hooks.clear()
corrupt_out = model(corrupted_input)

# If patched_out ≈ clean_out, layer L is causally responsible
recovery = 1 - (patched_out - clean_out).norm() / (corrupt_out - clean_out).norm()
PyTorch implementation
# Activation patching via forward hooks
import torch

def run_with_patch(model, tokens, layer, patch_tensor):
    """Run model but replace one layer's residual-stream output."""
    hooks = []
    def hook(module, inp, out):
        # out is a tuple in many HuggingFace models; patch the hidden state
        hidden = out[0] if isinstance(out, tuple) else out
        hidden = patch_tensor.to(hidden.device)
        return (hidden,) + out[1:] if isinstance(out, tuple) else hidden
    hooks.append(layer.register_forward_hook(hook))
    with torch.no_grad():
        logits = model(tokens).logits
    for h in hooks:
        h.remove()
    return logits

def logit_diff(logits, correct_tok, wrong_tok):
    """Metric: log-prob(correct) - log-prob(wrong) at last position."""
    last = logits[:, -1, :]
    return (last[:, correct_tok] - last[:, wrong_tok]).item()

Quick check

Derivation

In the SAE loss L = ||x - x_hat||^2 + lambda * ||f||_1, if you want features that each fire for a single concept, which loss term does the heavy lifting and what is the risk of over-weighting it?

In the SAE loss L = ||x - x_hat||^2 + lambda * ||f||_1, if you want features that each fire for a single concept, which loss term does the heavy lifting and what is the risk of over-weighting it?
🔧

Break It — See What Happens

Remove sparsity penalty (lambda = 0)
Use too few SAE features (e.g., same width as model)
Interpret Individual Neurons (No SAEs)
Trust Probing Classifiers Alone

Quick check

Trade-off

A safety team runs a probe on layer 12 and finds 91% accuracy predicting “harmful intent” in user prompts. They conclude the model reliably detects harmful intent at layer 12. What is wrong with this conclusion?

A safety team runs a probe on layer 12 and finds 91% accuracy predicting “harmful intent” in user prompts. They conclude the model reliably detects harmful intent at layer 12. What is wrong with this conclusion?
📊

Real-World Numbers

WorkTeamKey Result
Scaling MonosemanticityAnthropic (2024); found abstract, multilingual, safety-relevant features
OpenAI SAE on GPT-4OpenAI (2024); demonstrated that SAEs scale to the largest production models
Induction HeadsAnthropic (2022);
Golden Gate ClaudeAnthropic (2024); model inserts Golden Gate Bridge references into all outputs
Circuit TracingAnthropic (2025)Full computational graphs traced on Claude 3.5 Haiku; revealed multi-step reasoning, planning, and multilingual feature sharing
✨ Insight · Interpretability is moving fast: from toy models (2022) to production frontier models (2024-2025). The core toolkit — SAEs + circuit tracing — now works at scale, but we still can't interpret full model behavior end-to-end on arbitrary inputs.

Quick check

Trade-off

Anthropic extracted ~34M features from Claude 3 Sonnet. OpenAI extracted 16M from GPT-4. What does the difference in scale imply about interpretability coverage?

Anthropic extracted ~34M features from Claude 3 Sonnet. OpenAI extracted 16M from GPT-4. What does the difference in scale imply about interpretability coverage?
🧠

Key Takeaways

What to remember for interviews

  1. 1Superposition lets neural networks represent more features than dimensions by encoding them as nearly-orthogonal directions — enabling polysemantic neurons that fire for multiple unrelated concepts.
  2. 2Sparse autoencoders (SAEs) crack superposition by learning an overcomplete basis (32x–256x wider) with an L1 sparsity penalty, decomposing polysemantic neurons into monosemantic features.
  3. 3The residual stream is a shared communication bus: each layer reads from and writes to it additively, making transformers shallow-and-wide rather than deep-and-sequential.
  4. 4Anthropic's Scaling Monosemanticity (2024) extracted ~34M interpretable features from Claude 3 Sonnet, including safety-relevant features for deception and sycophancy that causally affect model behavior.
  5. 5Probing classifiers reveal whether information exists in representations; causal activation patching is needed to confirm whether the model actually uses that information.
🧠

Recap Quiz

🧠

Mechanistic Interpretability recap

Trade-off

A 512-dim residual stream encodes thousands of concepts. Which mechanism makes this possible, and what side-effect does it create?

A 512-dim residual stream encodes thousands of concepts. Which mechanism makes this possible, and what side-effect does it create?
Derivation

An SAE trained on a 768-dim residual stream uses a 32× expansion factor. What is the hidden-layer size, and why is this ratio necessary rather than, say, 2×?

An SAE trained on a 768-dim residual stream uses a 32× expansion factor. What is the hidden-layer size, and why is this ratio necessary rather than, say, 2×?
Derivation

Induction heads emerge abruptly during training. What is the empirical marker of their arrival, and why does it matter for in-context learning?

Induction heads emerge abruptly during training. What is the empirical marker of their arrival, and why does it matter for in-context learning?
Trade-off

A probe achieves 94% accuracy predicting entity gender from layer 8 activations. What can you conclude, and what can you NOT conclude?

A probe achieves 94% accuracy predicting entity gender from layer 8 activations. What can you conclude, and what can you NOT conclude?
Trade-off

An SAE is trained with loss L = ||x - x_hat||^2 + lambda * ||f||_1. What happens if lambda is set too large, and what fails if it is set too small?

An SAE is trained with loss L = ||x - x_hat||^2 + lambda * ||f||_1. What happens if lambda is set too large, and what fails if it is set too small?
Derivation

Anthropic's circuit tracing on Claude 3.5 Haiku replaced MLP and attention outputs with SAE feature decompositions, then pruned weak attribution edges. What does the resulting “circuit” represent?

Anthropic's circuit tracing on Claude 3.5 Haiku replaced MLP and attention outputs with SAE feature decompositions, then pruned weak attribution edges. What does the resulting “circuit” represent?
Trade-off

Golden Gate Claude was created by artificially activating one SAE feature during inference. What does this demonstrate about SAE features beyond interpretability?

Golden Gate Claude was created by artificially activating one SAE feature during inference. What does this demonstrate about SAE features beyond interpretability?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 8 of 8

What is superposition and why does it make interpretability hard?

★★☆
AnthropicOpenAI

How do sparse autoencoders work and what do they find?

★★☆
Anthropic

What are induction heads and why do they matter for in-context learning?

★★★
AnthropicGoogle

Explain the residual stream view of transformers.

★★☆
Anthropic

How does circuit tracing work? What has it revealed?

★★★
Anthropic

What is polysemanticity and how do SAEs address it?

★★☆
AnthropicOpenAI

How could interpretability improve AI safety?

★★☆
AnthropicOpenAI

What did scaling monosemanticity find in Claude 3 Sonnet?

★★★
Anthropic