Skip to content

Transformer Math

Module 65 · Trust & Evaluation

🔬 Mechanistic Interpretability

Anthropic found a single Claude feature that fires only on ‘the Golden Gate Bridge’ — and clamping it causally steers Claude’s behavior to mention bridges in every response.

Status:

Anthropic partially traced parts of Claude's internal computation using attribution graphs. They followed a multi-step reasoning example — asking about the capital of Texas — and watched as intermediate features activated sequentially, ultimately writing the answer. The method captures a fraction of the total computation, not a complete end-to-end trace.

This module is a hands-on companion to the Interpretability module. That module explains what superposition and SAEs are. This one shows how to actually do mechanistic interpretability research — training SAEs, running attribution patching, tracing circuits, and steering model behavior.

🗺️

SAE Architecture

What you're seeing

A Sparse Autoencoder trained on a layer's residual stream activations. The encoder projects from d_model (8, simplified) into a much wider feature space d_sae (24 here; real SAEs use 32–256× expansion). An L1 penalty forces most feature activations to zero — only a handful “light up” for any given input. The decoder reconstructs the original activation from those sparse features; each decoder column is one learned “feature direction”.

What to notice

Three green features are active (Capitals, Math, Code) out of 24 — that's the sparsity in action. The loss combines reconstruction fidelity (||x − x̂||²) and a sparsity penalty (λ||f||₁). Anthropic's Gemma-Scope SAEs achieve >99% reconstruction with most features firing <1% of the time.

Sparse Autoencoder (SAE) Architecturex (activation)d = 8f = ReLU(W_enc · x + b)d_sae = 24 (3x) — most entries = 0x̂ (recon)d = 8W_enc[d → d_sae]CapitalsMathCodeL1 penaltyforces sparsityW_dec[d_sae → d]Loss =||x - x̂||²+ λ||f||₁Active neuron (input)Active SAE featureInactive (≈0)Reconstruction
🔬

SAE Training: How It Actually Works

The theory is in the Interpretability module. Here's what you actually build. An SAE has two parts: an encoder that maps model activations to a much wider space, and a decoder that maps back. The decoder column directions are the features.

SAE architecture

x ∈ ℝd
model activation
W_enc
f ∈ ℝd_sae
sparse features (ReLU)
W_dec
x̂ ∈ ℝd
reconstruction

d_sae is . Most entries of f are zero (L1 penalty). Each column of W_dec is one feature direction.

Real hyperparameters from Anthropic papers

HyperparameterTypical valueWhat it controls
expansion_factorHow many more features than model dimensions; higher = more capacity but slower
λ (L1 coefficient)1e-3 – 1e-1Sparsity vs. reconstruction trade-off; tune for L0 of 20–100 features per token
training_tokensSufficient coverage for features to specialize; fewer tokens → more dead features
learning_rate1e-4 – 5e-5Adam optimizer; cosine decay
target_layerMiddle layersMiddle layers have richest representations; Anthropic used residual stream post-MLP

PyTorch: Minimal SAE Training Loop

python
import torch
import torch.nn as nn
from torch.optim import Adam

class SparseAutoencoder(nn.Module):
    def __init__(self, d_model: int, expansion: int = 64):
        super().__init__()
        d_sae = d_model * expansion
        self.W_enc = nn.Linear(d_model, d_sae, bias=True)
        self.W_dec = nn.Linear(d_sae, d_model, bias=True)
        self.relu = nn.ReLU()
        # Normalize decoder columns to unit norm
        self._normalize_decoder()

    def _normalize_decoder(self):
        with torch.no_grad():
            norms = self.W_dec.weight.norm(dim=0, keepdim=True).clamp(min=1e-8)
            self.W_dec.weight.div_(norms)

    def forward(self, x: torch.Tensor):
        # Center around decoder bias before encoding
        x_cent = x - self.W_dec.bias
        f = self.relu(self.W_enc(x_cent))      # sparse features
        x_hat = self.W_dec(f)                  # reconstruction
        return x_hat, f

def sae_loss(x, x_hat, f, lam: float = 5e-3):
    recon = (x - x_hat).pow(2).mean()          # MSE reconstruction
    sparsity = f.abs().mean()                   # L1 on features
    return recon + lam * sparsity, recon, sparsity

# Training loop sketch
sae = SparseAutoencoder(d_model=4096, expansion=64)
opt = Adam(sae.parameters(), lr=2e-4)

for activations in dataloader:              # activations: [B, d_model]
    opt.zero_grad()
    x_hat, f = sae(activations)
    loss, recon, sparse = sae_loss(activations, x_hat, f, lam=5e-3)
    loss.backward()
    opt.step()
    sae._normalize_decoder()                # keep decoder cols unit norm
💡 Tip · The key insight: after training, each column of W_decis a feature direction in the model's activation space. If column 4,721 consistently activates for “Golden Gate Bridge” text, that direction is the Golden Gate Bridge feature. You can read features by checking what inputs maximize each column's activation.

Quick check

Trade-off

You observe 30% dead features after training. Lowering the L1 coefficient hasn&apos;t helped. What is the most likely root cause?

You observe 30% dead features after training. Lowering the L1 coefficient hasn&apos;t helped. What is the most likely root cause?
💡

The Intuition: MRI-ing a Language Model

Imagine you could MRI a language model's brain while it thinks. Not just watch inputs and outputs, but trace every intermediate step — which concepts activated, in what order, and how they caused the final answer.

That's what Anthropic did with Claude 3.5 Haiku in their 2025 Biology paper. The experiment:

Circuit trace: “What is the capital of Texas?” (Anthropic, 2025)

1

Token "Texas" is read

The residual stream at the "Texas" token position starts activating features.

2

"Texas-state" feature fires

An SAE feature learned to recognize Texas as a US state activates strongly. It encodes geographic context.

3

"Austin" (capital) feature activates

The state feature propagates, causing the capital-city feature for Austin to fire — the model is building the answer.

4

"Austin" written to output

The Austin feature writes its direction to the residual stream at the answer position, and the model decodes "Austin".

Steering validation: two-hop geography

Anthropic asked: “What is the capital of the state containing Dallas?” The circuit activated a Texas feature which then output Austin. To confirm causality, researchers intervened mid-computation — swapping the Texas feature activation for a California feature activation. The model output changed to Sacramento. The feature, not the input token, drove the answer.

This is the mechanistic interpretability litmus test: if swapping feature F for feature G at the right layer changes the output from answer(F) to answer(G), you have found the causal bottleneck. Average circuits like this span 2.3 hops (Ameisen et al. 2025).

✨ Insight · This is circuit tracing — the core method. It combines sparse autoencoders (to name the features) with attribution patching (to measure which features caused which). The result is a computational graph you can read, not a black box.
⚠ Warning · Hallucination mechanism (Biology paper, 2025): Circuit analysis found “known answer” features that suppress the model's default refusal circuit. When a question is asked about something the model genuinely knows, these features fire and allow the answer to proceed. Hallucination occurs when these “known answer” features activate despite the model not actually having sufficient knowledge — confidence gates open when they shouldn't.

Three methods form the practical toolkit:

MethodWhat it doesCost
Sparse Autoencoder (SAE)Decomposes activations into named, monosemantic featuresTrain once, reuse
Activation PatchingSwap activations between runs to measure causal effectO(N) forward passes
Attribution PatchingGradient approximation of causal effect — fast circuit tracing1 forward + 1 backward
Quick Check

Why does the SAE decoder matrix represent interpretable features?

📐

Attribution Patching & Circuit Tracing

Method 1: Activation Patching (exact, slow)

Run the model on a clean input and a corrupted input. For each component , swap its activation from the clean run into the corrupted run and measure the change in output metric :

Large effect = component causally matters. Requires one forward pass per component — O(N) total.

Method 2: Attribution Patching (approximate, fast)

First-order Taylor approximation. The attribution of feature on output is:

Gradient tells you sensitivity; activation difference tells you magnitude. One forward + one backward pass covers all components. Accuracy is within ~90% of full patching on most circuits.

Method 3: Circuit Tracing (SAE + attribution combined)

Replace MLP and attention outputs with their SAE decompositions. Now run attribution patching over the feature graph instead of neurons. This gives a directed graph where nodes are named features and edges are causal attributions:

Prune edges below a threshold to get the sparse subgraph — the “circuit”. Each node is interpretable (SAE feature), each edge is a measured causal strength.

Cross-Layer Transcoders (CLTs) — the 2025 upgrade

Standard SAEs are trained per-layer. Ameisen et al. (2025) instead train Cross-Layer Transcoders: each CLT reads from the residual stream at its own layer but outputs contributions to all subsequent MLP layers. This lets a single feature represent a concept that persists and propagates across depth, rather than needing separate features at each layer.

Edge weight between source feature and target feature is:

where is the source feature's activation, is the Jacobian through attention and residual connections, and the sum runs over the decoder-to-encoder paths. Loss uses JumpReLU (not standard ReLU) plus a tanh-based sparsity penalty to reduce dead features.

💡 Tip · CLT limitations to know for interviews: Attribution graphs succeed on only (the rest are too complex or ambiguous to yield clean sparse graphs). The replacement model — which substitutes CLT features for the original MLP computations — and matches the model's top-1 next-token prediction on ~50% of a filtered evaluation set (prompts where the base model predicts correctly with confidence below 80%), with ~11.5% normalized mean reconstruction error.

PyTorch: Basic Activation Patching

python
import torch
from contextlib import contextmanager

@contextmanager
def patch_activation(model, layer_name: str, patch_value: torch.Tensor):
    """Context manager to swap one layer's output mid-forward-pass."""
    hooks = []
    def hook_fn(module, input, output):
        return patch_value   # replace with clean-run activation
    handle = dict(model.named_modules())[layer_name].register_forward_hook(hook_fn)
    hooks.append(handle)
    try:
        yield
    finally:
        for h in hooks:
            h.remove()

def activation_patching_score(model, clean_tokens, corrupt_tokens, layer_name,
                               clean_cache, metric_fn):
    """
    Measure how much layer_name causally matters for the metric.
    metric_fn(logits) -> scalar (e.g., logit diff between two tokens)
    """
    # Baseline: corrupted run
    with torch.no_grad():
        corrupt_logits = model(corrupt_tokens)
    baseline = metric_fn(corrupt_logits)

    # Patched: corrupted run but swap in the clean activation
    clean_act = clean_cache[layer_name]
    with torch.no_grad():
        with patch_activation(model, layer_name, clean_act):
            patched_logits = model(corrupt_tokens)
    patched = metric_fn(patched_logits)

    return (patched - baseline).item()  # positive = component helps
PyTorch implementation
# Logit lens: project intermediate residual stream to vocab space
import torch

def logit_lens(model, tokens: torch.Tensor):
    """
    At each layer, unembed the residual stream directly to get
    a probability distribution over vocabulary — no more processing.
    Shows what the model 'thinks' the next token is at each depth.
    """
    unembed = model.lm_head          # W_U: (d_model, vocab)
    ln_f = model.transformer.ln_f    # final layer norm

    residual_stream = []

    def save_residual(module, inp, out):
        # out[0] is the hidden state after this transformer block
        h = out[0] if isinstance(out, tuple) else out
        residual_stream.append(h.detach().clone())

    hooks = [block.register_forward_hook(save_residual)
             for block in model.transformer.h]

    with torch.no_grad():
        model(tokens)
    for h in hooks:
        h.remove()

    # Project each layer's residual stream through the unembedding
    layer_logits = []
    for h in residual_stream:
        normed = ln_f(h)                         # apply final norm
        logits = normed @ unembed.weight.T        # (B, seq, vocab)
        layer_logits.append(logits[:, -1, :].softmax(-1))  # last position
    return layer_logits  # list of (B, vocab) — one per layer

Quick check

Derivation

You need to rank the causal importance of all 10M CLT features in a single experiment. Which method is feasible?

You need to rank the causal importance of all 10M CLT features in a single experiment. Which method is feasible?
🔧

Break It — Feature Steering in Action

The “Golden Gate Claude” experiment (Anthropic, 2024) demonstrated that SAE features are causally real. Researchers found a feature that activates for text about the Golden Gate Bridge, then clamped it to 10x its maximum activation during inference.

Clamp 'Golden Gate Bridge' feature to 10x activation

Normal response (feature = 0)

User: What are some things to do in San Francisco?

“Some great options include visiting Fisherman's Wharf, exploring Golden Gate Park, the Painted Ladies, and the ferry to Alcatraz.”

Steered response (feature clamped 10x)

User: What are some things to do in San Francisco?

“Oh, the Golden Gate Bridge, obviously! And near the Golden Gate Bridge you'll find... the Golden Gate Bridge visitor center, Golden Gate Bridge overlooks, and the Golden Gate Bridge gift shop.”

✨ Insight · Feature steering is the interpretability equivalent of unit testing: if the feature truly represents a concept, amplifying it should reliably inject that concept into outputs. It does. This is how we know SAE features are real computational objects, not just post-hoc labels.

Quick check

Trade-off

The Golden Gate Bridge steering works by clamping one feature. What happens if you clamp 10 unrelated features simultaneously at 10x?

The Golden Gate Bridge steering works by clamping one feature. What happens if you clamp 10 unrelated features simultaneously at 10x?
📊

Real Numbers

FindingSourceDetails
Scaling Monosemanticity (2024)Claude 3 Sonnet, (d_model 4,096 → d_sae 1,048,576), middle layers
~1B tokens for SAE trainingBricken et al. (2023)Enough to cover rare features; fewer tokens leaves features underspecialized
Circuit Tracing methods (2025)Largest Cross-Layer Transcoder has 10M features total; mean L0 sparsity is 88 active features per token across the corpus
Circuit Tracing methods (2025)0.80 = fraction of important inputs explained by the graph; 0.61 = fraction of end-to-end computation explained by CLT features. Reconstruction error ~11.5% normalized mean.
Circuit Tracing methods (2025)Perturbation validation: Spearman ρ = 0.72 for feature-to-feature influence; cosine similarity ~0.80 for predicted vs. observed feature activations
Multi-step reasoning chainsBiology paper (2025)Dallas → Texas feature → Austin; swap Texas for California mid-computation and output shifts to Sacramento. Feature, not token, drives the answer.
Cross-lingual feature sharingBiology paper (2025)Concepts like “banana” and “wedding” converge to shared features across English, French, Chinese — concepts exist before language. Middle layers are most multilingual; English has “mechanistic privilege” as the default language.
Biology paper (2025)In ~50% of rhyming poems, the model activates candidate end-of-line rhyme words before writing the line that sets up the rhyme. Steering those features to a different word changes the chosen rhyme 70% of the time.
⚠ Warning · The “biology” framing is intentional but cautious. Anthropic draws analogies to neuroscience (circuits, features, neurons) but emphasizes these are mechanistic descriptions, not claims about consciousness or intent. The features are real computational objects; what they “mean” is inferred by humans looking at activation patterns.

Quick check

Derivation

Graph completeness is 0.80 and replacement score is 0.61. Why is there a gap between these two numbers?

Graph completeness is 0.80 and replacement score is 0.61. Why is there a gap between these two numbers?
🛠️

Hands-On Tooling

Mechanistic interpretability is unusually accessible for a frontier research area. Three tools let you start exploring today:

  • Neuronpedia — browse 50M+ SAE features across GPT-2, Gemma, Llama, and DeepSeek-R1. Search by concept, inspect top activations, and steer model behavior interactively in your browser.
  • TransformerLens () — Python library for hooking into any layer of a transformer, extracting activations, and running activation patching experiments. The standard tool for mech interp research. Built around the same framework used to discover induction heads.
  • ARENA Chapter 1 — structured coding exercises covering induction heads (which ), superposition, SAE training, and circuit analysis. The closest thing to a university course in mech interp.
💡 Tip · Start with Neuronpedia to build intuition for what SAE features look like, then move to TransformerLens when you want to run your own experiments. ARENA exercises bridge the gap between “I understand the theory” and “I can find circuits myself.”

Quick check

Derivation

You use TransformerLens on a model trained on 500M tokens. Induction head experiments show no clear induction behavior. Most likely explanation?

You use TransformerLens on a model trained on 500M tokens. Induction head experiments show no clear induction behavior. Most likely explanation?
🧠

Key Takeaways

What to remember for interviews

  1. 1Circuit tracing = SAE decomposition + attribution patching. Name the features with SAEs, measure causality with gradients, prune to a sparse graph. The 2025 CLT approach uses 10M features with 88 active per token (L0 sparsity), achieving a graph completeness score of 0.80.
  2. 2SAEs work by learning an overcomplete basis (32x–256x wider) with an L1 sparsity penalty. CLTs upgrade this: features read from one layer but write to all subsequent MLP layers, so one feature can represent a concept that persists across depth.
  3. 3Attribution patching is activation patching's faster cousin: attr(f_i) ≈ (∂M/∂f_i) · (f_i^clean − f_i^corrupt). One forward + one backward pass covers all features instead of O(N) passes.
  4. 4Feature steering causally confirms interpretations. Dallas → Texas → Austin circuit: swap Texas for California mid-computation, output becomes Sacramento. Poetry planning: model activates rhyme-word candidates ~50% of the time before writing the setup line; steering succeeds 70% of the time.
  5. 5Current limits: replacement score is 0.61 (CLT features explain ~61% of end-to-end computation); graphs succeed on only ~25% of attempted prompts; attention decomposition is harder than MLP; human labeling introduces its own errors.
🧠

Recap quiz

🧠

Mechanistic Interpretability recap

Trade-off

An SAE trained with expansion factor 4x achieves 99% reconstruction but dense features. What should you change first?

An SAE trained with expansion factor 4x achieves 99% reconstruction but dense features. What should you change first?
Derivation

The CLT replacement score is 0.61 on Claude 3.5 Haiku. What does this imply for using attribution graphs in production safety work?

The CLT replacement score is 0.61 on Claude 3.5 Haiku. What does this imply for using attribution graphs in production safety work?
Trade-off

Researchers steer a &ldquo;California&rdquo; feature mid-computation on the prompt about the capital of the state containing Dallas. The model outputs Sacramento. What does this confirm?

Researchers steer a &ldquo;California&rdquo; feature mid-computation on the prompt about the capital of the state containing Dallas. The model outputs Sacramento. What does this confirm?
Trade-off

Attribution patching scores all N components in O(1) passes. Activation patching scores them in O(N) passes. When would you still prefer activation patching?

Attribution patching scores all N components in O(1) passes. Activation patching scores them in O(N) passes. When would you still prefer activation patching?
Derivation

Scaling Monosemanticity extracted ~34M features from Claude 3 Sonnet using 256x expansion. Why not use 1000x to extract even more features?

Scaling Monosemanticity extracted ~34M features from Claude 3 Sonnet using 256x expansion. Why not use 1000x to extract even more features?
Derivation

Induction heads emerge around 2B training tokens, with loss on repeated random sequences dropping from ~2.3 to ~0.1. What makes this a phase change rather than gradual learning?

Induction heads emerge around 2B training tokens, with loss on repeated random sequences dropping from ~2.3 to ~0.1. What makes this a phase change rather than gradual learning?
Trade-off

The Biology paper finds ~50% advance planning in rhyming poems and 70% steering success. Does this prove the model is &ldquo;consciously planning&rdquo;?

The Biology paper finds ~50% advance planning in rhyming poems and 70% steering success. Does this prove the model is &ldquo;consciously planning&rdquo;?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 5 of 5

Walk through training a sparse autoencoder on transformer activations. What are the key hyperparameters?

★★☆
Anthropic

How does attribution patching differ from activation patching? When would you use each?

★★★
AnthropicGoogle

What is the L1/reconstruction trade-off in SAE training and how do you pick the right sparsity coefficient?

★★☆
Anthropic

Describe how you would find the circuit responsible for a specific model behavior (e.g., gendered pronoun resolution).

★★★
AnthropicOpenAI

What are the limitations of current mechanistic interpretability methods? What can't they tell us?

★★☆
AnthropicGoogleOpenAI