Skip to content

Transformer Math

Module 7 · The Transformer

⚙️ FFN & Activations

Where does GPT store the fact that Paris is in France?

Status:

Attention gets all the press, but live in the Feed-Forward Network. Every token passes through an expand-then-compress bottleneck that stores factual knowledge, applies nonlinear transformations, and acts as a learned key-value memory. Understanding FFN is understanding where the model actually "knows" things.

🔬

FFN Architecture

What you're seeing: The three-phase FFN pipeline used in every Transformer layer. An input vector of dimension d_model = 4096 is projected up to its size, passed through a nonlinearity (GELU or SwiGLU), then projected back down. The two weight matrices W₁ and W₂ together account for ~67% of all Transformer parameters. The SwiGLU variant adds a third gate matrix V that multiplies element-wise before W₂.

Inputd_model4096W₁d × 4dGELU/SwiGLUd_ff = 16384W₂4d × dOutputd_model409667% of paramsW₁ + W₂ = 134M
🎮

FFN Expand-Compress Visualization

What you are seeing: An input vector of dimension 4 is expanded to a higher-dimensional hidden space, passed through an activation function, then compressed back to dimension 4. Node brightness shows activation magnitude. Dead (zero) neurons appear gray.

What to try: Switch between SwiGLU, ReLU, and No Activation to see how each shapes the hidden representation. Notice how ReLU zeros out many neurons while SwiGLU keeps smooth gating.

💡

The Intuition

The FFN in each Transformer layer performs the same operation on every token independently: expand to a higher dimension, apply a nonlinearity, then compress back. Why?

Diagram: FFN Expand-Compress Architecture (GPT-2 dimensions)

InputHiddenOutputW₁+ σ(·)W₂d=512d=2048 (4×)d=512

Input (512) expands 4× to hidden (2048), activation gates which neurons fire, then contracts back to 512. Same weight matrices for every token.

Diagram: Position-wise — same FFN weights, each token independently

TheFFN(same W₁, W₂)h'catFFN(same W₁, W₂)h'satFFN(same W₁, W₂)h'No cross-token interaction — each position processed independently

FFN as Key-Value Memory (Geva et al., 2021): each row of acts as a "key" that matches a pattern in the input. Each column of is the corresponding "value" — the information retrieved when that key matches. The activation function gates which memories fire. This is why you can edit factual knowledge by .

Why 67% of parameters? With , each FFN layer has parameters. Attention has (Q, K, V, O projections). That's a 2:1 ratio — FFN dominates.

Diagram: Activation functions — GELU suppresses small negatives smoothly; ReLU hard-zeros them

01−0.2-3-2-10123xf(x)ReLUGELUSwish (SwiGLU gate)

GELU allows small negative values to pass (smooth taper), unlike ReLU's hard zero. Swish grows super-linearly for large positive values and is used as the gate in SwiGLU.

SwiGLU activation (Shazeer, 2020): Modern transformers replaced ReLU/GELU with SwiGLU, which adds a multiplicative gate: . The gate learns which features to suppress. To compensate for the third weight matrix, the hidden dimension shrinks from 4x to .

✨ Insight · Think of the FFN as a giant lookup table. Attention figures out WHAT to look at (routing). The FFN then processes each token through its memory bank — retrieving facts, applying transformations, and updating the representation. Attention moves information between tokens; FFN transforms information within each token.

Quick check

Derivation

In the FFN key-value memory view, what determines whether a memory slot j is “retrieved” for a given input token?

In the FFN key-value memory view, what determines whether a memory slot j is “retrieved” for a given input token?
Quick Check

Where does a Transformer primarily store factual knowledge like 'Paris is the capital of France'?

📐

Step-by-Step Derivation

Standard FFN (Original Transformer)

Two linear transformations with a nonlinearity in between. Applied independently to each token position. The nonlinearity is essential: by the , a two-layer MLP can approximate any continuous function given enough hidden units — but only with a non-linear activation.

Where , , and typically.

SwiGLU FFN (Llama, PaLM, Mistral)

Adds a gating mechanism with a third weight matrix. The Swish activation is where is the sigmoid function:

💡 Tip · Three matrices (W1, V, W2) instead of two. To keep parameter count constant, reduce from to . Llama-2 uses .

Parameter Count Comparison

ComponentMatricesParameters
AttentionQ, K, V, O4d2
FFN (standard)W1, W28d2
FFN (SwiGLU)W1, V, W23 x d x (8d/3) = 8d2

In both cases, FFN has 2x the parameters of attention per layer. With biases excluded (modern practice), FFN holds of each Transformer block's parameters.

PyTorch implementation
import torch
import torch.nn as nn
import torch.nn.functional as F

class SwiGLU_FFN(nn.Module):
    """FFN with SwiGLU activation (Llama-2, Mistral, PaLM)."""
    def __init__(self, d_model: int, d_ff: int | None = None):
        super().__init__()
        if d_ff is None:
            d_ff = int(2 * (4 * d_model) / 3)
            d_ff = 256 * ((d_ff + 255) // 256)  # round up to 256

        self.w1 = nn.Linear(d_model, d_ff, bias=False)  # gate proj
        self.v  = nn.Linear(d_model, d_ff, bias=False)   # up proj
        self.w2 = nn.Linear(d_ff, d_model, bias=False)   # down proj

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # SwiGLU: (Swish(xW1) ⊙ xV) W2
        return self.w2(F.silu(self.w1(x)) * self.v(x))

# Example: Llama-2 7B dimensions
ffn = SwiGLU_FFN(d_model=4096, d_ff=11008)
x = torch.randn(1, 128, 4096)  # (batch, seq_len, d_model)
out = ffn(x)                   # (1, 128, 4096) — same shape as input

Quick check

Derivation

GPT-2 Small has d_model=768 and L=12 layers. Each FFN block has two matrices (W1: 768×3072, W2: 3072×768). What is the total FFN parameter count across all layers?

GPT-2 Small has d_model=768 and L=12 layers. Each FFN block has two matrices (W1: 768×3072, W2: 3072×768). What is the total FFN parameter count across all layers?
🔧

Break It — See What Happens

Replace SwiGLU with ReLU
Remove FFN entirely

Empirically, — their pre-activation is always negative so they contribute nothing to any forward pass. SwiGLU's gating eliminates this hard zero, keeping all neurons active with learned suppression instead.

📊

Real-World Numbers

Modeld_modeld_ffRatioActivation
GPT-2 (1.5B)160064004.0xGELU
Llama-2 (7B)4096110082.69xSwiGLU
Llama-2 (70B)8192286723.5xSwiGLU
PaLM (540B)18432737284.0xSwiGLU
Mixtral (8x7B)4096143363.5xSwiGLU (8 experts, top-2)
✨ Insight · Notice the trend: the original 4x expansion ratio (GPT-2) gave way to ~2.7x with SwiGLU (Llama-2 7B) to keep parameter count constant with three matrices. Larger models like Llama-2 70B and PaLM use higher ratios because they can afford the extra parameters. Mixtral replaces each dense FFN with 8 sparse expert FFNs — — same active compute as a 13B dense model, .

Quick check

Derivation

Mixtral-8x7B has 8 FFN experts per layer, routing each token to top-2. If d_model=4096, d_ff=14336 per expert, and there are 32 layers, what fraction of FFN parameters is active per forward pass?

Mixtral-8x7B has 8 FFN experts per layer, routing each token to top-2. If d_model=4096, d_ff=14336 per expert, and there are 32 layers, what fraction of FFN parameters is active per forward pass?
🔬

FFN as Key-Value Memory (Geva et al. 2021)

Geva et al. (2021) showed that the two-layer FFN can be interpreted as a . Decompose the operation as:

Each row of is a key: it fires (produces a large positive pre-activation) when the input matches a learned pattern. The corresponding column of is the associated value: the information added to the residual stream when that key fires. The full FFN output is a weighted sum of value columns, where weights are the activated key scores after .

Concretely, if and , then with 4× expansion the FFN stores key-value pairs. A Llama-2 7B layer (d=4096, d_f=11008) contains 11,008 memory slots per layer, times 32 layers = 352,256 total memories.

Example: single memory slot j

key_j = W1[j, :] # pattern detector — fires for specific input patterns

score_j = σ(x @ key_j) # how strongly this key fires

value_j = W2[:, j] # what to add to residual stream when key fires

output = Σ_j score_j * value_j # weighted sum of memories

Empirical evidence supports this interpretation: individual FFN neurons activate for interpretable semantic categories (e.g., a neuron that fires for country names, or for present-tense verbs). The value vectors of high-activation neurons often correspond to tokens that the model would predict in that context. This is why knowledge editing techniques (ROME, MEMIT) target FFN columns directly — they are the “storage slots” where factual associations live.

✨ Insight · The key-value memory view explains a counterintuitive observation: larger FFN width improves factual recall more than it improves reasoning. More memory slots = more facts stored. This is distinct from attention, which improves relational reasoning (who attends to whom) but attention primarily routes information between positions, while FFN layers are a major locus of factual knowledge storage.

Memory-slot counts here are illustrative calculations for the cited architectures, not a claim that each FFN neuron maps to exactly one clean human-readable fact.

Quick check

Trade-off

You want to change a model’s belief from “The Eiffel Tower is in Rome” to “The Eiffel Tower is in Paris.” Based on the KV-memory view, which weight matrix should you edit, and which entry in that matrix?

You want to change a model’s belief from “The Eiffel Tower is in Rome” to “The Eiffel Tower is in Paris.” Based on the KV-memory view, which weight matrix should you edit, and which entry in that matrix?
🧠

Key Takeaways

What to remember for interviews

  1. 1The FFN expands each token's representation to a higher-dimensional space (4× by default), applies a nonlinearity, then compresses back — giving each position independent computation after attention routing.
  2. 2FFN layers hold ~67% of a Transformer's parameters (8d² per block vs 4d² for attention) and dominate compute at short sequence lengths.
  3. 3The FFN acts as a key-value memory: W1 rows are 'keys' matching input patterns, W2 columns are 'values' storing retrieved knowledge — this is why factual knowledge can be surgically edited in FFN weights (ROME, MEMIT).
  4. 4SwiGLU (used in Llama, PaLM) replaces GELU with a multiplicative gate that learns to suppress or amplify features; to keep parameter count constant, the hidden dim shrinks from 4× to ~2.67×.
  5. 5In MoE architectures (Mixtral), dense FFNs are replaced with sparse expert FFNs — each token is routed to top-k experts, activating only a fraction of total parameters per forward pass.
🧠

Recap Quiz

Derivation

A Transformer block has d_model=1024. Attention (Q,K,V,O) has 4d² params; FFN (W1,W2, no bias) has 8d². What fraction of the block’s params does FFN hold?

A Transformer block has d_model=1024. Attention (Q,K,V,O) has 4d² params; FFN (W1,W2, no bias) has 8d². What fraction of the block’s params does FFN hold?
Trade-off

The original Transformer used a 4x FFN expansion (d_ff = 4 × d_model). Why 4x specifically, and what happens if you use 2x or 8x?

The original Transformer used a 4x FFN expansion (d_ff = 4 × d_model). Why 4x specifically, and what happens if you use 2x or 8x?
Derivation

SwiGLU uses three matrices (W1, V, W2) instead of two. To keep FFN parameter count equal to a standard 4x FFN (8d²), what hidden dimension d_ff should you use, and what value did Llama-2 7B actually pick (rounded up to a hardware-friendly multiple of 256) for d_model=4096?

SwiGLU uses three matrices (W1, V, W2) instead of two. To keep FFN parameter count equal to a standard 4x FFN (8d²), what hidden dimension d_ff should you use, and what value did Llama-2 7B actually pick (rounded up to a hardware-friendly multiple of 256) for d_model=4096?
Trade-off

Geva et al. (2021) showed FFN layers are key-value memories. ROME (Meng et al. 2022) edits factual knowledge by modifying W2. Why W2 and not W1?

Geva et al. (2021) showed FFN layers are key-value memories. ROME (Meng et al. 2022) edits factual knowledge by modifying W2. Why W2 and not W1?
Trade-off

Mixtral-8x7B replaces each FFN with 8 expert FFNs and routes each token to the top-2 experts. Total parameters are ~47B but active parameters per forward pass are ~13B. Why are experts always FFN blocks and not attention blocks?

Mixtral-8x7B replaces each FFN with 8 expert FFNs and routes each token to the top-2 experts. Total parameters are ~47B but active parameters per forward pass are ~13B. Why are experts always FFN blocks and not attention blocks?
Trade-off

A ReLU FFN with d_ff=16,384 neurons shows that ~40% of neurons are always zero after training. What does this reveal about how the FFN stores information?

A ReLU FFN with d_ff=16,384 neurons shows that ~40% of neurons are always zero after training. What does this reveal about how the FFN stores information?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 6 of 6

Why do FFN layers contain ~67% of a Transformer's parameters while attention has only ~33%? What does each component contribute?

★★☆
GoogleOpenAI

Explain SwiGLU and why it replaced GELU/ReLU in modern Transformers. What is the parameter cost?

★★☆
GoogleMeta

What is the 'FFN as key-value memory' hypothesis (Geva et al.)? What evidence supports it?

★★★
AnthropicGoogle

What is the expansion ratio in FFN and why is 4x standard? How does it change with SwiGLU?

★★☆
GoogleMeta

How does Mixture of Experts (MoE) relate to FFN layers? Why are experts always FFN blocks?

★★☆
GoogleMeta

If you remove the FFN layers entirely from a Transformer, what happens? What about replacing them with linear layers (no activation)?

★★☆
OpenAIAnthropic