⚙️ FFN & Activations
Where does GPT store the fact that Paris is in France?
Attention gets all the press, but live in the Feed-Forward Network. Every token passes through an expand-then-compress bottleneck that stores factual knowledge, applies nonlinear transformations, and acts as a learned key-value memory. Understanding FFN is understanding where the model actually "knows" things.
FFN Architecture
What you're seeing: The three-phase FFN pipeline used in every Transformer layer. An input vector of dimension d_model = 4096 is projected up to its size, passed through a nonlinearity (GELU or SwiGLU), then projected back down. The two weight matrices W₁ and W₂ together account for ~67% of all Transformer parameters. The SwiGLU variant adds a third gate matrix V that multiplies element-wise before W₂.
FFN Expand-Compress Visualization
What you are seeing: An input vector of dimension 4 is expanded to a higher-dimensional hidden space, passed through an activation function, then compressed back to dimension 4. Node brightness shows activation magnitude. Dead (zero) neurons appear gray.
What to try: Switch between SwiGLU, ReLU, and No Activation to see how each shapes the hidden representation. Notice how ReLU zeros out many neurons while SwiGLU keeps smooth gating.
The Intuition
The FFN in each Transformer layer performs the same operation on every token independently: expand to a higher dimension, apply a nonlinearity, then compress back. Why?
Diagram: FFN Expand-Compress Architecture (GPT-2 dimensions)
Input (512) expands 4× to hidden (2048), activation gates which neurons fire, then contracts back to 512. Same weight matrices for every token.
Diagram: Position-wise — same FFN weights, each token independently
FFN as Key-Value Memory (Geva et al., 2021): each row of acts as a "key" that matches a pattern in the input. Each column of is the corresponding "value" — the information retrieved when that key matches. The activation function gates which memories fire. This is why you can edit factual knowledge by .
Why 67% of parameters? With , each FFN layer has parameters. Attention has (Q, K, V, O projections). That's a 2:1 ratio — FFN dominates.
Diagram: Activation functions — GELU suppresses small negatives smoothly; ReLU hard-zeros them
GELU allows small negative values to pass (smooth taper), unlike ReLU's hard zero. Swish grows super-linearly for large positive values and is used as the gate in SwiGLU.
SwiGLU activation (Shazeer, 2020): Modern transformers replaced ReLU/GELU with SwiGLU, which adds a multiplicative gate: . The gate learns which features to suppress. To compensate for the third weight matrix, the hidden dimension shrinks from 4x to .
Quick check
In the FFN key-value memory view, what determines whether a memory slot j is “retrieved” for a given input token?
Where does a Transformer primarily store factual knowledge like 'Paris is the capital of France'?
Step-by-Step Derivation
Standard FFN (Original Transformer)
Two linear transformations with a nonlinearity in between. Applied independently to each token position. The nonlinearity is essential: by the , a two-layer MLP can approximate any continuous function given enough hidden units — but only with a non-linear activation.
Where , , and typically.
SwiGLU FFN (Llama, PaLM, Mistral)
Adds a gating mechanism with a third weight matrix. The Swish activation is where is the sigmoid function:
Parameter Count Comparison
| Component | Matrices | Parameters |
|---|---|---|
| Attention | Q, K, V, O | 4d2 |
| FFN (standard) | W1, W2 | 8d2 |
| FFN (SwiGLU) | W1, V, W2 | 3 x d x (8d/3) = 8d2 |
In both cases, FFN has 2x the parameters of attention per layer. With biases excluded (modern practice), FFN holds of each Transformer block's parameters.
PyTorch implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
class SwiGLU_FFN(nn.Module):
"""FFN with SwiGLU activation (Llama-2, Mistral, PaLM)."""
def __init__(self, d_model: int, d_ff: int | None = None):
super().__init__()
if d_ff is None:
d_ff = int(2 * (4 * d_model) / 3)
d_ff = 256 * ((d_ff + 255) // 256) # round up to 256
self.w1 = nn.Linear(d_model, d_ff, bias=False) # gate proj
self.v = nn.Linear(d_model, d_ff, bias=False) # up proj
self.w2 = nn.Linear(d_ff, d_model, bias=False) # down proj
def forward(self, x: torch.Tensor) -> torch.Tensor:
# SwiGLU: (Swish(xW1) ⊙ xV) W2
return self.w2(F.silu(self.w1(x)) * self.v(x))
# Example: Llama-2 7B dimensions
ffn = SwiGLU_FFN(d_model=4096, d_ff=11008)
x = torch.randn(1, 128, 4096) # (batch, seq_len, d_model)
out = ffn(x) # (1, 128, 4096) — same shape as inputQuick check
GPT-2 Small has d_model=768 and L=12 layers. Each FFN block has two matrices (W1: 768×3072, W2: 3072×768). What is the total FFN parameter count across all layers?
Break It — See What Happens
Empirically, — their pre-activation is always negative so they contribute nothing to any forward pass. SwiGLU's gating eliminates this hard zero, keeping all neurons active with learned suppression instead.
Real-World Numbers
| Model | d_model | d_ff | Ratio | Activation |
|---|---|---|---|---|
| GPT-2 (1.5B) | 1600 | 6400 | 4.0x | GELU |
| Llama-2 (7B) | 4096 | 11008 | 2.69x | SwiGLU |
| Llama-2 (70B) | 8192 | 28672 | 3.5x | SwiGLU |
| PaLM (540B) | 18432 | 73728 | 4.0x | SwiGLU |
| Mixtral (8x7B) | 4096 | 14336 | 3.5x | SwiGLU (8 experts, top-2) |
Quick check
Mixtral-8x7B has 8 FFN experts per layer, routing each token to top-2. If d_model=4096, d_ff=14336 per expert, and there are 32 layers, what fraction of FFN parameters is active per forward pass?
FFN as Key-Value Memory (Geva et al. 2021)
Geva et al. (2021) showed that the two-layer FFN can be interpreted as a . Decompose the operation as:
Each row of is a key: it fires (produces a large positive pre-activation) when the input matches a learned pattern. The corresponding column of is the associated value: the information added to the residual stream when that key fires. The full FFN output is a weighted sum of value columns, where weights are the activated key scores after .
Concretely, if and , then with 4× expansion the FFN stores key-value pairs. A Llama-2 7B layer (d=4096, d_f=11008) contains 11,008 memory slots per layer, times 32 layers = 352,256 total memories.
Example: single memory slot j
key_j = W1[j, :] # pattern detector — fires for specific input patterns
score_j = σ(x @ key_j) # how strongly this key fires
value_j = W2[:, j] # what to add to residual stream when key fires
output = Σ_j score_j * value_j # weighted sum of memories
Empirical evidence supports this interpretation: individual FFN neurons activate for interpretable semantic categories (e.g., a neuron that fires for country names, or for present-tense verbs). The value vectors of high-activation neurons often correspond to tokens that the model would predict in that context. This is why knowledge editing techniques (ROME, MEMIT) target FFN columns directly — they are the “storage slots” where factual associations live.
Memory-slot counts here are illustrative calculations for the cited architectures, not a claim that each FFN neuron maps to exactly one clean human-readable fact.
Quick check
You want to change a model’s belief from “The Eiffel Tower is in Rome” to “The Eiffel Tower is in Paris.” Based on the KV-memory view, which weight matrix should you edit, and which entry in that matrix?
Key Takeaways
What to remember for interviews
- 1The FFN expands each token's representation to a higher-dimensional space (4× by default), applies a nonlinearity, then compresses back — giving each position independent computation after attention routing.
- 2FFN layers hold ~67% of a Transformer's parameters (8d² per block vs 4d² for attention) and dominate compute at short sequence lengths.
- 3The FFN acts as a key-value memory: W1 rows are 'keys' matching input patterns, W2 columns are 'values' storing retrieved knowledge — this is why factual knowledge can be surgically edited in FFN weights (ROME, MEMIT).
- 4SwiGLU (used in Llama, PaLM) replaces GELU with a multiplicative gate that learns to suppress or amplify features; to keep parameter count constant, the hidden dim shrinks from 4× to ~2.67×.
- 5In MoE architectures (Mixtral), dense FFNs are replaced with sparse expert FFNs — each token is routed to top-k experts, activating only a fraction of total parameters per forward pass.
Recap Quiz
A Transformer block has d_model=1024. Attention (Q,K,V,O) has 4d² params; FFN (W1,W2, no bias) has 8d². What fraction of the block’s params does FFN hold?
The original Transformer used a 4x FFN expansion (d_ff = 4 × d_model). Why 4x specifically, and what happens if you use 2x or 8x?
SwiGLU uses three matrices (W1, V, W2) instead of two. To keep FFN parameter count equal to a standard 4x FFN (8d²), what hidden dimension d_ff should you use, and what value did Llama-2 7B actually pick (rounded up to a hardware-friendly multiple of 256) for d_model=4096?
Geva et al. (2021) showed FFN layers are key-value memories. ROME (Meng et al. 2022) edits factual knowledge by modifying W2. Why W2 and not W1?
Mixtral-8x7B replaces each FFN with 8 expert FFNs and routes each token to the top-2 experts. Total parameters are ~47B but active parameters per forward pass are ~13B. Why are experts always FFN blocks and not attention blocks?
A ReLU FFN with d_ff=16,384 neurons shows that ~40% of neurons are always zero after training. What does this reveal about how the FFN stores information?
Further Reading
- GLU Variants Improve Transformer — Shazeer 2020 — shows SwiGLU and GeGLU outperform standard ReLU FFNs. SwiGLU is now the default in LLaMA and PaLM.
- Transformer Feed-Forward Layers Are Key-Value Memories — Geva et al. 2021 — interprets FFN layers as implicit key-value stores where keys match input patterns and values store output distributions.
- LLM Visualization — Brendan Bycroft — 3D walkthrough showing the FFN block's two linear layers and activation function in a real GPT model.
- The Illustrated Transformer — Jay Alammar — Visual walkthrough of the FFN sublayer and how it complements the attention block within each Transformer layer.
- Mixture of Experts Explained (Hugging Face blog) — How Mixtral replaces dense FFNs with sparse MoE layers — concrete explanation of routing, expert selection, and capacity factors.
- 3Blue1Brown — Transformers (What they are and what they do) — Visual breakdown of the MLP (FFN) layers in Transformers — intuition for what the expansion and contraction matrices compute.
- 3Blue1Brown — How might LLMs store facts (Chapter 7) — Grant Sanderson 2024 — visual explanation of how MLP layers act as key-value memories storing factual associations, with the connection to superposition.
Interview Questions
Showing 6 of 6
Why do FFN layers contain ~67% of a Transformer's parameters while attention has only ~33%? What does each component contribute?
★★☆Explain SwiGLU and why it replaced GELU/ReLU in modern Transformers. What is the parameter cost?
★★☆What is the 'FFN as key-value memory' hypothesis (Geva et al.)? What evidence supports it?
★★★What is the expansion ratio in FFN and why is 4x standard? How does it change with SwiGLU?
★★☆How does Mixture of Experts (MoE) relate to FFN layers? Why are experts always FFN blocks?
★★☆If you remove the FFN layers entirely from a Transformer, what happens? What about replacing them with linear layers (no activation)?
★★☆