FFN & Activations — Transformer Math

Module 7 · The Transformer

⚙️ FFN & Activations

Where does GPT store the fact that Paris is in France?

Status:

Attention gets all the press, but live in the Feed-Forward Network. Every token passes through an expand-then-compress bottleneck that stores factual knowledge, applies nonlinear transformations, and acts as a learned key-value memory. Understanding FFN is understanding where the model actually "knows" things.

🔬

FFN Architecture

What you're seeing: The three-phase FFN pipeline used in every Transformer layer. An input vector of dimension d_model = 4096 is projected up to its size, passed through a nonlinearity (GELU or SwiGLU), then projected back down. The two weight matrices W₁ and W₂ together account for ~67% of all Transformer parameters. The SwiGLU variant adds a third gate matrix V that multiplies element-wise before W₂.

🎮

FFN Expand-Compress Visualization

What you are seeing: An input vector of dimension 4 is expanded to a higher-dimensional hidden space, passed through an activation function, then compressed back to dimension 4. Node brightness shows activation magnitude. Dead (zero) neurons appear gray.

What to try: Switch between SwiGLU, ReLU, and No Activation to see how each shapes the hidden representation. Notice how ReLU zeros out many neurons while SwiGLU keeps smooth gating.

💡

The Intuition

The FFN in each Transformer layer performs the same operation on every token independently: expand to a higher dimension, apply a nonlinearity, then compress back. Why?

Diagram: FFN Expand-Compress Architecture (GPT-2 dimensions)

Input (512) expands 4× to hidden (2048), activation gates which neurons fire, then contracts back to 512. Same weight matrices for every token.

Diagram: Position-wise — same FFN weights, each token independently

FFN as Key-Value Memory (Geva et al., 2021): each row of acts as a "key" that matches a pattern in the input. Each column of is the corresponding "value" — the information retrieved when that key matches. The activation function gates which memories fire. This is why you can edit factual knowledge by .

Why 67% of parameters? With , each FFN layer has parameters. Attention has (Q, K, V, O projections). That's a 2:1 ratio — FFN dominates.

Diagram: Activation functions — GELU suppresses small negatives smoothly; ReLU hard-zeros them

GELU allows small negative values to pass (smooth taper), unlike ReLU's hard zero. Swish grows super-linearly for large positive values and is used as the gate in SwiGLU.

SwiGLU activation (Shazeer, 2020): Modern transformers replaced ReLU/GELU with SwiGLU, which adds a multiplicative gate: . The gate learns which features to suppress. To compensate for the third weight matrix, the hidden dimension shrinks from 4x to .

✨ Insight · Think of the FFN as a giant lookup table. Attention figures out WHAT to look at (routing). The FFN then processes each token through its memory bank — retrieving facts, applying transformations, and updating the representation. Attention moves information between tokens; FFN transforms information within each token.

Quick check

Derivation

In the FFN key-value memory view, what determines whether a memory slot j is “retrieved” for a given input token?

The attention score between the token and its neighbors — highly attended tokens retrieve more memories.The magnitude of the token embedding — larger embeddings activate more slots.The dot product between the input x and the corresponding W1 row (key_j): a large positive value means slot j fires.The token position — earlier tokens retrieve earlier memory slots sequentially.

Quick Check

Where does a Transformer primarily store factual knowledge like 'Paris is the capital of France'?

📐

Step-by-Step Derivation

Standard FFN (Original Transformer)

Two linear transformations with a nonlinearity in between. Applied independently to each token position. The nonlinearity is essential: by the , a two-layer MLP can approximate any continuous function given enough hidden units — but only with a non-linear activation.

Where , , and typically.

SwiGLU FFN (Llama, PaLM, Mistral)

Adds a gating mechanism with a third weight matrix. The Swish activation is where is the sigmoid function:

💡 Tip · Three matrices (W1, V, W2) instead of two. To keep parameter count constant, reduce from to . Llama-2 uses .

Parameter Count Comparison

Component	Matrices	Parameters
Attention	Q, K, V, O	4d2
FFN (standard)	W1, W2	8d2
FFN (SwiGLU)	W1, V, W2	3 x d x (8d/3) = 8d2

In both cases, FFN has 2x the parameters of attention per layer. With biases excluded (modern practice), FFN holds of each Transformer block's parameters.

PyTorch implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class SwiGLU_FFN(nn.Module):
    """FFN with SwiGLU activation (Llama-2, Mistral, PaLM)."""
    def __init__(self, d_model: int, d_ff: int | None = None):
        super().__init__()
        if d_ff is None:
            d_ff = int(2 * (4 * d_model) / 3)
            d_ff = 256 * ((d_ff + 255) // 256)  # round up to 256

        self.w1 = nn.Linear(d_model, d_ff, bias=False)  # gate proj
        self.v  = nn.Linear(d_model, d_ff, bias=False)   # up proj
        self.w2 = nn.Linear(d_ff, d_model, bias=False)   # down proj

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # SwiGLU: (Swish(xW1) ⊙ xV) W2
        return self.w2(F.silu(self.w1(x)) * self.v(x))

# Example: Llama-2 7B dimensions
ffn = SwiGLU_FFN(d_model=4096, d_ff=11008)
x = torch.randn(1, 128, 4096)  # (batch, seq_len, d_model)
out = ffn(x)                   # (1, 128, 4096) — same shape as input

Quick check

Derivation

GPT-2 Small has d_model=768 and L=12 layers. Each FFN block has two matrices (W1: 768×3072, W2: 3072×768). What is the total FFN parameter count across all layers?

28,311,552 — only W1 counts because W2 is the transpose of W1 (weight tying).169,869,312 — you must multiply by the number of attention heads (12).56,623,104 — each FFN has 2 × 768 × 3072 = 4,718,592 params; times 12 layers = 56.6M.4,718,592 — that is the count per layer; across 12 layers the parameter sharing means no multiplication.

🔧

Break It — See What Happens

Replace SwiGLU with ReLU

Remove FFN entirely

Empirically, — their pre-activation is always negative so they contribute nothing to any forward pass. SwiGLU's gating eliminates this hard zero, keeping all neurons active with learned suppression instead.

📊

Real-World Numbers

Model	d_model	d_ff	Ratio	Activation
GPT-2 (1.5B)	1600	6400	4.0x	GELU
Llama-2 (7B)	4096	11008	2.69x	SwiGLU
Llama-2 (70B)	8192	28672	3.5x	SwiGLU
PaLM (540B)	18432	73728	4.0x	SwiGLU
Mixtral (8x7B)	4096	14336	3.5x	SwiGLU (8 experts, top-2)

✨ Insight · Notice the trend: the original 4x expansion ratio (GPT-2) gave way to ~2.7x with SwiGLU (Llama-2 7B) to keep parameter count constant with three matrices. Larger models like Llama-2 70B and PaLM use higher ratios because they can afford the extra parameters. Mixtral replaces each dense FFN with 8 sparse expert FFNs — — same active compute as a 13B dense model, .

Quick check

Derivation

Mixtral-8x7B has 8 FFN experts per layer, routing each token to top-2. If d_model=4096, d_ff=14336 per expert, and there are 32 layers, what fraction of FFN parameters is active per forward pass?

100% — all experts are evaluated in parallel then their outputs are summed.12.5% — top-1 of 8 is the actual Mixtral routing strategy.25% — top-2 out of 8 experts = 2/8 = 25% of expert FFN params active per token per layer.50% — half the experts handle odd positions, half handle even positions.

🔬

FFN as Key-Value Memory (Geva et al. 2021)

Geva et al. (2021) showed that the two-layer FFN can be interpreted as a . Decompose the operation as:

Each row of is a key: it fires (produces a large positive pre-activation) when the input matches a learned pattern. The corresponding column of is the associated value: the information added to the residual stream when that key fires. The full FFN output is a weighted sum of value columns, where weights are the activated key scores after .

Concretely, if and , then with 4× expansion the FFN stores key-value pairs. A Llama-2 7B layer (d=4096, d_f=11008) contains 11,008 memory slots per layer, times 32 layers = 352,256 total memories.

Example: single memory slot j

key_j = W1[j, :] # pattern detector — fires for specific input patterns

score_j = σ(x @ key_j) # how strongly this key fires

value_j = W2[:, j] # what to add to residual stream when key fires

output = Σ_j score_j * value_j # weighted sum of memories

Empirical evidence supports this interpretation: individual FFN neurons activate for interpretable semantic categories (e.g., a neuron that fires for country names, or for present-tense verbs). The value vectors of high-activation neurons often correspond to tokens that the model would predict in that context. This is why knowledge editing techniques (ROME, MEMIT) target FFN columns directly — they are the “storage slots” where factual associations live.

✨ Insight · The key-value memory view explains a counterintuitive observation: larger FFN width improves factual recall more than it improves reasoning. More memory slots = more facts stored. This is distinct from attention, which improves relational reasoning (who attends to whom) but attention primarily routes information between positions, while FFN layers are a major locus of factual knowledge storage.

Memory-slot counts here are illustrative calculations for the cited architectures, not a claim that each FFN neuron maps to exactly one clean human-readable fact.

Quick check

Trade-off

You want to change a model’s belief from “The Eiffel Tower is in Rome” to “The Eiffel Tower is in Paris.” Based on the KV-memory view, which weight matrix should you edit, and which entry in that matrix?

Edit W1[j,:] (the key) to make it fire only for “Paris-related” inputs instead of Eiffel Tower queries.Edit the embedding row for “Rome” to align it with the “Paris” vector, redirecting the factual association at the token level.Edit W2[:,j] where j is the neuron whose key fires for “Eiffel Tower location” — the value column stores what is retrieved, so editing it steers the answer to “Paris.”Edit the attention Q matrix to re-route the “Eiffel Tower” token toward Paris-related context during decoding.

🧠

Key Takeaways

What to remember for interviews

1The FFN expands each token's representation to a higher-dimensional space (4× by default), applies a nonlinearity, then compresses back — giving each position independent computation after attention routing.
2FFN layers hold ~67% of a Transformer's parameters (8d² per block vs 4d² for attention) and dominate compute at short sequence lengths.
3The FFN acts as a key-value memory: W1 rows are 'keys' matching input patterns, W2 columns are 'values' storing retrieved knowledge — this is why factual knowledge can be surgically edited in FFN weights (ROME, MEMIT).
4SwiGLU (used in Llama, PaLM) replaces GELU with a multiplicative gate that learns to suppress or amplify features; to keep parameter count constant, the hidden dim shrinks from 4× to ~2.67×.
5In MoE architectures (Mixtral), dense FFNs are replaced with sparse expert FFNs — each token is routed to top-k experts, activating only a fraction of total parameters per forward pass.

🧠

Recap Quiz

📚

Interview Questions

Difficulty:

Company:

Showing 6 of 6

Why do FFN layers contain ~67% of a Transformer's parameters while attention has only ~33%? What does each component contribute?

★★☆

GoogleOpenAI

Explain SwiGLU and why it replaced GELU/ReLU in modern Transformers. What is the parameter cost?

★★☆

GoogleMeta

What is the 'FFN as key-value memory' hypothesis (Geva et al.)? What evidence supports it?

★★★

AnthropicGoogle

What is the expansion ratio in FFN and why is 4x standard? How does it change with SwiGLU?

★★☆

GoogleMeta

How does Mixture of Experts (MoE) relate to FFN layers? Why are experts always FFN blocks?

★★☆

GoogleMeta

If you remove the FFN layers entirely from a Transformer, what happens? What about replacing them with linear layers (no activation)?

★★☆

OpenAIAnthropic

←

🧠 Multi-Head Attention

🔗 LayerNorm & Residuals

→

⚙️ FFN & Activations

FFN Architecture

FFN Expand-Compress Visualization

The Intuition

Step-by-Step Derivation

Break It — See What Happens

Real-World Numbers

FFN as Key-Value Memory (Geva et al. 2021)

Key Takeaways

Recap Quiz

Further Reading

Interview Questions