Skip to content

Transformer Math

📋 Formula Cheat Sheet

5-minute pre-interview review — all key formulas on one page

Core Attention

Scaled Dot-Product Attention

Q, K, V Projections

Multi-Head Attention

Multi-Head Output

Feed-Forward Network

Standard FFN

SwiGLU (modern)

Layer Normalization

LayerNorm

RMSNorm

Pre-Norm Residual

Positional Encoding

Sinusoidal PE

RoPE

Training & Loss

Cross-Entropy Loss

Perplexity

KV Cache

Incremental Attention

Memory Calculation

Key Numbers

Params per Transformer layer