📋 Formula Cheat Sheet
5-minute pre-interview review — all key formulas on one page
Core Attention
Scaled Dot-Product Attention
Q, K, V Projections
Multi-Head Attention
Multi-Head Output
Feed-Forward Network
Standard FFN
SwiGLU (modern)
Layer Normalization
LayerNorm
RMSNorm
Pre-Norm Residual
Positional Encoding
Sinusoidal PE
RoPE
Training & Loss
Cross-Entropy Loss
Perplexity
KV Cache
Incremental Attention
Memory Calculation
Key Numbers
Params per Transformer layer