Transformer Math — ML internals, defended every number

📋 Formula Cheat Sheet

5-minute pre-interview review — all key formulas on one page

Scaled Dot-Product Attention

Q, K, V Projections

Multi-Head Output

Standard FFN

SwiGLU (modern)

LayerNorm

RMSNorm

Pre-Norm Residual

Sinusoidal PE

RoPE

Cross-Entropy Loss

Perplexity

Incremental Attention

Memory Calculation

Params per Transformer layer