Transformer Math

How big? How much data? Chinchilla has the answer

AI

FineWeb, filtering, dedup — data quality beats data quantity

AIME

The full o1 model scored 83.3% on AIME 2024 with consensus@64 (74.4% pass@1) and 94.8% on MATH (vs.

AMP

🔙 Backpropagation →

For example, log-softmax = x_i - log(sum(exp(x_j))) is computed as x_i - (max_x + log(sum(exp(x_j - max_x)))) — subtracting the max before exp prevents overflow.

ANN

(2) Online: embed query, ANN search (top-k=20), rerank with cross-encoder (top-5), stuff into LLM context.

ANSI

🗄️ State Management →

Dual state systems: React context for UI, module state for services

API

DDP, ZeRO, FSDP — training across thousands of GPUs

ARC

How big? How much data? Chinchilla has the answer

AWQ

Quality loss is 1–3% on typical benchmarks (GPTQ, AWQ) but can be higher on chain-of-thought reasoning tasks where precision on intermediate tokens matters.

AWS

vLLM wins on: hardware portability (AMD, AWS Neuron), faster model updates (no engine recompile), Python-native API, and operational simplicity.

B

BatchNorm

He init accounts for batch normalization inserted after each ReLU, which would otherwise restore variance to 1 automatically.

BERT

Jay Alammar — how BERT reuses the Transformer encoder with bidirectional attention and masked language modeling, producing contextual embeddings that outperform static word vectors.

BLEU

BPE, vocabulary size, and why GPT can't count letters

BPE

🔭 Case: Design Perplexity →

The complete Transformer pipeline — from raw text to next-token prediction

C

CAI

🎯 RLHF & Reward Models →

Anthropic s CAI paper (Bai et al.

CDN

Live Web Fetch (conditional) (typical external HTTP round-trip; depends on publisher CDN; qualitative range) Triggered only for freshness-critical queries (news, prices, sports scores).

ChatGPT

🤖 Agents & ReAct →

Think → Act → Observe — the reasoning loop

CI

🧪 Agent Evaluation →

A trajectory evaluator scores the full decision chain, not just the final diff.

CLAUDE

📍 Positional Encoding →

CLAUDE.md explicitly flags this distinction: “not permutation-invariant.” The sinusoidal PE uses base 10000 in the divisor term.

CLI

🔄 Eval-Driven Development →

Judge calibration, regression gating, launch criteria, eval ops

CLIP

🧬 Model Merging →

report +2pp accuracy on ImageNet with ViT-G/14 CLIP fine-tunes using model soups vs best single run.

CLS

BERT s bidirectional attention makes [CLS] a function of the full sentence; GPT-2 s causal mask means each token sees only past context.

CPU

Turning token IDs into meaningful vectors

CSAM

💎 Case: Design Gemini →

Multi-modal frontier serving — TPU stack, 1M-token attention, safety classifier chain

CSS

🤖 Sub-agents →

The sub-agent needs to search for CSS files — that history is noise.

CUDA

⚡ Flash Attention →

Tiling, IO-awareness, and O(N) memory attention

D

DALL

🎨 Diffusion Basics →

Yannic Kilcher — DALL·E 2 / Diffusion Models Explained (YouTube) Visual walkthrough of latent diffusion, CLIP guidance, and how modern text-to-image systems combine these ideas.

DDP

DDP (DistributedDataParallel) replicates the entire model on each GPU and only synchronizes gradients via all-reduce after each backward pass.

DDPM

🎨 Diffusion Basics →

Denoising Diffusion Probabilistic Models (Ho et al., 2020) — the foundational DDPM paper that revived diffusion models for image generation.

DeepSeek

Multilingual models trained on Chinese data (e.g., Qwen, DeepSeek) use tokenizers with more CJK merges to reduce this gap.

DPO

RLHF/DPO teaches which responses humans prefer among those the model can already produce.

DRAM

Hurts: the 394M extra params are re-fetched from DRAM on every generated token, creating a bandwidth bottleneck.

F

FAQ

📐 The Design Doc →

Amazon Working Backwards — PR/FAQ + 6-Pager Not an ML piece, but the discipline of writing the customer-facing press release before the design doc is the methodological backbone of this module.

FFN

The complete Transformer pipeline — from raw text to next-token prediction

FlashAttention

This is why long-context models are challenging, and why Flash Attention (O(n) memory) is crucial.

FLOP

🔄 Eval-Driven Development →

BPE, vocabulary size, and why GPT can't count letters

FLOPS

🧠 Multi-Head Attention →

MHA uses h× more FLOPS, giving a strictly larger compute budget per forward pass.

FN

Judge calibration, regression gating, launch criteria, eval ops

FP

🔄 Eval-Driven Development →

An order-invariant win rate confirms the signal is real, not an artifact of position bias.

FS

🔮 Speculative Execution →

Three isolation layers: (1) Overlay filesystem — reads from the real FS but writes go to a temporary copy-on-write layer.

FSDP

FSDP (Fully Sharded Data Parallel) shards model parameters, gradients, AND optimizer states across GPUs (equivalent to ZeRO Stage 3).

G

GB

But word order matters ('dog bites man' vs 'man bites dog').

GDPR

💬 Case: Design ChatGPT →

Because GDPR and CCPA require a documented retention and deletion policy; a system with no TTL has no retention policy and is a compliance blocker for regulated Enterprise customers.

GELU

Both W1 and W2 store keys; the GELU activation retrieves the matching value GELU is a nonlinear gate, not a retrieval mechanism.

GEMM

Matrix multiplication, weight initialization, and the universal approximation theorem

GPT

The complete Transformer pipeline — from raw text to next-token prediction

GPTQ

Quality loss is 1–3% on typical benchmarks (GPTQ, AWQ) but can be higher on chain-of-thought reasoning tasks where precision on intermediate tokens matters.

GPU

64 dimensions cannot be loaded into GPU VRAM; 8192 is the architectural maximum for Transformers.

GQA

The complete Transformer pipeline — from raw text to next-token prediction

GRPO

🎮 RL Foundations →

GRPO (Group Relative Policy Optimization) replaces the learned value function with a simple group-based baseline: for each prompt, sample K completions, compute their rewards, and use the group mean as the baseline.

H

HBM

FlashAttention keeps O(n) HBM memory without changing FLOPs.

HNSW

Index structures (IVF, HNSW) add overhead on top.

HTML

A production pipeline: (1) Crawl — Common Crawl or custom scraper, (2) Extraction — HTML to text with boilerplate removal (trafilatura, resiliparse), (3) Language filtering — fasttext lid model, keep target languages, (4) Quality filtering — perplexity filter (small LM trained on Wikipedia), heur...

HTTP

Serving stacks, continuous batching, latency vs throughput, vLLM, and API design

HuggingFace

Tokenizer Summary — Hugging Face docs HuggingFace reference covering every major algorithm — WordPiece, Unigram, BPE, byte-level BPE — with concrete examples of merge rules.

HumanEval

Generate N completions per prompt using the SFT model, score each with a reward model or human eval, keep only the best.

I

ID

The complete Transformer pipeline — from raw text to next-token prediction

IDE

🔌 Tool Use & Protocols →

Function calling, MCP, A2A — connecting agents to the world

IDF

Classifier or heuristic scores: perplexity filter (trained on Wikipedia), TF-IDF similarity to reference corpus, text length, symbol-to-word ratio.

ImageNet

📐 Optimizers →

opt-sgd-vs-adam-generalization Training a ResNet-50 on ImageNet, a researcher finds SGD with momentum achieves 77.0% top-1 accuracy while AdamW achieves 76.1%.

InfiniBand

DDP, ZeRO, FSDP — training across thousands of GPUs

InstructGPT

🔧 Fine-tuning & LoRA →

SFT Dataset Reference Points InstructGPT (Ouyang et al., 2022) used ~13,000 human-written SFT demonstrations collected from ~40 labelers RLHF on 33K preference pairs — enough that its 1.3B RLHF model was preferred by human evaluators over the 175B GPT-3 base Llama-2 Chat used ~27,540 vendor-colle...

IO

The core of Transformers — derive this on a whiteboard

IS

The complete Transformer pipeline — from raw text to next-token prediction

J

JSON

Serving stacks, continuous batching, latency vs throughput, vLLM, and API design

K

KB

Turning token IDs into meaningful vectors

KL

It adds a KL penalty against a uniform prior, equivalent to label smoothing in the logit space.

KV

The complete Transformer pipeline — from raw text to next-token prediction

L

LayerNorm

4) N identical blocks, each: LayerNorm → Multi-Head Attention (with causal mask) → residual add → LayerNorm → FFN (SwiGLU) → residual add.

LIMA

LIMA (Less Is More for Alignment, 2023) showed that a 65B Llama model fine-tuned on just 1,000 high-quality examples performed comparably to models trained on 52K+ examples (Alpaca, Databricks-dolly).

LLM

For a typical LLM: ~65% of parameters are in FFN layers (two large weight matrices per block: d→4d and 4d→d, or 8d/3 for SwiGLU).

LM

But for small models the ratio is much higher: a 125M param model with 50K vocab and d=768 already has 38M embedding params = 31% of total.

LN

The complete Transformer pipeline — from raw text to next-token prediction

LoRA

🔗 LayerNorm & Residuals →

Forgetting during fine-tuning is addressed by fine-tuning methods (LoRA, regularization), not by more pretraining tokens.

LR

Stability comes from controlling magnitude, not centering — empirical results show negligible perplexity difference.

LSP

🔌 Plugins & MCP →

Model Context Protocol Specification The open standard for connecting AI agents to external tools and data sources.

M

MATH

Chain-of-thought, o1, DeepSeek-R1, test-time compute

MB

The core of Transformers — derive this on a whiteboard

MCP

Chain-of-thought, o1, DeepSeek-R1, test-time compute

MHA

Standard MHA: W_Q + W_K + W_V + W_O = 4d² attention params.

ML

1989) states: a single hidden-layer MLP with a non-polynomial activation can approximate any continuous function on a compact set to arbitrary precision, given enough neurons.

MLA

🧠 Multi-Head Attention →

2024 evolution — MLA (Multi-Head Latent Attention): 93.3% KV cache reduction vs MHA

MLP

The complete Transformer pipeline — from raw text to next-token prediction

MMLU

Evaluate on benchmarks (MMLU, HumanEval, MT-Bench) and human preference tests.

MQA

Per-head compute cost and why MQA/GQA were invented.

MRR

Ground LLM outputs in real data — reduce hallucination

MS

Sub-second: ANN search <50ms, reranking <100ms, LLM streaming for perceived latency.

MSE

MSE on probabilities would (1) not properly penalize confident wrong answers (assigning 0.01 vs 0.001 to the correct token matters a lot), (2) not correspond to the log-likelihood objective we actually want to maximize, (3) have worse gradient properties — cross-entropy gradients are proportional...

MT

For each prompt, generate N completions (e.g., N=64) from the current model, score each with a reward model, and keep only the highest-scoring one.

MTTR

💬 Case: Design ChatGPT →

MTTR dominated by detection, not rollback.

N

NCCL

🔭 Case: Design Perplexity →

(2) Check NCCL topology — one rank may have a bad NIC, cross-switch connection, or PCIe bottleneck.

NDCG

The likely gap: the upgrade was evaluated on standard relevance benchmarks (NDCG, MRR) where the new model improved, but citation precision is a downstream metric — it depends on what the LLM does with the reranked documents, not just which documents score highest.

NLI

(2) Online: embed query, ANN search (top-k=20), rerank with cross-encoder (top-5), stuff into LLM context.

NLP

🔭 Case: Design Perplexity →

BPE, vocabulary size, and why GPT can't count letters

NotebookLM

/module/dr-case-notebooklm text-accent underline underline-offset-2 text-accent underline underline-offset-2 Company Lens — same design, different pushes rounded-lg border border-border bg-card p-4 rounded-lg border border-border bg-card p-4 rounded-lg border border-border bg-card p-4 Retrieval q...

NSFW

🎨 Case: Design Midjourney →

Browser / mobile app — submits text prompt, polls or subscribes for result Auth, per-tenant rate limiting, step-budget enforcement by tier Text-level NSFW / policy check — fast classifier rejects before any GPU time is spent Three-lane priority queue: Pro → Paid → Free.

NVIDIA

🔙 Backpropagation →

Chain rule, computation graphs, and autograd — how gradients flow backward

O

OCR

👁️ Vision Transformers & CLIP →

Production models handle this via: (1) dynamic resolution — resize to multiple supported resolutions based on aspect ratio; (2) image tiling — split high-res images into crops, each encoded separately; (3) token compression — use a perceiver resampler or pooling to reduce visual tokens from hundr...

OK

Architecture: (1) Offline pipeline: chunk documents (512 tokens, 50 token overlap), embed with a bi-encoder (e.g., BGE-large), store in vector DB (Pinecone/Qdrant/pgvector).

OOM

The real win from FlashAttention is enabling longer sequences (100K+) that were previously OOM — that is a qualitative capability gain, not just throughput.

ORM

(Note: the widely cited ~$5.5M figure is for DeepSeek-V3 pretraining, not R1.) ORM scores only the final answer: correct = 1, wrong = 0.

OS

💾 KV Cache & Memory →

The GPU spends most time loading model weights and KV cache from HBM, not computing.

P

PagedAttention

This is why serving GPT-3-scale models for many concurrent users requires PagedAttention or similar — each concurrent request consumes ~10 GB just for KV cache at max context.

PaLM

🧠 Multi-Head Attention →

One head isn't enough — each head learns different patterns

PDF

🖥️ Terminal UI (Ink) →

React reconciler for terminals, Yoga flexbox, ANSI rendering, and keyboard focus

PE

📍 Positional Encoding →

Sinusoidal PE: no learned parameters, generalizes to unseen lengths in theory (but poorly in practice), deterministic.

PII

🏗️ The Full Forward Pass →

random web data, (5) Deduplication — exact dedup (URL + hash) then fuzzy dedup (MinHash LSH at document level, n-gram dedup at paragraph level), (6) PII removal — regex + NER for emails, phone numbers, (7) Toxicity filtering — classifier, but careful not to remove all discussion of sensitive topi...

PM

📊 LLM Evaluation →

A PM says the judge is superhuman.

PPL

Step 2: Multi-head attention computes Q, K, V projections, applies causal mask, computes weighted sum.

PPO

Align the model to human preferences using either PPO (with a reward model) or DPO (direct preference pairs).

PR

🧪 Agent Evaluation →

(2) Use pass@k: probability of at least one success in k attempts, estimated as 1 - C(n-c, k)/C(n, k) where c = number of successes in n trials.

PreToolUse

🔧 Tool System →

Add: (1) Zod input validation before execution, (2) permission resolution (deny rules → allow rules → PreToolUse hook → user prompt), (3) timeout per tool, (4) error isolation via Promise.allSettled so one failed Grep doesn't abort the rest.

PyTorch

BPE, vocabulary size, and why GPT can't count letters

Q

QA

🎲 Sampling & Decoding →

For factual QA or structured tasks, greedy can work fine since there's often one correct answer.

QK

📍 Positional Encoding →

Attention has no sense of order — how do we fix that?

QPS

⚙️ Agent Harness Architecture →

BPE, vocabulary size, and why GPT can't count letters

QueryEngine

Sub-agents need: (1) Context isolation — fresh QueryEngine with clean message history, not polluted by parent's 100K+ token conversation.

R

RAG

🏎️ Speculative Decoding →

Small model drafts, big model verifies — parallel generation

RAM

🔧 Fine-tuning & LoRA →

With r=16 on a 4096x4096 matrix, you train 131K params instead of 16.8M — a 128x reduction.

ReAct

✍️ Prompt Engineering →

System prompts, few-shot, structured output, tool schemas

ReLU

🏗️ The Full Forward Pass →

However, ReLU kills ~half the neurons (negative outputs become 0), halving the effective variance.

RL

Step 4: Another LayerNorm on the result.

RLHF

RLHF can help with quality but does not specifically target the train/inference distribution mismatch.

ROI

✍️ Prompt Engineering →

System prompts, few-shot, structured output, tool schemas

RoPE

3) Positional encoding (sinusoidal or RoPE) adds position information.

S

SFT

The key insight: pre-training already teaches the model almost everything — SFT just teaches the format and style of interaction.

SGD

🏗️ The Full Forward Pass →

Matrix multiplication, weight initialization, and the universal approximation theorem

SLA

📐 The Design Doc →

A synchronous router adds its latency directly to TTFT.

SLO

Batch size is 1 per step — weights load from HBM but arithmetic intensity is ~1 FLOP/byte With batch=1 the GPU loads each weight matrix from HBM to do just a few multiply-adds, leaving arithmetic units starved while memory bandwidth is saturated.

SOTA

The core of Transformers — derive this on a whiteboard

SRAM

The core of Transformers — derive this on a whiteboard

SRE

🛟 Error Recovery →

When the API returns 'prompt too long', the agent triggers reactive compaction: summarize older messages (keeping recent tool results intact), replace the originals with the summary, and retry.

SSE

The paper reports 2–4× higher throughput than prior systems (FasterTransformer, Orca) at the same latency; much larger headline numbers seen elsewhere depend on the specific static-batching baseline being compared against.

SWE

Chain-of-thought, o1, DeepSeek-R1, test-time compute

SwiGLU

The complete Transformer pipeline — from raw text to next-token prediction

T

TB

Softmax is translation-invariant but not scale-invariant.

TF

Classifier or heuristic scores: perplexity filter (trained on Wikipedia), TF-IDF similarity to reference corpus, text length, symbol-to-word ratio.

TFLOP

The core of Transformers — derive this on a whiteboard

TFLOPS

~125 TFLOPS effective // = 1.25e14 FLOPS const a100Flops = 1.25e14; // effective FLOPS per A100 const a100Hours = C / a100Flops / 3600; const costPerHour = 2; // $2/A100-hour const trainingCost = a100Hours * costPerHour; return ( Training Cost (@$2/A100-hr) How big should the model be?

TP

Strategy: (1) Tensor parallelism (TP=8) within each 8-GPU node — splits individual layers across GPUs connected by fast NVLink (~600 GB/s).

TPU

Serving stacks, continuous batching, latency vs throughput, vLLM, and API design

TTFT

📏 Long Context & Context Engineering →

Turning token IDs into meaningful vectors

TTL

The cache has a TTL (typically 5 min) so low-traffic endpoints may not benefit.

U

UI

🧪 Agent Evaluation →

Trajectory eval, tool accuracy, and why agent eval is harder

URL

UX

Critical for UX — users see output immediately instead of waiting for the full response.

V

VRAM

64 dimensions cannot be loaded into GPU VRAM; 8192 is the architectural maximum for Transformers.

X

XLA