Skip to content

Transformer Math

Glossary

Auto-extracted from all 84 modules. Use this to look up acronyms or jump to where a concept is first introduced.

150 terms indexed84 modules scanned

LLM-friendly plain text at /glossary-data.txt

A

How big? How much data? Chinchilla has the answer

FineWeb, filtering, dedup — data quality beats data quantity

The full o1 model scored 83.3% on AIME 2024 with consensus@64 (74.4% pass@1) and 94.8% on MATH (vs.

For example, log-softmax = x_i - log(sum(exp(x_j))) is computed as x_i - (max_x + log(sum(exp(x_j - max_x)))) — subtracting the max before exp prevents overflow.

(2) Online: embed query, ANN search (top-k=20), rerank with cross-encoder (top-5), stuff into LLM context.

Dual state systems: React context for UI, module state for services

DDP, ZeRO, FSDP — training across thousands of GPUs

How big? How much data? Chinchilla has the answer

Quality loss is 1–3% on typical benchmarks (GPTQ, AWQ) but can be higher on chain-of-thought reasoning tasks where precision on intermediate tokens matters.

vLLM wins on: hardware portability (AMD, AWS Neuron), faster model updates (no engine recompile), Python-native API, and operational simplicity.

B

He init accounts for batch normalization inserted after each ReLU, which would otherwise restore variance to 1 automatically.

Jay Alammar — how BERT reuses the Transformer encoder with bidirectional attention and masked language modeling, producing contextual embeddings that outperform static word vectors.

BPE, vocabulary size, and why GPT can't count letters

The complete Transformer pipeline — from raw text to next-token prediction

C

Anthropic s CAI paper (Bai et al.

Live Web Fetch (conditional) (typical external HTTP round-trip; depends on publisher CDN; qualitative range) Triggered only for freshness-critical queries (news, prices, sports scores).

Think → Act → Observe — the reasoning loop

A trajectory evaluator scores the full decision chain, not just the final diff.

CLAUDE.md explicitly flags this distinction: “not permutation-invariant.” The sinusoidal PE uses base 10000 in the divisor term.

Judge calibration, regression gating, launch criteria, eval ops

report +2pp accuracy on ImageNet with ViT-G/14 CLIP fine-tunes using model soups vs best single run.

BERT s bidirectional attention makes [CLS] a function of the full sentence; GPT-2 s causal mask means each token sees only past context.

Turning token IDs into meaningful vectors

Multi-modal frontier serving — TPU stack, 1M-token attention, safety classifier chain

The sub-agent needs to search for CSS files — that history is noise.

Tiling, IO-awareness, and O(N) memory attention

D

Yannic Kilcher — DALL·E 2 / Diffusion Models Explained (YouTube) Visual walkthrough of latent diffusion, CLIP guidance, and how modern text-to-image systems combine these ideas.

DDP (DistributedDataParallel) replicates the entire model on each GPU and only synchronizes gradients via all-reduce after each backward pass.

Denoising Diffusion Probabilistic Models (Ho et al., 2020) — the foundational DDPM paper that revived diffusion models for image generation.

Multilingual models trained on Chinese data (e.g., Qwen, DeepSeek) use tokenizers with more CJK merges to reduce this gap.

RLHF/DPO teaches which responses humans prefer among those the model can already produce.

Hurts: the 394M extra params are re-fetched from DRAM on every generated token, creating a bandwidth bottleneck.

F

Amazon Working Backwards — PR/FAQ + 6-Pager Not an ML piece, but the discipline of writing the customer-facing press release before the design doc is the methodological backbone of this module.

The complete Transformer pipeline — from raw text to next-token prediction

This is why long-context models are challenging, and why Flash Attention (O(n) memory) is crucial.

BPE, vocabulary size, and why GPT can't count letters

MHA uses h× more FLOPS, giving a strictly larger compute budget per forward pass.

Judge calibration, regression gating, launch criteria, eval ops

An order-invariant win rate confirms the signal is real, not an artifact of position bias.

Three isolation layers: (1) Overlay filesystem — reads from the real FS but writes go to a temporary copy-on-write layer.

FSDP (Fully Sharded Data Parallel) shards model parameters, gradients, AND optimizer states across GPUs (equivalent to ZeRO Stage 3).

G

But word order matters ('dog bites man' vs 'man bites dog').

Because GDPR and CCPA require a documented retention and deletion policy; a system with no TTL has no retention policy and is a compliance blocker for regulated Enterprise customers.

Both W1 and W2 store keys; the GELU activation retrieves the matching value GELU is a nonlinear gate, not a retrieval mechanism.

Matrix multiplication, weight initialization, and the universal approximation theorem

The complete Transformer pipeline — from raw text to next-token prediction

Quality loss is 1–3% on typical benchmarks (GPTQ, AWQ) but can be higher on chain-of-thought reasoning tasks where precision on intermediate tokens matters.

64 dimensions cannot be loaded into GPU VRAM; 8192 is the architectural maximum for Transformers.

The complete Transformer pipeline — from raw text to next-token prediction

GRPO (Group Relative Policy Optimization) replaces the learned value function with a simple group-based baseline: for each prompt, sample K completions, compute their rewards, and use the group mean as the baseline.

H

FlashAttention keeps O(n) HBM memory without changing FLOPs.

Index structures (IVF, HNSW) add overhead on top.

A production pipeline: (1) Crawl — Common Crawl or custom scraper, (2) Extraction — HTML to text with boilerplate removal (trafilatura, resiliparse), (3) Language filtering — fasttext lid model, keep target languages, (4) Quality filtering — perplexity filter (small LM trained on Wikipedia), heur...

Serving stacks, continuous batching, latency vs throughput, vLLM, and API design

Tokenizer Summary — Hugging Face docs HuggingFace reference covering every major algorithm — WordPiece, Unigram, BPE, byte-level BPE — with concrete examples of merge rules.

Generate N completions per prompt using the SFT model, score each with a reward model or human eval, keep only the best.

I

The complete Transformer pipeline — from raw text to next-token prediction

Function calling, MCP, A2A — connecting agents to the world

Classifier or heuristic scores: perplexity filter (trained on Wikipedia), TF-IDF similarity to reference corpus, text length, symbol-to-word ratio.

opt-sgd-vs-adam-generalization Training a ResNet-50 on ImageNet, a researcher finds SGD with momentum achieves 77.0% top-1 accuracy while AdamW achieves 76.1%.

DDP, ZeRO, FSDP — training across thousands of GPUs

SFT Dataset Reference Points InstructGPT (Ouyang et al., 2022) used ~13,000 human-written SFT demonstrations collected from ~40 labelers RLHF on 33K preference pairs — enough that its 1.3B RLHF model was preferred by human evaluators over the 175B GPT-3 base Llama-2 Chat used ~27,540 vendor-colle...

The core of Transformers — derive this on a whiteboard

The complete Transformer pipeline — from raw text to next-token prediction

J

Serving stacks, continuous batching, latency vs throughput, vLLM, and API design

K

Turning token IDs into meaningful vectors

It adds a KL penalty against a uniform prior, equivalent to label smoothing in the logit space.

The complete Transformer pipeline — from raw text to next-token prediction

L

4) N identical blocks, each: LayerNorm → Multi-Head Attention (with causal mask) → residual add → LayerNorm → FFN (SwiGLU) → residual add.

LIMA (Less Is More for Alignment, 2023) showed that a 65B Llama model fine-tuned on just 1,000 high-quality examples performed comparably to models trained on 52K+ examples (Alpaca, Databricks-dolly).

For a typical LLM: ~65% of parameters are in FFN layers (two large weight matrices per block: d→4d and 4d→d, or 8d/3 for SwiGLU).

But for small models the ratio is much higher: a 125M param model with 50K vocab and d=768 already has 38M embedding params = 31% of total.

The complete Transformer pipeline — from raw text to next-token prediction

Forgetting during fine-tuning is addressed by fine-tuning methods (LoRA, regularization), not by more pretraining tokens.

Stability comes from controlling magnitude, not centering — empirical results show negligible perplexity difference.

Model Context Protocol Specification The open standard for connecting AI agents to external tools and data sources.

M

Chain-of-thought, o1, DeepSeek-R1, test-time compute

The core of Transformers — derive this on a whiteboard

Chain-of-thought, o1, DeepSeek-R1, test-time compute

Standard MHA: W_Q + W_K + W_V + W_O = 4d² attention params.

1989) states: a single hidden-layer MLP with a non-polynomial activation can approximate any continuous function on a compact set to arbitrary precision, given enough neurons.

2024 evolution — MLA (Multi-Head Latent Attention): 93.3% KV cache reduction vs MHA

The complete Transformer pipeline — from raw text to next-token prediction

Evaluate on benchmarks (MMLU, HumanEval, MT-Bench) and human preference tests.

Per-head compute cost and why MQA/GQA were invented.

Ground LLM outputs in real data — reduce hallucination

Sub-second: ANN search <50ms, reranking <100ms, LLM streaming for perceived latency.

MSE on probabilities would (1) not properly penalize confident wrong answers (assigning 0.01 vs 0.001 to the correct token matters a lot), (2) not correspond to the log-likelihood objective we actually want to maximize, (3) have worse gradient properties — cross-entropy gradients are proportional...

For each prompt, generate N completions (e.g., N=64) from the current model, score each with a reward model, and keep only the highest-scoring one.

MTTR dominated by detection, not rollback.

N

(2) Check NCCL topology — one rank may have a bad NIC, cross-switch connection, or PCIe bottleneck.

The likely gap: the upgrade was evaluated on standard relevance benchmarks (NDCG, MRR) where the new model improved, but citation precision is a downstream metric — it depends on what the LLM does with the reranked documents, not just which documents score highest.

(2) Online: embed query, ANN search (top-k=20), rerank with cross-encoder (top-5), stuff into LLM context.

BPE, vocabulary size, and why GPT can't count letters

/module/dr-case-notebooklm text-accent underline underline-offset-2 text-accent underline underline-offset-2 Company Lens — same design, different pushes rounded-lg border border-border bg-card p-4 rounded-lg border border-border bg-card p-4 rounded-lg border border-border bg-card p-4 Retrieval q...

Browser / mobile app — submits text prompt, polls or subscribes for result Auth, per-tenant rate limiting, step-budget enforcement by tier Text-level NSFW / policy check — fast classifier rejects before any GPU time is spent Three-lane priority queue: Pro → Paid → Free.

Chain rule, computation graphs, and autograd — how gradients flow backward

O

Production models handle this via: (1) dynamic resolution — resize to multiple supported resolutions based on aspect ratio; (2) image tiling — split high-res images into crops, each encoded separately; (3) token compression — use a perceiver resampler or pooling to reduce visual tokens from hundr...

Architecture: (1) Offline pipeline: chunk documents (512 tokens, 50 token overlap), embed with a bi-encoder (e.g., BGE-large), store in vector DB (Pinecone/Qdrant/pgvector).

The real win from FlashAttention is enabling longer sequences (100K+) that were previously OOM — that is a qualitative capability gain, not just throughput.

(Note: the widely cited ~$5.5M figure is for DeepSeek-V3 pretraining, not R1.) ORM scores only the final answer: correct = 1, wrong = 0.

The GPU spends most time loading model weights and KV cache from HBM, not computing.

P

This is why serving GPT-3-scale models for many concurrent users requires PagedAttention or similar — each concurrent request consumes ~10 GB just for KV cache at max context.

One head isn't enough — each head learns different patterns

React reconciler for terminals, Yoga flexbox, ANSI rendering, and keyboard focus

Sinusoidal PE: no learned parameters, generalizes to unseen lengths in theory (but poorly in practice), deterministic.

random web data, (5) Deduplication — exact dedup (URL + hash) then fuzzy dedup (MinHash LSH at document level, n-gram dedup at paragraph level), (6) PII removal — regex + NER for emails, phone numbers, (7) Toxicity filtering — classifier, but careful not to remove all discussion of sensitive topi...

A PM says the judge is superhuman.

Step 2: Multi-head attention computes Q, K, V projections, applies causal mask, computes weighted sum.

Align the model to human preferences using either PPO (with a reward model) or DPO (direct preference pairs).

(2) Use pass@k: probability of at least one success in k attempts, estimated as 1 - C(n-c, k)/C(n, k) where c = number of successes in n trials.

Add: (1) Zod input validation before execution, (2) permission resolution (deny rules → allow rules → PreToolUse hook → user prompt), (3) timeout per tool, (4) error isolation via Promise.allSettled so one failed Grep doesn't abort the rest.

BPE, vocabulary size, and why GPT can't count letters

Q

For factual QA or structured tasks, greedy can work fine since there's often one correct answer.

Attention has no sense of order — how do we fix that?

BPE, vocabulary size, and why GPT can't count letters

Sub-agents need: (1) Context isolation — fresh QueryEngine with clean message history, not polluted by parent's 100K+ token conversation.

R

Small model drafts, big model verifies — parallel generation

With r=16 on a 4096x4096 matrix, you train 131K params instead of 16.8M — a 128x reduction.

System prompts, few-shot, structured output, tool schemas

However, ReLU kills ~half the neurons (negative outputs become 0), halving the effective variance.

Step 4: Another LayerNorm on the result.

RLHF can help with quality but does not specifically target the train/inference distribution mismatch.

System prompts, few-shot, structured output, tool schemas

3) Positional encoding (sinusoidal or RoPE) adds position information.

S

The key insight: pre-training already teaches the model almost everything — SFT just teaches the format and style of interaction.

Matrix multiplication, weight initialization, and the universal approximation theorem

A synchronous router adds its latency directly to TTFT.

Batch size is 1 per step — weights load from HBM but arithmetic intensity is ~1 FLOP/byte With batch=1 the GPU loads each weight matrix from HBM to do just a few multiply-adds, leaving arithmetic units starved while memory bandwidth is saturated.

The core of Transformers — derive this on a whiteboard

The core of Transformers — derive this on a whiteboard

When the API returns 'prompt too long', the agent triggers reactive compaction: summarize older messages (keeping recent tool results intact), replace the originals with the summary, and retry.

The paper reports 2–4× higher throughput than prior systems (FasterTransformer, Orca) at the same latency; much larger headline numbers seen elsewhere depend on the specific static-batching baseline being compared against.

Chain-of-thought, o1, DeepSeek-R1, test-time compute

The complete Transformer pipeline — from raw text to next-token prediction

T

Softmax is translation-invariant but not scale-invariant.

Classifier or heuristic scores: perplexity filter (trained on Wikipedia), TF-IDF similarity to reference corpus, text length, symbol-to-word ratio.

The core of Transformers — derive this on a whiteboard

~125 TFLOPS effective // = 1.25e14 FLOPS const a100Flops = 1.25e14; // effective FLOPS per A100 const a100Hours = C / a100Flops / 3600; const costPerHour = 2; // $2/A100-hour const trainingCost = a100Hours * costPerHour; return ( Training Cost (@$2/A100-hr) How big should the model be?

Strategy: (1) Tensor parallelism (TP=8) within each 8-GPU node — splits individual layers across GPUs connected by fast NVLink (~600 GB/s).

Serving stacks, continuous batching, latency vs throughput, vLLM, and API design

Turning token IDs into meaningful vectors

The cache has a TTL (typically 5 min) so low-traffic endpoints may not benefit.

U

Trajectory eval, tool accuracy, and why agent eval is harder

random web data, (5) Deduplication — exact dedup (URL + hash) then fuzzy dedup (MinHash LSH at document level, n-gram dedup at paragraph level), (6) PII removal — regex + NER for emails, phone numbers, (7) Toxicity filtering — classifier, but careful not to remove all discussion of sensitive topi...

Critical for UX — users see output immediately instead of waiting for the full response.

V

64 dimensions cannot be loaded into GPU VRAM; 8192 is the architectural maximum for Transformers.

X

Serving stacks, continuous batching, latency vs throughput, vLLM, and API design

Y

Combine fine-tuned models without retraining

Andrej Karpathy — The spelled-out intro to neural networks (YouTube) 2.5-hour walkthrough building micrograd step by step — every chain rule application shown explicitly.