Glossary
Auto-extracted from all 84 modules. Use this to look up acronyms or jump to where a concept is first introduced.
LLM-friendly plain text at /glossary-data.txt
A
How big? How much data? Chinchilla has the answer
FineWeb, filtering, dedup — data quality beats data quantity
The full o1 model scored 83.3% on AIME 2024 with consensus@64 (74.4% pass@1) and 94.8% on MATH (vs.
For example, log-softmax = x_i - log(sum(exp(x_j))) is computed as x_i - (max_x + log(sum(exp(x_j - max_x)))) — subtracting the max before exp prevents overflow.
(2) Online: embed query, ANN search (top-k=20), rerank with cross-encoder (top-5), stuff into LLM context.
Dual state systems: React context for UI, module state for services
DDP, ZeRO, FSDP — training across thousands of GPUs
How big? How much data? Chinchilla has the answer
Quality loss is 1–3% on typical benchmarks (GPTQ, AWQ) but can be higher on chain-of-thought reasoning tasks where precision on intermediate tokens matters.
vLLM wins on: hardware portability (AMD, AWS Neuron), faster model updates (no engine recompile), Python-native API, and operational simplicity.
B
BatchNorm
🧮 MLP & Matmul →He init accounts for batch normalization inserted after each ReLU, which would otherwise restore variance to 1 automatically.
BERT
📊 Embeddings →Jay Alammar — how BERT reuses the Transformer encoder with bidirectional attention and masked language modeling, producing contextual embeddings that outperform static word vectors.
BLEU
🔤 Tokenization →BPE, vocabulary size, and why GPT can't count letters
The complete Transformer pipeline — from raw text to next-token prediction
C
Anthropic s CAI paper (Bai et al.
Live Web Fetch (conditional) (typical external HTTP round-trip; depends on publisher CDN; qualitative range) Triggered only for freshness-critical queries (news, prices, sports scores).
ChatGPT
🤖 Agents & ReAct →Think → Act → Observe — the reasoning loop
A trajectory evaluator scores the full decision chain, not just the final diff.
CLAUDE
📍 Positional Encoding →CLAUDE.md explicitly flags this distinction: “not permutation-invariant.” The sinusoidal PE uses base 10000 in the divisor term.
Judge calibration, regression gating, launch criteria, eval ops
report +2pp accuracy on ImageNet with ViT-G/14 CLIP fine-tunes using model soups vs best single run.
BERT s bidirectional attention makes [CLS] a function of the full sentence; GPT-2 s causal mask means each token sees only past context.
Turning token IDs into meaningful vectors
Multi-modal frontier serving — TPU stack, 1M-token attention, safety classifier chain
The sub-agent needs to search for CSS files — that history is noise.
Tiling, IO-awareness, and O(N) memory attention
D
Yannic Kilcher — DALL·E 2 / Diffusion Models Explained (YouTube) Visual walkthrough of latent diffusion, CLIP guidance, and how modern text-to-image systems combine these ideas.
DDP (DistributedDataParallel) replicates the entire model on each GPU and only synchronizes gradients via all-reduce after each backward pass.
Denoising Diffusion Probabilistic Models (Ho et al., 2020) — the foundational DDPM paper that revived diffusion models for image generation.
DeepSeek
🔤 Tokenization →Multilingual models trained on Chinese data (e.g., Qwen, DeepSeek) use tokenizers with more CJK merges to reduce this gap.
RLHF/DPO teaches which responses humans prefer among those the model can already produce.
DRAM
📊 Embeddings →Hurts: the 394M extra params are re-fetched from DRAM on every generated token, creating a bandwidth bottleneck.
F
Amazon Working Backwards — PR/FAQ + 6-Pager Not an ML piece, but the discipline of writing the customer-facing press release before the design doc is the methodological backbone of this module.
The complete Transformer pipeline — from raw text to next-token prediction
FlashAttention
🎯 Self-Attention →This is why long-context models are challenging, and why Flash Attention (O(n) memory) is crucial.
FLOP
🔤 Tokenization →BPE, vocabulary size, and why GPT can't count letters
MHA uses h× more FLOPS, giving a strictly larger compute budget per forward pass.
Judge calibration, regression gating, launch criteria, eval ops
An order-invariant win rate confirms the signal is real, not an artifact of position bias.
Three isolation layers: (1) Overlay filesystem — reads from the real FS but writes go to a temporary copy-on-write layer.
FSDP (Fully Sharded Data Parallel) shards model parameters, gradients, AND optimizer states across GPUs (equivalent to ZeRO Stage 3).
G
But word order matters ('dog bites man' vs 'man bites dog').
Because GDPR and CCPA require a documented retention and deletion policy; a system with no TTL has no retention policy and is a compliance blocker for regulated Enterprise customers.
Both W1 and W2 store keys; the GELU activation retrieves the matching value GELU is a nonlinear gate, not a retrieval mechanism.
GEMM
🧮 MLP & Matmul →Matrix multiplication, weight initialization, and the universal approximation theorem
The complete Transformer pipeline — from raw text to next-token prediction
GPTQ
📈 Scaling Laws →Quality loss is 1–3% on typical benchmarks (GPTQ, AWQ) but can be higher on chain-of-thought reasoning tasks where precision on intermediate tokens matters.
64 dimensions cannot be loaded into GPU VRAM; 8192 is the architectural maximum for Transformers.
The complete Transformer pipeline — from raw text to next-token prediction
GRPO (Group Relative Policy Optimization) replaces the learned value function with a simple group-based baseline: for each prompt, sample K completions, compute their rewards, and use the group mean as the baseline.
H
FlashAttention keeps O(n) HBM memory without changing FLOPs.
Index structures (IVF, HNSW) add overhead on top.
A production pipeline: (1) Crawl — Common Crawl or custom scraper, (2) Extraction — HTML to text with boilerplate removal (trafilatura, resiliparse), (3) Language filtering — fasttext lid model, keep target languages, (4) Quality filtering — perplexity filter (small LM trained on Wikipedia), heur...
Serving stacks, continuous batching, latency vs throughput, vLLM, and API design
HuggingFace
🔤 Tokenization →Tokenizer Summary — Hugging Face docs HuggingFace reference covering every major algorithm — WordPiece, Unigram, BPE, byte-level BPE — with concrete examples of merge rules.
HumanEval
🎯 SFT & Post-Training Pipeline →Generate N completions per prompt using the SFT model, score each with a reward model or human eval, keep only the best.
I
The complete Transformer pipeline — from raw text to next-token prediction
Function calling, MCP, A2A — connecting agents to the world
Classifier or heuristic scores: perplexity filter (trained on Wikipedia), TF-IDF similarity to reference corpus, text length, symbol-to-word ratio.
ImageNet
📐 Optimizers →opt-sgd-vs-adam-generalization Training a ResNet-50 on ImageNet, a researcher finds SGD with momentum achieves 77.0% top-1 accuracy while AdamW achieves 76.1%.
InfiniBand
🖥️ Distributed Training →DDP, ZeRO, FSDP — training across thousands of GPUs
InstructGPT
🔧 Fine-tuning & LoRA →SFT Dataset Reference Points InstructGPT (Ouyang et al., 2022) used ~13,000 human-written SFT demonstrations collected from ~40 labelers RLHF on 33K preference pairs — enough that its 1.3B RLHF model was preferred by human evaluators over the 175B GPT-3 base Llama-2 Chat used ~27,540 vendor-colle...
The core of Transformers — derive this on a whiteboard
The complete Transformer pipeline — from raw text to next-token prediction
J
Serving stacks, continuous batching, latency vs throughput, vLLM, and API design
K
Turning token IDs into meaningful vectors
It adds a KL penalty against a uniform prior, equivalent to label smoothing in the logit space.
The complete Transformer pipeline — from raw text to next-token prediction
L
LayerNorm
🏗️ High-Level Overview →4) N identical blocks, each: LayerNorm → Multi-Head Attention (with causal mask) → residual add → LayerNorm → FFN (SwiGLU) → residual add.
LIMA (Less Is More for Alignment, 2023) showed that a 65B Llama model fine-tuned on just 1,000 high-quality examples performed comparably to models trained on 52K+ examples (Alpaca, Databricks-dolly).
For a typical LLM: ~65% of parameters are in FFN layers (two large weight matrices per block: d→4d and 4d→d, or 8d/3 for SwiGLU).
But for small models the ratio is much higher: a 125M param model with 50K vocab and d=768 already has 38M embedding params = 31% of total.
The complete Transformer pipeline — from raw text to next-token prediction
Forgetting during fine-tuning is addressed by fine-tuning methods (LoRA, regularization), not by more pretraining tokens.
Stability comes from controlling magnitude, not centering — empirical results show negligible perplexity difference.
Model Context Protocol Specification The open standard for connecting AI agents to external tools and data sources.
M
Chain-of-thought, o1, DeepSeek-R1, test-time compute
The core of Transformers — derive this on a whiteboard
Chain-of-thought, o1, DeepSeek-R1, test-time compute
Standard MHA: W_Q + W_K + W_V + W_O = 4d² attention params.
1989) states: a single hidden-layer MLP with a non-polynomial activation can approximate any continuous function on a compact set to arbitrary precision, given enough neurons.
2024 evolution — MLA (Multi-Head Latent Attention): 93.3% KV cache reduction vs MHA
The complete Transformer pipeline — from raw text to next-token prediction
Evaluate on benchmarks (MMLU, HumanEval, MT-Bench) and human preference tests.
Per-head compute cost and why MQA/GQA were invented.
Ground LLM outputs in real data — reduce hallucination
Sub-second: ANN search <50ms, reranking <100ms, LLM streaming for perceived latency.
MSE on probabilities would (1) not properly penalize confident wrong answers (assigning 0.01 vs 0.001 to the correct token matters a lot), (2) not correspond to the log-likelihood objective we actually want to maximize, (3) have worse gradient properties — cross-entropy gradients are proportional...
For each prompt, generate N completions (e.g., N=64) from the current model, score each with a reward model, and keep only the highest-scoring one.
MTTR dominated by detection, not rollback.
N
(2) Check NCCL topology — one rank may have a bad NIC, cross-switch connection, or PCIe bottleneck.
The likely gap: the upgrade was evaluated on standard relevance benchmarks (NDCG, MRR) where the new model improved, but citation precision is a downstream metric — it depends on what the LLM does with the reranked documents, not just which documents score highest.
(2) Online: embed query, ANN search (top-k=20), rerank with cross-encoder (top-5), stuff into LLM context.
BPE, vocabulary size, and why GPT can't count letters
NotebookLM
🔭 Case: Design Perplexity →/module/dr-case-notebooklm text-accent underline underline-offset-2 text-accent underline underline-offset-2 Company Lens — same design, different pushes rounded-lg border border-border bg-card p-4 rounded-lg border border-border bg-card p-4 rounded-lg border border-border bg-card p-4 Retrieval q...
Browser / mobile app — submits text prompt, polls or subscribes for result Auth, per-tenant rate limiting, step-budget enforcement by tier Text-level NSFW / policy check — fast classifier rejects before any GPU time is spent Three-lane priority queue: Pro → Paid → Free.
NVIDIA
🔙 Backpropagation →Chain rule, computation graphs, and autograd — how gradients flow backward
O
Production models handle this via: (1) dynamic resolution — resize to multiple supported resolutions based on aspect ratio; (2) image tiling — split high-res images into crops, each encoded separately; (3) token compression — use a perceiver resampler or pooling to reduce visual tokens from hundr...
Architecture: (1) Offline pipeline: chunk documents (512 tokens, 50 token overlap), embed with a bi-encoder (e.g., BGE-large), store in vector DB (Pinecone/Qdrant/pgvector).
The real win from FlashAttention is enabling longer sequences (100K+) that were previously OOM — that is a qualitative capability gain, not just throughput.
(Note: the widely cited ~$5.5M figure is for DeepSeek-V3 pretraining, not R1.) ORM scores only the final answer: correct = 1, wrong = 0.
The GPU spends most time loading model weights and KV cache from HBM, not computing.
P
PagedAttention
🎯 Self-Attention →This is why serving GPT-3-scale models for many concurrent users requires PagedAttention or similar — each concurrent request consumes ~10 GB just for KV cache at max context.
One head isn't enough — each head learns different patterns
React reconciler for terminals, Yoga flexbox, ANSI rendering, and keyboard focus
Sinusoidal PE: no learned parameters, generalizes to unseen lengths in theory (but poorly in practice), deterministic.
random web data, (5) Deduplication — exact dedup (URL + hash) then fuzzy dedup (MinHash LSH at document level, n-gram dedup at paragraph level), (6) PII removal — regex + NER for emails, phone numbers, (7) Toxicity filtering — classifier, but careful not to remove all discussion of sensitive topi...
A PM says the judge is superhuman.
Step 2: Multi-head attention computes Q, K, V projections, applies causal mask, computes weighted sum.
Align the model to human preferences using either PPO (with a reward model) or DPO (direct preference pairs).
(2) Use pass@k: probability of at least one success in k attempts, estimated as 1 - C(n-c, k)/C(n, k) where c = number of successes in n trials.
PreToolUse
🔧 Tool System →Add: (1) Zod input validation before execution, (2) permission resolution (deny rules → allow rules → PreToolUse hook → user prompt), (3) timeout per tool, (4) error isolation via Promise.allSettled so one failed Grep doesn't abort the rest.
PyTorch
🔤 Tokenization →BPE, vocabulary size, and why GPT can't count letters
Q
For factual QA or structured tasks, greedy can work fine since there's often one correct answer.
Attention has no sense of order — how do we fix that?
BPE, vocabulary size, and why GPT can't count letters
QueryEngine
⚙️ Agent Harness Architecture →Sub-agents need: (1) Context isolation — fresh QueryEngine with clean message history, not polluted by parent's 100K+ token conversation.
R
Small model drafts, big model verifies — parallel generation
With r=16 on a 4096x4096 matrix, you train 131K params instead of 16.8M — a 128x reduction.
System prompts, few-shot, structured output, tool schemas
ReLU
🧮 MLP & Matmul →However, ReLU kills ~half the neurons (negative outputs become 0), halving the effective variance.
Step 4: Another LayerNorm on the result.
RLHF can help with quality but does not specifically target the train/inference distribution mismatch.
System prompts, few-shot, structured output, tool schemas
3) Positional encoding (sinusoidal or RoPE) adds position information.
S
The key insight: pre-training already teaches the model almost everything — SFT just teaches the format and style of interaction.
Matrix multiplication, weight initialization, and the universal approximation theorem
A synchronous router adds its latency directly to TTFT.
Batch size is 1 per step — weights load from HBM but arithmetic intensity is ~1 FLOP/byte With batch=1 the GPU loads each weight matrix from HBM to do just a few multiply-adds, leaving arithmetic units starved while memory bandwidth is saturated.
The core of Transformers — derive this on a whiteboard
The core of Transformers — derive this on a whiteboard
When the API returns 'prompt too long', the agent triggers reactive compaction: summarize older messages (keeping recent tool results intact), replace the originals with the summary, and retry.
The paper reports 2–4× higher throughput than prior systems (FasterTransformer, Orca) at the same latency; much larger headline numbers seen elsewhere depend on the specific static-batching baseline being compared against.
Chain-of-thought, o1, DeepSeek-R1, test-time compute
SwiGLU
🏗️ High-Level Overview →The complete Transformer pipeline — from raw text to next-token prediction
T
Softmax is translation-invariant but not scale-invariant.
Classifier or heuristic scores: perplexity filter (trained on Wikipedia), TF-IDF similarity to reference corpus, text length, symbol-to-word ratio.
TFLOP
🎯 Self-Attention →The core of Transformers — derive this on a whiteboard
TFLOPS
📈 Scaling Laws →~125 TFLOPS effective // = 1.25e14 FLOPS const a100Flops = 1.25e14; // effective FLOPS per A100 const a100Hours = C / a100Flops / 3600; const costPerHour = 2; // $2/A100-hour const trainingCost = a100Hours * costPerHour; return ( Training Cost (@$2/A100-hr) How big should the model be?
Strategy: (1) Tensor parallelism (TP=8) within each 8-GPU node — splits individual layers across GPUs connected by fast NVLink (~600 GB/s).
Serving stacks, continuous batching, latency vs throughput, vLLM, and API design
TTFT
📊 Embeddings →Turning token IDs into meaningful vectors
The cache has a TTL (typically 5 min) so low-traffic endpoints may not benefit.
U
Trajectory eval, tool accuracy, and why agent eval is harder
random web data, (5) Deduplication — exact dedup (URL + hash) then fuzzy dedup (MinHash LSH at document level, n-gram dedup at paragraph level), (6) PII removal — regex + NER for emails, phone numbers, (7) Toxicity filtering — classifier, but careful not to remove all discussion of sensitive topi...
Critical for UX — users see output immediately instead of waiting for the full response.
V
VRAM
📊 Embeddings →64 dimensions cannot be loaded into GPU VRAM; 8192 is the architectural maximum for Transformers.
X
Serving stacks, continuous batching, latency vs throughput, vLLM, and API design
Y
Combine fine-tuned models without retraining
YouTube
🔙 Backpropagation →Andrej Karpathy — The spelled-out intro to neural networks (YouTube) 2.5-hour walkthrough building micrograd step by step — every chain rule application shown explicitly.