AGI: How big? How much data? Chinchilla has the answer
AI: FineWeb, filtering, dedup — data quality beats data quantity
AIME: The full o1 model scored 83.3% on AIME 2024 with consensus@64 (74.4% pass@1) and 94.8% on MATH (vs.
AMP: For example, log-softmax = x_i - log(sum(exp(x_j))) is computed as x_i - (max_x + log(sum(exp(x_j - max_x)))) — subtracting the max before exp prevents overflow.
ANN: (2) Online: embed query, ANN search (top-k=20), rerank with cross-encoder (top-5), stuff into LLM context.
ANSI: Dual state systems: React context for UI, module state for services
API: DDP, ZeRO, FSDP — training across thousands of GPUs
ARC: How big? How much data? Chinchilla has the answer
AWQ: Quality loss is 1–3% on typical benchmarks (GPTQ, AWQ) but can be higher on chain-of-thought reasoning tasks where precision on intermediate tokens matters.
AWS: vLLM wins on: hardware portability (AMD, AWS Neuron), faster model updates (no engine recompile), Python-native API, and operational simplicity.
BatchNorm: He init accounts for batch normalization inserted after each ReLU, which would otherwise restore variance to 1 automatically.
BERT: Jay Alammar — how BERT reuses the Transformer encoder with bidirectional attention and masked language modeling, producing contextual embeddings that outperform static word vectors.
BLEU: BPE, vocabulary size, and why GPT can't count letters
BPE: The complete Transformer pipeline — from raw text to next-token prediction
CAI: Anthropic s CAI paper (Bai et al.
CDN: Live Web Fetch (conditional) (typical external HTTP round-trip; depends on publisher CDN; qualitative range) Triggered only for freshness-critical queries (news, prices, sports scores).
ChatGPT: Think → Act → Observe — the reasoning loop
CI: A trajectory evaluator scores the full decision chain, not just the final diff.
CLAUDE: CLAUDE.md explicitly flags this distinction: “not permutation-invariant.” The sinusoidal PE uses base 10000 in the divisor term.
CLI: Judge calibration, regression gating, launch criteria, eval ops
CLIP: report +2pp accuracy on ImageNet with ViT-G/14 CLIP fine-tunes using model soups vs best single run.
CLS: BERT s bidirectional attention makes [CLS] a function of the full sentence; GPT-2 s causal mask means each token sees only past context.
CPU: Turning token IDs into meaningful vectors
CSAM: Multi-modal frontier serving — TPU stack, 1M-token attention, safety classifier chain
CSS: The sub-agent needs to search for CSS files — that history is noise.
CUDA: Tiling, IO-awareness, and O(N) memory attention
DALL: Yannic Kilcher — DALL·E 2 / Diffusion Models Explained (YouTube) Visual walkthrough of latent diffusion, CLIP guidance, and how modern text-to-image systems combine these ideas.
DDP: DDP (DistributedDataParallel) replicates the entire model on each GPU and only synchronizes gradients via all-reduce after each backward pass.
DDPM: Denoising Diffusion Probabilistic Models (Ho et al., 2020) — the foundational DDPM paper that revived diffusion models for image generation.
DeepSeek: Multilingual models trained on Chinese data (e.g., Qwen, DeepSeek) use tokenizers with more CJK merges to reduce this gap.
DPO: RLHF/DPO teaches which responses humans prefer among those the model can already produce.
DRAM: Hurts: the 394M extra params are re-fetched from DRAM on every generated token, creating a bandwidth bottleneck.
FAQ: Amazon Working Backwards — PR/FAQ + 6-Pager Not an ML piece, but the discipline of writing the customer-facing press release before the design doc is the methodological backbone of this module.
FFN: The complete Transformer pipeline — from raw text to next-token prediction
FlashAttention: This is why long-context models are challenging, and why Flash Attention (O(n) memory) is crucial.
FLOP: BPE, vocabulary size, and why GPT can't count letters
FLOPS: MHA uses h× more FLOPS, giving a strictly larger compute budget per forward pass.
FN: Judge calibration, regression gating, launch criteria, eval ops
FP: An order-invariant win rate confirms the signal is real, not an artifact of position bias.
FS: Three isolation layers: (1) Overlay filesystem — reads from the real FS but writes go to a temporary copy-on-write layer.
FSDP: FSDP (Fully Sharded Data Parallel) shards model parameters, gradients, AND optimizer states across GPUs (equivalent to ZeRO Stage 3).
GB: But word order matters ('dog bites man' vs 'man bites dog').
GDPR: Because GDPR and CCPA require a documented retention and deletion policy; a system with no TTL has no retention policy and is a compliance blocker for regulated Enterprise customers.
GELU: Both W1 and W2 store keys; the GELU activation retrieves the matching value GELU is a nonlinear gate, not a retrieval mechanism.
GEMM: Matrix multiplication, weight initialization, and the universal approximation theorem
GPT: The complete Transformer pipeline — from raw text to next-token prediction
GPTQ: Quality loss is 1–3% on typical benchmarks (GPTQ, AWQ) but can be higher on chain-of-thought reasoning tasks where precision on intermediate tokens matters.
GPU: 64 dimensions cannot be loaded into GPU VRAM; 8192 is the architectural maximum for Transformers.
GQA: The complete Transformer pipeline — from raw text to next-token prediction
GRPO: GRPO (Group Relative Policy Optimization) replaces the learned value function with a simple group-based baseline: for each prompt, sample K completions, compute their rewards, and use the group mean as the baseline.
HBM: FlashAttention keeps O(n) HBM memory without changing FLOPs.
HNSW: Index structures (IVF, HNSW) add overhead on top.
HTML: A production pipeline: (1) Crawl — Common Crawl or custom scraper, (2) Extraction — HTML to text with boilerplate removal (trafilatura, resiliparse), (3) Language filtering — fasttext lid model, keep target languages, (4) Quality filtering — perplexity filter (small LM trained on Wikipedia), heur...
HTTP: Serving stacks, continuous batching, latency vs throughput, vLLM, and API design
HuggingFace: Tokenizer Summary — Hugging Face docs HuggingFace reference covering every major algorithm — WordPiece, Unigram, BPE, byte-level BPE — with concrete examples of merge rules.
HumanEval: Generate N completions per prompt using the SFT model, score each with a reward model or human eval, keep only the best.
ID: The complete Transformer pipeline — from raw text to next-token prediction
IDE: Function calling, MCP, A2A — connecting agents to the world
IDF: Classifier or heuristic scores: perplexity filter (trained on Wikipedia), TF-IDF similarity to reference corpus, text length, symbol-to-word ratio.
ImageNet: opt-sgd-vs-adam-generalization Training a ResNet-50 on ImageNet, a researcher finds SGD with momentum achieves 77.0% top-1 accuracy while AdamW achieves 76.1%.
InfiniBand: DDP, ZeRO, FSDP — training across thousands of GPUs
InstructGPT: SFT Dataset Reference Points InstructGPT (Ouyang et al., 2022) used ~13,000 human-written SFT demonstrations collected from ~40 labelers RLHF on 33K preference pairs — enough that its 1.3B RLHF model was preferred by human evaluators over the 175B GPT-3 base Llama-2 Chat used ~27,540 vendor-colle...
IO: The core of Transformers — derive this on a whiteboard
IS: The complete Transformer pipeline — from raw text to next-token prediction
JSON: Serving stacks, continuous batching, latency vs throughput, vLLM, and API design
KB: Turning token IDs into meaningful vectors
KL: It adds a KL penalty against a uniform prior, equivalent to label smoothing in the logit space.
KV: The complete Transformer pipeline — from raw text to next-token prediction
LayerNorm: 4) N identical blocks, each: LayerNorm → Multi-Head Attention (with causal mask) → residual add → LayerNorm → FFN (SwiGLU) → residual add.
LIMA: LIMA (Less Is More for Alignment, 2023) showed that a 65B Llama model fine-tuned on just 1,000 high-quality examples performed comparably to models trained on 52K+ examples (Alpaca, Databricks-dolly).
LLM: For a typical LLM: ~65% of parameters are in FFN layers (two large weight matrices per block: d→4d and 4d→d, or 8d/3 for SwiGLU).
LM: But for small models the ratio is much higher: a 125M param model with 50K vocab and d=768 already has 38M embedding params = 31% of total.
LN: The complete Transformer pipeline — from raw text to next-token prediction
LoRA: Forgetting during fine-tuning is addressed by fine-tuning methods (LoRA, regularization), not by more pretraining tokens.
LR: Stability comes from controlling magnitude, not centering — empirical results show negligible perplexity difference.
LSP: Model Context Protocol Specification The open standard for connecting AI agents to external tools and data sources.
MATH: Chain-of-thought, o1, DeepSeek-R1, test-time compute
MB: The core of Transformers — derive this on a whiteboard
MCP: Chain-of-thought, o1, DeepSeek-R1, test-time compute
MHA: Standard MHA: W_Q + W_K + W_V + W_O = 4d² attention params.
ML: 1989) states: a single hidden-layer MLP with a non-polynomial activation can approximate any continuous function on a compact set to arbitrary precision, given enough neurons.
MLA: 2024 evolution — MLA (Multi-Head Latent Attention): 93.3% KV cache reduction vs MHA
MLP: The complete Transformer pipeline — from raw text to next-token prediction
MMLU: Evaluate on benchmarks (MMLU, HumanEval, MT-Bench) and human preference tests.
MQA: Per-head compute cost and why MQA/GQA were invented.
MRR: Ground LLM outputs in real data — reduce hallucination
MS: Sub-second: ANN search <50ms, reranking <100ms, LLM streaming for perceived latency.
MSE: MSE on probabilities would (1) not properly penalize confident wrong answers (assigning 0.01 vs 0.001 to the correct token matters a lot), (2) not correspond to the log-likelihood objective we actually want to maximize, (3) have worse gradient properties — cross-entropy gradients are proportional...
MT: For each prompt, generate N completions (e.g., N=64) from the current model, score each with a reward model, and keep only the highest-scoring one.
MTTR: MTTR dominated by detection, not rollback.
NCCL: (2) Check NCCL topology — one rank may have a bad NIC, cross-switch connection, or PCIe bottleneck.
NDCG: The likely gap: the upgrade was evaluated on standard relevance benchmarks (NDCG, MRR) where the new model improved, but citation precision is a downstream metric — it depends on what the LLM does with the reranked documents, not just which documents score highest.
NLI: (2) Online: embed query, ANN search (top-k=20), rerank with cross-encoder (top-5), stuff into LLM context.
NLP: BPE, vocabulary size, and why GPT can't count letters
NotebookLM: /module/dr-case-notebooklm text-accent underline underline-offset-2 text-accent underline underline-offset-2 Company Lens — same design, different pushes rounded-lg border border-border bg-card p-4 rounded-lg border border-border bg-card p-4 rounded-lg border border-border bg-card p-4 Retrieval q...
NSFW: Browser / mobile app — submits text prompt, polls or subscribes for result Auth, per-tenant rate limiting, step-budget enforcement by tier Text-level NSFW / policy check — fast classifier rejects before any GPU time is spent Three-lane priority queue: Pro → Paid → Free.
NVIDIA: Chain rule, computation graphs, and autograd — how gradients flow backward
OCR: Production models handle this via: (1) dynamic resolution — resize to multiple supported resolutions based on aspect ratio; (2) image tiling — split high-res images into crops, each encoded separately; (3) token compression — use a perceiver resampler or pooling to reduce visual tokens from hundr...
OK: Architecture: (1) Offline pipeline: chunk documents (512 tokens, 50 token overlap), embed with a bi-encoder (e.g., BGE-large), store in vector DB (Pinecone/Qdrant/pgvector).
OOM: The real win from FlashAttention is enabling longer sequences (100K+) that were previously OOM — that is a qualitative capability gain, not just throughput.
ORM: (Note: the widely cited ~$5.5M figure is for DeepSeek-V3 pretraining, not R1.) ORM scores only the final answer: correct = 1, wrong = 0.
OS: The GPU spends most time loading model weights and KV cache from HBM, not computing.
PagedAttention: This is why serving GPT-3-scale models for many concurrent users requires PagedAttention or similar — each concurrent request consumes ~10 GB just for KV cache at max context.
PaLM: One head isn't enough — each head learns different patterns
PDF: React reconciler for terminals, Yoga flexbox, ANSI rendering, and keyboard focus
PE: Sinusoidal PE: no learned parameters, generalizes to unseen lengths in theory (but poorly in practice), deterministic.
PII: random web data, (5) Deduplication — exact dedup (URL + hash) then fuzzy dedup (MinHash LSH at document level, n-gram dedup at paragraph level), (6) PII removal — regex + NER for emails, phone numbers, (7) Toxicity filtering — classifier, but careful not to remove all discussion of sensitive topi...
PM: A PM says the judge is superhuman.
PPL: Step 2: Multi-head attention computes Q, K, V projections, applies causal mask, computes weighted sum.
PPO: Align the model to human preferences using either PPO (with a reward model) or DPO (direct preference pairs).
PR: (2) Use pass@k: probability of at least one success in k attempts, estimated as 1 - C(n-c, k)/C(n, k) where c = number of successes in n trials.
PreToolUse: Add: (1) Zod input validation before execution, (2) permission resolution (deny rules → allow rules → PreToolUse hook → user prompt), (3) timeout per tool, (4) error isolation via Promise.allSettled so one failed Grep doesn't abort the rest.
PyTorch: BPE, vocabulary size, and why GPT can't count letters
QA: For factual QA or structured tasks, greedy can work fine since there's often one correct answer.
QK: Attention has no sense of order — how do we fix that?
QPS: BPE, vocabulary size, and why GPT can't count letters
QueryEngine: Sub-agents need: (1) Context isolation — fresh QueryEngine with clean message history, not polluted by parent's 100K+ token conversation.
RAG: Small model drafts, big model verifies — parallel generation
RAM: With r=16 on a 4096x4096 matrix, you train 131K params instead of 16.8M — a 128x reduction.
ReAct: System prompts, few-shot, structured output, tool schemas
ReLU: However, ReLU kills ~half the neurons (negative outputs become 0), halving the effective variance.
RL: Step 4: Another LayerNorm on the result.
RLHF: RLHF can help with quality but does not specifically target the train/inference distribution mismatch.
ROI: System prompts, few-shot, structured output, tool schemas
RoPE: 3) Positional encoding (sinusoidal or RoPE) adds position information.
SFT: The key insight: pre-training already teaches the model almost everything — SFT just teaches the format and style of interaction.
SGD: Matrix multiplication, weight initialization, and the universal approximation theorem
SLA: A synchronous router adds its latency directly to TTFT.
SLO: Batch size is 1 per step — weights load from HBM but arithmetic intensity is ~1 FLOP/byte With batch=1 the GPU loads each weight matrix from HBM to do just a few multiply-adds, leaving arithmetic units starved while memory bandwidth is saturated.
SOTA: The core of Transformers — derive this on a whiteboard
SRAM: The core of Transformers — derive this on a whiteboard
SRE: When the API returns 'prompt too long', the agent triggers reactive compaction: summarize older messages (keeping recent tool results intact), replace the originals with the summary, and retry.
SSE: The paper reports 2–4× higher throughput than prior systems (FasterTransformer, Orca) at the same latency; much larger headline numbers seen elsewhere depend on the specific static-batching baseline being compared against.
SWE: Chain-of-thought, o1, DeepSeek-R1, test-time compute
SwiGLU: The complete Transformer pipeline — from raw text to next-token prediction
TB: Softmax is translation-invariant but not scale-invariant.
TF: Classifier or heuristic scores: perplexity filter (trained on Wikipedia), TF-IDF similarity to reference corpus, text length, symbol-to-word ratio.
TFLOP: The core of Transformers — derive this on a whiteboard
TFLOPS: ~125 TFLOPS effective // = 1.25e14 FLOPS const a100Flops = 1.25e14; // effective FLOPS per A100 const a100Hours = C / a100Flops / 3600; const costPerHour = 2; // $2/A100-hour const trainingCost = a100Hours * costPerHour; return ( Training Cost (@$2/A100-hr) How big should the model be?
TP: Strategy: (1) Tensor parallelism (TP=8) within each 8-GPU node — splits individual layers across GPUs connected by fast NVLink (~600 GB/s).
TPU: Serving stacks, continuous batching, latency vs throughput, vLLM, and API design
TTFT: Turning token IDs into meaningful vectors
TTL: The cache has a TTL (typically 5 min) so low-traffic endpoints may not benefit.
UI: Trajectory eval, tool accuracy, and why agent eval is harder
URL: random web data, (5) Deduplication — exact dedup (URL + hash) then fuzzy dedup (MinHash LSH at document level, n-gram dedup at paragraph level), (6) PII removal — regex + NER for emails, phone numbers, (7) Toxicity filtering — classifier, but careful not to remove all discussion of sensitive topi...
UX: Critical for UX — users see output immediately instead of waiting for the full response.
VRAM: 64 dimensions cannot be loaded into GPU VRAM; 8192 is the architectural maximum for Transformers.
XLA: Serving stacks, continuous batching, latency vs throughput, vLLM, and API design
YAML: Combine fine-tuned models without retraining
YouTube: Andrej Karpathy — The spelled-out intro to neural networks (YouTube) 2.5-hour walkthrough building micrograd step by step — every chain rule application shown explicitly.