AGI: How big? How much data? Chinchilla has the answer AI: FineWeb, filtering, dedup — data quality beats data quantity AIME: The full o1 model scored 83.3% on AIME 2024 with consensus@64 (74.4% pass@1) and 94.8% on MATH (vs. AMP: For example, log-softmax = x_i - log(sum(exp(x_j))) is computed as x_i - (max_x + log(sum(exp(x_j - max_x)))) — subtracting the max before exp prevents overflow. ANN: (2) Online: embed query, ANN search (top-k=20), rerank with cross-encoder (top-5), stuff into LLM context. ANSI: Dual state systems: React context for UI, module state for services API: DDP, ZeRO, FSDP — training across thousands of GPUs ARC: How big? How much data? Chinchilla has the answer AWQ: Quality loss is 1–3% on typical benchmarks (GPTQ, AWQ) but can be higher on chain-of-thought reasoning tasks where precision on intermediate tokens matters. AWS: vLLM wins on: hardware portability (AMD, AWS Neuron), faster model updates (no engine recompile), Python-native API, and operational simplicity. BatchNorm: He init accounts for batch normalization inserted after each ReLU, which would otherwise restore variance to 1 automatically. BERT: Jay Alammar — how BERT reuses the Transformer encoder with bidirectional attention and masked language modeling, producing contextual embeddings that outperform static word vectors. BLEU: BPE, vocabulary size, and why GPT can't count letters BPE: The complete Transformer pipeline — from raw text to next-token prediction CAI: Anthropic s CAI paper (Bai et al. CDN: Live Web Fetch (conditional) (typical external HTTP round-trip; depends on publisher CDN; qualitative range) Triggered only for freshness-critical queries (news, prices, sports scores). ChatGPT: Think → Act → Observe — the reasoning loop CI: A trajectory evaluator scores the full decision chain, not just the final diff. CLAUDE: CLAUDE.md explicitly flags this distinction: “not permutation-invariant.” The sinusoidal PE uses base 10000 in the divisor term. CLI: Judge calibration, regression gating, launch criteria, eval ops CLIP: report +2pp accuracy on ImageNet with ViT-G/14 CLIP fine-tunes using model soups vs best single run. CLS: BERT s bidirectional attention makes [CLS] a function of the full sentence; GPT-2 s causal mask means each token sees only past context. CPU: Turning token IDs into meaningful vectors CSAM: Multi-modal frontier serving — TPU stack, 1M-token attention, safety classifier chain CSS: The sub-agent needs to search for CSS files — that history is noise. CUDA: Tiling, IO-awareness, and O(N) memory attention DALL: Yannic Kilcher — DALL·E 2 / Diffusion Models Explained (YouTube) Visual walkthrough of latent diffusion, CLIP guidance, and how modern text-to-image systems combine these ideas. DDP: DDP (DistributedDataParallel) replicates the entire model on each GPU and only synchronizes gradients via all-reduce after each backward pass. DDPM: Denoising Diffusion Probabilistic Models (Ho et al., 2020) — the foundational DDPM paper that revived diffusion models for image generation. DeepSeek: Multilingual models trained on Chinese data (e.g., Qwen, DeepSeek) use tokenizers with more CJK merges to reduce this gap. DPO: RLHF/DPO teaches which responses humans prefer among those the model can already produce. DRAM: Hurts: the 394M extra params are re-fetched from DRAM on every generated token, creating a bandwidth bottleneck. FAQ: Amazon Working Backwards — PR/FAQ + 6-Pager Not an ML piece, but the discipline of writing the customer-facing press release before the design doc is the methodological backbone of this module. FFN: The complete Transformer pipeline — from raw text to next-token prediction FlashAttention: This is why long-context models are challenging, and why Flash Attention (O(n) memory) is crucial. FLOP: BPE, vocabulary size, and why GPT can't count letters FLOPS: MHA uses h× more FLOPS, giving a strictly larger compute budget per forward pass. FN: Judge calibration, regression gating, launch criteria, eval ops FP: An order-invariant win rate confirms the signal is real, not an artifact of position bias. FS: Three isolation layers: (1) Overlay filesystem — reads from the real FS but writes go to a temporary copy-on-write layer. FSDP: FSDP (Fully Sharded Data Parallel) shards model parameters, gradients, AND optimizer states across GPUs (equivalent to ZeRO Stage 3). GB: But word order matters ('dog bites man' vs 'man bites dog'). GDPR: Because GDPR and CCPA require a documented retention and deletion policy; a system with no TTL has no retention policy and is a compliance blocker for regulated Enterprise customers. GELU: Both W1 and W2 store keys; the GELU activation retrieves the matching value GELU is a nonlinear gate, not a retrieval mechanism. GEMM: Matrix multiplication, weight initialization, and the universal approximation theorem GPT: The complete Transformer pipeline — from raw text to next-token prediction GPTQ: Quality loss is 1–3% on typical benchmarks (GPTQ, AWQ) but can be higher on chain-of-thought reasoning tasks where precision on intermediate tokens matters. GPU: 64 dimensions cannot be loaded into GPU VRAM; 8192 is the architectural maximum for Transformers. GQA: The complete Transformer pipeline — from raw text to next-token prediction GRPO: GRPO (Group Relative Policy Optimization) replaces the learned value function with a simple group-based baseline: for each prompt, sample K completions, compute their rewards, and use the group mean as the baseline. HBM: FlashAttention keeps O(n) HBM memory without changing FLOPs. HNSW: Index structures (IVF, HNSW) add overhead on top. HTML: A production pipeline: (1) Crawl — Common Crawl or custom scraper, (2) Extraction — HTML to text with boilerplate removal (trafilatura, resiliparse), (3) Language filtering — fasttext lid model, keep target languages, (4) Quality filtering — perplexity filter (small LM trained on Wikipedia), heur... HTTP: Serving stacks, continuous batching, latency vs throughput, vLLM, and API design HuggingFace: Tokenizer Summary — Hugging Face docs HuggingFace reference covering every major algorithm — WordPiece, Unigram, BPE, byte-level BPE — with concrete examples of merge rules. HumanEval: Generate N completions per prompt using the SFT model, score each with a reward model or human eval, keep only the best. ID: The complete Transformer pipeline — from raw text to next-token prediction IDE: Function calling, MCP, A2A — connecting agents to the world IDF: Classifier or heuristic scores: perplexity filter (trained on Wikipedia), TF-IDF similarity to reference corpus, text length, symbol-to-word ratio. ImageNet: opt-sgd-vs-adam-generalization Training a ResNet-50 on ImageNet, a researcher finds SGD with momentum achieves 77.0% top-1 accuracy while AdamW achieves 76.1%. InfiniBand: DDP, ZeRO, FSDP — training across thousands of GPUs InstructGPT: SFT Dataset Reference Points InstructGPT (Ouyang et al., 2022) used ~13,000 human-written SFT demonstrations collected from ~40 labelers RLHF on 33K preference pairs — enough that its 1.3B RLHF model was preferred by human evaluators over the 175B GPT-3 base Llama-2 Chat used ~27,540 vendor-colle... IO: The core of Transformers — derive this on a whiteboard IS: The complete Transformer pipeline — from raw text to next-token prediction JSON: Serving stacks, continuous batching, latency vs throughput, vLLM, and API design KB: Turning token IDs into meaningful vectors KL: It adds a KL penalty against a uniform prior, equivalent to label smoothing in the logit space. KV: The complete Transformer pipeline — from raw text to next-token prediction LayerNorm: 4) N identical blocks, each: LayerNorm → Multi-Head Attention (with causal mask) → residual add → LayerNorm → FFN (SwiGLU) → residual add. LIMA: LIMA (Less Is More for Alignment, 2023) showed that a 65B Llama model fine-tuned on just 1,000 high-quality examples performed comparably to models trained on 52K+ examples (Alpaca, Databricks-dolly). LLM: For a typical LLM: ~65% of parameters are in FFN layers (two large weight matrices per block: d→4d and 4d→d, or 8d/3 for SwiGLU). LM: But for small models the ratio is much higher: a 125M param model with 50K vocab and d=768 already has 38M embedding params = 31% of total. LN: The complete Transformer pipeline — from raw text to next-token prediction LoRA: Forgetting during fine-tuning is addressed by fine-tuning methods (LoRA, regularization), not by more pretraining tokens. LR: Stability comes from controlling magnitude, not centering — empirical results show negligible perplexity difference. LSP: Model Context Protocol Specification The open standard for connecting AI agents to external tools and data sources. MATH: Chain-of-thought, o1, DeepSeek-R1, test-time compute MB: The core of Transformers — derive this on a whiteboard MCP: Chain-of-thought, o1, DeepSeek-R1, test-time compute MHA: Standard MHA: W_Q + W_K + W_V + W_O = 4d² attention params. ML: 1989) states: a single hidden-layer MLP with a non-polynomial activation can approximate any continuous function on a compact set to arbitrary precision, given enough neurons. MLA: 2024 evolution — MLA (Multi-Head Latent Attention): 93.3% KV cache reduction vs MHA MLP: The complete Transformer pipeline — from raw text to next-token prediction MMLU: Evaluate on benchmarks (MMLU, HumanEval, MT-Bench) and human preference tests. MQA: Per-head compute cost and why MQA/GQA were invented. MRR: Ground LLM outputs in real data — reduce hallucination MS: Sub-second: ANN search <50ms, reranking <100ms, LLM streaming for perceived latency. MSE: MSE on probabilities would (1) not properly penalize confident wrong answers (assigning 0.01 vs 0.001 to the correct token matters a lot), (2) not correspond to the log-likelihood objective we actually want to maximize, (3) have worse gradient properties — cross-entropy gradients are proportional... MT: For each prompt, generate N completions (e.g., N=64) from the current model, score each with a reward model, and keep only the highest-scoring one. MTTR: MTTR dominated by detection, not rollback. NCCL: (2) Check NCCL topology — one rank may have a bad NIC, cross-switch connection, or PCIe bottleneck. NDCG: The likely gap: the upgrade was evaluated on standard relevance benchmarks (NDCG, MRR) where the new model improved, but citation precision is a downstream metric — it depends on what the LLM does with the reranked documents, not just which documents score highest. NLI: (2) Online: embed query, ANN search (top-k=20), rerank with cross-encoder (top-5), stuff into LLM context. NLP: BPE, vocabulary size, and why GPT can't count letters NotebookLM: /module/dr-case-notebooklm text-accent underline underline-offset-2 text-accent underline underline-offset-2 Company Lens — same design, different pushes rounded-lg border border-border bg-card p-4 rounded-lg border border-border bg-card p-4 rounded-lg border border-border bg-card p-4 Retrieval q... NSFW: Browser / mobile app — submits text prompt, polls or subscribes for result Auth, per-tenant rate limiting, step-budget enforcement by tier Text-level NSFW / policy check — fast classifier rejects before any GPU time is spent Three-lane priority queue: Pro → Paid → Free. NVIDIA: Chain rule, computation graphs, and autograd — how gradients flow backward OCR: Production models handle this via: (1) dynamic resolution — resize to multiple supported resolutions based on aspect ratio; (2) image tiling — split high-res images into crops, each encoded separately; (3) token compression — use a perceiver resampler or pooling to reduce visual tokens from hundr... OK: Architecture: (1) Offline pipeline: chunk documents (512 tokens, 50 token overlap), embed with a bi-encoder (e.g., BGE-large), store in vector DB (Pinecone/Qdrant/pgvector). OOM: The real win from FlashAttention is enabling longer sequences (100K+) that were previously OOM — that is a qualitative capability gain, not just throughput. ORM: (Note: the widely cited ~$5.5M figure is for DeepSeek-V3 pretraining, not R1.) ORM scores only the final answer: correct = 1, wrong = 0. OS: The GPU spends most time loading model weights and KV cache from HBM, not computing. PagedAttention: This is why serving GPT-3-scale models for many concurrent users requires PagedAttention or similar — each concurrent request consumes ~10 GB just for KV cache at max context. PaLM: One head isn't enough — each head learns different patterns PDF: React reconciler for terminals, Yoga flexbox, ANSI rendering, and keyboard focus PE: Sinusoidal PE: no learned parameters, generalizes to unseen lengths in theory (but poorly in practice), deterministic. PII: random web data, (5) Deduplication — exact dedup (URL + hash) then fuzzy dedup (MinHash LSH at document level, n-gram dedup at paragraph level), (6) PII removal — regex + NER for emails, phone numbers, (7) Toxicity filtering — classifier, but careful not to remove all discussion of sensitive topi... PM: A PM says the judge is superhuman. PPL: Step 2: Multi-head attention computes Q, K, V projections, applies causal mask, computes weighted sum. PPO: Align the model to human preferences using either PPO (with a reward model) or DPO (direct preference pairs). PR: (2) Use pass@k: probability of at least one success in k attempts, estimated as 1 - C(n-c, k)/C(n, k) where c = number of successes in n trials. PreToolUse: Add: (1) Zod input validation before execution, (2) permission resolution (deny rules → allow rules → PreToolUse hook → user prompt), (3) timeout per tool, (4) error isolation via Promise.allSettled so one failed Grep doesn't abort the rest. PyTorch: BPE, vocabulary size, and why GPT can't count letters QA: For factual QA or structured tasks, greedy can work fine since there's often one correct answer. QK: Attention has no sense of order — how do we fix that? QPS: BPE, vocabulary size, and why GPT can't count letters QueryEngine: Sub-agents need: (1) Context isolation — fresh QueryEngine with clean message history, not polluted by parent's 100K+ token conversation. RAG: Small model drafts, big model verifies — parallel generation RAM: With r=16 on a 4096x4096 matrix, you train 131K params instead of 16.8M — a 128x reduction. ReAct: System prompts, few-shot, structured output, tool schemas ReLU: However, ReLU kills ~half the neurons (negative outputs become 0), halving the effective variance. RL: Step 4: Another LayerNorm on the result. RLHF: RLHF can help with quality but does not specifically target the train/inference distribution mismatch. ROI: System prompts, few-shot, structured output, tool schemas RoPE: 3) Positional encoding (sinusoidal or RoPE) adds position information. SFT: The key insight: pre-training already teaches the model almost everything — SFT just teaches the format and style of interaction. SGD: Matrix multiplication, weight initialization, and the universal approximation theorem SLA: A synchronous router adds its latency directly to TTFT. SLO: Batch size is 1 per step — weights load from HBM but arithmetic intensity is ~1 FLOP/byte With batch=1 the GPU loads each weight matrix from HBM to do just a few multiply-adds, leaving arithmetic units starved while memory bandwidth is saturated. SOTA: The core of Transformers — derive this on a whiteboard SRAM: The core of Transformers — derive this on a whiteboard SRE: When the API returns 'prompt too long', the agent triggers reactive compaction: summarize older messages (keeping recent tool results intact), replace the originals with the summary, and retry. SSE: The paper reports 2–4× higher throughput than prior systems (FasterTransformer, Orca) at the same latency; much larger headline numbers seen elsewhere depend on the specific static-batching baseline being compared against. SWE: Chain-of-thought, o1, DeepSeek-R1, test-time compute SwiGLU: The complete Transformer pipeline — from raw text to next-token prediction TB: Softmax is translation-invariant but not scale-invariant. TF: Classifier or heuristic scores: perplexity filter (trained on Wikipedia), TF-IDF similarity to reference corpus, text length, symbol-to-word ratio. TFLOP: The core of Transformers — derive this on a whiteboard TFLOPS: ~125 TFLOPS effective // = 1.25e14 FLOPS const a100Flops = 1.25e14; // effective FLOPS per A100 const a100Hours = C / a100Flops / 3600; const costPerHour = 2; // $2/A100-hour const trainingCost = a100Hours * costPerHour; return ( Training Cost (@$2/A100-hr) How big should the model be? TP: Strategy: (1) Tensor parallelism (TP=8) within each 8-GPU node — splits individual layers across GPUs connected by fast NVLink (~600 GB/s). TPU: Serving stacks, continuous batching, latency vs throughput, vLLM, and API design TTFT: Turning token IDs into meaningful vectors TTL: The cache has a TTL (typically 5 min) so low-traffic endpoints may not benefit. UI: Trajectory eval, tool accuracy, and why agent eval is harder URL: random web data, (5) Deduplication — exact dedup (URL + hash) then fuzzy dedup (MinHash LSH at document level, n-gram dedup at paragraph level), (6) PII removal — regex + NER for emails, phone numbers, (7) Toxicity filtering — classifier, but careful not to remove all discussion of sensitive topi... UX: Critical for UX — users see output immediately instead of waiting for the full response. VRAM: 64 dimensions cannot be loaded into GPU VRAM; 8192 is the architectural maximum for Transformers. XLA: Serving stacks, continuous batching, latency vs throughput, vLLM, and API design YAML: Combine fine-tuned models without retraining YouTube: Andrej Karpathy — The spelled-out intro to neural networks (YouTube) 2.5-hour walkthrough building micrograd step by step — every chain rule application shown explicitly.