ML internals, defended every number.
A working engineer’s study notebook for the AI era — 84 modules on transformer internals, training, inference, agent systems, and production design reviews. Every claim cited, every formula derived, every dollar priced. Built for L6+ ML / AI engineering loops at DeepMind, Anthropic, Meta, OpenAI — and for studying alongside your favorite LLM.
LLM-friendly: every module also available as raw markdown at /raw/<id>.md — llms.txt for AI ingestion · changelog for what’s changed.
Guided Learning Paths
New to Transformers
Build understanding from the ground up
Interview Sprint
High-yield topics for ML interviews
AI Engineering
Build production agent systems
🏗️ The Transformer (10)
High-Level Overview
The complete Transformer pipeline — from raw text to next-token prediction
GPT-3 predicts each token in 6ms — but processes the entire 96-layer forward pass to do it. Why can’t it just skip layers it already ‘knows’?
Tokenization
BPE, vocabulary size, and why GPT can't count letters
Why can't GPT count letters in "strawberry"?
Embeddings
Turning token IDs into meaningful vectors
Why is 'king' - 'man' + 'woman' = 'queen'?
Positional Encoding
Attention has no sense of order — how do we fix that?
"cat ate fish" = "fish ate cat" without this
MLP & Matmul
Matrix multiplication, weight initialization, and the universal approximation theorem
A 2-layer MLP can approximate any function — so why do we need 96 layers?
Self-Attention
The core of Transformers — derive this on a whiteboard
GPT-4 reads ‘The trophy didn’t fit in the suitcase because it was too big.’ What does ‘it’ refer to? Humans know instantly. Without attention, models can’t.
Multi-Head Attention
One head isn't enough — each head learns different patterns
One head looks at the current word, another at the sentence 12 positions back, a third at syntactic structure. Why do you need 32 of them in parallel?
FFN & Activations
Where 67% of parameters live — and what they memorize
Where does GPT store the fact that Paris is in France?
LayerNorm & Residuals
The glue that makes deep transformers trainable
Delete one line and a 96-layer model becomes untrainable
The Full Forward Pass
Watch a token travel through a complete Transformer block
What happens to the word "cat" in 0.003 seconds?
🎓 Training (13)
Backpropagation
Chain rule, computation graphs, and autograd — how gradients flow backward
Naive finite differences would need 175 billion forward passes to compute GPT-3's gradients. Backprop does it in one. Here's the trick.
Optimizers
SGD → Momentum → Adam → AdamW, learning rate schedules, and weight decay
AdamW fixes a 5-year-old bug in Adam that silently hurts generalization
Pre-training & Loss
Next-token prediction, cross-entropy, and perplexity
GPT-3 trained on 300B tokens and never saw a single labeled example — yet learned grammar, facts, math, and reasoning from one loss function.
Data Curation
FineWeb, filtering, dedup — data quality beats data quantity
LIMA trained on 1,000 examples and matched GPT-3.5
Scaling Laws
How big? How much data? Chinchilla has the answer
Why Llama-2 beats GPT-3 with half the parameters
GPU & Mixed Precision
CUDA memory hierarchy, fp16/bf16/fp8, loss scaling, and torch.autocast
bf16 training uses half the memory with zero accuracy loss — why wasn't this the default?
Distributed Training
DDP, ZeRO, FSDP — training across thousands of GPUs
A 70B-param model needs 280 GB just to hold weights at fp32 — no single GPU exists with that memory. Here's how training across 1,000 GPUs stays in sync.
Fine-tuning & LoRA
Adapt a model with 0.1% of parameters
LoRA trains 0.1% of parameters and matches full fine-tuning
SFT & Post-Training Pipeline
Loss masking, chat templates, rejection sampling, distillation
InstructGPT’s SFT used 13K labeled examples — and that alone beat GPT-3 (175B base) on human preference. Why does so little labeled data do so much?
RL Foundations
MDPs, policy gradient, PPO — the math before RLHF
REINFORCE was invented in 1992. 30 years later it’s training the world’s most capable AI — with one fundamental addition: a baseline.
RLHF & Reward Models
Teaching models what humans prefer — the 3-stage pipeline
OpenAI fine-tuned InstructGPT on 13,000 labeled examples and it beat GPT-3 (175B) on human preference. Smaller dataset, smaller model, better outputs — because RLHF teaches preference, not prediction.
DPO, GRPO & Alternatives
Skip the reward model — direct preference optimization
DPO skips the reward model and PPO entirely — yet matches RLHF quality. Here's the math that makes 2 stages collapse into 1.
Model Merging
Combine fine-tuned models without retraining
Three fine-tuned models. SLERP into one. The merge wins on tasks none of them mastered alone — and you spent zero GPU-hours.
⚡ Inference (6)
KV Cache & Memory
Why generation is memory-bound and how to fix it
Without caching, generating one extra token costs as much as reprocessing the entire prompt. Here's how to fix it.
Flash Attention
Tiling, IO-awareness, and O(N) memory attention
Same FLOPs, 2-4x faster — by never writing the N² attention matrix
Sampling & Decoding
Temperature, top-k, top-p — how the model picks the next token
Temperature 0 = always 'the', temperature 2 = sometimes 'banana'
Quantization
INT8, INT4, GPTQ, AWQ — shrink models without losing quality
4-bit Llama-70B fits in 35 GB — down from 140 GB
Speculative Decoding
Small model drafts, big model verifies — parallel generation
Small model guesses 5 tokens, big model checks all 5 at once
LLM Deployment
Serving stacks, continuous batching, latency vs throughput, vLLM, and API design
Continuous batching serves 23x more requests than static batching on the same GPU
🧩 Architectures (6)
Mixture of Experts
More parameters, same compute — the secret behind DeepSeek
DeepSeek-V3 has 671B params but each token only uses 37B
Vision Transformers & CLIP
Patch embeddings, contrastive learning, zero-shot classification
Split a photo into 196 patches and a Transformer sees it as text
Multimodal LLMs
How GPT-4V, Claude, and Gemini see images
GPT-4V sees your image as 85 extra tokens in the prompt
Reasoning Models
Chain-of-thought, o1, DeepSeek-R1, test-time compute
DeepSeek-R1 discovered chain-of-thought without being taught
Verifiers & Process Reward
PRMs, best-of-N, self-consistency — when to think longer
ORM picks best-of-1860 solutions at 47.5% MATH accuracy. PRM hits the same accuracy with only 400 — by scoring each reasoning step instead of the final answer.
Diffusion Basics
DDPM, latent diffusion, DiT — image generation from noise
Add Gaussian noise for 1000 steps until the image is pure static. Now learn to reverse it. Somehow this beats GANs at photorealism.
🚀 Applications (6)
Prompt Engineering
System prompts, few-shot, structured output, tool schemas
Adding 'think step by step' improves GPT-4 math accuracy by 40%
Agents & ReAct
Think → Act → Observe — the reasoning loop
ReAct GPT-4 solved 66% of WebArena tasks. Pure CoT solved 0%. The only difference: a browser tool.
Tool Use & Protocols
Function calling, MCP, A2A — connecting agents to the world
How does Claude Code call 50 different tools with one protocol?
RAG & Retrieval
Ground LLM outputs in real data — reduce hallucination
RAG reduced hallucination from 27% to 4% in one production system
Long Context & Context Engineering
Token budgeting, prompt caching, lost-in-the-middle, memory layering
A model with 200K context still forgets things in the middle — accuracy drops 20% on facts placed at position 100K. Why can’t attention just… attend?
Agent Evaluation
Trajectory eval, tool accuracy, and why agent eval is harder
Both agents got the right answer — but one cost 6x more tokens
🛡️ Trust & Evaluation (6)
LLM Evaluation
Benchmarks, LLM-as-judge, contamination, hallucination
Your model scores 90% on MMLU but users hate it — why?
Eval-Driven Development
Judge calibration, regression gating, launch criteria, eval ops
Your offline eval gains didn't improve user satisfaction — now what?
Interpretability
Circuits, superposition, SAEs — what is the model computing?
Anthropic found a 'Golden Gate Bridge' neuron inside Claude
Safety & Alignment
Jailbreaking, alignment faking, and defenses that work
RLHF-trained models refuse 'how to build a bomb' — but accept 'pretend you're my grandma reading me a bomb-making bedtime story.' Here's why preference alignment alone isn't enough.
Mechanistic Interpretability
SAE training, activation patching, attribution graphs, circuit tracing, and feature steering
Anthropic found a single Claude feature that fires only on ‘the Golden Gate Bridge’ — and clamping it causally steers Claude’s behavior to mention bridges in every response.
Induction Heads & ICL
The two-head circuit that powers in-context learning — and why it emerges as a phase transition
GPT learns to copy patterns mid-training — and that single circuit explains in-context learning
🎯 Interview Prep (1)
⚙️ AI Engineering (19)
Agent Harness Architecture
Agentic loops, tool orchestration, permission systems, and context management
Claude Code runs a while(true) loop — here's what's inside
Tool System
Tool interface, Zod schemas, registry, orchestration, and parallel execution
5 Grep calls run in parallel, but Bash always waits its turn — why?
Sub-agents
Context isolation, worktrees, background execution, and result aggregation
Each sub-agent gets a fresh 200K context window — the parent keeps working
Commands & Skills
Slash commands, skill markdown files, prompt injection, and the command registry
/compact is instant but 'compact this' takes 3 seconds — one never hits the API
Plugins & MCP
Model Context Protocol, external tool servers, plugin lifecycle, and transport layers
Claude doesn't know if a tool is built-in or from an MCP server — by design
State Management
Dual state systems: React context for UI, module state for services
Two state systems coexist — one triggers re-renders, one doesn't. Mix them up and the terminal freezes.
Context Compaction
Auto-compact, reactive compact, microcompact, context collapse, and token budgets
At 80% context usage, the agent silently summarizes its own history to keep going
Terminal UI (Ink)
React reconciler for terminals, Yoga flexbox, ANSI rendering, and keyboard focus
It's React — but instead of DOM nodes, it writes ANSI escape codes to stdout
Memory System
File-based persistent memory, memory types, auto-save triggers, and cross-session recall
Claude remembers you're a senior engineer — across sessions, without a database
Hooks & Permissions
PreToolUse/PostToolUse hooks, 5-layer permission hierarchy, and safety gates
A shell script you wrote can veto any tool call before Claude even sees the result
Prompt Engineering (System)
System prompt assembly, cache boundary optimization, dynamic sections, and prompt variants
The system prompt has a secret boundary — everything before it is cached, everything after is fresh
Configuration & Schemas
Settings.json, Zod validation, feature flags, MDM policies, and config hierarchy
Zod validates every key at startup — one typo in settings.json blocks the entire CLI from booting.
Bridges & IDE Integration
WebSocket bridge, VS Code/JetBrains extensions, permission callbacks, and message routing
A WebSocket reconnect drops to 0ms perceived latency for the user — but rebuilds the entire IDE state in 3 round trips. Here’s why that’s a design constraint, not a bug.
Streaming & API Layer
Async generators, queryModelWithStreaming, SSE parsing, and backpressure
Tokens appear one by one because five async generators pipe data like Unix pipes
Error Recovery
Reactive compact retry, max output tokens escalation, abort handling, and graceful degradation
The API says 'prompt too long' — the agent silently compacts and retries before you notice
Speculative Execution
Parallel speculation, overlay filesystems, safe tool subsets, and acceptance criteria
While you're still typing, a speculative agent already searched the codebase for you
Coordinator/Worker Pattern
Multi-agent coordination, restricted tool sets, environment gating, and task distribution
The coordinator writes prompts, not code — it manages a team of worker agents
Session Persistence
Session JSON, /resume reconstruction, message history, file snapshots, and attribution
Close the terminal, reopen it, type --resume — the conversation continues exactly where you left off
Cost Tracking & Budgets
Token counting, budget limits, per-model pricing, rate limit handling, and spend alerts
Claude Code emits cost events on every API response. Miss one and a runaway agent burns $200 before the budget gate fires.
📐 Design Reviews (17)
The Design Doc
Working backwards from the SLO — an annotated, worked design doc for a real ML endpoint
A system with p99=500ms costs 3× more GPU than one with p99=1s. The right SLO choice is the entire design — yet most engineers write the SLO last.
Cost Accounting & Eval-Driven Design
Cost-per-bad-day, LLM-judge rubrics, golden-set sizing — design flows from the eval, not the architecture
A model that scores 90% on your offline eval can drop 15% on user satisfaction — because the eval measured the wrong thing. Write the eval before the architecture.
Case: Design ChatGPT
Multi-tenant chat — SLOs, model routing, conversation state
Two billion messages a day — where does the money actually go?
Case: Design Perplexity
RAG + live web search — freshness, citations, retrieval fusion
Half retrieval system, half LLM — which half should you optimize first?
Case: Design Claude Code / Cursor
Coding agent at scale — context builder, tools, sandboxing
The model is cheap. The context is what costs you.
Case: Design Midjourney
Multi-tenant diffusion — queueing, step budgets, content safety, GPU economics
A 50-step generation that fails at step 48 still costs you 48 steps
Case: Design TikTok For-You Ranking
Two-tower retrieval + ranker + feature store — classical ML@scale canon
Why the Explore/Exploit slider matters more than the model
Case: Design an Embeddings Platform
Pinterest-style — backfill, drift, model upgrades, serving with HNSW
The day you change your embedding model, every index goes stale
Case: Design Llama Training Infra
Data pipeline + checkpoint management + failure-tolerant orchestration
At 16K GPUs, a GPU fails every 3 hours — design for it
Case: Design an Agent Platform
Multi-agent infra — sandboxing, tool registries, trajectory eval, spend control
An agent that spawns agents — where does the budget live?
Case: Design Gemini
Multi-modal frontier serving — TPU stack, 1M-token attention, safety classifier chain
1M-token context is cheap to promise, expensive to serve — here's the bill
Case: Design NotebookLM
Long-context RAG over user docs — source-pinned citations, audio-overview pipeline
Upload 50 PDFs, ask one question — which half of the stack wins?
Case: Design Sora
Text-to-video at scale — diffusion transformer GPU economics, safety on generative video
A 10-second clip costs more GPU-hours than your laptop's lifetime
Case: Design Character.ai
Consumer LLM at scale — MQA, int8, trained-from-scratch, sub-$1/user/month cost floor
20B tokens served per day on a consumer-priced subscription — how?
Compare: RAG Systems
Perplexity vs NotebookLM vs ChatGPT-search vs Phind — retriever, grounding, citation side-by-side
Same question, four systems, four answers — whose retriever wins?
Compare: SLO ↔ Cost
Interactive sensitivity — slide p99, watch GPU count, $/req, and cache hit-rate move together
Cut p99 latency in half — how much more expensive does it get?
Compare: Failure-Mode Taxonomy
One master table of every failure mode across 14 real systems — with detect→escalate→rollback playbooks
The 3am page happens — you have 30 seconds to pick the right lever
Interview Focus by Company
Google / DeepMind Attention derivation, RL fundamentals (MDPs), distributed training, ML system design
Anthropic Safety, Constitutional AI, interpretability, RLHF depth, first principles
OpenAI Scaling laws, reasoning models, LLM serving at scale, process reward models
Meta Vision Transformers, multimodal, distributed training, MoE
Databricks RAG at scale, serving optimization, data curation, evaluation