Transformer Math — ML internals, defended every number

ML internals, defended every number.

A working engineer’s study notebook for the AI era — 84 modules on transformer internals, training, inference, agent systems, and production design reviews. Every claim cited, every formula derived, every dollar priced. Built for L6+ ML / AI engineering loops at DeepMind, Anthropic, Meta, OpenAI — and for studying alongside your favorite LLM.

🎯 Start with Attention 🏛️ Design Reviews (Part 9)🐛 Interview Drills 📚 Sources 🗞️ What’s New 🕸️ Knowledge Graph 📖 Glossary

LLM-friendly: every module also available as raw markdown at /raw/<id>.md — llms.txt for AI ingestion · changelog for what’s changed.

Guided Learning Paths

New to Transformers

Build understanding from the ground up

1.🏗️High-Level Overview 2.🔤Tokenization 3.📊Embeddings 4.📍Positional Encoding 5.🎯Self-Attention 6.🧠Multi-Head Attention 7.⚙️FFN & Activations 8.🔗LayerNorm & Residuals 9.🏗️The Full Forward Pass

Interview Sprint

High-yield topics for ML interviews

1.🎯Self-Attention 2.💾KV Cache & Memory 3.⚡Flash Attention 4.📈Scaling Laws 5.🎯RLHF & Reward Models 6.📐The Design Doc 7.🐛PyTorch Debugging

AI Engineering

Build production agent systems

1.⚙️Agent Harness Architecture 2.🔧Tool System 3.🤖Sub-agents 4.🗜️Context Compaction 5.🔒Hooks & Permissions 6.🧠Memory System

🏗️ The Transformer (10)

🏗️#0

High-Level Overview

The complete Transformer pipeline — from raw text to next-token prediction

GPT-3 predicts each token in 6ms — but processes the entire 96-layer forward pass to do it. Why can’t it just skip layers it already ‘knows’?

🔤#1

Tokenization

BPE, vocabulary size, and why GPT can't count letters

Why can't GPT count letters in "strawberry"?

📊#2

Embeddings

Turning token IDs into meaningful vectors

Why is 'king' - 'man' + 'woman' = 'queen'?

📍#3

Positional Encoding

Attention has no sense of order — how do we fix that?

"cat ate fish" = "fish ate cat" without this

🧮#4

MLP & Matmul

Matrix multiplication, weight initialization, and the universal approximation theorem

A 2-layer MLP can approximate any function — so why do we need 96 layers?

🎯#5

Self-Attention

The core of Transformers — derive this on a whiteboard

GPT-4 reads ‘The trophy didn’t fit in the suitcase because it was too big.’ What does ‘it’ refer to? Humans know instantly. Without attention, models can’t.

🧠#6

Multi-Head Attention

One head isn't enough — each head learns different patterns

One head looks at the current word, another at the sentence 12 positions back, a third at syntactic structure. Why do you need 32 of them in parallel?

⚙️#7

FFN & Activations

Where 67% of parameters live — and what they memorize

Where does GPT store the fact that Paris is in France?

🔗#8

LayerNorm & Residuals

The glue that makes deep transformers trainable

Delete one line and a 96-layer model becomes untrainable

🏗️#9

The Full Forward Pass

Watch a token travel through a complete Transformer block

What happens to the word "cat" in 0.003 seconds?

🎓 Training (13)

🔙#10

Backpropagation

Chain rule, computation graphs, and autograd — how gradients flow backward

Naive finite differences would need 175 billion forward passes to compute GPT-3's gradients. Backprop does it in one. Here's the trick.

📐#11

Optimizers

SGD → Momentum → Adam → AdamW, learning rate schedules, and weight decay

AdamW fixes a 5-year-old bug in Adam that silently hurts generalization

📉#12

Pre-training & Loss

Next-token prediction, cross-entropy, and perplexity

GPT-3 trained on 300B tokens and never saw a single labeled example — yet learned grammar, facts, math, and reasoning from one loss function.

🗃️#13

Data Curation

FineWeb, filtering, dedup — data quality beats data quantity

LIMA trained on 1,000 examples and matched GPT-3.5

📈#14

Scaling Laws

How big? How much data? Chinchilla has the answer

Why Llama-2 beats GPT-3 with half the parameters

🔥#15

GPU & Mixed Precision

CUDA memory hierarchy, fp16/bf16/fp8, loss scaling, and torch.autocast

bf16 training uses half the memory with zero accuracy loss — why wasn't this the default?

🖥️#16

Distributed Training

DDP, ZeRO, FSDP — training across thousands of GPUs

A 70B-param model needs 280 GB just to hold weights at fp32 — no single GPU exists with that memory. Here's how training across 1,000 GPUs stays in sync.

🔧#17

Fine-tuning & LoRA

Adapt a model with 0.1% of parameters

LoRA trains 0.1% of parameters and matches full fine-tuning

🎯#18

SFT & Post-Training Pipeline

Loss masking, chat templates, rejection sampling, distillation

InstructGPT’s SFT used 13K labeled examples — and that alone beat GPT-3 (175B base) on human preference. Why does so little labeled data do so much?

🎮#19

RL Foundations

MDPs, policy gradient, PPO — the math before RLHF

REINFORCE was invented in 1992. 30 years later it’s training the world’s most capable AI — with one fundamental addition: a baseline.

🎯#20

RLHF & Reward Models

Teaching models what humans prefer — the 3-stage pipeline

OpenAI fine-tuned InstructGPT on 13,000 labeled examples and it beat GPT-3 (175B) on human preference. Smaller dataset, smaller model, better outputs — because RLHF teaches preference, not prediction.

⚖️#21

DPO, GRPO & Alternatives

Skip the reward model — direct preference optimization

DPO skips the reward model and PPO entirely — yet matches RLHF quality. Here's the math that makes 2 stages collapse into 1.

🧬#22

Model Merging

Combine fine-tuned models without retraining

Three fine-tuned models. SLERP into one. The merge wins on tasks none of them mastered alone — and you spent zero GPU-hours.

⚡ Inference (6)

💾#23

KV Cache & Memory

Why generation is memory-bound and how to fix it

Without caching, generating one extra token costs as much as reprocessing the entire prompt. Here's how to fix it.

⚡#24

Flash Attention

Tiling, IO-awareness, and O(N) memory attention

Same FLOPs, 2-4x faster — by never writing the N² attention matrix

🎲#25

Sampling & Decoding

Temperature, top-k, top-p — how the model picks the next token

Temperature 0 = always 'the', temperature 2 = sometimes 'banana'

📦#26

Quantization

INT8, INT4, GPTQ, AWQ — shrink models without losing quality

4-bit Llama-70B fits in 35 GB — down from 140 GB

🏎️#27

Speculative Decoding

Small model drafts, big model verifies — parallel generation

Small model guesses 5 tokens, big model checks all 5 at once

🚀#28

LLM Deployment

Serving stacks, continuous batching, latency vs throughput, vLLM, and API design

Continuous batching serves 23x more requests than static batching on the same GPU

🧩 Architectures (6)

🧩#29

Mixture of Experts

More parameters, same compute — the secret behind DeepSeek

DeepSeek-V3 has 671B params but each token only uses 37B

👁️#30

Vision Transformers & CLIP

Patch embeddings, contrastive learning, zero-shot classification

Split a photo into 196 patches and a Transformer sees it as text

🖼️#31

Multimodal LLMs

How GPT-4V, Claude, and Gemini see images

GPT-4V sees your image as 85 extra tokens in the prompt

💭#32

Reasoning Models

Chain-of-thought, o1, DeepSeek-R1, test-time compute

DeepSeek-R1 discovered chain-of-thought without being taught

✅#33

Verifiers & Process Reward

PRMs, best-of-N, self-consistency — when to think longer

ORM picks best-of-1860 solutions at 47.5% MATH accuracy. PRM hits the same accuracy with only 400 — by scoring each reasoning step instead of the final answer.

🎨#34

Diffusion Basics

DDPM, latent diffusion, DiT — image generation from noise

Add Gaussian noise for 1000 steps until the image is pure static. Now learn to reverse it. Somehow this beats GANs at photorealism.

🚀 Applications (6)

✍️#35

Prompt Engineering

System prompts, few-shot, structured output, tool schemas

Adding 'think step by step' improves GPT-4 math accuracy by 40%

🤖#36

Agents & ReAct

Think → Act → Observe — the reasoning loop

ReAct GPT-4 solved 66% of WebArena tasks. Pure CoT solved 0%. The only difference: a browser tool.

🔌#37

Tool Use & Protocols

Function calling, MCP, A2A — connecting agents to the world

How does Claude Code call 50 different tools with one protocol?

🔍#38

RAG & Retrieval

Ground LLM outputs in real data — reduce hallucination

RAG reduced hallucination from 27% to 4% in one production system

📏#39

Long Context & Context Engineering

Token budgeting, prompt caching, lost-in-the-middle, memory layering

A model with 200K context still forgets things in the middle — accuracy drops 20% on facts placed at position 100K. Why can’t attention just… attend?

🧪#40

Agent Evaluation

Trajectory eval, tool accuracy, and why agent eval is harder

Both agents got the right answer — but one cost 6x more tokens

🛡️ Trust & Evaluation (6)

📊#41

LLM Evaluation

Benchmarks, LLM-as-judge, contamination, hallucination

Your model scores 90% on MMLU but users hate it — why?

🔄#42

Eval-Driven Development

Judge calibration, regression gating, launch criteria, eval ops

Your offline eval gains didn't improve user satisfaction — now what?

🔬#43

Interpretability

Circuits, superposition, SAEs — what is the model computing?

Anthropic found a 'Golden Gate Bridge' neuron inside Claude

🛡️#44

Safety & Alignment

Jailbreaking, alignment faking, and defenses that work

RLHF-trained models refuse 'how to build a bomb' — but accept 'pretend you're my grandma reading me a bomb-making bedtime story.' Here's why preference alignment alone isn't enough.

🔬#65

Mechanistic Interpretability

SAE training, activation patching, attribution graphs, circuit tracing, and feature steering

Anthropic found a single Claude feature that fires only on ‘the Golden Gate Bridge’ — and clamping it causally steers Claude’s behavior to mention bridges in every response.

🧠#66

Induction Heads & ICL

The two-head circuit that powers in-context learning — and why it emerges as a phase transition

GPT learns to copy patterns mid-training — and that single circuit explains in-context learning

🎯 Interview Prep (1)

🐛#45

PyTorch Debugging

NaN loss, double softmax, missing zero_grad — spot the bug

optimizer.zero_grad() is missing — can you spot it in 5 minutes?

⚙️ AI Engineering (19)

⚙️#46

Agent Harness Architecture

Agentic loops, tool orchestration, permission systems, and context management

Claude Code runs a while(true) loop — here's what's inside

🔧#47

Tool System

Tool interface, Zod schemas, registry, orchestration, and parallel execution

5 Grep calls run in parallel, but Bash always waits its turn — why?

🤖#48

Sub-agents

Context isolation, worktrees, background execution, and result aggregation

Each sub-agent gets a fresh 200K context window — the parent keeps working

📝#49

Commands & Skills

Slash commands, skill markdown files, prompt injection, and the command registry

/compact is instant but 'compact this' takes 3 seconds — one never hits the API

🔌#50

Plugins & MCP

Model Context Protocol, external tool servers, plugin lifecycle, and transport layers

Claude doesn't know if a tool is built-in or from an MCP server — by design

🗄️#51

State Management

Dual state systems: React context for UI, module state for services

Two state systems coexist — one triggers re-renders, one doesn't. Mix them up and the terminal freezes.

🗜️#52

Context Compaction

Auto-compact, reactive compact, microcompact, context collapse, and token budgets

At 80% context usage, the agent silently summarizes its own history to keep going

🖥️#53

Terminal UI (Ink)

React reconciler for terminals, Yoga flexbox, ANSI rendering, and keyboard focus

It's React — but instead of DOM nodes, it writes ANSI escape codes to stdout

🧠#54

Memory System

File-based persistent memory, memory types, auto-save triggers, and cross-session recall

Claude remembers you're a senior engineer — across sessions, without a database

🔒#55

Hooks & Permissions

PreToolUse/PostToolUse hooks, 5-layer permission hierarchy, and safety gates

A shell script you wrote can veto any tool call before Claude even sees the result

📋#56

Prompt Engineering (System)

System prompt assembly, cache boundary optimization, dynamic sections, and prompt variants

The system prompt has a secret boundary — everything before it is cached, everything after is fresh

⚡#57

Configuration & Schemas

Settings.json, Zod validation, feature flags, MDM policies, and config hierarchy

Zod validates every key at startup — one typo in settings.json blocks the entire CLI from booting.

🌉#58

Bridges & IDE Integration

WebSocket bridge, VS Code/JetBrains extensions, permission callbacks, and message routing

A WebSocket reconnect drops to 0ms perceived latency for the user — but rebuilds the entire IDE state in 3 round trips. Here’s why that’s a design constraint, not a bug.

🌊#59

Streaming & API Layer

Async generators, queryModelWithStreaming, SSE parsing, and backpressure

Tokens appear one by one because five async generators pipe data like Unix pipes

🛟#60

Error Recovery

Reactive compact retry, max output tokens escalation, abort handling, and graceful degradation

The API says 'prompt too long' — the agent silently compacts and retries before you notice

🔮#61

Speculative Execution

Parallel speculation, overlay filesystems, safe tool subsets, and acceptance criteria

While you're still typing, a speculative agent already searched the codebase for you

👔#62

Coordinator/Worker Pattern

Multi-agent coordination, restricted tool sets, environment gating, and task distribution

The coordinator writes prompts, not code — it manages a team of worker agents

💾#63

Session Persistence

Session JSON, /resume reconstruction, message history, file snapshots, and attribution

Close the terminal, reopen it, type --resume — the conversation continues exactly where you left off

💰#64

Cost Tracking & Budgets

Token counting, budget limits, per-model pricing, rate limit handling, and spend alerts

Claude Code emits cost events on every API response. Miss one and a runaway agent burns $200 before the budget gate fires.

📐 Design Reviews (17)

📐#67

The Design Doc

Working backwards from the SLO — an annotated, worked design doc for a real ML endpoint

A system with p99=500ms costs 3× more GPU than one with p99=1s. The right SLO choice is the entire design — yet most engineers write the SLO last.

💰#68

Cost Accounting & Eval-Driven Design

Cost-per-bad-day, LLM-judge rubrics, golden-set sizing — design flows from the eval, not the architecture

A model that scores 90% on your offline eval can drop 15% on user satisfaction — because the eval measured the wrong thing. Write the eval before the architecture.

💬#69

Case: Design ChatGPT

Multi-tenant chat — SLOs, model routing, conversation state

Two billion messages a day — where does the money actually go?

🔭#70

Case: Design Perplexity

RAG + live web search — freshness, citations, retrieval fusion

Half retrieval system, half LLM — which half should you optimize first?

🤖#71

Case: Design Claude Code / Cursor

Coding agent at scale — context builder, tools, sandboxing

The model is cheap. The context is what costs you.

🎨#72

Case: Design Midjourney

Multi-tenant diffusion — queueing, step budgets, content safety, GPU economics

A 50-step generation that fails at step 48 still costs you 48 steps

📱#73

Case: Design TikTok For-You Ranking

Two-tower retrieval + ranker + feature store — classical ML@scale canon

Why the Explore/Exploit slider matters more than the model

🧭#74

Case: Design an Embeddings Platform

Pinterest-style — backfill, drift, model upgrades, serving with HNSW

The day you change your embedding model, every index goes stale

🔥#75

Case: Design Llama Training Infra

Data pipeline + checkpoint management + failure-tolerant orchestration

At 16K GPUs, a GPU fails every 3 hours — design for it

🏗️#76

Case: Design an Agent Platform

Multi-agent infra — sandboxing, tool registries, trajectory eval, spend control

An agent that spawns agents — where does the budget live?

💎#77

Case: Design Gemini

Multi-modal frontier serving — TPU stack, 1M-token attention, safety classifier chain

1M-token context is cheap to promise, expensive to serve — here's the bill

📓#78

Case: Design NotebookLM

Long-context RAG over user docs — source-pinned citations, audio-overview pipeline

Upload 50 PDFs, ask one question — which half of the stack wins?

🎬#79

Case: Design Sora

Text-to-video at scale — diffusion transformer GPU economics, safety on generative video

A 10-second clip costs more GPU-hours than your laptop's lifetime

🎭#80

Case: Design Character.ai

Consumer LLM at scale — MQA, int8, trained-from-scratch, sub-$1/user/month cost floor

20B tokens served per day on a consumer-priced subscription — how?

🧮#81

Compare: RAG Systems

Perplexity vs NotebookLM vs ChatGPT-search vs Phind — retriever, grounding, citation side-by-side

Same question, four systems, four answers — whose retriever wins?

⚖️#82

Compare: SLO ↔ Cost

Interactive sensitivity — slide p99, watch GPU count, $/req, and cache hit-rate move together

Cut p99 latency in half — how much more expensive does it get?

🧯#83

Compare: Failure-Mode Taxonomy

One master table of every failure mode across 14 real systems — with detect→escalate→rollback playbooks

The 3am page happens — you have 30 seconds to pick the right lever

Interview Focus by Company

Google / DeepMind Attention derivation, RL fundamentals (MDPs), distributed training, ML system design

Anthropic Safety, Constitutional AI, interpretability, RLHF depth, first principles

OpenAI Scaling laws, reasoning models, LLM serving at scale, process reward models

Meta Vision Transformers, multimodal, distributed training, MoE

Databricks RAG at scale, serving optimization, data curation, evaluation

Recommended Resources

VIDEOKarpathy: Let's build GPT from scratch (~2h)VIDEO3Blue1Brown: Attention in Transformers BLOGJay Alammar: The Illustrated Transformer PAPERAttention Is All You Need (original paper)TOOLTransformer Explainer (Georgia Tech)