Skip to content

Transformer Math

ML internals, defended every number.

A working engineer’s study notebook for the AI era — 84 modules on transformer internals, training, inference, agent systems, and production design reviews. Every claim cited, every formula derived, every dollar priced. Built for L6+ ML / AI engineering loops at DeepMind, Anthropic, Meta, OpenAI — and for studying alongside your favorite LLM.

LLM-friendly: every module also available as raw markdown at /raw/<id>.md llms.txt for AI ingestion · changelog for what’s changed.

🏗️ The Transformer (10)

🏗️#0

High-Level Overview

The complete Transformer pipeline — from raw text to next-token prediction

GPT-3 predicts each token in 6ms — but processes the entire 96-layer forward pass to do it. Why can&rsquo;t it just skip layers it already &lsquo;knows&rsquo;?

🔤#1

Tokenization

BPE, vocabulary size, and why GPT can't count letters

Why can't GPT count letters in "strawberry"?

📊#2

Embeddings

Turning token IDs into meaningful vectors

Why is 'king' - 'man' + 'woman' = 'queen'?

📍#3

Positional Encoding

Attention has no sense of order — how do we fix that?

"cat ate fish" = "fish ate cat" without this

🧮#4

MLP & Matmul

Matrix multiplication, weight initialization, and the universal approximation theorem

A 2-layer MLP can approximate any function — so why do we need 96 layers?

🎯#5

Self-Attention

The core of Transformers — derive this on a whiteboard

GPT-4 reads ‘The trophy didn’t fit in the suitcase because it was too big.’ What does ‘it’ refer to? Humans know instantly. Without attention, models can’t.

🧠#6

Multi-Head Attention

One head isn't enough — each head learns different patterns

One head looks at the current word, another at the sentence 12 positions back, a third at syntactic structure. Why do you need 32 of them in parallel?

⚙️#7

FFN & Activations

Where 67% of parameters live — and what they memorize

Where does GPT store the fact that Paris is in France?

🔗#8

LayerNorm & Residuals

The glue that makes deep transformers trainable

Delete one line and a 96-layer model becomes untrainable

🏗️#9

The Full Forward Pass

Watch a token travel through a complete Transformer block

What happens to the word "cat" in 0.003 seconds?

🎓 Training (13)

🔙#10

Backpropagation

Chain rule, computation graphs, and autograd — how gradients flow backward

Naive finite differences would need 175 billion forward passes to compute GPT-3's gradients. Backprop does it in one. Here's the trick.

📐#11

Optimizers

SGD → Momentum → Adam → AdamW, learning rate schedules, and weight decay

AdamW fixes a 5-year-old bug in Adam that silently hurts generalization

📉#12

Pre-training & Loss

Next-token prediction, cross-entropy, and perplexity

GPT-3 trained on 300B tokens and never saw a single labeled example — yet learned grammar, facts, math, and reasoning from one loss function.

🗃️#13

Data Curation

FineWeb, filtering, dedup — data quality beats data quantity

LIMA trained on 1,000 examples and matched GPT-3.5

📈#14

Scaling Laws

How big? How much data? Chinchilla has the answer

Why Llama-2 beats GPT-3 with half the parameters

🔥#15

GPU & Mixed Precision

CUDA memory hierarchy, fp16/bf16/fp8, loss scaling, and torch.autocast

bf16 training uses half the memory with zero accuracy loss — why wasn't this the default?

🖥️#16

Distributed Training

DDP, ZeRO, FSDP — training across thousands of GPUs

A 70B-param model needs 280 GB just to hold weights at fp32 — no single GPU exists with that memory. Here's how training across 1,000 GPUs stays in sync.

🔧#17

Fine-tuning & LoRA

Adapt a model with 0.1% of parameters

LoRA trains 0.1% of parameters and matches full fine-tuning

🎯#18

SFT & Post-Training Pipeline

Loss masking, chat templates, rejection sampling, distillation

InstructGPT&rsquo;s SFT used 13K labeled examples — and that alone beat GPT-3 (175B base) on human preference. Why does so little labeled data do so much?

🎮#19

RL Foundations

MDPs, policy gradient, PPO — the math before RLHF

REINFORCE was invented in 1992. 30 years later it&rsquo;s training the world&rsquo;s most capable AI — with one fundamental addition: a baseline.

🎯#20

RLHF & Reward Models

Teaching models what humans prefer — the 3-stage pipeline

OpenAI fine-tuned InstructGPT on 13,000 labeled examples and it beat GPT-3 (175B) on human preference. Smaller dataset, smaller model, better outputs — because RLHF teaches preference, not prediction.

⚖️#21

DPO, GRPO & Alternatives

Skip the reward model — direct preference optimization

DPO skips the reward model and PPO entirely — yet matches RLHF quality. Here's the math that makes 2 stages collapse into 1.

🧬#22

Model Merging

Combine fine-tuned models without retraining

Three fine-tuned models. SLERP into one. The merge wins on tasks none of them mastered alone — and you spent zero GPU-hours.

Inference (6)

🧩 Architectures (6)

🚀 Applications (6)

🛡️ Trust & Evaluation (6)

🎯 Interview Prep (1)

⚙️ AI Engineering (19)

⚙️#46

Agent Harness Architecture

Agentic loops, tool orchestration, permission systems, and context management

Claude Code runs a while(true) loop — here's what's inside

🔧#47

Tool System

Tool interface, Zod schemas, registry, orchestration, and parallel execution

5 Grep calls run in parallel, but Bash always waits its turn — why?

🤖#48

Sub-agents

Context isolation, worktrees, background execution, and result aggregation

Each sub-agent gets a fresh 200K context window — the parent keeps working

📝#49

Commands & Skills

Slash commands, skill markdown files, prompt injection, and the command registry

/compact is instant but 'compact this' takes 3 seconds — one never hits the API

🔌#50

Plugins & MCP

Model Context Protocol, external tool servers, plugin lifecycle, and transport layers

Claude doesn't know if a tool is built-in or from an MCP server — by design

🗄️#51

State Management

Dual state systems: React context for UI, module state for services

Two state systems coexist — one triggers re-renders, one doesn't. Mix them up and the terminal freezes.

🗜️#52

Context Compaction

Auto-compact, reactive compact, microcompact, context collapse, and token budgets

At 80% context usage, the agent silently summarizes its own history to keep going

🖥️#53

Terminal UI (Ink)

React reconciler for terminals, Yoga flexbox, ANSI rendering, and keyboard focus

It's React — but instead of DOM nodes, it writes ANSI escape codes to stdout

🧠#54

Memory System

File-based persistent memory, memory types, auto-save triggers, and cross-session recall

Claude remembers you're a senior engineer — across sessions, without a database

🔒#55

Hooks & Permissions

PreToolUse/PostToolUse hooks, 5-layer permission hierarchy, and safety gates

A shell script you wrote can veto any tool call before Claude even sees the result

📋#56

Prompt Engineering (System)

System prompt assembly, cache boundary optimization, dynamic sections, and prompt variants

The system prompt has a secret boundary — everything before it is cached, everything after is fresh

#57

Configuration & Schemas

Settings.json, Zod validation, feature flags, MDM policies, and config hierarchy

Zod validates every key at startup — one typo in settings.json blocks the entire CLI from booting.

🌉#58

Bridges & IDE Integration

WebSocket bridge, VS Code/JetBrains extensions, permission callbacks, and message routing

A WebSocket reconnect drops to 0ms perceived latency for the user — but rebuilds the entire IDE state in 3 round trips. Here&rsquo;s why that&rsquo;s a design constraint, not a bug.

🌊#59

Streaming & API Layer

Async generators, queryModelWithStreaming, SSE parsing, and backpressure

Tokens appear one by one because five async generators pipe data like Unix pipes

🛟#60

Error Recovery

Reactive compact retry, max output tokens escalation, abort handling, and graceful degradation

The API says 'prompt too long' — the agent silently compacts and retries before you notice

🔮#61

Speculative Execution

Parallel speculation, overlay filesystems, safe tool subsets, and acceptance criteria

While you're still typing, a speculative agent already searched the codebase for you

👔#62

Coordinator/Worker Pattern

Multi-agent coordination, restricted tool sets, environment gating, and task distribution

The coordinator writes prompts, not code — it manages a team of worker agents

💾#63

Session Persistence

Session JSON, /resume reconstruction, message history, file snapshots, and attribution

Close the terminal, reopen it, type --resume — the conversation continues exactly where you left off

💰#64

Cost Tracking & Budgets

Token counting, budget limits, per-model pricing, rate limit handling, and spend alerts

Claude Code emits cost events on every API response. Miss one and a runaway agent burns $200 before the budget gate fires.

📐 Design Reviews (17)

📐#67

The Design Doc

Working backwards from the SLO — an annotated, worked design doc for a real ML endpoint

A system with p99=500ms costs 3&times; more GPU than one with p99=1s. The right SLO choice is the entire design — yet most engineers write the SLO last.

💰#68

Cost Accounting & Eval-Driven Design

Cost-per-bad-day, LLM-judge rubrics, golden-set sizing — design flows from the eval, not the architecture

A model that scores 90% on your offline eval can drop 15% on user satisfaction — because the eval measured the wrong thing. Write the eval before the architecture.

💬#69

Case: Design ChatGPT

Multi-tenant chat — SLOs, model routing, conversation state

Two billion messages a day — where does the money actually go?

🔭#70

Case: Design Perplexity

RAG + live web search — freshness, citations, retrieval fusion

Half retrieval system, half LLM — which half should you optimize first?

🤖#71

Case: Design Claude Code / Cursor

Coding agent at scale — context builder, tools, sandboxing

The model is cheap. The context is what costs you.

🎨#72

Case: Design Midjourney

Multi-tenant diffusion — queueing, step budgets, content safety, GPU economics

A 50-step generation that fails at step 48 still costs you 48 steps

📱#73

Case: Design TikTok For-You Ranking

Two-tower retrieval + ranker + feature store — classical ML@scale canon

Why the Explore/Exploit slider matters more than the model

🧭#74

Case: Design an Embeddings Platform

Pinterest-style — backfill, drift, model upgrades, serving with HNSW

The day you change your embedding model, every index goes stale

🔥#75

Case: Design Llama Training Infra

Data pipeline + checkpoint management + failure-tolerant orchestration

At 16K GPUs, a GPU fails every 3 hours — design for it

🏗️#76

Case: Design an Agent Platform

Multi-agent infra — sandboxing, tool registries, trajectory eval, spend control

An agent that spawns agents — where does the budget live?

💎#77

Case: Design Gemini

Multi-modal frontier serving — TPU stack, 1M-token attention, safety classifier chain

1M-token context is cheap to promise, expensive to serve — here's the bill

📓#78

Case: Design NotebookLM

Long-context RAG over user docs — source-pinned citations, audio-overview pipeline

Upload 50 PDFs, ask one question — which half of the stack wins?

🎬#79

Case: Design Sora

Text-to-video at scale — diffusion transformer GPU economics, safety on generative video

A 10-second clip costs more GPU-hours than your laptop's lifetime

🎭#80

Case: Design Character.ai

Consumer LLM at scale — MQA, int8, trained-from-scratch, sub-$1/user/month cost floor

20B tokens served per day on a consumer-priced subscription — how?

🧮#81

Compare: RAG Systems

Perplexity vs NotebookLM vs ChatGPT-search vs Phind — retriever, grounding, citation side-by-side

Same question, four systems, four answers — whose retriever wins?

⚖️#82

Compare: SLO ↔ Cost

Interactive sensitivity — slide p99, watch GPU count, $/req, and cache hit-rate move together

Cut p99 latency in half — how much more expensive does it get?

🧯#83

Compare: Failure-Mode Taxonomy

One master table of every failure mode across 14 real systems — with detect→escalate→rollback playbooks

The 3am page happens — you have 30 seconds to pick the right lever

Interview Focus by Company

Google / DeepMind Attention derivation, RL fundamentals (MDPs), distributed training, ML system design

Anthropic Safety, Constitutional AI, interpretability, RLHF depth, first principles

OpenAI Scaling laws, reasoning models, LLM serving at scale, process reward models

Meta Vision Transformers, multimodal, distributed training, MoE

Databricks RAG at scale, serving optimization, data curation, evaluation