# Transformer Math
> Tier-1 ML/AI engineering interview prep covering Transformers, Training, Inference, Architectures, Applications, Trust & Evaluation, AI Engineering, and Production Design Reviews. Every number cited, arithmetic verified.

> Source: https://personal-wiki.pages.dev | Full text: https://personal-wiki.pages.dev/llms-full.txt | Raw modules: https://personal-wiki.pages.dev/raw/<id>.md

## The Transformer

- [High-Level Overview](https://personal-wiki.pages.dev/raw/transformer-overview.md): The complete Transformer pipeline — from raw text to next-token prediction
- [Tokenization](https://personal-wiki.pages.dev/raw/tokenization.md): BPE, vocabulary size, and why GPT can't count letters
- [Embeddings](https://personal-wiki.pages.dev/raw/embeddings.md): Turning token IDs into meaningful vectors
- [Positional Encoding](https://personal-wiki.pages.dev/raw/positional-encoding.md): Attention has no sense of order — how do we fix that?
- [MLP & Matmul](https://personal-wiki.pages.dev/raw/mlp-fundamentals.md): Matrix multiplication, weight initialization, and the universal approximation theorem
- [Self-Attention](https://personal-wiki.pages.dev/raw/attention.md): The core of Transformers — derive this on a whiteboard
- [Multi-Head Attention](https://personal-wiki.pages.dev/raw/multi-head.md): One head isn't enough — each head learns different patterns
- [FFN & Activations](https://personal-wiki.pages.dev/raw/ffn.md): Where 67% of parameters live — and what they memorize
- [LayerNorm & Residuals](https://personal-wiki.pages.dev/raw/layernorm.md): The glue that makes deep transformers trainable
- [The Full Forward Pass](https://personal-wiki.pages.dev/raw/forward-pass.md): Watch a token travel through a complete Transformer block

## Training

- [Backpropagation](https://personal-wiki.pages.dev/raw/backpropagation.md): Chain rule, computation graphs, and autograd — how gradients flow backward
- [Optimizers](https://personal-wiki.pages.dev/raw/optimizers.md): SGD → Momentum → Adam → AdamW, learning rate schedules, and weight decay
- [Pre-training & Loss](https://personal-wiki.pages.dev/raw/pretraining.md): Next-token prediction, cross-entropy, and perplexity
- [Data Curation](https://personal-wiki.pages.dev/raw/data-curation.md): FineWeb, filtering, dedup — data quality beats data quantity
- [Scaling Laws](https://personal-wiki.pages.dev/raw/scaling-laws.md): How big? How much data? Chinchilla has the answer
- [GPU & Mixed Precision](https://personal-wiki.pages.dev/raw/gpu-precision.md): CUDA memory hierarchy, fp16/bf16/fp8, loss scaling, and torch.autocast
- [Distributed Training](https://personal-wiki.pages.dev/raw/distributed-training.md): DDP, ZeRO, FSDP — training across thousands of GPUs
- [Fine-tuning & LoRA](https://personal-wiki.pages.dev/raw/fine-tuning.md): Adapt a model with 0.1% of parameters
- [SFT & Post-Training Pipeline](https://personal-wiki.pages.dev/raw/sft-post-training.md): Loss masking, chat templates, rejection sampling, distillation
- [RL Foundations](https://personal-wiki.pages.dev/raw/rl-foundations.md): MDPs, policy gradient, PPO — the math before RLHF
- [RLHF & Reward Models](https://personal-wiki.pages.dev/raw/rlhf.md): Teaching models what humans prefer — the 3-stage pipeline
- [DPO, GRPO & Alternatives](https://personal-wiki.pages.dev/raw/dpo.md): Skip the reward model — direct preference optimization

## Inference

- [KV Cache & Memory](https://personal-wiki.pages.dev/raw/kv-cache.md): Why generation is memory-bound and how to fix it
- [Flash Attention](https://personal-wiki.pages.dev/raw/flash-attention.md): Tiling, IO-awareness, and O(N) memory attention
- [Sampling & Decoding](https://personal-wiki.pages.dev/raw/sampling.md): Temperature, top-k, top-p — how the model picks the next token
- [Quantization](https://personal-wiki.pages.dev/raw/quantization.md): INT8, INT4, GPTQ, AWQ — shrink models without losing quality
- [Speculative Decoding](https://personal-wiki.pages.dev/raw/speculative-decoding.md): Small model drafts, big model verifies — parallel generation
- [LLM Deployment](https://personal-wiki.pages.dev/raw/llm-deployment.md): Serving stacks, continuous batching, latency vs throughput, vLLM, and API design

## Architectures

- [Mixture of Experts](https://personal-wiki.pages.dev/raw/moe.md): More parameters, same compute — the secret behind DeepSeek
- [Vision Transformers & CLIP](https://personal-wiki.pages.dev/raw/vit.md): Patch embeddings, contrastive learning, zero-shot classification
- [Multimodal LLMs](https://personal-wiki.pages.dev/raw/multimodal.md): How GPT-4V, Claude, and Gemini see images
- [Reasoning Models](https://personal-wiki.pages.dev/raw/reasoning.md): Chain-of-thought, o1, DeepSeek-R1, test-time compute
- [Verifiers & Process Reward](https://personal-wiki.pages.dev/raw/verifier-prm.md): PRMs, best-of-N, self-consistency — when to think longer
- [Diffusion Basics](https://personal-wiki.pages.dev/raw/diffusion.md): DDPM, latent diffusion, DiT — image generation from noise

## Applications

- [Prompt Engineering](https://personal-wiki.pages.dev/raw/prompt-engineering.md): System prompts, few-shot, structured output, tool schemas
- [Agents & ReAct](https://personal-wiki.pages.dev/raw/agents.md): Think → Act → Observe — the reasoning loop
- [Tool Use & Protocols](https://personal-wiki.pages.dev/raw/tool-use.md): Function calling, MCP, A2A — connecting agents to the world
- [RAG & Retrieval](https://personal-wiki.pages.dev/raw/rag.md): Ground LLM outputs in real data — reduce hallucination
- [Long Context & Context Engineering](https://personal-wiki.pages.dev/raw/long-context.md): Token budgeting, prompt caching, lost-in-the-middle, memory layering
- [Agent Evaluation](https://personal-wiki.pages.dev/raw/agent-eval.md): Trajectory eval, tool accuracy, and why agent eval is harder

## Trust & Evaluation

- [LLM Evaluation](https://personal-wiki.pages.dev/raw/evaluation.md): Benchmarks, LLM-as-judge, contamination, hallucination
- [Eval-Driven Development](https://personal-wiki.pages.dev/raw/eval-ops.md): Judge calibration, regression gating, launch criteria, eval ops
- [Interpretability](https://personal-wiki.pages.dev/raw/interpretability.md): Circuits, superposition, SAEs — what is the model computing?
- [Safety & Alignment](https://personal-wiki.pages.dev/raw/safety.md): Jailbreaking, alignment faking, and defenses that work
- [Mechanistic Interpretability](https://personal-wiki.pages.dev/raw/mech-interp.md): SAE training, activation patching, attribution graphs, circuit tracing, and feature steering
- [Induction Heads & ICL](https://personal-wiki.pages.dev/raw/induction-heads.md): The two-head circuit that powers in-context learning — and why it emerges as a phase transition

## Interview Prep

- [PyTorch Debugging](https://personal-wiki.pages.dev/raw/pytorch-debugging.md): NaN loss, double softmax, missing zero_grad — spot the bug

## AI Engineering

- [Agent Harness Architecture](https://personal-wiki.pages.dev/raw/agent-harness.md): Agentic loops, tool orchestration, permission systems, and context management
- [Tool System](https://personal-wiki.pages.dev/raw/tool-system.md): Tool interface, Zod schemas, registry, orchestration, and parallel execution
- [Sub-agents](https://personal-wiki.pages.dev/raw/sub-agents.md): Context isolation, worktrees, background execution, and result aggregation
- [Commands & Skills](https://personal-wiki.pages.dev/raw/commands-skills.md): Slash commands, skill markdown files, prompt injection, and the command registry
- [Plugins & MCP](https://personal-wiki.pages.dev/raw/plugins-mcp.md): Model Context Protocol, external tool servers, plugin lifecycle, and transport layers
- [State Management](https://personal-wiki.pages.dev/raw/state-management.md): Dual state systems: React context for UI, module state for services
- [Context Compaction](https://personal-wiki.pages.dev/raw/context-compaction.md): Auto-compact, reactive compact, microcompact, context collapse, and token budgets
- [Terminal UI (Ink)](https://personal-wiki.pages.dev/raw/terminal-ui.md): React reconciler for terminals, Yoga flexbox, ANSI rendering, and keyboard focus
- [Memory System](https://personal-wiki.pages.dev/raw/memory-system.md): File-based persistent memory, memory types, auto-save triggers, and cross-session recall
- [Hooks & Permissions](https://personal-wiki.pages.dev/raw/hooks-permissions.md): PreToolUse/PostToolUse hooks, 5-layer permission hierarchy, and safety gates
- [Prompt Engineering (System)](https://personal-wiki.pages.dev/raw/prompt-cache.md): System prompt assembly, cache boundary optimization, dynamic sections, and prompt variants
- [Configuration & Schemas](https://personal-wiki.pages.dev/raw/config-schemas.md): Settings.json, Zod validation, feature flags, MDM policies, and config hierarchy
- [Bridges & IDE Integration](https://personal-wiki.pages.dev/raw/bridges.md): WebSocket bridge, VS Code/JetBrains extensions, permission callbacks, and message routing
- [Streaming & API Layer](https://personal-wiki.pages.dev/raw/streaming-api.md): Async generators, queryModelWithStreaming, SSE parsing, and backpressure
- [Error Recovery](https://personal-wiki.pages.dev/raw/error-recovery.md): Reactive compact retry, max output tokens escalation, abort handling, and graceful degradation
- [Speculative Execution](https://personal-wiki.pages.dev/raw/speculative-execution.md): Parallel speculation, overlay filesystems, safe tool subsets, and acceptance criteria
- [Coordinator/Worker Pattern](https://personal-wiki.pages.dev/raw/coordinator-worker.md): Multi-agent coordination, restricted tool sets, environment gating, and task distribution
- [Session Persistence](https://personal-wiki.pages.dev/raw/session-persistence.md): Session JSON, /resume reconstruction, message history, file snapshots, and attribution
- [Cost Tracking & Budgets](https://personal-wiki.pages.dev/raw/cost-tracking.md): Token counting, budget limits, per-model pricing, rate limit handling, and spend alerts

## Design Reviews

- [The Design Doc](https://personal-wiki.pages.dev/raw/dr-methodology.md): Working backwards from the SLO — an annotated, worked design doc for a real ML endpoint
- [Cost Accounting & Eval-Driven Design](https://personal-wiki.pages.dev/raw/dr-cost-and-eval.md): Cost-per-bad-day, LLM-judge rubrics, golden-set sizing — design flows from the eval, not the architecture
- [Case: Design ChatGPT](https://personal-wiki.pages.dev/raw/dr-case-chatgpt.md): Multi-tenant chat — SLOs, model routing, conversation state
- [Case: Design Perplexity](https://personal-wiki.pages.dev/raw/dr-case-perplexity.md): RAG + live web search — freshness, citations, retrieval fusion
- [Case: Design Claude Code / Cursor](https://personal-wiki.pages.dev/raw/dr-case-coding-agent.md): Coding agent at scale — context builder, tools, sandboxing
- [Case: Design Midjourney](https://personal-wiki.pages.dev/raw/dr-case-image-gen.md): Multi-tenant diffusion — queueing, step budgets, content safety, GPU economics
- [Case: Design TikTok For-You Ranking](https://personal-wiki.pages.dev/raw/dr-case-feed-ranking.md): Two-tower retrieval + ranker + feature store — classical ML@scale canon
- [Case: Design an Embeddings Platform](https://personal-wiki.pages.dev/raw/dr-case-embeddings-platform.md): Pinterest-style — backfill, drift, model upgrades, serving with HNSW
- [Case: Design Llama Training Infra](https://personal-wiki.pages.dev/raw/dr-case-training-infra.md): Data pipeline + checkpoint management + failure-tolerant orchestration
- [Case: Design an Agent Platform](https://personal-wiki.pages.dev/raw/dr-case-agent-platform.md): Multi-agent infra — sandboxing, tool registries, trajectory eval, spend control
- [Case: Design Gemini](https://personal-wiki.pages.dev/raw/dr-case-gemini.md): Multi-modal frontier serving — TPU stack, 1M-token attention, safety classifier chain
- [Case: Design NotebookLM](https://personal-wiki.pages.dev/raw/dr-case-notebooklm.md): Long-context RAG over user docs — source-pinned citations, audio-overview pipeline
- [Case: Design Sora](https://personal-wiki.pages.dev/raw/dr-case-sora.md): Text-to-video at scale — diffusion transformer GPU economics, safety on generative video
- [Case: Design Character.ai](https://personal-wiki.pages.dev/raw/dr-case-characterai.md): Consumer LLM at scale — MQA, int8, trained-from-scratch, sub-$1/user/month cost floor
- [Compare: RAG Systems](https://personal-wiki.pages.dev/raw/dr-compare-rag.md): Perplexity vs NotebookLM vs ChatGPT-search vs Phind — retriever, grounding, citation side-by-side
- [Compare: SLO ↔ Cost](https://personal-wiki.pages.dev/raw/dr-compare-slo-cost.md): Interactive sensitivity — slide p99, watch GPU count, $/req, and cache hit-rate move together
- [Compare: Failure-Mode Taxonomy](https://personal-wiki.pages.dev/raw/dr-compare-failure-taxonomy.md): One master table of every failure mode across 14 real systems — with detect→escalate→rollback playbooks

## Optional

- [Full concatenated text](https://personal-wiki.pages.dev/llms-full.txt): All 83 modules in one file (~500K–1M tokens). Suitable for Claude/Gemini long-context ingestion.
- [Sitemap](https://personal-wiki.pages.dev/sitemap.xml): Full URL list for crawler discovery.