# Transformer Math > Tier-1 ML/AI engineering interview prep covering Transformers, Training, Inference, Architectures, Applications, Trust & Evaluation, AI Engineering, and Production Design Reviews. Every number cited, arithmetic verified. > Source: https://personal-wiki.pages.dev | Full text: https://personal-wiki.pages.dev/llms-full.txt | Raw modules: https://personal-wiki.pages.dev/raw/.md ## The Transformer - [High-Level Overview](https://personal-wiki.pages.dev/raw/transformer-overview.md): The complete Transformer pipeline — from raw text to next-token prediction - [Tokenization](https://personal-wiki.pages.dev/raw/tokenization.md): BPE, vocabulary size, and why GPT can't count letters - [Embeddings](https://personal-wiki.pages.dev/raw/embeddings.md): Turning token IDs into meaningful vectors - [Positional Encoding](https://personal-wiki.pages.dev/raw/positional-encoding.md): Attention has no sense of order — how do we fix that? - [MLP & Matmul](https://personal-wiki.pages.dev/raw/mlp-fundamentals.md): Matrix multiplication, weight initialization, and the universal approximation theorem - [Self-Attention](https://personal-wiki.pages.dev/raw/attention.md): The core of Transformers — derive this on a whiteboard - [Multi-Head Attention](https://personal-wiki.pages.dev/raw/multi-head.md): One head isn't enough — each head learns different patterns - [FFN & Activations](https://personal-wiki.pages.dev/raw/ffn.md): Where 67% of parameters live — and what they memorize - [LayerNorm & Residuals](https://personal-wiki.pages.dev/raw/layernorm.md): The glue that makes deep transformers trainable - [The Full Forward Pass](https://personal-wiki.pages.dev/raw/forward-pass.md): Watch a token travel through a complete Transformer block ## Training - [Backpropagation](https://personal-wiki.pages.dev/raw/backpropagation.md): Chain rule, computation graphs, and autograd — how gradients flow backward - [Optimizers](https://personal-wiki.pages.dev/raw/optimizers.md): SGD → Momentum → Adam → AdamW, learning rate schedules, and weight decay - [Pre-training & Loss](https://personal-wiki.pages.dev/raw/pretraining.md): Next-token prediction, cross-entropy, and perplexity - [Data Curation](https://personal-wiki.pages.dev/raw/data-curation.md): FineWeb, filtering, dedup — data quality beats data quantity - [Scaling Laws](https://personal-wiki.pages.dev/raw/scaling-laws.md): How big? How much data? Chinchilla has the answer - [GPU & Mixed Precision](https://personal-wiki.pages.dev/raw/gpu-precision.md): CUDA memory hierarchy, fp16/bf16/fp8, loss scaling, and torch.autocast - [Distributed Training](https://personal-wiki.pages.dev/raw/distributed-training.md): DDP, ZeRO, FSDP — training across thousands of GPUs - [Fine-tuning & LoRA](https://personal-wiki.pages.dev/raw/fine-tuning.md): Adapt a model with 0.1% of parameters - [SFT & Post-Training Pipeline](https://personal-wiki.pages.dev/raw/sft-post-training.md): Loss masking, chat templates, rejection sampling, distillation - [RL Foundations](https://personal-wiki.pages.dev/raw/rl-foundations.md): MDPs, policy gradient, PPO — the math before RLHF - [RLHF & Reward Models](https://personal-wiki.pages.dev/raw/rlhf.md): Teaching models what humans prefer — the 3-stage pipeline - [DPO, GRPO & Alternatives](https://personal-wiki.pages.dev/raw/dpo.md): Skip the reward model — direct preference optimization ## Inference - [KV Cache & Memory](https://personal-wiki.pages.dev/raw/kv-cache.md): Why generation is memory-bound and how to fix it - [Flash Attention](https://personal-wiki.pages.dev/raw/flash-attention.md): Tiling, IO-awareness, and O(N) memory attention - [Sampling & Decoding](https://personal-wiki.pages.dev/raw/sampling.md): Temperature, top-k, top-p — how the model picks the next token - [Quantization](https://personal-wiki.pages.dev/raw/quantization.md): INT8, INT4, GPTQ, AWQ — shrink models without losing quality - [Speculative Decoding](https://personal-wiki.pages.dev/raw/speculative-decoding.md): Small model drafts, big model verifies — parallel generation - [LLM Deployment](https://personal-wiki.pages.dev/raw/llm-deployment.md): Serving stacks, continuous batching, latency vs throughput, vLLM, and API design ## Architectures - [Mixture of Experts](https://personal-wiki.pages.dev/raw/moe.md): More parameters, same compute — the secret behind DeepSeek - [Vision Transformers & CLIP](https://personal-wiki.pages.dev/raw/vit.md): Patch embeddings, contrastive learning, zero-shot classification - [Multimodal LLMs](https://personal-wiki.pages.dev/raw/multimodal.md): How GPT-4V, Claude, and Gemini see images - [Reasoning Models](https://personal-wiki.pages.dev/raw/reasoning.md): Chain-of-thought, o1, DeepSeek-R1, test-time compute - [Verifiers & Process Reward](https://personal-wiki.pages.dev/raw/verifier-prm.md): PRMs, best-of-N, self-consistency — when to think longer - [Diffusion Basics](https://personal-wiki.pages.dev/raw/diffusion.md): DDPM, latent diffusion, DiT — image generation from noise ## Applications - [Prompt Engineering](https://personal-wiki.pages.dev/raw/prompt-engineering.md): System prompts, few-shot, structured output, tool schemas - [Agents & ReAct](https://personal-wiki.pages.dev/raw/agents.md): Think → Act → Observe — the reasoning loop - [Tool Use & Protocols](https://personal-wiki.pages.dev/raw/tool-use.md): Function calling, MCP, A2A — connecting agents to the world - [RAG & Retrieval](https://personal-wiki.pages.dev/raw/rag.md): Ground LLM outputs in real data — reduce hallucination - [Long Context & Context Engineering](https://personal-wiki.pages.dev/raw/long-context.md): Token budgeting, prompt caching, lost-in-the-middle, memory layering - [Agent Evaluation](https://personal-wiki.pages.dev/raw/agent-eval.md): Trajectory eval, tool accuracy, and why agent eval is harder ## Trust & Evaluation - [LLM Evaluation](https://personal-wiki.pages.dev/raw/evaluation.md): Benchmarks, LLM-as-judge, contamination, hallucination - [Eval-Driven Development](https://personal-wiki.pages.dev/raw/eval-ops.md): Judge calibration, regression gating, launch criteria, eval ops - [Interpretability](https://personal-wiki.pages.dev/raw/interpretability.md): Circuits, superposition, SAEs — what is the model computing? - [Safety & Alignment](https://personal-wiki.pages.dev/raw/safety.md): Jailbreaking, alignment faking, and defenses that work - [Mechanistic Interpretability](https://personal-wiki.pages.dev/raw/mech-interp.md): SAE training, activation patching, attribution graphs, circuit tracing, and feature steering - [Induction Heads & ICL](https://personal-wiki.pages.dev/raw/induction-heads.md): The two-head circuit that powers in-context learning — and why it emerges as a phase transition ## Interview Prep - [PyTorch Debugging](https://personal-wiki.pages.dev/raw/pytorch-debugging.md): NaN loss, double softmax, missing zero_grad — spot the bug ## AI Engineering - [Agent Harness Architecture](https://personal-wiki.pages.dev/raw/agent-harness.md): Agentic loops, tool orchestration, permission systems, and context management - [Tool System](https://personal-wiki.pages.dev/raw/tool-system.md): Tool interface, Zod schemas, registry, orchestration, and parallel execution - [Sub-agents](https://personal-wiki.pages.dev/raw/sub-agents.md): Context isolation, worktrees, background execution, and result aggregation - [Commands & Skills](https://personal-wiki.pages.dev/raw/commands-skills.md): Slash commands, skill markdown files, prompt injection, and the command registry - [Plugins & MCP](https://personal-wiki.pages.dev/raw/plugins-mcp.md): Model Context Protocol, external tool servers, plugin lifecycle, and transport layers - [State Management](https://personal-wiki.pages.dev/raw/state-management.md): Dual state systems: React context for UI, module state for services - [Context Compaction](https://personal-wiki.pages.dev/raw/context-compaction.md): Auto-compact, reactive compact, microcompact, context collapse, and token budgets - [Terminal UI (Ink)](https://personal-wiki.pages.dev/raw/terminal-ui.md): React reconciler for terminals, Yoga flexbox, ANSI rendering, and keyboard focus - [Memory System](https://personal-wiki.pages.dev/raw/memory-system.md): File-based persistent memory, memory types, auto-save triggers, and cross-session recall - [Hooks & Permissions](https://personal-wiki.pages.dev/raw/hooks-permissions.md): PreToolUse/PostToolUse hooks, 5-layer permission hierarchy, and safety gates - [Prompt Engineering (System)](https://personal-wiki.pages.dev/raw/prompt-cache.md): System prompt assembly, cache boundary optimization, dynamic sections, and prompt variants - [Configuration & Schemas](https://personal-wiki.pages.dev/raw/config-schemas.md): Settings.json, Zod validation, feature flags, MDM policies, and config hierarchy - [Bridges & IDE Integration](https://personal-wiki.pages.dev/raw/bridges.md): WebSocket bridge, VS Code/JetBrains extensions, permission callbacks, and message routing - [Streaming & API Layer](https://personal-wiki.pages.dev/raw/streaming-api.md): Async generators, queryModelWithStreaming, SSE parsing, and backpressure - [Error Recovery](https://personal-wiki.pages.dev/raw/error-recovery.md): Reactive compact retry, max output tokens escalation, abort handling, and graceful degradation - [Speculative Execution](https://personal-wiki.pages.dev/raw/speculative-execution.md): Parallel speculation, overlay filesystems, safe tool subsets, and acceptance criteria - [Coordinator/Worker Pattern](https://personal-wiki.pages.dev/raw/coordinator-worker.md): Multi-agent coordination, restricted tool sets, environment gating, and task distribution - [Session Persistence](https://personal-wiki.pages.dev/raw/session-persistence.md): Session JSON, /resume reconstruction, message history, file snapshots, and attribution - [Cost Tracking & Budgets](https://personal-wiki.pages.dev/raw/cost-tracking.md): Token counting, budget limits, per-model pricing, rate limit handling, and spend alerts ## Design Reviews - [The Design Doc](https://personal-wiki.pages.dev/raw/dr-methodology.md): Working backwards from the SLO — an annotated, worked design doc for a real ML endpoint - [Cost Accounting & Eval-Driven Design](https://personal-wiki.pages.dev/raw/dr-cost-and-eval.md): Cost-per-bad-day, LLM-judge rubrics, golden-set sizing — design flows from the eval, not the architecture - [Case: Design ChatGPT](https://personal-wiki.pages.dev/raw/dr-case-chatgpt.md): Multi-tenant chat — SLOs, model routing, conversation state - [Case: Design Perplexity](https://personal-wiki.pages.dev/raw/dr-case-perplexity.md): RAG + live web search — freshness, citations, retrieval fusion - [Case: Design Claude Code / Cursor](https://personal-wiki.pages.dev/raw/dr-case-coding-agent.md): Coding agent at scale — context builder, tools, sandboxing - [Case: Design Midjourney](https://personal-wiki.pages.dev/raw/dr-case-image-gen.md): Multi-tenant diffusion — queueing, step budgets, content safety, GPU economics - [Case: Design TikTok For-You Ranking](https://personal-wiki.pages.dev/raw/dr-case-feed-ranking.md): Two-tower retrieval + ranker + feature store — classical ML@scale canon - [Case: Design an Embeddings Platform](https://personal-wiki.pages.dev/raw/dr-case-embeddings-platform.md): Pinterest-style — backfill, drift, model upgrades, serving with HNSW - [Case: Design Llama Training Infra](https://personal-wiki.pages.dev/raw/dr-case-training-infra.md): Data pipeline + checkpoint management + failure-tolerant orchestration - [Case: Design an Agent Platform](https://personal-wiki.pages.dev/raw/dr-case-agent-platform.md): Multi-agent infra — sandboxing, tool registries, trajectory eval, spend control - [Case: Design Gemini](https://personal-wiki.pages.dev/raw/dr-case-gemini.md): Multi-modal frontier serving — TPU stack, 1M-token attention, safety classifier chain - [Case: Design NotebookLM](https://personal-wiki.pages.dev/raw/dr-case-notebooklm.md): Long-context RAG over user docs — source-pinned citations, audio-overview pipeline - [Case: Design Sora](https://personal-wiki.pages.dev/raw/dr-case-sora.md): Text-to-video at scale — diffusion transformer GPU economics, safety on generative video - [Case: Design Character.ai](https://personal-wiki.pages.dev/raw/dr-case-characterai.md): Consumer LLM at scale — MQA, int8, trained-from-scratch, sub-$1/user/month cost floor - [Compare: RAG Systems](https://personal-wiki.pages.dev/raw/dr-compare-rag.md): Perplexity vs NotebookLM vs ChatGPT-search vs Phind — retriever, grounding, citation side-by-side - [Compare: SLO ↔ Cost](https://personal-wiki.pages.dev/raw/dr-compare-slo-cost.md): Interactive sensitivity — slide p99, watch GPU count, $/req, and cache hit-rate move together - [Compare: Failure-Mode Taxonomy](https://personal-wiki.pages.dev/raw/dr-compare-failure-taxonomy.md): One master table of every failure mode across 14 real systems — with detect→escalate→rollback playbooks ## Optional - [Full concatenated text](https://personal-wiki.pages.dev/llms-full.txt): All 83 modules in one file (~500K–1M tokens). Suitable for Claude/Gemini long-context ingestion. - [Sitemap](https://personal-wiki.pages.dev/sitemap.xml): Full URL list for crawler discovery.