Skip to content

Transformer Math

Sources & Bibliography

Every paper, blog post, video, and code repository cited across all 84 modules — deduplicated and searchable.

345 total sources345 unique sources84 modules covered4.1 sources per module avg

345 sources currently indexed across the wiki.

📄 Papers137

Azar et al. 2024 — introduces IPO to address DPO

Schaeffer et al. 2023 — argues apparent emergence is an artifact of discontinuous evaluation metrics, not a fundamental property of scale.

Michel et al. 2019 — shows most attention heads are redundant and can be pruned with minimal quality loss. Challenges assumptions about head count.

Vaswani et al. 2017 — the paper that introduced scaled dot-product attention and the Transformer architecture.

Wu et al., 2023 — a framework for multi-agent conversations with customizable agents.

Protects salient weight channels identified by activation magnitudes, achieving better quality than round-to-nearest at 4-bit.

The original BatchNorm paper — understanding why BN works for images clarifies why LayerNorm is needed for variable-length sequences.

(Ho & Salimans, 2022) — the CFG paper. Explains how to trade sample diversity for prompt fidelity without a separate classifier model.

Sumers et al., 2023 — formal taxonomy of agent memory: working, episodic, semantic, and procedural. The framework behind the file-based memory design.

Khattab & Zaharia 2020 — the architecture that keeps per-token embeddings and uses MaxSim scoring. Relevant to understanding why single-vector bi-encoders are the retrieval floor, not the ceiling.

Introduces the LR range test — the fastest practical method to find a good learning rate.

Drop 90% of task-vector deltas at random, rescale survivors by 1/(1-p), then merge — enables combining many models with near-zero accuracy loss.

The paper that introduced AdamW — shows why L2 regularization and weight decay are NOT equivalent in adaptive optimizers.

Lee et al. 2022 — rigorous study showing deduplication improves perplexity and reduces verbatim memorization risk across multiple LMs.

Wang et al. 2022 — combines Pre-LN and Post-LN with scaled initialization to train 1000-layer transformers stably. Used in GLM-130B.

Chen et al. 2024 — introduces Multi-Head Latent Attention (MLA), compressing KV cache to a shared low-rank latent with 93.3% reduction vs MHA.

DeepSeek 2024 — 671B MoE with auxiliary-loss-free load balancing and multi-token prediction

He et al. 2015 — extends Xavier init for ReLU activations by accounting for the halved variance from the rectification. Standard init for modern deep nets.

(Ho et al., 2020) — the foundational DDPM paper that revived diffusion models for image generation.

The DPR paper that established dual-encoder dense retrieval as the production baseline. The retrieval recall numbers here are the standard against which all Perplexity-style systems are measured.

The DPO paper — eliminates the reward model by directly optimizing policy from preference pairs.

Foundational paper showing that arithmetic on fine-tuned weight differences enables multi-task editing, forgetting, and analogy without retraining.

A practitioner paper on batching strategy, LoRA switching, and GPU utilization for diffusion serving at scale. The most directly relevant systems paper for this case study

Mikolov et al. 2013 — introduced Skip-gram and CBOW, the foundation of modern word embeddings.

Narayanan et al. 2021 — combining data, tensor, and pipeline parallelism (3D parallelism) with concrete recipes for scaling to trillions of parameters.

The vLLM paper — virtual memory paging for KV cache, eliminating fragmentation and enabling continuous batching.

Tay et al., 2020 — covers KV cache and attention computation optimizations that make prompt caching possible.

Wei et al. 2022 — documents capabilities that appear unpredictably at scale, raising questions about whether scaling produces continuous or discontinuous improvements.

Shazeer 2019 — multi-query attention reduces KV cache size by sharing one KV head across all query heads. Direct precursor to GQA.

The original RLHF paper applying reward learning from human preferences to language models.

Alayrac et al. 2022 — few-shot multimodal learning with interleaved image-text inputs

Flash Attention v2 — improved work partitioning across warps and thread blocks for up to 2x additional speedup.

Shah et al. 2024 — Flash Attention v3 for H100, using async TMA and FP8 tensor cores to achieve 1.5-2x speedup over v2.

The original Flash Attention paper — tiling attention computation to exploit GPU SRAM, achieving 2-4x speedup.

Micikevicius et al. 2022 — defines E4M3 and E5M2 FP8 formats and training recipes. The format used by DeepSeek-V3 and H100 tensor cores.

Park et al., 2023 — memory retrieval, reflection, and planning for believable agent behavior.

Shazeer 2020 — shows SwiGLU and GeGLU outperform standard ReLU FFNs. SwiGLU is now the default in LLaMA and PaLM.

Patil et al., 2023 — improving LLM accuracy in API call generation via retrieval-augmented training.

Huang et al. 2019 — introduces micro-batching to reduce pipeline bubble overhead, the foundation of modern pipeline parallelism.

One-shot weight quantization using approximate second-order information, enabling 3-4 bit models with minimal quality loss.

Grouped-query attention — interpolates between MHA and MQA to reduce KV cache size with minimal quality loss.

Schulman et al. 2016 — introduces GAE, the variance-reduction technique that PPO relies on for stable RLHF training.

(Rombach et al., 2022) — introduced latent diffusion, the architecture behind Stable Diffusion.

Liang et al. 2022 — multi-metric evaluation framework covering accuracy, fairness, robustness, and more

Luo et al. 2024 — Monte Carlo tree search to automatically generate step-level labels for PRM training without human annotation

Ram et al., 2023 — dynamically retrieving only the relevant chunks rather than keeping the full context, the retrieval complement to compaction.

Greshake et al., 2023 — how attackers inject instructions via tool results; the threat model behind PreToolUse deny rules and input sanitization.

Chen et al. 2023 — scaling ViT to 6B parameters and aligning with LLMs; shows dynamic resolution handling for OCR-heavy tasks

Wang et al. 2022 — end-to-end circuit analysis of a real linguistic capability; the canonical example of mechanistic interpretability on a real model

Zheng et al. 2023 — LLM-as-judge evaluation and Elo-based human preference ranking

Per-channel INT2 KV cache quantization — 2.35x memory reduction with negligible quality loss, enabling longer contexts on the same GPU.

Brown et al. 2020 — the GPT-3 paper showing that large-scale pretraining enables strong few-shot task performance without fine-tuning.

Recent survey of LLM-based multi-agent architectures and coordination patterns.

Ba et al. 2016 — the original LayerNorm paper. Normalizes across features instead of batch dimension.

Dubois et al. 2024 — length-controlled win rates to debias automatic evaluators; addresses verbosity inflation in GPT-4 judge scoring

LetPapers

Lightman et al. (2023). Process reward models outperform outcome reward models by scoring each reasoning step.

Zhou et al. 2023 — 1,000 carefully curated examples match far larger datasets. Quality over quantity for SFT.

Detailed post-training recipe: SFT (27K examples), rejection sampling, 5 rounds of RLHF.

Li et al. 2024 — single model handles single-image, multi-image, and video tasks; shows how to unify vision tasks with one instruction-tuned model

Kim et al., 2023 — DAG-based parallel function call planning, the same dependency-aware parallelism coordinators use to maximize worker throughput.

Dettmers et al. 2022 — mixed-precision INT8 quantization that handles outlier activations separately, enabling near-lossless 8-bit inference for 175B+ models.

Chen et al. 2023 — extends context window from 4K to 100K using shifted sparse attention during fine-tuning.

The original LoRA paper — freeze base weights, train low-rank decomposition matrices A and B.

Liu et al., 2023 — LLMs struggle with information in the middle of long contexts, motivating compaction.

The foundational HNSW paper. Read Section 4 on layered graph construction and Section 5 on query complexity — essential for justifying M, ef_construction, and ef_search tradeoffs in an interview.

Hendrycks et al. 2020 — 57-subject benchmark testing broad knowledge and reasoning

Packer et al., 2023 — virtual context management with paging, a complementary approach to compaction.

The primary source for Llama-scale training infrastructure at Meta. Section 3 on pre-training covers the 3D-parallel strategy, checkpoint policies, and failure-recovery design that this case study is grounded in.

Hong et al., 2023 — structured multi-agent collaboration with role-based task decomposition.

Nguyen et al. 2024 — min-p sets a dynamic floor at p_min × max_prob, automatically adapting to the distribution without the fixed-cutoff fragility of top-p.

Mistral AI 2024 — open-weight MoE model with 8 experts per layer, top-2 routing

Average fine-tuned checkpoint weights from the same pre-trained base to reach flatter loss regions and improve accuracy over any single run.

Sanh et al., 2021. Shows that instruction fine-tuning on diverse tasks dramatically improves zero-shot prompting — why modern models follow instructions without few-shot examples.

Sennrich et al. 2016 — the paper that introduced Byte Pair Encoding for NLP tokenization.

Xiong et al. 2020 — analyzes Pre-LN vs Post-LN placement. Pre-LN enables stable training without warmup.

The online softmax algorithm that makes Flash Attention

Mitra et al. 2023 — smaller models trained on carefully synthesized step-by-step reasoning data can match much larger models; shows why data quality beats scale for SFT.

Hong et al. 2024 — combines SFT and preference optimization in a single pass using an odds ratio penalty, eliminating the need for a reference model entirely.

Shazeer et al. 2017 — the original MoE paper introducing sparsely-gated expert routing

First-principles analysis of memory bandwidth vs. compute bottlenecks in large model serving. The paper that grounds the batch-size/latency tradeoff in hardware arithmetic.

The PPO paper — clipped surrogate objective that became the default RL algorithm for RLHF.

Production lessons from scaling FSDP across thousands of GPUs at Meta.

4-bit NormalFloat quantization + LoRA adapters, enabling 65B model fine-tuning on a single 48GB GPU.

The evaluation framework for RAG systems — faithfulness, answer relevance, context precision, context recall. The eval metrics used in the cross-system comparison in this module are grounded in RAGAS.

Yao et al., 2022 — interleaving reasoning traces and actions for grounded decision-making.

Shinn et al., 2023 — agents that reflect on failures and improve across episodes.

Lewis et al., 2020 — the foundational RAG paper; the MEMORY.md index pattern is a lightweight, file-based approximation of RAG-style selective retrieval.

Coste et al. 2023 — shows that reward hacking can be reduced by ensembling multiple reward models, with quantitative analysis of the overoptimization curve.

Su et al. 2021 — RoPE encodes relative position via rotation, enabling length extrapolation used by LLaMA and Mistral.

Zhang & Sennrich 2019 — drops the mean-centering step for a simpler, faster norm. Used by LLaMA and Mistral.

The paper that formally defines the cost-quality trade-off in LLM routing. Introduces the APGR/CGPT metrics and shows that a trained classifier-router can match 95% of GPT-4 quality at 40% of the cost.

(Peebles & Xie, 2023) — replaced U-Net with a Vision Transformer, showing clean scaling laws for diffusion.

Xu et al. 2023 — comprehensive survey comparing LoRA, adapters, prefix tuning, and prompt tuning across benchmarks.

Kaplan et al. 2020 — empirical power laws relating compute, data, and parameters to loss. Foundation of modern training budgets.

Snell et al. (2024). When to think longer vs. use a bigger model.

(Esser et al., 2024) — Stability's Stable Diffusion 3 paper: rectified flow training with MMDiT. FLUX.1 uses the same rectified-flow family but is a separate model from Black Forest Labs with no equivalent paper.

Liu et al., 2023 — extending context windows, reducing but not eliminating the need for compaction.

Wang et al., 2022. Sample multiple CoT reasoning paths and take the majority vote. Simple technique, significant accuracy gains.

The method behind Alpaca — use a model to generate its own instruction-tuning data.

Asai et al. 2023 — model learns to decide when to retrieve and to critique its own outputs with special reflection tokens.

Kudo & Richardson 2018 — unigram language model tokenizer used by T5, mT5, XLNet, and many multilingual models.

Zheng et al. 2023 — RadixAttention for KV cache reuse across requests, 5x throughput on multi-turn workloads

The canonical paper on LLM-judge calibration. If you take one idea: the judge needs its own eval, and that eval is a human-labeled subsample you refresh on a schedule.

The foundational paper for understanding why the LLM judge evaluating your RAG system needs its own calibration. Essential reading before designing any groundedness or citation-precision eval.

Meng et al. 2024 — removes the reference model from DPO entirely, using sequence-average log-probability as the implicit reward.

Migrates quantization difficulty from activations to weights via per-channel smoothing, enabling W8A8 quantization.

Patel et al. 2023 — separates prefill and decode across GPU clusters for optimal hardware utilization

Yang et al., 2024 — production agent harness design lessons from solving real GitHub issues; ACI (agent-computer interface) design principles.

Princeton / Chicago, 2023 — the benchmark that made coding-agent trajectory eval rigorous. Essential reading for any team designing agent eval harnesses.

Fedus et al. 2021 — top-1 routing with capacity factor and auxiliary balance loss; showed MoE scales to 1T+ params

Gunasekar et al. 2023 — Microsoft

Introduces nucleus sampling (top-p) — dynamically truncates the vocabulary to the smallest set covering probability p.

Gao et al. 2020 — influential open dataset combining 22 diverse sources. Used to train GPT-Neo and GPT-J.

Lester et al. 2021 — prompt tuning trains only soft prompt tokens while freezing the entire model. Matches full fine-tuning at 11B scale.

Trim-Elect Sign-Disjoint Merge: three-step algorithm that cuts inter-task interference by resolving parameter sign conflicts before averaging.

Schick et al., 2023 — training LLMs to decide when and how to call external tools.

Press et al. 2022 — adds linear bias to attention scores instead of positional embeddings. Enables length extrapolation.

Hoffmann et al. 2022 — proved most LLMs were undertrained. Optimal ratio: ~20 tokens per parameter.

Geva et al. 2021 — interprets FFN layers as implicit key-value stores where keys match input patterns and values store output distributions.

Yao et al. (2023). Structured search over reasoning paths using BFS/DFS — the algorithmic foundation for o1-style test-time search.

Lambert et al. 2024 — end-to-end open post-training recipe covering data curation, SFT, DPO, and RLVR with full ablations and reproducible results.

Typical sampling — selects tokens whose information content is close to the expected information, producing more human-like text.

Clark et al. 2022 — scaling laws specific to MoE: how performance scales with number of experts, active params, and total params

Zou et al. 2023 — gradient-based suffix optimization that transfers across GPT-4, Claude, and Gemini; motivates the need for input classifiers

Press & Wolf 2017 — shows tying input and output embedding weights improves perplexity and saves parameters.

Liu et al. 2023 — visual instruction tuning connecting a vision encoder to an LLM

Peng et al. 2023 — extends RoPE context windows without full fine-tuning via interpolation. Technique used by Llama models for context extension.

Tunstall et al. 2023 — the first widely-adopted DPO model (Mistral-7B base), with a concrete recipe for distilled preference data generation.

✍️ Blogs35

Karpathy 2019 — systematic approach to debugging and training neural networks from scratch

Hornik, Stinchcombe, White 1989 — proves that a single hidden-layer MLP can approximate any continuous function on a compact domain given enough neurons.

Olah & Carter 2016 — the original visual explainer of attention mechanisms before transformers.

The practitioner post that reframed eval-first design for the AI engineering generation. The trajectory eval section of this module follows Hamel

In-depth posts on preference learning, RLHF variants, and alignment techniques including DPO analysis.

Comprehensive posts on RLHF, reward modeling, and alignment — covers reward hacking, Goodhart

Rigorous treatment of the chain rule, Jacobians, and multivariate gradient flow.

Survey of jailbreak techniques, prompt injection, and defenses — GCG suffixes, many-shot, and gradient-based attacks

Comprehensive survey of decoding strategies including temperature, top-k, top-p, and beam search — with analysis of quality tradeoffs.

Comprehensive overview of vision-language model architectures — from dual encoders (CLIP) to decoder-only multimodal LLMs

Covers expert parallelism, pipeline parallelism, and how MoE fits into distributed training strategies

Detailed breakdown of distributed training and inference strategies — parallelism, memory, and communication tradeoffs with worked examples.

Comprehensive survey of agent components: planning, memory, tool use, and multi-agent coordination.

Comprehensive derivation of REINFORCE, Actor-Critic, and PPO — essential prerequisite for understanding the RLHF training loop.

Exhaustive survey of prompting techniques: zero-shot, few-shot, CoT, self-consistency, ToT, ReAct, and automatic prompt optimization.

The canonical survey of RAG architectures — covers bi-encoders, cross-encoders, fusion-in-decoder, and long-context approaches. Essential background for defending any retrieval design choice.

Covers the spectrum from self-consistency (no verifier) through ORM to PRM — useful for understanding the tradeoffs in test-time compute strategies

Comprehensive deep-dive covering DDPM, score matching, DDIM, and the connection to stochastic differential equations.

Deep dive on context window extensions and memory-augmented transformers — the research landscape that motivates runtime compaction.

Step-by-step 3D walkthrough of a GPT model — trace every tensor through the forward pass.

A real-world case study of the explore/exploit tradeoff, diversity re-ranking, and the product/ML boundary in a large-scale feed system.

Intuitive introduction to RoPE with derivations — explains why rotation in complex space encodes relative position elegantly.

The clearest public explanation of the online/offline feature store architecture, backfill strategies, and train/serve skew. Written by practitioners who built the Uber Michelangelo feature store.

Jay Alammar — how BERT reuses the Transformer encoder with bidirectional attention and masked language modeling for pretraining.

Visual walkthrough of DeepSeek-R1

Jay Alammar — step-by-step visual breakdown of the full Stable Diffusion pipeline: text encoder, UNet denoiser, VAE decoder, and CLIP guidance.

Interactive visual explanation of GPT-2 running live in the browser — great for seeing attention weights.

Benchmark-grounded comparison of ANN algorithms with real recall/latency/memory numbers. Use this to back up the HNSW justification in the architecture deep dive.

🎬 Videos15

Animated geometric intuition for why the chain rule distributes gradient across a computation graph.

Visual intuition for what transformer attention heads actually compute — useful foundation before diving into circuit-level interpretability

Visual explanation of how token embeddings work and how meaning is encoded in high-dimensional vector space — geometric intuition.

Grant Sanderson 2024 — visual explanation of how MLP layers act as key-value memories storing factual associations, with the connection to superposition.

Visual intuition for how neurons, layers, and matrix multiplication combine to form a neural network. The gold standard for building geometric intuition.

Visual breakdown of the MLP (FFN) layers in Transformers — intuition for what the expansion and contraction matrices compute.

Codes scaled dot-product self-attention from scratch in ~50 lines of PyTorch — essential companion for internalizing Q, K, V.

Karpathy builds an MLP character-level language model from scratch — covers weight init, batch norm, learning rate tuning, and the vanishing gradient problem.

2.5-hour walkthrough building micrograd step by step — every chain rule application shown explicitly.

Covers the SFT and RLHF fine-tuning pipeline end-to-end, including practical data requirements and training tips.

The canonical talk on why averages and even p95 lie, and why p99/p99.9 are the only metrics that capture the user's experience. The HDR histogram argument is mandatory background for SLO design.

Visual walkthrough of latent diffusion, CLIP guidance, and how modern text-to-image systems combine these ideas.

💻 Code17

100-line autodiff engine and neural network library — the clearest possible implementation of backprop from scratch.

The cleanest GPT implementation in ~300 lines of PyTorch — read model.py to see the exact forward pass used in practice.

The older, imperative approach to terminal UIs — useful comparison point to understand what React TUI improves on.

Step-by-step walkthrough of building a React custom renderer — the same techniques Ink uses to target the terminal instead of the DOM.

HuggingFace 2024 — accessible walkthrough of the full FineWeb pipeline: Common Crawl ingestion, quality scoring, MinHash dedup, and ablation results. Best pedagogical resource on web-scale data curation.

The terminal React renderer whose components are adapted by ink-compat for cross-platform rendering.

Official list of reference MCP server implementations — filesystem, GitHub, Postgres, Puppeteer, and more.

Production-grade open-source library implementing SLERP, TIES, DARE, Task Arithmetic, and Model Soups on HuggingFace models.

How Mixtral replaces dense FFNs with sparse MoE layers — concrete explanation of routing, expert selection, and capacity factors.

Production BPE tokenizer powering GPT-4. Fast Rust implementation with Python bindings.

Hugging Face library implementing LoRA, prefix tuning, prompt tuning, and other PEFT methods.

The official package for building custom React renderers — the foundation Ink is built on.

The gold standard for terminal session persistence — background processes, detach/attach, and named sessions; the UX model Claude Code

HuggingFace reference covering every major algorithm — WordPiece, Unigram, BPE, byte-level BPE — with concrete examples of merge rules.

A minimal state library for React — similar memoized selector pattern, different implementation.

🏢 Industry30

Anthropic 2024 — evidence that models can strategically fake alignment during training

Practical guide to designing task-specific evals, calibrating LLM judges, and building regression gates for Claude-based applications

Official reference for API error types, status codes, and recommended handling strategies.

Current pricing for all Claude models — input, output, and cached token rates.

Official docs for custom slash commands in Claude Code — creating, organizing, and distributing skills.

Claude-specific guidance on system prompts, XML tags for structure, extended thinking, and common pitfalls.

Official docs for streaming message responses via Server-Sent Events.

Anthropic. Practical guide to prompt design for agentic systems, tool use patterns, and orchestration.

Official reference for the allow/deny rule syntax, hook configuration, and the five-layer permission hierarchy.

Google Cloud cost optimization — the same principles (metering, budgets, alerts) apply to LLM spend.

Anthropic 2025 — defending against universal jailbreaks with constitution-trained input/output classifiers

The paper that introduced the two-tower architecture for candidate generation at scale. Still the reference implementation for user-tower + item-tower + ANN retrieval.

Command/Query Responsibility Segregation with event sourcing — the pattern behind session replay: store events, not snapshots, then replay to reconstruct state.

Product-level documentation for NotebookLM's Gemini 1.5 Pro long-context approach. Pair with Google I/O 2024 talks on Vertex AI Matching Engine.

OpenAI 2023 — safety evaluations and capabilities of GPT-4 with vision

Radford et al. 2019 — decoder-only forward pass. Demonstrates that a single left-to-right pass can perform many NLP tasks.

Anthropic 2024 — exploiting long context windows to jailbreak LLMs with many in-context examples

Dean & Ghemawat, 2004 — the original coordinator/worker pattern for distributed computation.

The reference design for LLM tool interfaces — parallel function calling, strict mode, and tool_choice options.

Technical details on RL-trained reasoning and safety evaluations.

OpenAI, 2024. Demonstrates learned test-time compute allocation via internal chain-of-thought reasoning.

Comprehensive guide covering tactics for getting better results: writing clear instructions, providing reference text, and splitting complex tasks.

Practical deep RL resource covering policy gradients, PPO, and SAC with working implementations — the best entry point before diving into RLHF.

How prompt caching works, cache breakpoints, and cost implications for agent systems.

🔗 Other111

The canonical rule for config management — strict separation of config from code, environment-variable-first, directly applicable to agent settings design.

Hewitt et al., 1973 — the Actor Model that underpins modern multi-agent message passing.

The browser/Node.js API for cooperative cancellation — the mechanism behind Ctrl+C propagation through the streaming pipeline.

The structural design pattern at the heart of the bridge — converting one interface (QueryEngine) into multiple frontend-specific interfaces.

52K instruction-response pairs generated via Self-Instruct from GPT-3.5. Popularized SFT for open-source.

The paper that defined SLO-driven design at scale. Section 4 on the latency-at-p99.9 requirement and its architectural implications is the playbook this module derives from.

Not an ML piece, but the discipline of writing the customer-facing press release before the design doc is the methodological backbone of this module.

The definitive pedagogical guide to SGD, momentum, AdaGrad, RMSProp, Adam, and their variants with clear math.

The low-level sequences that all TUI libraries emit — understanding CSI codes (cursor movement, color, erase) demystifies what the reconciler produces.

The enterprise device management protocol used to deploy organization-wide settings policies.

Hands-on coding tutorials for transformer interpretability — TransformerLens, induction heads, superposition, SAEs, and circuit analysis.

GNU Bash programmable completion — the shell mechanism that agent CLIs mirror for /command tab completion.

Live benchmark for LLM tool-calling accuracy — shows which models handle nested schemas, parallel calls, and error recovery best.

How CPUs predict which branch to take — the same predict-execute-verify pattern applies to agent speculation.

The open standard for embedding AI-generation provenance metadata in images. Relevant to the OpenAI company-lens discussion on watermarking and traceability.

The production distributed task queue — the software engineering analog of the coordinator/worker pattern, with retries, priorities, and result backends.

Chapter 4 on evaluation is the textbook reference. The cost-accounting chapter reframes LLM unit economics around request shape, not just token count.

The canonical post on production LLM engineering. The hallucination and evaluation sections ground the citation-precision and groundedness design choices in this module.

Chapter 7 on feature pipelines and Chapter 10 on infrastructure cover the embedding lifecycle — freshness, serving, versioning — at the right abstraction level for a senior design interview.

Practitioner overview of the economic and operational realities of large-scale training — goodput, failure modes, and the org structure implications of running a cluster at this scale.

Olah 2015 — the clearest visual explanation of backprop as reverse-mode autodiff on computational graphs, with step-by-step chain rule diagrams.

Olah 2014 — visual intuition for how neural networks learn representations of language, connecting word embeddings to deeper network features.

Olah 2014 — how neural networks warp data manifolds to make them linearly separable, foundational for understanding embedding geometry.

The pattern for preventing cascading failures when a downstream service is unhealthy.

Ameisen et al. 2025 — combining SAEs with attribution patching to trace full computational circuits in Claude 3.5 Haiku

The OS-level COW primitive that makes fork cheap — the same copy-on-write principle applied to filesystem overlays in agent speculative execution.

Lindsey 2025 — evidence of introspective awareness where models can report on their own internal representations

Sofroniew et al. 2026 — investigating how emotion concept representations form and function inside Claude Sonnet 4.5

The most thorough practitioner guide to LLM-as-judge design — rubric construction, bias mitigation, calibration.

Practitioner-grade breakdown of retrieval, ranking, and re-ranking patterns at scale. The canonical starting point for recsys system design.

Martin Fowler — storing state as a sequence of events, the pattern behind session replay.

AWS architecture blog on backoff strategies — full jitter outperforms equal jitter and decorrelated jitter.

The design pattern that inspired AI tool hooks — shell scripts triggered by lifecycle events.

The git primitives for saving and restoring working state — the lightweight alternative to overlay FS for speculative edits confined to git-tracked files.

The git primitive that enables parallel sub-agents to operate on the same repo without file conflicts.

Pennington et al. 2014 — combines count-based and predictive methods for learning word vectors.

The M/M/1 queueing argument and the 70% utilization cap are made explicit here. The error-budget math in the SLO chapter pairs with this module's queueing deep dive.

The feature flag service used for build-time dead code elimination and gradual rollouts.

Copy-on-write immutable update library — used in agent state managers to safely update deeply nested AppState without mutation bugs.

Olsson et al. 2022 — induction heads as the mechanism behind in-context learning, emerging as a phase change during training.

The transport protocol underlying MCP communication between client and server.

The widely-adopted reference implementation of the tool-call loop — useful comparison to Claude Code

Inspiration for MCP — a similar protocol for editor-language server communication.

The industry standard for feature flag systems — covers build-time vs. runtime flags, targeting rules, and rollout strategies used in agent feature gating.

Live benchmark tracking price, throughput, and latency across all major LLM providers — the reference for model selection decisions in cost-aware agents.

Step-by-step guide to building and connecting your first MCP server — covers stdio and Streamable HTTP transports.

Reference for async function*, for await...of, and the async iteration protocol.

The open standard for connecting AI agents to external tools and data sources.

Nanda 2023 — comprehensive guide to getting started in mech interp research, with recommended papers, exercises, and learning path.

Open-source platform for exploring 50M+ sparse autoencoder features across GPT-2, Gemma, Llama — hands-on companion to the theory in this module.

Official Node.js guide on backpressure — the mechanism that prevents unbounded memory growth when the consumer is slower than the producer.

Official NVIDIA guide covering loss scaling, tensor cores, and the AMP workflow with benchmarks.

Official docs for TensorRT-LLM — NVIDIA

The auth standard underlying MCP

Lindsey et al. 2025 — probing Claude 3.5 Haiku

The open standard for emitting cost, latency, and token metrics — the instrumentation layer beneath production LLM cost dashboards.

Yu et al. 2022 — the continuous batching paper (Orca), showing iteration-level scheduling that keeps GPUs busy instead of waiting for the longest sequence.

The kernel filesystem that inspired the copy-on-write pattern used in speculative execution.

Security risks specific to LLM applications — including prompt injection and insecure tool use.

Phind's description of their code-specialized retrieval pipeline, domain-weighted re-ranking, and function-level citation design.

How open-source serving frameworks implement automatic prefix caching — the same mechanism Anthropic exposes via cache_control breakpoints.

The foundational security principle behind the permission hierarchy — grant minimum necessary access.

Official deep-dive into how autograd builds the computation graph, handles in-place ops, and propagates gradients — essential for debugging gradient issues

Debugging torch.compile graph breaks, dynamic shapes, and recompilations — increasingly important for modern training pipelines

Reference for torch.distributed, NCCL backend, process groups, and the DDP/FSDP/RPC APIs that underpin every production training stack.

PyTorch docs — common issues with memory, parallelism, and reproducibility

The gold standard for a single React tree rendering to multiple native targets — the same multi-renderer problem Claude Code

Two persistence strategies (snapshot vs append-only) that mirror the session save tradeoffs.

The canonical memoized selector library — the pattern behind AppState slice subscriptions that prevent unnecessary re-renders.

The definitive RL textbook covering MDPs, policy gradients, temporal-difference learning, and more.

The book that codified circuit breakers, bulkheads, and timeouts — the stability patterns directly applied in agent error recovery.

The original 1986 Nature paper that popularized backprop for training multi-layer networks.

Templeton et al. 2024 — dictionary learning at scale to find interpretable features in large models

Accessible walkthrough of merge algorithms with intuitive diagrams and concrete Mergekit YAML examples.

The thesis-length argument that the gap between ML design and ML-in-production is owned by the eval harness, not the model.

The argument that online/offline parity is the hardest SLO to enforce in an embedding platform. Directly relevant to the eval and canary sections of this module.

Hard-won practitioner lessons on tool-use reliability, prompt design for tool selection, and the gap between benchmark performance and real-world correctness.

The real-world consequences of CPU speculative execution gone wrong — illustrates why side-effect isolation (overlay FS) is non-negotiable before committing speculative work.

The CPU architecture concept — predict the branch, execute speculatively, commit or rollback.

The WAL mechanism that enables concurrent reads during writes — relevant to session checkpoint design.

The standard behind text/event-stream — event types, data fields, reconnection.

Primary source for Stable Diffusion architecture notes, SDXL improvements, and the open-weight model family that forms the technical baseline for most independent diffusion services.

The Python equivalent of Ink with CSS-style layout — useful cross-language comparison of the reactive TUI approach.

a16z analysis of cost structures in production LLM applications.

Bricken et al. 2023 — sparse autoencoders on a one-layer transformer find thousands of interpretable features; the predecessor to scaling monosemanticity

Elhage et al. 2022 — understanding how neural networks represent more features than dimensions

Kamath et al. 2025 — traces how attention QK circuits interact with features, extending induction head analysis to larger and more complex models

The TypeScript operator that validates config objects against a schema without widening the type — the compile-time complement to Zod

Olah 2015 — the gold-standard visual explainer of recurrent memory; useful context for understanding why in-context learning is surprising in attention-only models

Glorot & Bengio 2010 — derives the √(2/(fan_in+fan_out)) initialization by analyzing variance flow through layers. The theoretical foundation for Xavier init.

Official docs for building VS Code extensions — the primary IDE integration surface for Claude Code.

How VS Code isolates extensions in a separate process — the same isolation model used by the Claude Code IDE bridge to sandbox the engine from the editor.

The protocol spec underlying IDE-to-engine communication in bridge mode.

The browser standard for backpressure-aware streaming — ReadableStream, WritableStream, and the pipe chain that async generators implement natively.

Gurnee et al. 2025 — studying how models use linebreaks and whitespace as geometric pivots in activation space

Formal state machines for agent control flow — the alternative to ad-hoc React state for tracking agent lifecycle (idle → running → waiting → done).

The metadata format used in memory files — structured data at the top of markdown documents.

The YAML spec underlying skill frontmatter — understanding anchors, block scalars, and type coercion prevents subtle parsing bugs.

The runtime validation library used for config schemas — bridges the gap between TypeScript types and runtime data.