Sources & Bibliography

Every paper, blog post, video, and code repository cited across all 84 modules — deduplicated and searchable.

345 total sources345 unique sources84 modules covered4.1 sources per module avg

345 sources currently indexed across the wiki.

📄 Papers137

A General Theoretical Paradigm to Understand Learning from Human Feedback (IPO)Papers

Azar et al. 2024 — introduces IPO to address DPO

dpo

Are Emergent Abilities of Large Language Models a Mirage?Papers

Schaeffer et al. 2023 — argues apparent emergence is an artifact of discontinuous evaluation metrics, not a fundamental property of scale.

scaling-laws

Are Sixteen Heads Really Better than One?Papers

Michel et al. 2019 — shows most attention heads are redundant and can be pruned with minimal quality loss. Challenges assumptions about head count.

multi-head

Attention Is All You NeedPapers

Vaswani et al. 2017 — the paper that introduced scaled dot-product attention and the Transformer architecture.

attention forward-pass multi-head positional-encoding

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent ConversationPapers

Wu et al., 2023 — a framework for multi-agent conversations with customizable agents.

sub-agents

AWQ: Activation-aware Weight Quantization for LLM Compression and AccelerationPapers

Protects salient weight channels identified by activation magnitudes, achieving better quality than round-to-nearest at 4-bit.

quantization

Batch Normalization: Accelerating Deep Network Training (Ioffe & Szegedy 2015)Papers

The original BatchNorm paper — understanding why BN works for images clarifies why LayerNorm is needed for variable-length sequences.

layer-norm

Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsPapers

Wei et al., 2022. The foundational paper showing that adding

prompt-engineering reasoning-models

Classifier-Free Diffusion GuidancePapers

(Ho & Salimans, 2022) — the CFG paper. Explains how to trade sample diversity for prompt fidelity without a separate classifier model.

diffusion

Cognitive Architecture for Language Agents (CoALA)Papers

Sumers et al., 2023 — formal taxonomy of agent memory: working, episodic, semantic, and procedural. The framework behind the file-based memory design.

memory-system

ColBERT: Efficient and Effective Passage Search via Contextualized Late InteractionPapers

Khattab & Zaharia 2020 — the architecture that keeps per-token embeddings and uses MaxSim scoring. Relevant to understanding why single-vector bi-encoders are the retrieval floor, not the ceiling.

dr-case-perplexity rag

Constitutional AI: Harmlessness from AI FeedbackPapers

Anthropic

dpo rlhf safety

Cyclical Learning Rates for Training Neural Networks (Smith, 2017)Papers

Introduces the LR range test — the fastest practical method to find a good learning rate.

optimizers

DARE: Language Model Merging by Random Delta Drop (Yu et al., 2023)Papers

Drop 90% of task-vector deltas at random, rescale survivors by 1/(1-p), then merge — enables combining many models with near-zero accuracy loss.

model-merging

Decoupled Weight Decay Regularization (Loshchilov & Hutter, 2019)Papers

The paper that introduced AdamW — shows why L2 regularization and weight decay are NOT equivalent in adaptive optimizers.

optimizers

Deduplicating Training Data Makes Language Models BetterPapers

Lee et al. 2022 — rigorous study showing deduplication improves perplexity and reduces verbatim memorization risk across multiple LMs.

data-curation

DeepNorm: Scaling Transformers to 1,000 LayersPapers

Wang et al. 2022 — combines Pre-LN and Post-LN with scaled initialization to train 1000-layer transformers stably. Used in GLM-130B.

layer-norm

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement LearningPapers

DeepSeek

dpo reasoning-models verifier-prm

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language ModelPapers

Chen et al. 2024 — introduces Multi-Head Latent Attention (MLA), compressing KV cache to a shared low-rank latent with 93.3% reduction vs MHA.

kv-cache

DeepSeek-V3 Technical ReportPapers

DeepSeek 2024 — 671B MoE with auxiliary-loss-free load balancing and multi-token prediction

mo-e

Delving Deep into Rectifiers (He/Kaiming Init)Papers

He et al. 2015 — extends Xavier init for ReLU activations by accounting for the halved variance from the rectification. Standard init for modern deep nets.

mlp-fundamentals

Denoising Diffusion Probabilistic ModelsPapers

(Ho et al., 2020) — the foundational DDPM paper that revived diffusion models for image generation.

diffusion dr-case-image-gen

Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al., 2020)Papers

The DPR paper that established dual-encoder dense retrieval as the production baseline. The retrieval recall numbers here are the standard against which all Perplexity-style systems are measured.

dr-case-perplexity dr-compare-rag

Direct Preference Optimization: Your Language Model is Secretly a Reward ModelPapers

The DPO paper — eliminates the reward model by directly optimizing policy from preference pairs.

dpo

Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining ResearchPapers

AI2 2024 — describes the full curation pipeline for OLMo

data-curation

Editing Models with Task Arithmetic (Ilharco et al., 2022)Papers

Foundational paper showing that arithmetic on fine-tuned weight differences enables multi-task editing, forgetting, and analogy without retraining.

model-merging

Efficient Diffusion Serving — Ying Sheng et al., SwiftDiffusion (2024)Papers

A practitioner paper on batching strategy, LoRA switching, and GPU utilization for diffusion serving at scale. The most directly relevant systems paper for this case study

dr-case-image-gen

Efficient Estimation of Word Representations in Vector Space (Word2Vec)Papers

Mikolov et al. 2013 — introduced Skip-gram and CBOW, the foundation of modern word embeddings.

embeddings

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LMPapers

Narayanan et al. 2021 — combining data, tensor, and pipeline parallelism (3D parallelism) with concrete recipes for scaling to trillions of parameters.

distributed-training dr-case-training-infra

Efficient Memory Management for Large Language Model Serving with PagedAttentionPapers

The vLLM paper — virtual memory paging for KV cache, eliminating fragmentation and enabling continuous batching.

kv-cache llm-deployment prompt-cache

Efficient Transformers: A SurveyPapers

Tay et al., 2020 — covers KV cache and attention computation optimizations that make prompt caching possible.

prompt-cache

Emergent Abilities of Large Language ModelsPapers

Wei et al. 2022 — documents capabilities that appear unpredictably at scale, raising questions about whether scaling produces continuous or discontinuous improvements.

scaling-laws

Fast Transformer Decoding: One Write-Head is All You Need (MQA)Papers

Shazeer 2019 — multi-query attention reduces KV cache size by sharing one KV head across all query heads. Direct precursor to GQA.

kv-cache multi-head

Fine-Tuning Language Models from Human Preferences (Ziegler et al. 2019)Papers

The original RLHF paper applying reward learning from human preferences to language models.

rlhf

Flamingo: a Visual Language Model for Few-Shot LearningPapers

Alayrac et al. 2022 — few-shot multimodal learning with interleaved image-text inputs

multimodal-llm

FlashAttention-2: Faster Attention with Better Parallelism and Work PartitioningPapers

Flash Attention v2 — improved work partitioning across warps and thread blocks for up to 2x additional speedup.

flash-attention gpu-precision

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precisionPapers

Shah et al. 2024 — Flash Attention v3 for H100, using async TMA and FP8 tensor cores to achieve 1.5-2x speedup over v2.

flash-attention

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-AwarenessPapers

The original Flash Attention paper — tiling attention computation to exploit GPU SRAM, achieving 2-4x speedup.

flash-attention gpu-precision

FP8 Formats for Deep LearningPapers

Micikevicius et al. 2022 — defines E4M3 and E5M2 FP8 formats and training recipes. The format used by DeepSeek-V3 and H100 tensor cores.

quantization

Generative Agents: Interactive Simulacra of Human BehaviorPapers

Park et al., 2023 — memory retrieval, reflection, and planning for believable agent behavior.

memory-system

GLU Variants Improve TransformerPapers

Shazeer 2020 — shows SwiGLU and GeGLU outperform standard ReLU FFNs. SwiGLU is now the default in LLaMA and PaLM.

ffn

Gorilla: Large Language Model Connected with Massive APIsPapers

Patil et al., 2023 — improving LLM accuracy in API call generation via retrieval-augmented training.

tool-system

GPipe: Efficient Training of Giant Neural Networks using Pipeline ParallelismPapers

Huang et al. 2019 — introduces micro-batching to reduce pipeline bubble overhead, the foundation of modern pipeline parallelism.

distributed-training

GPT-4 Technical ReportPapers

OpenAI

scaling-laws

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained TransformersPapers

One-shot weight quantization using approximate second-order information, enabling 3-4 bit models with minimal quality loss.

quantization

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head CheckpointsPapers

Grouped-query attention — interpolates between MHA and MQA to reduce KV cache size with minimal quality loss.

kv-cache multi-head

High-Dimensional Continuous Control Using Generalized Advantage EstimationPapers

Schulman et al. 2016 — introduces GAE, the variance-reduction technique that PPO relies on for stable RLHF training.

rl-foundations

High-Resolution Image Synthesis with Latent Diffusion ModelsPapers

(Rombach et al., 2022) — introduced latent diffusion, the architecture behind Stable Diffusion.

diffusion dr-case-image-gen

Holistic Evaluation of Language Models (HELM)Papers

Liang et al. 2022 — multi-metric evaluation framework covering accuracy, fairness, robustness, and more

evaluation

Improve Mathematical Reasoning in Language Models by Automated Process SupervisionPapers

Luo et al. 2024 — Monte Carlo tree search to automatically generate step-level labels for PRM training without human annotation

verifier-prm

In-Context Retrieval-Augmented Language ModelsPapers

Ram et al., 2023 — dynamically retrieving only the relevant chunks rather than keeping the full context, the retrieval complement to compaction.

context-compaction

Indirect Prompt Injection Attacks on LLMsPapers

Greshake et al., 2023 — how attackers inject instructions via tool results; the threat model behind PreToolUse deny rules and input sanitization.

hooks-permissions

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic TasksPapers

Chen et al. 2023 — scaling ViT to 6B parameters and aligning with LLMs; shows dynamic resolution handling for OCR-heavy tasks

multimodal-llm

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 SmallPapers

Wang et al. 2022 — end-to-end circuit analysis of a real linguistic capability; the canonical example of mechanistic interpretability on a real model

interpretability

Judging LLM-as-a-Judge with MT-Bench and Chatbot ArenaPapers

Zheng et al. 2023 — LLM-as-judge evaluation and Elo-based human preference ranking

evaluation

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV CachePapers

Per-channel INT2 KV cache quantization — 2.35x memory reduction with negligible quality loss, enabling longer contexts on the same GPU.

kv-cache

Language Models are Few-Shot Learners (GPT-3)Papers

Brown et al. 2020 — the GPT-3 paper showing that large-scale pretraining enables strong few-shot task performance without fine-tuning.

pretraining

Large Language Model based Multi-Agents: A Survey of Progress and ChallengesPapers

Recent survey of LLM-based multi-agent architectures and coordination patterns.

coordinator-worker

Layer NormalizationPapers

Ba et al. 2016 — the original LayerNorm paper. Normalizes across features instead of batch dimension.

layer-norm

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic EvaluatorsPapers

Dubois et al. 2024 — length-controlled win rates to debias automatic evaluators; addresses verbosity inflation in GPT-4 judge scoring

evaluation

LetPapers

Lightman et al. (2023). Process reward models outperform outcome reward models by scoring each reasoning step.

reasoning-models verifier-prm

LIMA: Less Is More for AlignmentPapers

Zhou et al. 2023 — 1,000 carefully curated examples match far larger datasets. Quality over quantity for SFT.

data-curation sft-post-training

Llama 2: Open Foundation and Fine-Tuned Chat ModelsPapers

Detailed post-training recipe: SFT (27K examples), rejection sampling, 5 rounds of RLHF.

sft-post-training

LLaVA-OneVision: Easy Visual Task TransferPapers

Li et al. 2024 — single model handles single-image, multi-image, and video tasks; shows how to unify vision tasks with one instruction-tuned model

multimodal-llm

LLM-Compiler: Parallel Function CallingPapers

Kim et al., 2023 — DAG-based parallel function call planning, the same dependency-aware parallelism coordinators use to maximize worker throughput.

coordinator-worker

LLM.int8(): 8-bit Matrix Multiplication for Transformers at ScalePapers

Dettmers et al. 2022 — mixed-precision INT8 quantization that handles outlier activations separately, enabling near-lossless 8-bit inference for 175B+ models.

quantization

LongLoRA: Efficient Fine-tuning of Long-Context Large Language ModelsPapers

Chen et al. 2023 — extends context window from 4K to 100K using shifted sparse attention during fine-tuning.

long-context

LoRA: Low-Rank Adaptation of Large Language ModelsPapers

The original LoRA paper — freeze base weights, train low-rank decomposition matrices A and B.

fine-tuning

Lost in the Middle: How Language Models Use Long ContextsPapers

Liu et al., 2023 — LLMs struggle with information in the middle of long contexts, motivating compaction.

context-compaction rag

Malkov & Yashunin — Efficient and Robust Approximate Nearest Neighbor Search Using HNSW (2018)Papers

The foundational HNSW paper. Read Section 4 on layered graph construction and Section 5 on query complexity — essential for justifying M, ef_construction, and ef_search tradeoffs in an interview.

dr-case-embeddings-platform

Measuring Massive Multitask Language Understanding (MMLU)Papers

Hendrycks et al. 2020 — 57-subject benchmark testing broad knowledge and reasoning

evaluation

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model ParallelismPapers

NVIDIA

distributed-training

MemGPT: Towards LLMs as Operating SystemsPapers

Packer et al., 2023 — virtual context management with paging, a complementary approach to compaction.

context-compaction memory-system

Meta — Llama 3 Herd of Models (Dubey et al., 2024)Papers

The primary source for Llama-scale training infrastructure at Meta. Section 3 on pre-training covers the 3D-parallel strategy, checkpoint policies, and failure-recovery design that this case study is grounded in.

dr-case-training-infra pretraining

MetaGPT: Meta Programming for Multi-Agent Collaborative FrameworkPapers

Hong et al., 2023 — structured multi-agent collaboration with role-based task decomposition.

sub-agents

Min-P Sampling: Truncation Sampling as Language Model DesmoothingPapers

Nguyen et al. 2024 — min-p sets a dynamic floor at p_min × max_prob, automatically adapting to the distribution without the fixed-cutoff fragility of top-p.

sampling

Mixtral of ExpertsPapers

Mistral AI 2024 — open-weight MoE model with 8 experts per layer, top-2 routing

mo-e

Model Soups (Wortsman et al., 2022)Papers

Average fine-tuned checkpoint weights from the same pre-trained base to reach flatter loss regions and improve accuracy over any single run.

model-merging

Multitask Prompted Training Enables Zero-Shot Task GeneralizationPapers

Sanh et al., 2021. Shows that instruction fine-tuning on diverse tasks dramatically improves zero-shot prompting — why modern models follow instructions without few-shot examples.

prompt-engineering

Neural Machine Translation of Rare Words with Subword Units (BPE)Papers

Sennrich et al. 2016 — the paper that introduced Byte Pair Encoding for NLP tokenization.

tokenization

On Layer Normalization in the Transformer ArchitecturePapers

Xiong et al. 2020 — analyzes Pre-LN vs Post-LN placement. Pre-LN enables stable training without warmup.

layer-norm

Online normalizer calculation for softmax (Milakov & Gimelshein)Papers

The online softmax algorithm that makes Flash Attention

flash-attention

Orca 2: Teaching Small Language Models How to ReasonPapers

Mitra et al. 2023 — smaller models trained on carefully synthesized step-by-step reasoning data can match much larger models; shows why data quality beats scale for SFT.

sft-post-training

ORPO: Monolithic Preference Optimization without Reference ModelPapers

Hong et al. 2024 — combines SFT and preference optimization in a single pass using an odds ratio penalty, eliminating the need for a reference model entirely.

dpo

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts LayerPapers

Shazeer et al. 2017 — the original MoE paper introducing sparsely-gated expert routing

mo-e

Pope et al. — Efficiently Scaling Transformer Inference (Google, 2022)Papers

First-principles analysis of memory bandwidth vs. compute bottlenecks in large model serving. The paper that grounds the batch-size/latency tradeoff in hardware arithmetic.

dr-compare-slo-cost

Proximal Policy Optimization Algorithms (Schulman et al. 2017)Papers

The PPO paper — clipped surrogate objective that became the default RL algorithm for RLHF.

rl-foundations

PyTorch FSDP: Experiences on Scaling Fully Sharded Data ParallelPapers

Production lessons from scaling FSDP across thousands of GPUs at Meta.

distributed-training dr-case-training-infra

QLoRA: Efficient Finetuning of Quantized LLMsPapers

4-bit NormalFloat quantization + LoRA adapters, enabling 65B model fine-tuning on a single 48GB GPU.

fine-tuning

RAGAS: Automated Evaluation of Retrieval Augmented Generation (Es et al., 2023)Papers

The evaluation framework for RAG systems — faithfulness, answer relevance, context precision, context recall. The eval metrics used in the cross-system comparison in this module are grounded in RAGAS.

dr-compare-rag rag

ReAct: Synergizing Reasoning and Acting in Language ModelsPapers

Yao et al., 2022 — interleaving reasoning traces and actions for grounded decision-making.

agent-harness agents dr-case-agent-platform dr-case-coding-agent

Reflexion: Language Agents with Verbal Reinforcement LearningPapers

Shinn et al., 2023 — agents that reflect on failures and improve across episodes.

agent-harness

Retrieval-Augmented Generation for Knowledge-Intensive NLP TasksPapers

Lewis et al., 2020 — the foundational RAG paper; the MEMORY.md index pattern is a lightweight, file-based approximation of RAG-style selective retrieval.

memory-system rag

Reward Model Ensembles Help Mitigate OveroptimizationPapers

Coste et al. 2023 — shows that reward hacking can be reduced by ensembling multiple reward models, with quantitative analysis of the overoptimization curve.

rlhf

RoFormer: Enhanced Transformer with Rotary Position EmbeddingPapers

Su et al. 2021 — RoPE encodes relative position via rotation, enabling length extrapolation used by LLaMA and Mistral.

long-context positional-encoding

Root Mean Square Layer Normalization (RMSNorm)Papers

Zhang & Sennrich 2019 — drops the mean-centering step for a simpler, faster norm. Used by LLaMA and Mistral.

layer-norm

RouteLLM — Learning to Route in LLMs (Ong et al., 2024)Papers

The paper that formally defines the cost-quality trade-off in LLM routing. Introduces the APGR/CGPT metrics and shows that a trained classifier-router can match 95% of GPT-4 quality at 40% of the cost.

dr-cost-and-eval

Scalable Diffusion Models with Transformers (DiT)Papers

(Peebles & Xie, 2023) — replaced U-Net with a Vision Transformer, showing clean scaling laws for diffusion.

diffusion

Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-TuningPapers

Xu et al. 2023 — comprehensive survey comparing LoRA, adapters, prefix tuning, and prompt tuning across benchmarks.

fine-tuning

Scaling Laws for Neural Language ModelsPapers

Kaplan et al. 2020 — empirical power laws relating compute, data, and parameters to loss. Foundation of modern training budgets.

pretraining scaling-laws

Scaling LLM Test-Time Compute OptimallyPapers

Snell et al. (2024). When to think longer vs. use a bigger model.

reasoning-models verifier-prm

Scaling Rectified Flow Transformers for High-Resolution Image SynthesisPapers

(Esser et al., 2024) — Stability's Stable Diffusion 3 paper: rectified flow training with MMDiT. FLUX.1 uses the same rectified-flow family but is a separate model from Black Forest Labs with no equivalent paper.

diffusion

Scaling Transformer to 1M tokens with RingAttentionPapers

Liu et al., 2023 — extending context windows, reducing but not eliminating the need for compaction.

context-compaction flash-attention

Self-Consistency Improves Chain of Thought Reasoning in Language ModelsPapers

Wang et al., 2022. Sample multiple CoT reasoning paths and take the majority vote. Simple technique, significant accuracy gains.

prompt-engineering reasoning-models verifier-prm

Self-Instruct: Aligning LMs with Self-Generated InstructionsPapers

The method behind Alpaca — use a model to generate its own instruction-tuning data.

sft-post-training

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-ReflectionPapers

Asai et al. 2023 — model learns to decide when to retrieve and to critique its own outputs with special reflection tokens.

rag

SentencePiece: A simple and language independent subword tokenizerPapers

Kudo & Richardson 2018 — unigram language model tokenizer used by T5, mT5, XLNet, and many multilingual models.

tokenization

SGLang: Efficient Execution of Structured Language Model ProgramsPapers

Zheng et al. 2023 — RadixAttention for KV cache reuse across requests, 5x throughput on multi-turn workloads

llm-deployment

Shreya Shankar — Who Validates the Validators?Papers

The canonical paper on LLM-judge calibration. If you take one idea: the judge needs its own eval, and that eval is a human-labeled subsample you refresh on a schedule.

dr-cost-and-eval

Shreya Shankar — Who Validates the Validators? Towards LLM-Assisted EvaluationPapers

The foundational paper for understanding why the LLM judge evaluating your RAG system needs its own calibration. Essential reading before designing any groundedness or citation-precision eval.

dr-case-perplexity dr-compare-rag

SimPO: Simple Preference Optimization with a Reference-Free RewardPapers

Meng et al. 2024 — removes the reference model from DPO entirely, using sequence-average log-probability as the implicit reward.

dpo

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language ModelsPapers

Migrates quantization difficulty from activations to weights via per-channel smoothing, enabling W8A8 quantization.

quantization

Splitwise: Efficient Generative Inference with Model SplittingPapers

Patel et al. 2023 — separates prefill and decode across GPU clusters for optimal hardware utilization

llm-deployment

SWE-agent: Agent-Computer Interfaces Enable Automated Software EngineeringPapers

Yang et al., 2024 — production agent harness design lessons from solving real GitHub issues; ACI (agent-computer interface) design principles.

agent-harness

SWE-bench Verified: Can Language Models Resolve Real GitHub Issues?Papers

Princeton / Chicago, 2023 — the benchmark that made coding-agent trajectory eval rigorous. Essential reading for any team designing agent eval harnesses.

dr-case-coding-agent

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient SparsityPapers

Fedus et al. 2021 — top-1 routing with capacity factor and auxiliary balance loss; showed MoE scales to 1T+ params

mo-e

Textbooks Are All You NeedPapers

Gunasekar et al. 2023 — Microsoft

data-curation

The Curious Case of Neural Text Degeneration (Holtzman et al. 2020)Papers

Introduces nucleus sampling (top-p) — dynamically truncates the vocabulary to the smallest set covering probability p.

sampling

The FineWeb Datasets: Decanting the Web for the Finest Text Data at ScalePapers

Penedo et al. 2024 — HuggingFace

data-curation

The Pile: An 800GB Dataset of Diverse Text for Language ModelingPapers

Gao et al. 2020 — influential open dataset combining 22 diverse sources. Used to train GPT-Neo and GPT-J.

data-curation

The Power of Scale for Parameter-Efficient Prompt TuningPapers

Lester et al. 2021 — prompt tuning trains only soft prompt tokens while freezing the entire model. Matches full fine-tuning at 11B scale.

fine-tuning

TIES-Merging (Yadav et al., 2023)Papers

Trim-Elect Sign-Disjoint Merge: three-step algorithm that cuts inter-task interference by resolving parameter sign conflicts before averaging.

model-merging

Toolformer: Language Models Can Teach Themselves to Use ToolsPapers

Schick et al., 2023 — training LLMs to decide when and how to call external tools.

agent-harness dr-case-coding-agent tool-system tool-use

Train Short, Test Long: Attention with Linear Biases (ALiBi)Papers

Press et al. 2022 — adds linear bias to attention scores instead of positional embeddings. Enables length extrapolation.

positional-encoding

Training Compute-Optimal Large Language Models (Chinchilla)Papers

Hoffmann et al. 2022 — proved most LLMs were undertrained. Optimal ratio: ~20 tokens per parameter.

pretraining scaling-laws

Training language models to follow instructions with human feedback (InstructGPT)Papers

OpenAI

rlhf sft-post-training

Transformer Feed-Forward Layers Are Key-Value MemoriesPapers

Geva et al. 2021 — interprets FFN layers as implicit key-value stores where keys match input patterns and values store output distributions.

ffn

Tree of Thoughts: Deliberate Problem Solving with Large Language ModelsPapers

Yao et al. (2023). Structured search over reasoning paths using BFS/DFS — the algorithmic foundation for o1-style test-time search.

reasoning-models

Tulu 3: Pushing Frontiers in Open Language Model Post-TrainingPapers

Lambert et al. 2024 — end-to-end open post-training recipe covering data curation, SFT, DPO, and RLVR with full ablations and reproducible results.

sft-post-training

Typical Decoding for Natural Language GenerationPapers

Typical sampling — selects tokens whose information content is close to the expected information, producing more human-like text.

sampling

Unified Scaling Laws for Routed Language ModelsPapers

Clark et al. 2022 — scaling laws specific to MoE: how performance scales with number of experts, active params, and total params

mo-e

Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG)Papers

Zou et al. 2023 — gradient-based suffix optimization that transfers across GPT-4, Claude, and Gemini; motivates the need for input classifiers

safety

Using the Output Embedding to Improve Language Models (Weight Tying)Papers

Press & Wolf 2017 — shows tying input and output embedding weights improves perplexity and saves parameters.

embeddings

Visual Instruction Tuning (LLaVA)Papers

Liu et al. 2023 — visual instruction tuning connecting a vision encoder to an LLM

multimodal-llm

YaRN: Efficient Context Window Extension of Large Language ModelsPapers

Peng et al. 2023 — extends RoPE context windows without full fine-tuning via interpolation. Technique used by Llama models for context extension.

positional-encoding

Zephyr: Direct Distillation of LM AlignmentPapers

Tunstall et al. 2023 — the first widely-adopted DPO model (Mistral-7B base), with a concrete recipe for distilled preference data generation.

dpo

ZeRO: Memory Optimizations Toward Training Trillion Parameter ModelsPapers

Microsoft

distributed-training

✍️ Blogs35

3Blue1Brown — Attention in TransformersBlogs

Grant Sanderson

attention multi-head

A Recipe for Training Neural NetworksBlogs

Karpathy 2019 — systematic approach to debugging and training neural networks from scratch

py-torch-debugging

Approximation by Superpositions of a Sigmoidal Function (Universal Approximation)Blogs

Hornik, Stinchcombe, White 1989 — proves that a single hidden-layer MLP can approximate any continuous function on a compact domain given enough neurons.

mlp-fundamentals

Chris Olah — Attention and Augmented Recurrent Neural NetworksBlogs

Olah & Carter 2016 — the original visual explainer of attention mechanisms before transformers.

attention

E2B — Secure Open-Source Cloud Runtime for AI AgentsBlogs

E2B

dr-case-agent-platform

Hamel Husain — Your AI Product Needs EvalsBlogs

The practitioner post that reframed eval-first design for the AI engineering generation. The trajectory eval section of this module follows Hamel

dr-case-agent-platform dr-case-coding-agent dr-cost-and-eval dr-methodology evaluation

Instagram Engineering — Powered by AI: InstagramBlogs

🎬 Videos15

3Blue1Brown — Backpropagation, visually explainedVideos

Animated geometric intuition for why the chain rule distributes gradient across a computation graph.

backpropagation

3Blue1Brown — But what is a GPT? (YouTube)Videos

Visual intuition for what transformer attention heads actually compute — useful foundation before diving into circuit-level interpretability

interpretability

3Blue1Brown — But what is a GPT? Visual intro to TransformersVideos

Visual explanation of how token embeddings work and how meaning is encoded in high-dimensional vector space — geometric intuition.

embeddings forward-pass

3Blue1Brown — How might LLMs store facts (Chapter 7)Videos

Grant Sanderson 2024 — visual explanation of how MLP layers act as key-value memories storing factual associations, with the connection to superposition.

ffn interpretability mech-interp

3Blue1Brown — Neural Networks seriesVideos

Visual intuition for how neurons, layers, and matrix multiplication combine to form a neural network. The gold standard for building geometric intuition.

mlp-fundamentals

3Blue1Brown — Transformers (What they are and what they do)Videos

Visual breakdown of the MLP (FFN) layers in Transformers — intuition for what the expansion and contraction matrices compute.

ffn

Andrej Karpathy — LetVideos

Codes scaled dot-product self-attention from scratch in ~50 lines of PyTorch — essential companion for internalizing Q, K, V.

attention embeddings forward-pass pretraining sampling tokenization

Andrej Karpathy — LetVideos

Karpathy

llm-deployment tokenization

Andrej Karpathy — makemore Part 3: MLPVideos

Karpathy builds an MLP character-level language model from scratch — covers weight init, batch norm, learning rate tuning, and the vanishing gradient problem.

mlp-fundamentals

Andrej Karpathy — Neural Networks: Zero to Hero (Optimization lecture)Videos

Karpathy

optimizers

Andrej Karpathy — The spelled-out intro to neural networks (YouTube)Videos

2.5-hour walkthrough building micrograd step by step — every chain rule application shown explicitly.

backpropagation py-torch-debugging

Andrej Karpathy — The State of GPT (Microsoft Build 2023)Videos

Covers the SFT and RLHF fine-tuning pipeline end-to-end, including practical data requirements and training tips.

fine-tuning rlhf sft-post-training scaling-laws

Andrej Karpathy — Zero To Hero: Building GPT (Device + Precision chapters)Videos

Karpathy

gpu-precision

Gil Tene — How NOT to Measure Latency (QCon 2015)Videos

The canonical talk on why averages and even p95 lie, and why p99/p99.9 are the only metrics that capture the user's experience. The HDR histogram argument is mandatory background for SLO design.

dr-compare-slo-cost

Yannic Kilcher — DALL·E 2 / Diffusion Models Explained (YouTube)Videos

Visual walkthrough of latent diffusion, CLIP guidance, and how modern text-to-image systems combine these ideas.

diffusion

💻 Code17

Andrej Karpathy — micrograd (GitHub)Code

100-line autodiff engine and neural network library — the clearest possible implementation of backprop from scratch.

backpropagation

Andrej Karpathy — nanoGPTCode

The cleanest GPT implementation in ~300 lines of PyTorch — read model.py to see the exact forward pass used in practice.

forward-pass

Anthropic SDK Streaming (TypeScript)Code

The official SDK

streaming-api

Blessed: A high-level terminal interface libraryCode

The older, imperative approach to terminal UIs — useful comparison point to understand what React TUI improves on.

terminal-ui

Building a Custom React RendererCode

Step-by-step walkthrough of building a React custom renderer — the same techniques Ink uses to target the terminal instead of the DOM.

terminal-ui

Claude Code (source)Code

Open-source reference for a production agent harness — the architecture this module describes.

agent-harness agents commands-skills context-compaction coordinator-worker error-recovery hooks-permissions memory-system plugins-mcp prompt-cache session-persistence speculative-execution state-management streaming-api sub-agents tool-system tool-use

FineWeb: Decanting the Web (HuggingFace Blog)Code

HuggingFace 2024 — accessible walkthrough of the full FineWeb pipeline: Common Crawl ingestion, quality scoring, MinHash dedup, and ablation results. Best pedagogical resource on web-scale data curation.

data-curation

Ink: React for interactive command-line appsCode

The terminal React renderer whose components are adapted by ink-compat for cross-platform rendering.

bridges terminal-ui

MCP Servers RegistryCode

Official list of reference MCP server implementations — filesystem, GitHub, Postgres, Puppeteer, and more.

plugins-mcp

Mergekit — toolkit for merging large language modelsCode

Production-grade open-source library implementing SLERP, TIES, DARE, Task Arithmetic, and Model Soups on HuggingFace models.

model-merging

Mixture of Experts Explained (Hugging Face blog)Code

How Mixtral replaces dense FFNs with sparse MoE layers — concrete explanation of routing, expert selection, and capacity factors.

ffn

OpenAI Tiktoken (cl100k_base)Code

Production BPE tokenizer powering GPT-4. Fast Rust implementation with Python bindings.

tokenization

PEFT: Parameter-Efficient Fine-Tuning (Hugging Face docs)Code

Hugging Face library implementing LoRA, prefix tuning, prompt tuning, and other PEFT methods.

fine-tuning

React Reconciler documentationCode

The official package for building custom React renderers — the foundation Ink is built on.

terminal-ui

tmux Session ManagementCode

The gold standard for terminal session persistence — background processes, detach/attach, and named sessions; the UX model Claude Code

session-persistence

Tokenizer Summary — Hugging Face docsCode

HuggingFace reference covering every major algorithm — WordPiece, Unigram, BPE, byte-level BPE — with concrete examples of merge rules.

tokenization

ZustandCode

A minimal state library for React — similar memoized selector pattern, different implementation.

state-management

🏢 Industry30

Alignment Faking in Large Language ModelsIndustry

Anthropic 2024 — evidence that models can strategically fake alignment during training

safety

Anthropic — Developing Evaluations for ClaudeIndustry

Practical guide to designing task-specific evals, calibrating LLM judges, and building regression gates for Claude-based applications

evaluation

Anthropic API Error CodesIndustry

Official reference for API error types, status codes, and recommended handling strategies.

error-recovery

Anthropic API PricingIndustry

Current pricing for all Claude models — input, output, and cached token rates.

cost-tracking

Anthropic Claude Code Skills DocumentationIndustry

Official docs for custom slash commands in Claude Code — creating, organizing, and distributing skills.

commands-skills

Anthropic Prompt Engineering DocsIndustry

Claude-specific guidance on system prompts, XML tags for structure, extended thinking, and common pitfalls.

prompt-engineering

Anthropic Streaming APIIndustry

Official docs for streaming message responses via Server-Sent Events.

streaming-api

Anthropic: Building Effective Agents — Orchestrator SubagentIndustry

Anthropic

coordinator-worker dr-case-agent-platform dr-case-coding-agent dr-cost-and-eval sub-agents

Anthropic: Long Context TipsIndustry

Anthropic

context-compaction

Building Effective AgentsIndustry

Anthropic. Practical guide to prompt design for agentic systems, tool use patterns, and orchestration.

prompt-engineering

Claude Code Permissions DocumentationIndustry

Official reference for the allow/deny rule syntax, hook configuration, and the five-layer permission hierarchy.

hooks-permissions

Cloud Cost Optimization PatternsIndustry

Google Cloud cost optimization — the same principles (metering, budgets, alerts) apply to LLM spend.

cost-tracking

Constitutional ClassifiersIndustry

Anthropic 2025 — defending against universal jailbreaks with constitution-trained input/output classifiers

safety

Covington et al. — Deep Neural Networks for YouTube Recommendations (RecSys 2016)Industry

The paper that introduced the two-tower architecture for candidate generation at scale. Still the reference implementation for user-tower + item-tower + ANN retrieval.

dr-case-feed-ranking

CQRS and Event Sourcing (Microsoft Azure Docs)Industry

Command/Query Responsibility Segregation with event sourcing — the pattern behind session replay: store events, not snapshots, then replay to reconstruct state.

session-persistence

DALL-E 3 Technical Report (OpenAI, 2023)Industry

OpenAI

dr-case-image-gen

Google NotebookLM — Product Changelog & Architecture NotesIndustry

Product-level documentation for NotebookLM's Gemini 1.5 Pro long-context approach. Pair with Google I/O 2024 talks on Vertex AI Matching Engine.

dr-compare-rag

GPT-4V System CardIndustry

OpenAI 2023 — safety evaluations and capabilities of GPT-4 with vision

multimodal-llm

Jeff Dean — Building Software Systems at Google and Lessons LearnedIndustry

The original

dr-methodology

Language Models are Unsupervised Multitask Learners (GPT-2)Industry

Radford et al. 2019 — decoder-only forward pass. Demonstrates that a single left-to-right pass can perform many NLP tasks.

forward-pass pretraining

Many-shot JailbreakingIndustry

Anthropic 2024 — exploiting long context windows to jailbreak LLMs with many in-context examples

safety

MapReduce: Simplified Data Processing on Large ClustersIndustry

Dean & Ghemawat, 2004 — the original coordinator/worker pattern for distributed computation.

coordinator-worker

OpenAI Function Calling GuideIndustry

The reference design for LLM tool interfaces — parallel function calling, strict mode, and tool_choice options.

tool-system

OpenAI o1 System CardIndustry

Technical details on RL-trained reasoning and safety evaluations.

reasoning-models

OpenAI o1 Technical ReportIndustry

OpenAI, 2024. Demonstrates learned test-time compute allocation via internal chain-of-thought reasoning.

verifier-prm

OpenAI Prompt Caching GuideIndustry

OpenAI

prompt-cache

OpenAI Prompt Engineering GuideIndustry

Comprehensive guide covering tactics for getting better results: writing clear instructions, providing reference text, and splitting complex tasks.

prompt-engineering

OpenAI Spinning Up in Deep RLIndustry

Practical deep RL resource covering policy gradients, PPO, and SAC with working implementations — the best entry point before diving into RLHF.

rl-foundations

Prompt Caching with ClaudeIndustry

How prompt caching works, cache breakpoints, and cost implications for agent systems.

cost-tracking prompt-cache

System Prompt Design Best PracticesIndustry

Anthropic

prompt-cache

🔗 Other111

12-Factor App: ConfigOther

The canonical rule for config management — strict separation of config from code, environment-variable-first, directly applicable to agent settings design.

config-schemas

A Mathematical Framework for Transformer Circuits (Elhage et al.)Other

Anthropic interpretability team

attention induction-heads interpretability

A Universal Modular ACTOR Formalism for Artificial IntelligenceOther

Hewitt et al., 1973 — the Actor Model that underpins modern multi-agent message passing.

coordinator-worker

AbortController and AbortSignal (MDN)Other

The browser/Node.js API for cooperative cancellation — the mechanism behind Ctrl+C propagation through the streaming pipeline.

error-recovery

Adapter Pattern (Refactoring Guru)Other

The structural design pattern at the heart of the bridge — converting one interface (QueryEngine) into multiple frontend-specific interfaces.

bridges

Alpaca: A Strong, Replicable Instruction-Following ModelOther

52K instruction-response pairs generated via Self-Instruct from GPT-3.5. Popularized SFT for open-source.

sft-post-training

Amazon DynamoDB — Dynamo: Amazon's Highly Available Key-value Store (DeCandia et al., SOSP 2007)Other

The paper that defined SLO-driven design at scale. Section 4 on the latency-at-p99.9 requirement and its architectural implications is the playbook this module derives from.

dr-compare-slo-cost

Amazon Working Backwards — PR/FAQ + 6-PagerOther

Not an ML piece, but the discipline of writing the customer-facing press release before the design doc is the methodological backbone of this module.

dr-methodology

An Overview of Gradient Descent Optimization Algorithms — Sebastian RuderOther

The definitive pedagogical guide to SGD, momentum, AdaGrad, RMSProp, Adam, and their variants with clear math.

optimizers

ANSI Escape Codes ReferenceOther

The low-level sequences that all TUI libraries emit — understanding CSI codes (cursor movement, color, erase) demystifies what the reconciler produces.

terminal-ui

Apple MDM Protocol ReferenceOther

The enterprise device management protocol used to deploy organization-wide settings policies.

config-schemas

ARENA — Mechanistic Interpretability ExercisesOther

Hands-on coding tutorials for transformer interpretability — TransformerLens, induction heads, superposition, SAEs, and circuit analysis.

mech-interp

Bash Tab Completion GuideOther

GNU Bash programmable completion — the shell mechanism that agent CLIs mirror for /command tab completion.

commands-skills

Berkeley Function-Calling LeaderboardOther

Live benchmark for LLM tool-calling accuracy — shows which models handle nested schemas, parallel calls, and error recovery best.

tool-system

Branch Prediction (Wikipedia)Other

How CPUs predict which branch to take — the same predict-execute-verify pattern applies to agent speculation.

speculative-execution

C2PA Content Credentials SpecificationOther

The open standard for embedding AI-generation provenance metadata in images. Relevant to the OpenAI company-lens discussion on watermarking and traceability.

dr-case-image-gen

Celery: Distributed Task QueueOther

The production distributed task queue — the software engineering analog of the coordinator/worker pattern, with retries, priorities, and result backends.

coordinator-worker

Chip Huyen — AI Engineering (OOther

Chapter 4 on evaluation is the textbook reference. The cost-accounting chapter reframes LLM unit economics around request shape, not just token count.

dr-cost-and-eval

Chip Huyen — Building LLM Applications for ProductionOther

The canonical post on production LLM engineering. The hallucination and evaluation sections ground the citation-precision and groundedness design choices in this module.

dr-case-perplexity

Chip Huyen — Designing Machine Learning Systems (OOther

Chapter 7 on feature pipelines and Chapter 10 on infrastructure cover the embedding lifecycle — freshness, serving, versioning — at the right abstraction level for a senior design interview.

dr-case-embeddings-platform dr-case-feed-ranking dr-methodology

Chip Huyen — Large Language Model Training at ScaleOther

Practitioner overview of the economic and operational realities of large-scale training — goodput, failure modes, and the org structure implications of running a cluster at this scale.

dr-case-training-infra

Chris Olah — Calculus on Computational Graphs: BackpropagationOther

Olah 2015 — the clearest visual explanation of backprop as reverse-mode autodiff on computational graphs, with step-by-step chain rule diagrams.

backpropagation

Chris Olah — Deep Learning, NLP, and RepresentationsOther

Olah 2014 — visual intuition for how neural networks learn representations of language, connecting word embeddings to deeper network features.

embeddings

Chris Olah — Neural Networks, Manifolds, and TopologyOther

Olah 2014 — how neural networks warp data manifolds to make them linearly separable, foundational for understanding embedding geometry.

embeddings interpretability mech-interp

Circuit Breaker Pattern (Martin Fowler)Other

The pattern for preventing cascading failures when a downstream service is unhealthy.

error-recovery

Circuit Tracing: Revealing Computational Graphs in Language ModelsOther

Ameisen et al. 2025 — combining SAEs with attribution patching to trace full computational circuits in Claude 3.5 Haiku

interpretability mech-interp

Copy-on-Write Semantics (Linux man page: fork)Other

The OS-level COW primitive that makes fork cheap — the same copy-on-write principle applied to filesystem overlays in agent speculative execution.

speculative-execution

Emergent Introspective Awareness in Large Language ModelsOther

Lindsey 2025 — evidence of introspective awareness where models can report on their own internal representations

interpretability safety

Emotion Concepts and their Function in a Large Language ModelOther

Sofroniew et al. 2026 — investigating how emotion concept representations form and function inside Claude Sonnet 4.5

interpretability safety

Eugene Yan — LLM EvalsOther

The most thorough practitioner guide to LLM-as-judge design — rubric construction, bias mitigation, calibration.

dr-cost-and-eval

Eugene Yan — Patterns for Building LLM-Based Systems & ProductsOther

Eugene

dr-case-embeddings-platform dr-case-perplexity

Eugene Yan — Patterns for Personalization in RecommendationsOther

Practitioner-grade breakdown of retrieval, ranking, and re-ranking patterns at scale. The canonical starting point for recsys system design.

dr-case-feed-ranking

Event Sourcing PatternOther

Martin Fowler — storing state as a sequence of events, the pattern behind session replay.

session-persistence

Exponential Backoff and Jitter (AWS)Other

AWS architecture blog on backoff strategies — full jitter outperforms equal jitter and decorrelated jitter.

error-recovery

Firecracker: Lightweight Virtualization for Serverless Applications (Agache et al., 2020)Other

AWS

dr-case-agent-platform

Git Hooks DocumentationOther

The design pattern that inspired AI tool hooks — shell scripts triggered by lifecycle events.

hooks-permissions

Git Stash and WorktreesOther

The git primitives for saving and restoring working state — the lightweight alternative to overlay FS for speculative edits confined to git-tracked files.

speculative-execution

Git Worktrees DocumentationOther

The git primitive that enables parallel sub-agents to operate on the same repo without file conflicts.

sub-agents

GloVe: Global Vectors for Word RepresentationOther

Pennington et al. 2014 — combines count-based and predictive methods for learning word vectors.

embeddings

Google SRE Book — Chapter 20: Load Balancing at the FrontendOther

The M/M/1 queueing argument and the 70% utilization cap are made explicit here. The error-budget math in the SLO chapter pairs with this module's queueing deep dive.

dr-compare-slo-cost

Google SRE Book: Handling OverloadOther

Google SRE

error-recovery

GrowthBook Feature FlagsOther

The feature flag service used for build-time dead code elimination and gradual rollouts.

config-schemas

Immer: Immutable State Made SimpleOther

Copy-on-write immutable update library — used in agent state managers to safely update deeply nested AppState without mutation bugs.

state-management

In-Context Learning and Induction HeadsOther

Olsson et al. 2022 — induction heads as the mechanism behind in-context learning, emerging as a phase change during training.

attention induction-heads multi-head

JSON Schema SpecificationOther

The standard that Zod

config-schemas tool-system

JSON-RPC 2.0 SpecificationOther

The transport protocol underlying MCP communication between client and server.

plugins-mcp

LangChain AgentExecutor ArchitectureOther

The widely-adopted reference implementation of the tool-call loop — useful comparison to Claude Code

agent-harness

LangSmith — Tracing and Evaluation for LLM ApplicationsOther

LangSmith

dr-case-agent-platform

Language Server ProtocolOther

Inspiration for MCP — a similar protocol for editor-language server communication.

plugins-mcp

Launch Darkly Feature Flags GuideOther

The industry standard for feature flag systems — covers build-time vs. runtime flags, targeting rules, and rollout strategies used in agent feature gating.

config-schemas

LLM API Pricing Comparison (Artificial Analysis)Other

Live benchmark tracking price, throughput, and latency across all major LLM providers — the reference for model selection decisions in cost-aware agents.

cost-tracking

MCP QuickstartOther

Step-by-step guide to building and connecting your first MCP server — covers stdio and Streamable HTTP transports.

plugins-mcp

MDN: Async iteration and generatorsOther

Reference for async function*, for await...of, and the async iteration protocol.

streaming-api

Model Context Protocol SpecificationOther

The open standard for connecting AI agents to external tools and data sources.

agent-harness commands-skills plugins-mcp

Neel Nanda — How to Become a Mechanistic Interpretability ResearcherOther

Nanda 2023 — comprehensive guide to getting started in mech interp research, with recommended papers, exercises, and learning path.

mech-interp

Neuronpedia — Interactive SAE Feature ExplorerOther

Open-source platform for exploring 50M+ sparse autoencoder features across GPT-2, Gemma, Llama — hands-on companion to the theory in this module.

interpretability mech-interp

Node.js Stream Backpressure GuideOther

Official Node.js guide on backpressure — the mechanism that prevents unbounded memory growth when the consumer is slower than the producer.

streaming-api

NVIDIA Mixed Precision Training GuideOther

Official NVIDIA guide covering loss scaling, tensor cores, and the AMP workflow with benchmarks.

gpu-precision

NVIDIA TensorRT-LLM DocumentationOther

Official docs for TensorRT-LLM — NVIDIA

llm-deployment

OAuth 2.0 Authorization Framework (RFC 6749)Other

The auth standard underlying MCP

plugins-mcp

On the Biology of a Large Language ModelOther

Lindsey et al. 2025 — probing Claude 3.5 Haiku

interpretability mech-interp

OpenTelemetry for LLM ObservabilityOther

The open standard for emitting cost, latency, and token metrics — the instrumentation layer beneath production LLM cost dashboards.

cost-tracking

Orca: A Distributed Serving System for Transformer-Based Generative ModelsOther

Yu et al. 2022 — the continuous batching paper (Orca), showing iteration-level scheduling that keeps GPUs busy instead of waiting for the longest sequence.

llm-deployment

OverlayFS Documentation (Linux Kernel)Other

The kernel filesystem that inspired the copy-on-write pattern used in speculative execution.

speculative-execution

OWASP LLM Top 10Other

Security risks specific to LLM applications — including prompt injection and insecure tool use.

hooks-permissions

Perplexity Engineering BlogOther

Primary source for Perplexity

dr-case-perplexity dr-compare-rag

Phind Engineering Blog — How We Built a Code Search EngineOther

Phind's description of their code-specialized retrieval pipeline, domain-weighted re-ranking, and function-level citation design.

dr-compare-rag

Prefix Caching in vLLMOther

How open-source serving frameworks implement automatic prefix caching — the same mechanism Anthropic exposes via cache_control breakpoints.

prompt-cache

Principle of Least Privilege (NIST)Other

The foundational security principle behind the permission hierarchy — grant minimum necessary access.

hooks-permissions

PyTorch Autograd MechanicsOther

Official deep-dive into how autograd builds the computation graph, handles in-place ops, and propagates gradients — essential for debugging gradient issues

py-torch-debugging

PyTorch Compile Troubleshooting GuideOther

Debugging torch.compile graph breaks, dynamic shapes, and recompilations — increasingly important for modern training pipelines

py-torch-debugging

PyTorch Distributed — Official DocsOther

Reference for torch.distributed, NCCL backend, process groups, and the DDP/FSDP/RPC APIs that underpin every production training stack.

dr-case-training-infra

PyTorch Frequently Asked QuestionsOther

PyTorch docs — common issues with memory, parallelism, and reproducibility

py-torch-debugging

React Context API (useContext)Other

React

state-management

React Native Architecture OverviewOther

The gold standard for a single React tree rendering to multiple native targets — the same multi-renderer problem Claude Code

bridges

Redis Persistence: RDB vs AOFOther

Two persistence strategies (snapshot vs append-only) that mirror the session save tradeoffs.

session-persistence

Redux Selector Pattern (Reselect)Other

The canonical memoized selector library — the pattern behind AppState slice subscriptions that prevent unnecessary re-renders.

state-management

Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed.)Other

The definitive RL textbook covering MDPs, policy gradients, temporal-difference learning, and more.

rl-foundations

Release It! — Production-Ready Software (Nygard)Other

The book that codified circuit breakers, bulkheads, and timeouts — the stability patterns directly applied in agent error recovery.

error-recovery

Rumelhart, Hinton & Williams (1986) — Learning representations by back-propagating errorsOther

The original 1986 Nature paper that popularized backprop for training multi-layer networks.

backpropagation

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 SonnetOther

Templeton et al. 2024 — dictionary learning at scale to find interpretable features in large models

interpretability mech-interp

Sebastian Raschka — Model Merging, Mixtures of Experts, and Towards Continual LearningOther

Accessible walkthrough of merge algorithms with intuitive diagrams and concrete Mergekit YAML examples.

model-merging

Shreya Shankar — Operationalizing MLOther

The thesis-length argument that the gap between ML design and ML-in-production is owned by the eval harness, not the model.

dr-methodology

Shreya Shankar — Who Validates the Validators? Verifying Parity in ML PipelinesOther

The argument that online/offline parity is the hardest SLO to enforce in an embedding platform. Directly relevant to the eval and canary sections of this module.

dr-case-embeddings-platform

Simon Willison — Things IOther

Hard-won practitioner lessons on tool-use reliability, prompt design for tool selection, and the gap between benchmark performance and real-world correctness.

dr-case-coding-agent

Simon Willison: Costs and Pricing for LLM APIsOther

Simon Willison

cost-tracking

Spectre and Meltdown: Lessons for Software DesignOther

The real-world consequences of CPU speculative execution gone wrong — illustrates why side-effect isolation (overlay FS) is non-negotiable before committing speculative work.

speculative-execution

Speculative Execution in CPUs (Hennessy & Patterson)Other

The CPU architecture concept — predict the branch, execute speculatively, commit or rollback.

speculative-execution

SQLite Write-Ahead LoggingOther

The WAL mechanism that enables concurrent reads during writes — relevant to session checkpoint design.

session-persistence

SQLite: The Appropriate Uses for SQLiteOther

SQLite

session-persistence

SSE Specification (WHATWG)Other

The standard behind text/event-stream — event types, data fields, reconnection.

streaming-api

Stability AI ResearchOther

Primary source for Stable Diffusion architecture notes, SDXL improvements, and the open-weight model family that forms the technical baseline for most independent diffusion services.

dr-case-image-gen

Textual: Python TUI FrameworkOther

The Python equivalent of Ink with CSS-style layout — useful cross-language comparison of the reactive TUI approach.

terminal-ui

Token Economics of LLM ApplicationsOther

a16z analysis of cost structures in production LLM applications.

cost-tracking

Towards Monosemanticity: Decomposing Language Models with Dictionary LearningOther

Bricken et al. 2023 — sparse autoencoders on a one-layer transformer find thousands of interpretable features; the predecessor to scaling monosemanticity

interpretability mech-interp

Toy Models of SuperpositionOther

Elhage et al. 2022 — understanding how neural networks represent more features than dimensions

interpretability mech-interp

Tracing Attention Computation Through Feature InteractionsOther

Kamath et al. 2025 — traces how attention QK circuits interact with features, extending induction head analysis to larger and more complex models

induction-heads

TypeScript satisfies OperatorOther

The TypeScript operator that validates config objects against a schema without widening the type — the compile-time complement to Zod

config-schemas

Understanding LSTM NetworksOther

Olah 2015 — the gold-standard visual explainer of recurrent memory; useful context for understanding why in-context learning is surprising in attention-only models

induction-heads

Understanding the Difficulty of Training Deep Feedforward Neural Networks (Xavier Init)Other

Glorot & Bengio 2010 — derives the √(2/(fan_in+fan_out)) initialization by analyzing variance flow through layers. The theoretical foundation for Xavier init.

mlp-fundamentals

VS Code Extension APIOther

Official docs for building VS Code extensions — the primary IDE integration surface for Claude Code.

bridges commands-skills

VS Code Extension Host ArchitectureOther

How VS Code isolates extensions in a separate process — the same isolation model used by the Claude Code IDE bridge to sandbox the engine from the editor.

bridges

WebSocket RFC 6455Other

The protocol spec underlying IDE-to-engine communication in bridge mode.

bridges

WHATWG Streams APIOther

The browser standard for backpressure-aware streaming — ReadableStream, WritableStream, and the pipe chain that async generators implement natively.

streaming-api

When Models Manipulate ManifoldsOther

Gurnee et al. 2025 — studying how models use linebreaks and whitespace as geometric pivots in activation space

mech-interp

XState: State Machines for JavaScriptOther

Formal state machines for agent control flow — the alternative to ad-hoc React state for tracking agent lifecycle (idle → running → waiting → done).

state-management

YAML Frontmatter SpecificationOther

The metadata format used in memory files — structured data at the top of markdown documents.

memory-system

YAML SpecificationOther

The YAML spec underlying skill frontmatter — understanding anchors, block scalars, and type coercion prevents subtle parsing bugs.

commands-skills

Yoga LayoutOther

Facebook

terminal-ui

Zero Trust Architecture (NIST SP 800-207)Other

NIST

hooks-permissions

Zod: TypeScript-first schema validationOther

The runtime validation library used for config schemas — bridges the gap between TypeScript types and runtime data.

config-schemas tool-system