📐 The Design Doc
A system with p99=500ms costs 3× more GPU than one with p99=1s. The right SLO choice is the entire design — yet most engineers write the SLO last.
Most ML system-design content teaches a framework. Senior interviews test whether you can write the design doc live.
This module is a worked example: POST /v1/complete for a paying enterprise tier, from blank page to signed-off doc. Copy the order of thinking, then compare it with the Cost & Eval module and the SLO vs. Cost comparison.
Step 1 — Requirements & SLOs
The Working-Backwards prompt
Before touching architecture, write what the customer sees. One paragraph, no engineering words. If you can't write it, you don't understand the problem yet.
The SLO table (always a table)
| Metric | Target | Why this value |
|---|---|---|
| p50 TTFT | Below human reading start time for typical prompts | |
| p95 TTFT | Tail users should still feel the product as “instant” | |
| p95 full completion (256 tok) | 3 s | Streaming hides most of this; 3 s is the hard ceiling |
| Availability | — matches enterprise SaaS norm | |
| Cost ceiling | $3 / 1M output tok | Derived from retail price × (1 − 35% target gross margin) |
Quick check
The SLO table sets availability at 99.9%/month. How many minutes of downtime does this budget allow per calendar month?
Step 2 — Eval Harness (design first)
This is the section every candidate skips and every senior reviewer circles. You cannot design a system you cannot measure. Write the eval before the architecture — it will change how you design.
What we're measuring
- Correctness — does the completion satisfy the customer intent? Approximated by LLM-judge on a golden set of ~500 prompt/ideal-output pairs, with an initial human-reviewed calibration round on a 50-example subsample to anchor the judge.
- Latency — p50/p95 TTFT and per-token latency, by prompt-length bucket and by tier. One aggregate number hides the regression you actually care about.
- Reliability — request-success rate excluding client errors. Broken out by failure mode (timeout, OOM, safety refusal).
- Safety — refusal rate on an adversarial set and false-refusal rate on a benign-but-sensitive set. Asymmetric costs: a false refusal annoys; a true violation ends the contract.
- Unit economics — cost per successful completion, segmented by tenant. This becomes the gate for the cost SLO above.
w ≈ 1.96 × √(p̂(1−p̂) / n) where p̂ is the expected pass rate and n is the sample size. At p̂ = 0.80 and n = 200, w ≈ 1.96 × √(0.16 / 200) ≈ 0.055, so the CI is roughly ±5.5 pp — adequate for a top-level gate. At p̂ = 0.90 the same n gives ±4.2 pp; at p̂ = 0.95 it narrows to ±3.0 pp because the binomial variance peaks at 0.50. Note: for multi-cohort drill-downs (per tier, per prompt-length bucket), do notmultiply a single pool size by the number of cells. Each cell has its own base rate and therefore its own required n. A cell where the easy-prompt tier passes at 95% needs far fewer examples to pin a ±3 pp CI than a cell where the adversarial tier passes at 60%. Size each cell independently, then sum — the aggregate is usually 2–4× higher than the naive “200 × cells” estimate would predict.The offline/online bridge
An offline eval win that doesn't show up online is a signal you're measuring the wrong thing. Spec the bridge up front: which online metric does each offline metric predict, and what's the expected effect size?
Step 3 — Back-of-Envelope
Now — and only now — the math. Seed the calculator with the SLO you just wrote. The default below is deliberately under-provisioned for the 1,000-QPS target at 70B; move the sliders and find the inflection where the GPU-count curve bends.
| Model Weights (FP16) | 140 GB |
| KV Cache / Request | 257.7 MB (768 tokens) |
| Tokens/sec per GPU | 300 |
| Effective QPS (after cache) | 700 |
| GPUs Needed | 600 |
| Cost / Month | $876,000 |
| Est. p95 Latency | 3.56s |
| Bottleneck | Compute/Bandwidth |
GPU Memory Usage
60%
Compute Utilization
100%
Monthly Cost
$876,000
Step 4 — Architecture
What you’re seeing: a generic LLM serving stack with eight components — load balancer, router, prefix cache, decode workers, KV store, autoscaler, monitoring, and object store — arranged by data-flow order from request ingress to token streaming. What to try: identify the prefix cache node and trace which components it short-circuits — that path is where 30–60% of real-world inference cost is saved.
Eight components, one reason for each to exist. In a real doc, every box has a paragraph justifying its inclusion against the SLO table.
Production LLM Serving Architecture
Hover over each component to see its role.
Justification table (abbreviated)
- API Gateway — auth + per-tenant rate limit + token-budget enforcement. Needed because the cost SLO is per-tenant, not global.
- Request Queue — decouples gateway from GPU pool, enables fair scheduling across tenants. Needed because without it, one hot tenant starves the rest.
- Model Router— sends cheap/short prompts to a small model, expensive prompts to the flagship. Needed because the cost SLO can't be met with a single model.
- Model Server with PagedAttention— continuous batching, prefix cache, KV-cache paging. Needed because the throughput SLO can't be met with naive batching.
- Streaming response — SSE back to the client. Needed because the TTFT SLO would otherwise be dominated by full completion latency.
Step 5 — Deep dives on the two risky components
Expand the deep dives
Open for the full technical detail.
Expand
Expand the deep dives
Open for the full technical detail.
Deep-dive the two components with the highest downside if they're wrong. Here, that's the router and the serving engine.
Deep dive A — Model Router (the cost lever)
Approach
The router's job is to sort every incoming prompt into one of two buckets — “small model sufficient” or “flagship required” — before the request touches a GPU. The standard mechanism is a lightweight embedding-based binary classifier: encode the prompt with a frozen sentence encoder (e.g., a 100M-parameter text-embedding-3-small class model), then pass the embedding through a shallow MLP or logistic head trained on ~5K labeled (prompt, quality-delta) pairs. Quality delta is defined as: LLM-judge score on the small model's response minus LLM-judge score on the flagship's response — if that gap exceeds a threshold (e.g., 0.15 on a 0–1 scale), the example is labeled “hard.” The RouteLLM paper (Ong et al. 2024) benchmarks several router architectures — SW-ranking, BERT-based, and matrix-factorization — and finds that even a simple causal LLM router trained on preference data achieves relative to always-flagship on MMLU-style benchmarks. Classifier latency budget is tight: router inference must complete within ~10 ms (P99) so it doesn't add to the TTFT SLO. This is achievable with a sub-100M encoder and shallow head on CPU, or a batched GPU call amortized across the request queue. A confidence gate is required: if P(hard) falls in [0.40, 0.60], default to flagship rather than guess. Start with 0.35/0.65 and tighten based on shadow-eval data.
Trade-off
The non-obvious trade-off is not accuracy vs. cost — it's routing latency vs. batch coherence. A synchronous on-path classifier adds latency before batching begins. Moving it off the hot path (async pre-classification at queue insertion) forces separate “easy” and “hard” sub-queues drained by the GPU scheduler independently. That keeps both routing latency and batch purity high but doubles scheduler complexity: two KV-cache pools, two admission-control loops, and two separate preemption surfaces. The rule is: run the classifier synchronously only if its P99 stays under 15 ms; otherwise pre-classify at queue insertion and accept the dual-pool cost.
Failure mode
Classifier distribution drift.When the product ships a new use case (e.g., code generation added to a previously chat-only endpoint), the new prompt embeddings fall outside the training manifold, and the classifier systematically mis-routes them — typically toward “easy” because novel prompts don't match the hard-pattern vocabulary. The result is a silent quality regression: the cost dashboard looks fine while the LLM-judge score for the new cohort tanks. The cost-SLO alert never fires; you only see it if you segment the quality dashboard by prompt type.
Detection metric
Continuously shadow-sample 5% of small-model-routed requests, replay through the flagship, and compute the daily quality-delta P75. Alert when P75 exceeds 0.12 (flagship consistently outscoring small model above the training threshold). Secondary signal: if more than 25% of requests land in the [0.40, 0.60] uncertainty band the classifier is encountering out-of-distribution inputs and a retrain is overdue — don't wait for the quality-delta alarm.
Mitigation
Monthly retrain cycle: pull the last 30 days of shadow-labeled traffic, merge with the original seed set, retrain only the classifier head (keep encoder frozen). Annotate newly detected prompt clusters manually — 20–50 labels per cluster — before adding to training; this is where the labeling budget goes, not on re-annotating already-covered traffic. Secondary effect: each retrain shifts the decision boundary, temporarily increasing flagship usage for 24–48 hours post-deploy. Schedule router deploys outside peak traffic windows.
Real-world example
RouteLLM (Ong et al., arXiv 2406.18665) reports 40–85% API cost reduction depending on quality-threshold setting, with GPT-4-class quality maintained at the 50th-percentile cost point across MMLU, MT-Bench, and MATH benchmarks. The key empirical finding relevant to our design: preference-data-trained routers generalize better than embedding classifiers when the easy/hard distinction is task-dependent rather than length-dependent — directly applicable here since short enterprise prompts can still require flagship reasoning (e.g., one-line math problems or terse policy queries).
Deep dive B — Model Server (the latency lever)
Approach
The serving engine is vLLM with PagedAttention (Kwon et al. 2023). The central insight: KV-cache memory fragmentation is structurally identical to the OS virtual-memory paging problem. The KV cache is divided into fixed-size blocks; a block table maps each sequence's logical KV positions to physical GPU memory blocks allocated on demand and freed on sequence completion. Near-zero internal fragmentation results. The vLLM paper reports over HuggingFace Transformers naive serving and 1.7× over Orca (the prior continuous-batching baseline) on 13B/175B models on A100s; gains are largest when output lengths vary widely, because that's exactly when pre-allocation wastes the most memory.
Two additional features compound the gain: (1) continuous batching — new requests join the batch at any iteration step, keeping GPU utilization near 100% under sustained load; (2) prefix caching — system prompts shared across tenant requests are stored as reusable KV blocks. At the 30% cache-hit-rate design assumption, prefix caching eliminates ~30% of prefill FLOPS on the cached portion, reducing p50 TTFT by roughly 15–20% for the enterprise tier where system prompts are long and repeated.
Trade-off
The non-obvious trade-off with continuous batching is prefill-decode interference. When a long-context request enters the batch mid-stream, its prefill phase (compute-intensive, many tokens at once) competes with the decode phases of already-running requests (memory-bandwidth-bound, one token per request per step). On an A100 PCIe, a 4K-token prefill occupies compute units for roughly 170–220 ms (back-of-envelope: 4096 tokens × 70B parameters × 2 FLOPS/parameter at BF16 ÷ 312 TFLOP/s ≈ 170 ms, plus memory-access overhead), during which all decode slots stall. The p95 TTFT for every short request queued behind spikes by . Chunked prefill — splitting the 4K prefill into 512-token chunks interleaved with decode steps — eliminates the stall at the cost of . Full prefill/decode disaggregation onto separate GPU pools becomes worth the cross-pool KV-transfer overhead when prefill share crosses approximately 30% of total GPU-seconds (per Sarathi-Serve, Agrawal et al. 2023).
Failure mode
KV-cache exhaustion under bursty load. Under PagedAttention, KV blocks are allocated lazily per sequence. A sudden 10× burst from one enterprise tenant allocates blocks faster than completing sequences free them. When the block table fills, vLLM preempts the lowest-priority sequences — introducing recomputation latency of hundreds of milliseconds per preempted sequence — and cascading TTFT regressions follow for every other tenant. The failure is silent until it cascades: a single burst tenant can consume 40% of the block table before the first preemption fires, by which time every tenant's p95 TTFT has already regressed past SLO.
Detection metric
Export vllm:gpu_cache_usage_perc as a Prometheus gauge sampled every 5 seconds. Set a yellow alert at 75% (throttle burst-tenant admission) and a red alert at 90% (hard-stop all new admissions, return HTTP 429 with Retry-After). Secondary signal: per-tenant block-table share — if any single tenant holds more than 40% of allocated blocks, the fair-share scheduler is not enforcing its weight. vLLM exposes per-sequence block counts via its engine stats endpoint; wire it into the tenant-level dashboard alongside queue-depth and TTFT p95.
Mitigation
Two-layer defense: (1) Admission control at the queue layer — when cache usage crosses 75%, stop admitting new sequences from tenants whose 5-minute request rate exceeds 3× their baseline, while continuing to admit normal-rate tenants. This limits blast radius to the bursting tenant without global degradation. (2) Preemption policy hardening — configure recompute (not swap-to-CPU) for sequences under 256 generated tokens; recompute is faster than a CPU memory round-trip for short sequences. Secondary effect to watch: aggressive admission control raises queue-wait latency for burst tenants, which triggers their client-side timeouts and produces a retry wave that worsens the burst. Set the 429 Retry-After header to 2× the current p50 queue-drain time, not a fixed value.
Real-world example
The vLLM paper (Kwon et al., arXiv 2309.06180) ablates the preemption path directly: under the ShareGPT workload (highly variable output lengths), , enabling the 2–4× throughput gains. The paper also quantifies prefix-sharing: for multi-turn conversations, KV-cache memory usage drops by up to 55% — directly validating the prefix-cache decision for our enterprise tier where system prompts are long and reused across many requests per session.
You've drafted the requirements, the eval, the capacity math, and the architecture. The interviewer asks for ‘one more deep dive.’ Which component do you pick?
Step 6 — Break It
Three failure modes, with the detection and mitigation already spec'd. Every failure should map back to a metric in the eval harness.
Remove the request queue
A single hot tenant saturates the GPU pool, starving everyone else. Without queue-level fair-share scheduling, per-tenant rate limits at the gateway only control request count, not GPU-seconds — a cheap tenant sending 10 RPS of 2048-token generations starves a paying tenant sending 100 RPS of 128-token generations. Detection: tenant-level latency skew in the eval harness. Mitigation: fair-share scheduler with weighted GPU-second budgets per tier.
Remove the model router
Send 100% of traffic to the flagship model. Cost per 1M output tokens roughly triples, breaching the $3 ceiling. Gross margin turns negative on the free tier within a month. Detection: unit-economics dashboard crosses cost SLO. Mitigation: restore routing — but also reconsider whether the free tier is a viable product shape.
Remove chunked prefill
Requests process prefill atomically, no interleaving with decode. A single 4K-token prompt blocks all decode slots for ~200 ms. p95 TTFT for everyone queued behind it regresses to 1–2 seconds, depending on queue depth. Detection: TTFT p95 alarm on long-prompt cohort. Mitigation: chunked prefill, or in severe cases prefill/decode disaggregation onto separate GPU pools.
Quick check
The model router is removed and all traffic routes to the flagship model. What is the first metric to breach SLO, and what is the approximate cost multiplier?
Step 7 — What does a bad day cost?
Reliability is a dollar number, not a percentage. The design doc is signed off when these three numbers are written down and agreed.
- 30-minute partial outage (routing broken, all traffic to flagship) — cost overrun roughly equals the delta between flagship and small-model cost × 30-minute traffic volume. The number itself is small; the interesting question is how long it takes to detect. Detectable within 2 minutes by cost-SLO alert — if you wait a full day, the same delta is closer to $130K.
- Full regional outage (15 minutes)— SLA credit math: assume a $50K MRR enterprise account on a standard tiered SLA where <99.9% monthly availability triggers a 10% service credit. Monthly availability = (43,200 − 15) / 43,200 ≈ 99.965%, which stays above the 99.9% threshold — so a single 15-minute outage does not trigger the credit tier at allfor a monthly-measured SLA. To breach the 99.9% tier you need >43 minutes of downtime in the calendar month. If this is the second incident that month and total downtime crosses 44 minutes, the credit fires: 10% × $50K MRR = $5,000 per account. With 20 affected enterprise accounts, that is $100K in credits — plus the engineering opportunity cost of the response, which at a 5-person team × 4 hours × a blended eng cost of ~$150/hr is another $3K. The key insight: a single short outage is cheap; the credit trigger is a threshold function, not a linear percentage, so a second incident in the same month is 10–100× more expensive than the first.
- Silent quality regression (4 hours before eval catches it) — worked churn math: assume 20% of hard queries silently route to the small model for 4 hours (14,400 s) at 1,000 QPS = 2.88M affected completions. Assign a 2% probability that any affected enterprise user files a complaint, and a 5% churn probability per complaint on a $50K ACV account. Expected churn cost: 0.02 × 0.05 × $50K × (number of affected accounts). For 10 affected accounts: 0.02 × 0.05 × $50K × 10 = $500 in expected ACV — seemingly small, but detection-window sensitivity changes everything. If the window is 4 hours (no shadow eval), the regression affects 10 accounts before correction. If the window is 5 minutes (continuous shadow eval at 5% sample rate), the regression is caught after ~17K affected completions and reaches <1 account before mitigation — a 10× reduction in exposure.
Quick check
A router regression sends 20% of hard queries to the small model for 30 minutes at 1,000 QPS. Compared to catching it in 2 minutes, how much larger is the affected-completion count?
Step 8 — Company Lens (same doc, different pushes)
Google's push
Expect the interviewer to drill on the scheduler and the multi-tenant sharing story. “What's the queueing policy? How do you handle noisy neighbors? What's the Borg/Kubernetes integration?” Google's L6+ design bar is heavy on scheduling theory and large-scale systems primitives — lean in, and treat the queue as a real component with its own design section.
Anthropic's push
Expect the interviewer to drill on safety as a first-class SLO wired into the serving design — not a post-hoc filter. The specific Anthropic framing is Constitutional AI and the Responsible Scaling Policy (RSP): safety levels (ASL-2, ASL-3) gate which model versions can be deployed and under what serving constraints. The canonical drill question is: “How does your serving design surface a safety regression before deployment?” The expected answer names eval-as-policy gating: a safety eval suite (adversarial prompts, jailbreak probes, false-refusal benchmarks) is a required CI gate before any model weight update reaches the serving fleet. If the safety eval fails, the rollout blocks regardless of latency or throughput improvements. Expect follow-up on how you separate false-refusal rate (too strict) from true-violation rate (too permissive), and how you handle the asymmetric cost: one missed true violation can end the enterprise contract; a 1% false-refusal rate erodes user trust more slowly but is measured daily. Be ready to name your specific safety metrics, their thresholds, and who owns the threshold calibration decision.
OpenAI's push
Expect drill-down on reliability and eval integration at scale. “How do you measure refusal quality across model updates? How does RLHF or fine-tuning interact with your serving pipeline's rollout strategy? What's your rollback plan if a model update regresses refusal behavior at the p99 tail?” OpenAI's design bar emphasizes staged rollouts (canary → shadow → production), with a quality eval gate at each stage. The key distinction from Google: OpenAI focuses more on model-update cadence and the operational complexity of frequent weight swaps than on multi-tenant scheduling primitives. Be ready to describe how you version KV-cache prefixes across model versions (they are not compatible across weight updates) and what your cache-warming strategy is for a new model version before it takes full traffic.
Meta / Databricks's push
Expect pressure on cost per query, open-source serving stack choices, and how you'd run this on your own hardware rather than behind someone else's API. Expect to justify vLLM vs TRT-LLM vs SGLang by the workload profile you described, not by brand loyalty.
Key Takeaways
What to remember for interviews
- 1Write the customer-facing paragraph before any engineering word. If you can't, you don't understand the problem yet.
- 2Write the eval harness before the architecture. Design flows from measurement, not the other way around.
- 3Capacity math before boxes. Numbers tell you how much architecture the problem deserves.
- 4Two deep dives — the cost lever and the latency lever. Everything else is 'I'd follow the same structure.'
- 5Break-it analysis maps every failure to a metric. If you can't measure the break, the mitigation is vibes.
- 6A bad day has a dollar number. Reliability is a price tag, not a percentage.
Interview Questions
Showing 5 of 5
You're given 30 minutes to design `/v1/complete` for a paying enterprise tier. What are the first four numbers you write on the whiteboard, and in what order?
★★☆An interviewer says 'forget the SLO for a second, just tell me the architecture.' What's the correct response?
★★★Your capacity math says you need 200 GPUs. Your budget is 60. What do you cut first?
★★★You ship the design doc. Two weeks in, p95 TTFT regresses from 420 ms to 900 ms. Your doc said 'continuous batching with vLLM.' Where do you look first?
★★★An interviewer asks you to justify a decision your design doc made. You realize you can't. What do you do?
★★☆Recap quiz
Design-review methodology recap
In the design-review methodology, what is the correct order of the first four steps, and why does eval come before architecture?
A p95 TTFT SLO of 800 ms forces a different architecture than a 5-second batch-completion SLO. What is the specific structural difference?
The design doc skips Step 2 (eval harness) and jumps to architecture. What is the most likely downstream consequence, and which failure mode exemplifies it?
A model router classifier runs synchronously on the hot path. At what P99 latency threshold should you move it off-path to async pre-classification at queue insertion?
Under PagedAttention, a single enterprise tenant bursts to 10× their baseline QPS. What is the failure cascade, and what metric fires the yellow alert?
A 15-minute regional outage occurs. The enterprise SLA guarantees 99.9% monthly availability with a 10% service credit on breach. Does the credit trigger?
A silent quality regression runs for 4 hours at 1,000 QPS before eval catches it. How does shrinking the detection window from 4 hours to 5 minutes change the number of affected enterprise accounts?
Further Reading
- Jeff Dean — Building Software Systems at Google and Lessons Learned — The original 'back-of-envelope numbers every engineer should know' talk. Pair with Dean's 2009 latency numbers — the practice of thinking in numbers before boxes.
- Amazon Working Backwards — PR/FAQ + 6-Pager — Not an ML piece, but the discipline of writing the customer-facing press release before the design doc is the methodological backbone of this module.
- Chip Huyen — Designing Machine Learning Systems (O'Reilly 2022) — The canonical ML system-design textbook. Chapter 1 on business objectives is the framework chapter candidates keep ignoring at their own cost.
- Shreya Shankar — Operationalizing ML — The thesis-length argument that the gap between ML design and ML-in-production is owned by the eval harness, not the model.
- Hamel Husain — Your AI Product Needs Evals — The practitioner post that converted a generation of AI engineers to eval-first design. Required reading before writing any LLM design doc.