The Design Doc — Transformer Math

📋

Step 1 — Requirements & SLOs

The Working-Backwards prompt

Before touching architecture, write what the customer sees. One paragraph, no engineering words. If you can't write it, you don't understand the problem yet.

“Enterprise customers call POST /v1/complete with a prompt and receive a streaming completion. Median first-token latency is under 500 ms; median full-completion latency for a 256-token response is under 3 seconds. The service is available 99.9% of the time measured monthly, with per-tenant rate limits and a hard cost ceiling per billable account.”

The SLO table (always a table)

Metric	Target	Why this value
p50 TTFT		Below human reading start time for typical prompts
p95 TTFT		Tail users should still feel the product as “instant”
p95 full completion (256 tok)	3 s	Streaming hides most of this; 3 s is the hard ceiling
Availability		— matches enterprise SaaS norm
Cost ceiling	$3 / 1M output tok	Derived from retail price × (1 − 35% target gross margin)

✨ Insight · Margin note.Notice what's NOT on the list: model choice, framework, cloud provider, even GPU type. Requirements come from the customer and the P&L, not from the tech menu. If you skip to architecture without this table, every subsequent decision is ungrounded.

Quick check

Derivation

The SLO table sets availability at 99.9%/month. How many minutes of downtime does this budget allow per calendar month?

4.3 minutes — equivalent to the 99.99% (four-nines) tier.8.6 hours — the enterprise contract SLA standard.432 minutes — one percent of a 30-day month.43 minutes — one-tenth of a percent of a 30-day month.

🧪

Step 2 — Eval Harness (design first)

This is the section every candidate skips and every senior reviewer circles. You cannot design a system you cannot measure. Write the eval before the architecture — it will change how you design.

What we're measuring

Correctness — does the completion satisfy the customer intent? Approximated by LLM-judge on a golden set of ~500 prompt/ideal-output pairs, with an initial human-reviewed calibration round on a 50-example subsample to anchor the judge.
Latency — p50/p95 TTFT and per-token latency, by prompt-length bucket and by tier. One aggregate number hides the regression you actually care about.
Reliability — request-success rate excluding client errors. Broken out by failure mode (timeout, OOM, safety refusal).
Safety — refusal rate on an adversarial set and false-refusal rate on a benign-but-sensitive set. Asymmetric costs: a false refusal annoys; a true violation ends the contract.
Unit economics — cost per successful completion, segmented by tenant. This becomes the gate for the cost SLO above.

💡 Tip · Golden-set sizing — Wilson interval derivation. For a binary quality judgment, the Wilson score interval gives the 95% CI half-width as approximately w ≈ 1.96 × √(p̂(1−p̂) / n) where p̂ is the expected pass rate and n is the sample size. At p̂ = 0.80 and n = 200, w ≈ 1.96 × √(0.16 / 200) ≈ 0.055, so the CI is roughly ±5.5 pp — adequate for a top-level gate. At p̂ = 0.90 the same n gives ±4.2 pp; at p̂ = 0.95 it narrows to ±3.0 pp because the binomial variance peaks at 0.50. Note: for multi-cohort drill-downs (per tier, per prompt-length bucket), do notmultiply a single pool size by the number of cells. Each cell has its own base rate and therefore its own required n. A cell where the easy-prompt tier passes at 95% needs far fewer examples to pin a ±3 pp CI than a cell where the adversarial tier passes at 60%. Size each cell independently, then sum — the aggregate is usually 2–4× higher than the naive “200 × cells” estimate would predict.

The offline/online bridge

An offline eval win that doesn't show up online is a signal you're measuring the wrong thing. Spec the bridge up front: which online metric does each offline metric predict, and what's the expected effect size?

🧮

Step 3 — Back-of-Envelope

Now — and only now — the math. Seed the calculator with the SLO you just wrote. The default below is deliberately under-provisioned for the 1,000-QPS target at 70B; move the sliders and find the inflection where the GPU-count curve bends.

Scenario: Enterprise tier: 1,000 QPS target, 512-in / 256-out tokens, 30% prefix-cache hit rate on repeated system prompts.

Model Size70B

GPU TypeA100-80GB

QPS Target1,000

Input Tokens512

Output Tokens256

Cache Hit Rate30%

Model Weights (FP16)	140 GB
KV Cache / Request	257.7 MB (768 tokens)
Tokens/sec per GPU	300
Effective QPS (after cache)	700
GPUs Needed	600
Cost / Month	$876,000
Est. p95 Latency	3.56s
Bottleneck	Compute/Bandwidth

GPU Memory Usage

60%

Compute Utilization

100%

Monthly Cost

$876,000

✨ Insight · Margin note.The calculator gives a number. The number is wrong — all back-of-envelope numbers are. The question is whether it's wrong by 1.5× or by 10×. 1.5× means the capacity plan survives; 10× means the entire architecture needs to change (routing, quantization, disaggregation). This is what calibrates how much detail the architecture deserves.

🏛️

Step 4 — Architecture

What you’re seeing: a generic LLM serving stack with eight components — load balancer, router, prefix cache, decode workers, KV store, autoscaler, monitoring, and object store — arranged by data-flow order from request ingress to token streaming. What to try: identify the prefix cache node and trace which components it short-circuits — that path is where 30–60% of real-world inference cost is saved.

Eight components, one reason for each to exist. In a real doc, every box has a paragraph justifying its inclusion against the SLO table.

Production LLM Serving Architecture

Hover over each component to see its role.

Justification table (abbreviated)

API Gateway — auth + per-tenant rate limit + token-budget enforcement. Needed because the cost SLO is per-tenant, not global.
Request Queue — decouples gateway from GPU pool, enables fair scheduling across tenants. Needed because without it, one hot tenant starves the rest.
Model Router— sends cheap/short prompts to a small model, expensive prompts to the flagship. Needed because the cost SLO can't be met with a single model.
Model Server with PagedAttention— continuous batching, prefix cache, KV-cache paging. Needed because the throughput SLO can't be met with naive batching.
Streaming response — SSE back to the client. Needed because the TTFT SLO would otherwise be dominated by full completion latency.

🔬

Step 5 — Deep dives on the two risky components

Expand the deep dives

Open for the full technical detail.

Expand

Deep-dive the two components with the highest downside if they're wrong. Here, that's the router and the serving engine.

Deep dive A — Model Router (the cost lever)

Approach

The router's job is to sort every incoming prompt into one of two buckets — “small model sufficient” or “flagship required” — before the request touches a GPU. The standard mechanism is a lightweight embedding-based binary classifier: encode the prompt with a frozen sentence encoder (e.g., a 100M-parameter text-embedding-3-small class model), then pass the embedding through a shallow MLP or logistic head trained on ~5K labeled (prompt, quality-delta) pairs. Quality delta is defined as: LLM-judge score on the small model's response minus LLM-judge score on the flagship's response — if that gap exceeds a threshold (e.g., 0.15 on a 0–1 scale), the example is labeled “hard.” The RouteLLM paper (Ong et al. 2024) benchmarks several router architectures — SW-ranking, BERT-based, and matrix-factorization — and finds that even a simple causal LLM router trained on preference data achieves relative to always-flagship on MMLU-style benchmarks. Classifier latency budget is tight: router inference must complete within ~10 ms (P99) so it doesn't add to the TTFT SLO. This is achievable with a sub-100M encoder and shallow head on CPU, or a batched GPU call amortized across the request queue. A confidence gate is required: if P(hard) falls in [0.40, 0.60], default to flagship rather than guess. Start with 0.35/0.65 and tighten based on shadow-eval data.

Trade-off

The non-obvious trade-off is not accuracy vs. cost — it's routing latency vs. batch coherence. A synchronous on-path classifier adds latency before batching begins. Moving it off the hot path (async pre-classification at queue insertion) forces separate “easy” and “hard” sub-queues drained by the GPU scheduler independently. That keeps both routing latency and batch purity high but doubles scheduler complexity: two KV-cache pools, two admission-control loops, and two separate preemption surfaces. The rule is: run the classifier synchronously only if its P99 stays under 15 ms; otherwise pre-classify at queue insertion and accept the dual-pool cost.

Failure mode

Classifier distribution drift.When the product ships a new use case (e.g., code generation added to a previously chat-only endpoint), the new prompt embeddings fall outside the training manifold, and the classifier systematically mis-routes them — typically toward “easy” because novel prompts don't match the hard-pattern vocabulary. The result is a silent quality regression: the cost dashboard looks fine while the LLM-judge score for the new cohort tanks. The cost-SLO alert never fires; you only see it if you segment the quality dashboard by prompt type.

Detection metric

Continuously shadow-sample 5% of small-model-routed requests, replay through the flagship, and compute the daily quality-delta P75. Alert when P75 exceeds 0.12 (flagship consistently outscoring small model above the training threshold). Secondary signal: if more than 25% of requests land in the [0.40, 0.60] uncertainty band the classifier is encountering out-of-distribution inputs and a retrain is overdue — don't wait for the quality-delta alarm.

Mitigation

Monthly retrain cycle: pull the last 30 days of shadow-labeled traffic, merge with the original seed set, retrain only the classifier head (keep encoder frozen). Annotate newly detected prompt clusters manually — 20–50 labels per cluster — before adding to training; this is where the labeling budget goes, not on re-annotating already-covered traffic. Secondary effect: each retrain shifts the decision boundary, temporarily increasing flagship usage for 24–48 hours post-deploy. Schedule router deploys outside peak traffic windows.

Real-world example

RouteLLM (Ong et al., arXiv 2406.18665) reports 40–85% API cost reduction depending on quality-threshold setting, with GPT-4-class quality maintained at the 50th-percentile cost point across MMLU, MT-Bench, and MATH benchmarks. The key empirical finding relevant to our design: preference-data-trained routers generalize better than embedding classifiers when the easy/hard distinction is task-dependent rather than length-dependent — directly applicable here since short enterprise prompts can still require flagship reasoning (e.g., one-line math problems or terse policy queries).

Deep dive B — Model Server (the latency lever)

Approach

The serving engine is vLLM with PagedAttention (Kwon et al. 2023). The central insight: KV-cache memory fragmentation is structurally identical to the OS virtual-memory paging problem. The KV cache is divided into fixed-size blocks; a block table maps each sequence's logical KV positions to physical GPU memory blocks allocated on demand and freed on sequence completion. Near-zero internal fragmentation results. The vLLM paper reports over HuggingFace Transformers naive serving and 1.7× over Orca (the prior continuous-batching baseline) on 13B/175B models on A100s; gains are largest when output lengths vary widely, because that's exactly when pre-allocation wastes the most memory.

Two additional features compound the gain: (1) continuous batching — new requests join the batch at any iteration step, keeping GPU utilization near 100% under sustained load; (2) prefix caching — system prompts shared across tenant requests are stored as reusable KV blocks. At the 30% cache-hit-rate design assumption, prefix caching eliminates ~30% of prefill FLOPS on the cached portion, reducing p50 TTFT by roughly 15–20% for the enterprise tier where system prompts are long and repeated.

Trade-off

The non-obvious trade-off with continuous batching is prefill-decode interference. When a long-context request enters the batch mid-stream, its prefill phase (compute-intensive, many tokens at once) competes with the decode phases of already-running requests (memory-bandwidth-bound, one token per request per step). On an A100 PCIe, a 4K-token prefill occupies compute units for roughly 170–220 ms (back-of-envelope: 4096 tokens × 70B parameters × 2 FLOPS/parameter at BF16 ÷ 312 TFLOP/s ≈ 170 ms, plus memory-access overhead), during which all decode slots stall. The p95 TTFT for every short request queued behind spikes by . Chunked prefill — splitting the 4K prefill into 512-token chunks interleaved with decode steps — eliminates the stall at the cost of . Full prefill/decode disaggregation onto separate GPU pools becomes worth the cross-pool KV-transfer overhead when prefill share crosses approximately 30% of total GPU-seconds (per Sarathi-Serve, Agrawal et al. 2023).

Failure mode

KV-cache exhaustion under bursty load. Under PagedAttention, KV blocks are allocated lazily per sequence. A sudden 10× burst from one enterprise tenant allocates blocks faster than completing sequences free them. When the block table fills, vLLM preempts the lowest-priority sequences — introducing recomputation latency of hundreds of milliseconds per preempted sequence — and cascading TTFT regressions follow for every other tenant. The failure is silent until it cascades: a single burst tenant can consume 40% of the block table before the first preemption fires, by which time every tenant's p95 TTFT has already regressed past SLO.

Detection metric

Export vllm:gpu_cache_usage_perc as a Prometheus gauge sampled every 5 seconds. Set a yellow alert at 75% (throttle burst-tenant admission) and a red alert at 90% (hard-stop all new admissions, return HTTP 429 with Retry-After). Secondary signal: per-tenant block-table share — if any single tenant holds more than 40% of allocated blocks, the fair-share scheduler is not enforcing its weight. vLLM exposes per-sequence block counts via its engine stats endpoint; wire it into the tenant-level dashboard alongside queue-depth and TTFT p95.

Mitigation

Two-layer defense: (1) Admission control at the queue layer — when cache usage crosses 75%, stop admitting new sequences from tenants whose 5-minute request rate exceeds 3× their baseline, while continuing to admit normal-rate tenants. This limits blast radius to the bursting tenant without global degradation. (2) Preemption policy hardening — configure recompute (not swap-to-CPU) for sequences under 256 generated tokens; recompute is faster than a CPU memory round-trip for short sequences. Secondary effect to watch: aggressive admission control raises queue-wait latency for burst tenants, which triggers their client-side timeouts and produces a retry wave that worsens the burst. Set the 429 Retry-After header to 2× the current p50 queue-drain time, not a fixed value.

Real-world example

The vLLM paper (Kwon et al., arXiv 2309.06180) ablates the preemption path directly: under the ShareGPT workload (highly variable output lengths), , enabling the 2–4× throughput gains. The paper also quantifies prefix-sharing: for multi-turn conversations, KV-cache memory usage drops by up to 55% — directly validating the prefix-cache decision for our enterprise tier where system prompts are long and reused across many requests per session.

✨ Insight · Margin note. Two deep dives — not four. An interviewer will push into the places you didn't dive, and the right answer there is “I'd follow the same structure — here's the risk I'd watch.” Deep diving everything equally is a junior signal.

🔧

Step 6 — Break It

Three failure modes, with the detection and mitigation already spec'd. Every failure should map back to a metric in the eval harness.

Remove the request queue

A single hot tenant saturates the GPU pool, starving everyone else. Without queue-level fair-share scheduling, per-tenant rate limits at the gateway only control request count, not GPU-seconds — a cheap tenant sending 10 RPS of 2048-token generations starves a paying tenant sending 100 RPS of 128-token generations. Detection: tenant-level latency skew in the eval harness. Mitigation: fair-share scheduler with weighted GPU-second budgets per tier.

Remove the model router

Send 100% of traffic to the flagship model. Cost per 1M output tokens roughly triples, breaching the $3 ceiling. Gross margin turns negative on the free tier within a month. Detection: unit-economics dashboard crosses cost SLO. Mitigation: restore routing — but also reconsider whether the free tier is a viable product shape.

Remove chunked prefill

Requests process prefill atomically, no interleaving with decode. A single 4K-token prompt blocks all decode slots for ~200 ms. p95 TTFT for everyone queued behind it regresses to 1–2 seconds, depending on queue depth. Detection: TTFT p95 alarm on long-prompt cohort. Mitigation: chunked prefill, or in severe cases prefill/decode disaggregation onto separate GPU pools.

Quick check

Trade-off

The model router is removed and all traffic routes to the flagship model. What is the first metric to breach SLO, and what is the approximate cost multiplier?

Cost per 1M output tokens breaches first — flagship costs roughly 3-10× more than small models.p95 TTFT breaches first — flagship models are slower than small models.Availability SLO breaches first — the flagship fleet is smaller and has higher error rates.Safety false-refusal rate increases — flagship models have higher refusal thresholds.

💸

Step 7 — What does a bad day cost?

Reliability is a dollar number, not a percentage. The design doc is signed off when these three numbers are written down and agreed.

30-minute partial outage (routing broken, all traffic to flagship) — cost overrun roughly equals the delta between flagship and small-model cost × 30-minute traffic volume. The number itself is small; the interesting question is how long it takes to detect. Detectable within 2 minutes by cost-SLO alert — if you wait a full day, the same delta is closer to $130K.
Full regional outage (15 minutes)— SLA credit math: assume a $50K MRR enterprise account on a standard tiered SLA where <99.9% monthly availability triggers a 10% service credit. Monthly availability = (43,200 − 15) / 43,200 ≈ 99.965%, which stays above the 99.9% threshold — so a single 15-minute outage does not trigger the credit tier at allfor a monthly-measured SLA. To breach the 99.9% tier you need >43 minutes of downtime in the calendar month. If this is the second incident that month and total downtime crosses 44 minutes, the credit fires: 10% × $50K MRR = $5,000 per account. With 20 affected enterprise accounts, that is $100K in credits — plus the engineering opportunity cost of the response, which at a 5-person team × 4 hours × a blended eng cost of ~$150/hr is another $3K. The key insight: a single short outage is cheap; the credit trigger is a threshold function, not a linear percentage, so a second incident in the same month is 10–100× more expensive than the first.
Silent quality regression (4 hours before eval catches it) — worked churn math: assume 20% of hard queries silently route to the small model for 4 hours (14,400 s) at 1,000 QPS = 2.88M affected completions. Assign a 2% probability that any affected enterprise user files a complaint, and a 5% churn probability per complaint on a $50K ACV account. Expected churn cost: 0.02 × 0.05 × $50K × (number of affected accounts). For 10 affected accounts: 0.02 × 0.05 × $50K × 10 = $500 in expected ACV — seemingly small, but detection-window sensitivity changes everything. If the window is 4 hours (no shadow eval), the regression affects 10 accounts before correction. If the window is 5 minutes (continuous shadow eval at 5% sample rate), the regression is caught after ~17K affected completions and reaches <1 account before mitigation — a 10× reduction in exposure.

⚠ Warning · The hardest SLO to write is the quality SLO. Latency and availability are percentages anyone can check. Quality regressions need the eval harness you wrote in Step 2 — which is why Step 2 comes before architecture.

Quick check

Derivation

A router regression sends 20% of hard queries to the small model for 30 minutes at 1,000 QPS. Compared to catching it in 2 minutes, how much larger is the affected-completion count?

3× larger — 30 min vs 10 min median detection.15× larger — detection window scales linearly with affected completions.10× larger — order-of-magnitude difference in detection windows.100× larger — the regression compounds exponentially with time.

🏢

Step 8 — Company Lens (same doc, different pushes)

Google's push

Expect the interviewer to drill on the scheduler and the multi-tenant sharing story. “What's the queueing policy? How do you handle noisy neighbors? What's the Borg/Kubernetes integration?” Google's L6+ design bar is heavy on scheduling theory and large-scale systems primitives — lean in, and treat the queue as a real component with its own design section.

Anthropic's push

Expect the interviewer to drill on safety as a first-class SLO wired into the serving design — not a post-hoc filter. The specific Anthropic framing is Constitutional AI and the Responsible Scaling Policy (RSP): safety levels (ASL-2, ASL-3) gate which model versions can be deployed and under what serving constraints. The canonical drill question is: “How does your serving design surface a safety regression before deployment?” The expected answer names eval-as-policy gating: a safety eval suite (adversarial prompts, jailbreak probes, false-refusal benchmarks) is a required CI gate before any model weight update reaches the serving fleet. If the safety eval fails, the rollout blocks regardless of latency or throughput improvements. Expect follow-up on how you separate false-refusal rate (too strict) from true-violation rate (too permissive), and how you handle the asymmetric cost: one missed true violation can end the enterprise contract; a 1% false-refusal rate erodes user trust more slowly but is measured daily. Be ready to name your specific safety metrics, their thresholds, and who owns the threshold calibration decision.

OpenAI's push

Expect drill-down on reliability and eval integration at scale. “How do you measure refusal quality across model updates? How does RLHF or fine-tuning interact with your serving pipeline's rollout strategy? What's your rollback plan if a model update regresses refusal behavior at the p99 tail?” OpenAI's design bar emphasizes staged rollouts (canary → shadow → production), with a quality eval gate at each stage. The key distinction from Google: OpenAI focuses more on model-update cadence and the operational complexity of frequent weight swaps than on multi-tenant scheduling primitives. Be ready to describe how you version KV-cache prefixes across model versions (they are not compatible across weight updates) and what your cache-warming strategy is for a new model version before it takes full traffic.

Meta / Databricks's push

Expect pressure on cost per query, open-source serving stack choices, and how you'd run this on your own hardware rather than behind someone else's API. Expect to justify vLLM vs TRT-LLM vs SGLang by the workload profile you described, not by brand loyalty.

🧠

Key Takeaways

What to remember for interviews

1Write the customer-facing paragraph before any engineering word. If you can't, you don't understand the problem yet.
2Write the eval harness before the architecture. Design flows from measurement, not the other way around.
3Capacity math before boxes. Numbers tell you how much architecture the problem deserves.
4Two deep dives — the cost lever and the latency lever. Everything else is 'I'd follow the same structure.'
5Break-it analysis maps every failure to a metric. If you can't measure the break, the mitigation is vibes.
6A bad day has a dollar number. Reliability is a price tag, not a percentage.

🧠

Recap quiz

📚

Transformer Math

📐 The Design Doc

Step 1 — Requirements & SLOs

Step 2 — Eval Harness (design first)

Step 3 — Back-of-Envelope

Step 4 — Architecture

Step 5 — Deep dives on the two risky components

Step 6 — Break It

Step 7 — What does a bad day cost?

Step 8 — Company Lens (same doc, different pushes)

Key Takeaways

Interview Questions

Recap quiz

Design-review methodology recap

Further Reading