💬 Case: Design ChatGPT
Two billion messages a day — where does the money actually go?
ChatGPT is a multi-tier consumer AI system: free must be cheap, Plus must feel faster, and Enterprise must be isolatable. The design problem is serving all three on one fleet without losing cost or quality control. The reusable lessons are storage tiering, routing, and evals that catch quality regressions as reliably as latency regressions.
Requirements & SLOs
Working backwards from the user
SLO table — four tiers, separate targets
| Metric | Target | Why this value |
|---|---|---|
| p50 TTFT — all tiers | Sub-300 ms feels instantaneous; above 500 ms users describe the product as “slow” in UX research (qualitative, per general HCI literature on interaction latency) | |
| p95 TTFT — Plus & Enterprise | Premium tier tail users should feel priority treatment; a paying customer waiting >1 s for first token is a churn signal | |
| p95 TTFT — Free tier | Free users tolerate more latency; separate target prevents the large free-tier volume from masking Plus regressions | |
| Median streaming token rate | Comfortably exceeds , or roughly 5-6 tokens/s in plain English. Below 30 tok/s the stream feels like the model is “thinking” rather than delivering | |
| Availability | 99.9% / month | ; Enterprise contracts carry 99.95% with SLA credits (per Google SRE Book error-budget framework) |
| Free tier message cap | (community estimate) | Enough for a real work session; tight enough to make Plus attractive; exact value is P&L-adjusted continuously by OpenAI |
| Cost ceiling — free message | On the order of fractions of a cent (community estimate) | Must stay below the advertising-equivalent CAC amortization; exact number is internal; model routing is the primary lever |
Quick check
ChatGPT targets 99.9% monthly availability (~43 min error budget). A single incident consumes 30 min. What fraction of the monthly budget remains, and what does that imply for the rest of the month?
Eval Harness — design before architecture
You cannot improve what you cannot measure. Hamel Husain's practitioner framework (hamel.dev/blog/posts/evals) . The ChatGPT eval harness below maps directly to those levels. Write this before touching architecture — the metrics here determine which components matter most.
Level 1 — Unit tests (every deploy)
- Routing correctness — assert that a sample of Free requests lands on the mini model and Plus requests land on the flagship. A bug in this layer is a cost incident, not just a quality incident.
- Prefix-cache hit rate — assert that a canonical system prompt gets a cache hit on the second request in the same session. A serialization format change that breaks this is detectable at unit-test time, not just in production.
- Safety gate smoke test — a small fixed set of clearly policy-violating prompts that must be blocked and clearly benign prompts that must be allowed. If either fails, the deploy stops.
Level 2 — Human / model eval (weekly)
- LLM-judge on a golden set— approximately 1,000 prompt/ideal-response pairs sampled from real traffic (privacy-scrubbed), stratified across use-case buckets (coding, writing, Q&A, reasoning). The judge scores helpfulness, factual accuracy, and format adherence on a 1–5 scale. Calibration: an initial human-review round on a 50-example subsample anchors the judge scores to human ratings, following the approach described in Husain (2023).
- Criteria drift check (per Shankar et al., 2024) — the EvalGen paper (arXiv:2404.12272) identifies . Monthly, re-calibrate the golden set by having human reviewers re-grade 50 examples and checking whether the judge's scores are still aligned. A drifted judge is worse than no judge because it silently approves regressions.
- Multi-turn coherence (the metric most teams miss) — single-turn quality scores miss the dominant ChatGPT use case: sessions where the model must track entities, honor earlier constraints, and not contradict prior output. The metric is a context-faithfulness score: given a synthetic conversation of 4–8 turns, does the model in turn N correctly reference facts established in turn N−3? Evaluated by LLM-judge on a golden set of ~300 multi-turn synthetic conversations. This metric catches regressions in the conversation state store (wrong prefix delivered to the model) that single-turn quality scores cannot surface.
- Refusal calibration — two golden sets: an adversarial set (~500 examples of known jailbreak patterns, red-team outputs) where target refusal rate is near 100%; and a benign-but-sensitive set (~500 examples of medical, legal, security questions a helpful assistant should answer) where target refusal rate is near 0%. Over-refusal on the benign set is a quality regression, not a safety win.
Online bridge
An offline eval win that does not show up in online metrics is a signal you are measuring the wrong thing. Specify the bridge as a hypothesis to validate: offline LLM-judge quality score should predict 7-day retention; multi-turn coherence score should predict average turns-per-session. Validate by running cohort analysis after each major model update — if an offline improvement does not move the online metric within two weeks, golden-set composition needs revision.
Quick check
A deploy causes GPT-4o to refuse to write Python for-loops, labeling them potentially harmful. At which eval tier should this regression be caught before it reaches production?
Back-of-Envelope
ChatGPT processes — roughly 23,000 QPS averaged over 24 hours, with a peak-to-average ratio on the order of 2x, giving a peak of roughly 46,000 QPS. Not all traffic goes to the flagship model. Below we size the flagship-model tier at to represent the combined load seen by a large serving model, with a 30% prefix-cache hit rate on the system prompt (a large, repeated prefix that is the prime caching candidate). Interact with the sliders and find the inflection where the GPU count stops being linear.
| Model Weights (FP16) | 140 GB |
| KV Cache / Request | 515.4 MB (1536 tokens) |
| Tokens/sec per GPU | 600 |
| Effective QPS (after cache) | 21,000 |
| GPUs Needed | 17922 |
| Cost / Month | $45,790,710 |
| Est. p95 Latency | 3.56s |
| Bottleneck | Compute/Bandwidth |
GPU Memory Usage
61%
Compute Utilization
100%
Monthly Cost
$45,790,710
Baseline: ChatGPT flagship fleet — 5000 GPUs @ $3.5/hr at p99 3000 ms, 30,000 QPS, 50% cache hit.
How tightening the p99 SLO drives GPU spend — and how prefix-cache hit rate is the single highest-leverage knob. Baseline figures are community estimates. Drag the SLO slider left and watch the GPU count curve steepen: halving latency roughly doubles the fleet.
| Effective QPS (after cache) | 15,000 |
| Latency-batch factor | 1.00× |
| GPUs Needed | 5,000 (+0% latency vs baseline) |
| Hourly Burn | $17,500 (+0% vs baseline) |
| Cost / Request | $0.00016 |
| Monthly Burn (24×7) | $12,775,000 |
| Bottleneck | Balanced |
Quick check
ChatGPT Plus targets a p95 TTFT of ~800 ms. If prefill runs at ~5K tokens/sec on an H100 for a 7B-class model, what is the maximum prompt length that fits inside the 800 ms budget with zero queuing delay?
Architecture
The diagram below shows the ChatGPT-specific topology: two model fleets, two safety classifiers (pre- and post-model), a conversation store on its own tier, and an async RLHF feedback queue. Hover each node to see its role.
ChatGPT Multi-Tier Serving Architecture
Hover over each component to see its role. Two-safety-classifier design and conversation store are ChatGPT-specific additions to the generic topology.
Component justification — ChatGPT-specific notes
| Component | Why it exists | ChatGPT-specific note |
|---|---|---|
| API Gateway | Auth, per-tier rate limits, token-budget enforcement | Also assembles the conversation prefix from the store and injects the system prompt — the natural prefix-cache boundary is established here |
| Conversation Store | Holds accumulated history for multi-turn coherence and KV-cache reuse across turns | Two-tier: Redis hot path (30-min sessions) + object-storage durable layer with per-tier TTLs; prefix must be byte-stable for cache hits |
| Safety Classifier (pre) | Blocks clearly policy-violating requests before GPU consumption | Fast binary gate, sub-10 ms; saves full model cost on the easiest blocks; runs for all tiers equally |
| Model Router | Routes traffic to the appropriate model fleet to meet the cost SLO | Free → mini (community name: GPT-4o-mini); Plus/Enterprise → flagship; shadow-judges 5% of mini responses; canary-deploys threshold changes before full rollout |
| Flagship / Mini Model Fleets | Serve tokens with PagedAttention and continuous batching (per vLLM, arXiv:2309.06180) | Prefix cache on the shared system prompt is tracked as a business metric — the cache hit rate directly reduces prefill GPU-seconds |
| Safety Classifier (post) | Catches outputs that passed the pre-model gate but were made harmful by conversation context | Higher latency budget than pre-model (runs after generation); blocks prompt injections and adversarial suffixes that only become harmful in context |
| RLHF Feedback Queue | Collects thumbs-up/down and edit signals for reward-model retraining | Async, off the serving hot path; architecturally coupled because the data collected here re-trains the model that the router shadow-judges |
Quick check
Why does a conversation store outage have a higher blast radius than a single model-server node failure?
Deep dives on the two load-bearing components
Expand the deep dives
Open for the full technical detail.
Expand
Expand the deep dives
Open for the full technical detail.
Two components matter most here: the conversation store and the router. Their failures hit the most SLOs at once.
Deep dive A — Conversation State Store
Approach
The conversation store is a two-tier architecture: a Redis cluster (or Redis-compatible in-memory store) serves as the hot path, holding sessions active in the last 30–60 minutes. The key is the session ID; the value is the serialized conversation prefix — the byte sequence that will be delivered to the model server as the KV-cache seed on the next turn. Write-through to a durable object-storage layer (S3 or equivalent) gives crash-resilience without requiring Redis durability (which would hurt write throughput). A versioning field on each conversation record allows the model server to detect staleness before prefilling.
TTL policies are tier-governed: Free tier conversations expire after 30 days of inactivity; Plus and Team after 90 days; Enterprise conversations are audit-locked with configurable retention, cannot be deleted without an explicit account action, and must survive a compliance audit. The TTL is not a cosmetic setting — it is the primary lever for keeping the storage cost bounded as the user base grows. Without TTLs, .
KV-cache reuse across turns — the non-obvious constraint
The conversation store is not just a data store; it is a prefix-stability contract. The model server's prefix cache (a core feature of vLLM-class engines, per arXiv:2309.06180) works by hashing the byte sequence of the input prefix and looking it up in a cache table. If the prefix on turn N is byte-identical to what was delivered on turn N−1 plus the new assistant turn appended, the prior turn's KV states are reused and only the new tokens are prefilled. If anything changes — a whitespace normalization, a system prompt version bump, a conversation serialization format drift — the hash misses and the model server re-prefills from scratch, burning GPU-seconds proportional to the full conversation length.
At a 10-turn session averaging 512 tokens per turn, a full re-prefill on turn 10 costs roughly 5,000 tokens of prefill work. On an H100 with a 70B model, a 5K-token prefill takes on the order of hundreds of milliseconds per 1K-token batch (inferred — verify against your serving stack). That translates to single-digit seconds of GPU time per affected session — or equivalently, the GPU-seconds that could have served many other users. The conversation store is therefore the guardian of the prefix-cache contract, and any change to how conversations are serialized must be treated as a potentially cache-invalidating event requiring a cache-warmup period.
Trade-off — consistency vs. latency on writes
The non-obvious trade-off is write consistency. If the Redis write completes before the durable-tier write, a Redis node failure between the two writes loses the turn permanently. If the durable write must complete before Redis is updated, every turn carries the latency of the object-storage round trip (typically 50–200 ms to cloud storage, which is within the TTFT budget but measurable). The practical resolution is a write-ahead log on the durable tier first, followed by the Redis write — Redis is the fast-read replica, not the source of truth. This adds roughly one object-storage write latency to each turn, but eliminates the loss window. The secondary effect is that Redis eviction under memory pressure is now safe: a cache miss falls back to the durable tier without data loss, paying only the cold-read latency cost.
Failure modes
- Cache miss / cold read — Redis unavailable or session evicted under memory pressure. Every turn pays full prefill cost for the entire accumulated context. TTFT regresses by the prefill time for the full history: roughly (inferred from vLLM throughput figures, arXiv:2309.06180). A 10-turn session with 5K tokens of history adds ~1 second of prefill penalty per request — a Plus-tier SLO breach on its own.
- State corruption— partial write during a network partition leaves a truncated conversation prefix. The model sees missing turns and produces incoherent output. If the truncated context drops a user-specified constraint (“don't include code”), the model violates it on the next turn, which is a visible quality regression, not just a latency regression.
- Full store outage — every active multi-turn session simultaneously loses context coherence. Unlike a model-server node failure (which loses one in-flight request), a store outage hits every concurrent session at once. The cascade: all sessions send full history as raw tokens; the model fleet sees a simultaneous surge of long-context prefill requests; KV-cache memory pressure spikes; if admission control is not tuned for this scenario, the queues fill and p95 TTFT for all tiers breaches simultaneously.
Detection metric
The primary detection signal is the prefix-cache hit rateon the model-server fleet, tracked per tier and per conversation-length bucket. A sudden drop to zero on a specific tier points to a serialization change that invalidated the cache; a gradual drop across all tiers points to memory pressure on Redis causing excessive evictions. The secondary signal is the average prefill token count per request: it should stay roughly constant for a stable traffic mix; a spike means sessions are arriving without cached prefixes.
Mitigation
Redis cluster with read replicas eliminates single-node failure as a blast-radius event. Eviction policy set to LRU with a minimum memory headroom threshold (e.g., never exceed 85% utilization) prevents sudden eviction storms under traffic spikes. The versioned-prefix scheme allows the model server to detect a length mismatch — if the delivered prefix is shorter than the stored session length, fall back to a cold prefill rather than returning garbage. On a full store outage, a graceful degradation mode re-enables single-turn serving (no multi-turn coherence) while the store recovers, rather than blocking all requests.
Real-world example
Discord's 2017 Redis migration (discord.com/blog/how-discord-stores-billions-of-messages) is the closest published analog: a Redis eviction storm under memory pressure cascades into a thundering-herd of backend database reads. The ChatGPT version of this is a thundering-herd of full-history prefills on the GPU fleet — the cost vector is GPU-seconds instead of database IOPS, but the cascade mechanism is identical. The mitigation (circuit-breaker admission control, graceful fallback to cold-serve) is standard in Redis-backed web systems and directly applicable here.
Deep dive B — Model Router
Approach
The model router has two layers. The primary layer is tier-based: Free-tier and Team requests route to the smaller model (community name: GPT-4o-mini); Plus and Enterprise requests route to the flagship. This is a hard mapping — it can be implemented as a lookup table keyed by the authenticated tier in the JWT, with sub-millisecond latency.
The secondary layer is a lightweight request classifier that runs within the Plus tier to handle capacity overflow: when the flagship queue depth exceeds a threshold, requests classified as “easy” (short Q&A, simple writing help, factual lookups) can be served by the smaller model with a user-visible degradation banner, rather than timing out silently. The classifier is trained on ~5K human-labeled easy/hard examples, uses a small embedding model to featurize the prompt, and runs in under 5 ms to stay off the TTFT critical path. The fallback chain is: flagship → mini with banner → capacity-exceeded message with retry-after header. There is no silent fallback; every downgrade is visible to the user and logged for the eval harness.
The quality verifier shadow-judge closes the feedback loop: 5% of responses from mini-routed requests are asynchronously scored by the flagship as a judge. If the quality gap between the two model tracks widens beyond a threshold (measured as the delta in LLM-judge scores), the easy/hard classifier threshold re-calibrates to route more traffic to the flagship. This is the mechanism by which a model update that widens the quality gap is automatically reflected in routing policy without requiring a manual threshold change.
Trade-off — cost-per-miss math
The non-obvious trade-off is the cost-per-miss calculation. A routing miss has two types: a false downgrade (Plus request routed to mini when it should go to flagship) and a false upgrade (Free request routed to flagship when it should go to mini). The asymmetry is large: a false upgrade costs approximately the cost delta between the two models per request. With a 70B flagship and a 13B mini, the cost ratio is on the order of (community estimate based on relative parameter counts and memory bandwidth requirements). If the free tier contributes roughly (community estimate), even a 1% false-upgrade rate at 20K free-tier QPS means 200 requests/second being served at 5–10x their intended cost — roughly $500K/day at scale.
The false-downgrade cost is asymmetric: a Plus user getting a mini response loses quality, which is a retention risk (and therefore a revenue risk), but does not cause a direct cost overrun. The routing threshold should therefore be calibrated asymmetrically: err on the side of upgrading ambiguous Plus requests to the flagship, and err on the side of keeping Free requests on mini even for harder-looking prompts. The shadow-judge metric is the operational signal that tells the team when the error rate in either direction is crossing the acceptable threshold.
Failure modes
- Classifier drift after a model update— the easy/hard classifier was trained on prompts graded against the prior model version. After a model update, the new model's difficulty distribution shifts. What was “easy” for the prior model may be “hard” for the new one (or vice versa). Detection: shadow-judge score divergence — if the mini quality score on the “easy” cohort drops after a model update, the classifier labels are stale. Mitigation: trigger a classifier retraining run on post-update traffic within 48 hours of a new model deploy.
- Router regression — the costly failure mode — a config push, feature-flag flip, or canary-weight bug routes all free-tier traffic to the flagship for 30 minutes. At ~20K QPS of free-tier traffic, 512 output tokens per request, and a cost delta of approximately $5/M tokens between flagship and mini (community estimate), the cost overrun is: 20,000 req/s × 1,800 s × 512 tok/req × $5/106 tok ≈ . Low six figures for a single 30-minute incident. A cost-SLO alarm on per-tier GPU-spend fires within 2 minutes if configured; without one, the first signal is the billing dashboard the next morning.
- Shadow-judge latency budget exceeded — the shadow-judge runs asynchronously and should not affect serving latency. If the async queue depth grows (flagship is capacity-constrained), shadow-judge results become stale. The threshold re-calibration lags behind the actual quality gap. The mitigation is to dedicate a small fraction of flagship capacity (on the order of 2–5%) to the shadow-judge queue with a separate rate limit, so it cannot be starved by serving traffic.
Detection metric
Three metrics, in priority order: (1) per-tier GPU-spend rate — a cost alarm that fires within 2 minutes of a routing regression; (2) shadow-judge score delta — the quality gap between mini-routed and flagship-routed responses, tracked as a rolling 1-hour average; (3) per-tier queue depth— a leading indicator of capacity pressure before it shows up in TTFT. Watch queue-depth p95 per tier: a single burst of long-context Plus requests can starve regular Plus traffic if the scheduler's priority weights are stale.
Mitigation
Canary routing: roll routing threshold changes to 1% of traffic for 10 minutes and watch the cost and shadow-judge alarms before full rollout. Automated rollback: if per-tier cost-SLO breach is detected, the routing config reverts to the prior version without human intervention. Monthly classifier retraining: a scheduled pipeline pulls the past 30 days of shadow-judge-labeled traffic, adds human-reviewed examples from the escalation queue, and retrains the easy/hard classifier. The retrained classifier is canary-deployed using the same 1% process.
Real-world example
The vLLM team's v0.6.0 release notes (vllm.ai/blog) report from the chunked-prefill improvements in that release — a reminder that serving engine updates can change the easy/hard boundary substantially. A prompt that took 400 ms of prefill before the update may take 150 ms after, which reclassifies it from “hard (flagship only)” to “easy (mini viable)” from a latency-budget perspective. This is exactly the kind of model/infrastructure co-evolution that makes monthly classifier retraining necessary rather than optional.
Why is the conversation state store a higher blast-radius deep dive than an individual model-server node?
Break It
Three failure modes, each mapping to a metric in the eval harness. If you cannot measure the break, the mitigation is guesswork.
Remove prefix caching
Every request re-prefills the system prompt and full conversation history from scratch. At roughly 30K QPS with a 1,000-token system prompt and an average 4-turn history (approximately 2,000 tokens of prior turns), each request adds ~3,000 tokens of prefill versus the cached baseline. The vLLM paper (arXiv:2309.06180) reports from prefix caching on repeated prefixes. That 2–4x translates directly to fleet size: the flagship fleet would need 2–4x as many GPUs to serve the same QPS with the same latency. Alternatively, p95 TTFT regresses by the added prefill time — on the order of 500 ms–1 s per request for a session at turn 5. Detection: prefix-cache hit rate metric drops to 0%; cost-per-session metric spikes 2–4x. Mitigation: restore caching; audit what changed the prefix (system-prompt update, serialization format change) and ensure prefix bytes are stable across turns.
Disable conversation-store TTLs
Conversation histories accumulate indefinitely. Storage cost grows linearly with the user base. At 100M monthly active users with an average 50 KB of conversation history, that is 5 PB of object-storage accumulation with no upper bound — manageable financially in object storage, but now every retrieval can return arbitrarily long contexts, pressuring the model server with 10K+ token prefills for old-but-active accounts. The compliance cost is the larger issue: most privacy regulations (GDPR, CCPA) require a documented retention and deletion policy. A system with no TTL has no retention policy, which is a compliance blocker for any regulated-industry Enterprise customer. Detection: storage cost trend alert; compliance audit finding (not a real-time metric — a lagging indicator). Mitigation: restore per-tier TTLs; add a user-initiated deletion flow that satisfies right-to-erasure requirements.
Remove the model router
Route 100% of traffic to the flagship model. A 70B flagship costs roughly 5–10x more per token than a 7–13B mini model (community estimate based on relative parameter counts and memory bandwidth). If the mini model handles 60–70% of ChatGPT's free-tier volume, removing the router multiplies the cost of those requests by 5–10x. The free tier becomes immediately unprofitable at scale — not marginally, but by an order of magnitude. Within days, the cost model collapses. The free tier is either removed or rate-limited to near-zero, destroying the top-of-funnel for Plus conversion. Detection: per-tier cost-per-session metric crosses the ceiling within hours. Mitigation:restore routing; verify that the quality delta between mini and flagship is actually perceptible on the tasks the free tier does most (short Q&A, simple writing). If the quality gap has closed since the routing threshold was last calibrated, the threshold should be more aggressive.
Quick check
Disabling conversation-store TTLs is primarily a compliance risk, not just a storage cost problem. Why?
What does a bad day cost?
Reliability is a dollar number, not a percentage. The following estimates are back-of-envelope — the goal is order-of-magnitude accuracy. All model cost figures are community estimates unless cited.
| Incident | Detected in 2 min | Detected in 4 h | Detection lever (10x sensitivity) |
|---|---|---|---|
| Router regression — free tier routes to flagship | (2 min × 20K QPS × 512 tok × $5/M delta; community estimate on cost delta) | (240 min × same rate — the same arithmetic at full window) | Per-tier GPU-spend alarm within 2 min vs. next-day billing dashboard — 120x cost difference |
| Prefix-cache invalidation — system prompt format changed | ~2% TTFT regression on active sessions; small fleet overage (session-level, recoverable in minutes once cache warms) | 4 h of 2–4x prefill cost across all active sessions; Plus-tier SLO breach for the full window; potential SLA credit trigger on Enterprise | Cache-hit-rate alarm (fires in <1 min) vs. no alarm (first signal is p95 TTFT report next morning) — qualitatively ~100x cost window difference |
| Safety classifier false-positive storm — benign requests refused | Small direct cost (refused requests skip GPU); social media posts begin within minutes of the storm | A 4-hour over-refusal event generates more support volume and brand damage than a 15-min availability outage; recovery requires a public post-mortem | Benign-but-sensitive golden set in CI (fires before deploy) vs. user-reported complaints — deploy-time detection vs. post-incident |
ChatGPT On-call Runbook
Model router misroutes free-tier traffic to flagship
MTTR p50 / p99: 7 min / 20 minBlast radius: All free-tier users receive flagship-quality responses; GPU spend spikes 3–5× on the over-served tier within minutes (community estimate on tier cost delta).
- 1. DetectPer-tier GPU-spend alarm fires within 2 min at 2× baseline; token throughput ratio alert (free:paid) deviates >20% from 7-day moving average.
- 2. EscalatePage serving on-call. Check router config diff against last deploy. Confirm via tier-split metric dashboard before rollback.
- 3. RollbackRevert router config to previous version (blue/green flip, ~5 min). Validate tier split restores to target ratio. MTTR dominated by detection, not rollback.
- 4. PostAdd pre-deploy integration test that routes a synthetic free-tier token through the router and asserts the response comes from the small-model tier. Gate all router deploys on this test.
Safety classifier false-positive storm during news event
MTTR p50 / p99: 15 min / 60 minBlast radius: Benign requests refused at elevated rate (estimated 5–20% of traffic during storm). Users hit refusal messages; social media posts begin within minutes.
- 1. DetectBenign golden-set refusal rate alarm: fires if refusal rate on synthetic non-harmful probes exceeds 2× baseline for 60 s. Secondary: user-visible error rate spike.
- 2. EscalatePage safety on-call + policy lead. Do not roll back classifier silently — log the storm start time for post-incident audit. Notify comms if rate exceeds 10% for >5 min.
- 3. RollbackShadow-mode the new classifier (pass traffic to old weights, compare outputs). If old weights are clean, cut over. If no fallback: disable the classifier layer and accept higher TP rate temporarily — over-refusal is a worse user experience than mild under-refusal at scale.
- 4. PostExpand the benign golden set with query variants from the news event. Add a canary deploy stage that holds 1% traffic for 10 min with refusal rate monitoring before full rollout.
KV-cache OOM during viral prompt (unusually long context)
MTTR p50 / p99: 5 min / 25 minBlast radius: Serving pods OOM-kill active sessions. Affected users see mid-stream disconnects. Cache eviction cascades can degrade adjacent sessions if not isolated per-pod.
- 1. DetectPod memory utilization alarm at 85% triggers; KV-cache eviction rate counter spikes above rolling baseline. Correlated with p99 TTFT increase.
- 2. EscalatePage infra on-call. Identify the prompt pattern driving the spike (log top-N input-length percentiles). Activate admission control to cap input length for new sessions.
- 3. RollbackEnable request-level input-length cap (e.g., 32 k tokens max for non-Enterprise). Restart OOM pods. For Enterprise sessions: route to dedicated long-context pool if capacity exists.
- 4. PostAdd input-length distribution to capacity model. Set KV-cache high-water-mark alarm at 70% so there is headroom before OOM. Investigate whether MQA/GQA can reduce per-session KV footprint.
For a direct comparison of how this runbook pattern differs under multi-turn agent workloads, see the Gemini case study and the Character.AI session-persistence design.
Company Lens — same system, different drills
OpenAI's drill
Expect the interviewer to drill on the routing-economics loop and model lifecycle management. Specific questions you should be able to answer: “How do you know when to update the routing threshold — what is the triggering signal and the deployment process?” (Answer: shadow-judge score delta alarm → canary routing deploy → auto-rollback on cost-SLO breach.) “When you ship a new model and retire the old one, how do you migrate without a conversation-store invalidation event that wipes every user's prefix cache?” (Answer: maintain both model versions behind the same router during a migration window; flush the prefix cache gradually by tier rather than all at once.) “How does the RLHF feedback from thumbs-down signals reach the reward model without contaminating the training set with adversarially-prompted negative labels?” OpenAI interviewers are acutely sensitive to the gap between the training-time world (clean batches, infinite budget) and the serving-time world (adversarial users, cost ceiling). Safety is a design constraint, not a sidebar — expect questions on how the safety classifiers integrate with the serving pipeline rather than bolt on.
Google's drill
Expect heavy emphasis on fair-share scheduling and multi-tenant queueing. Specific questions: “How do you prevent a single hot tenant from starving paying customers?” (Answer: per-tenant queue-depth p95 monitored in real time; fair-share weighted by GPU-second budgets so a burst above quota gets deprioritised rather than blocked, preserving SLO for other tenants.) “What is the Borg/Kubernetes integration for your GPU fleet — how does the scheduler know a prefill pod needs priority over a decode pod?” (Answer: prefill and decode pods carry different resource classes; the scheduler enforces disaggregated-prefill quotas per tier with preemption.) Google L6+ bar is heavy on scheduling theory — expect Borg-style gang-scheduling and bin-packing tradeoffs, not just serving metrics.
Anthropic's drill
Expect deep focus on the safety-helpfulness tradeoff as a measurable loss function, not a policy dial. Specific questions: “How does your system handle a sequence of requests that are each individually benign but collectively constitute policy evasion?” (Answer: the conversation store's multi-turn context is fed to the safety classifier, not just the current turn — context-aware classification requires the full prefix, not a single message.) “How do you measure whether Constitutional AI revisions are improving safety without degrading helpfulness?” (Answer: the benign-but-sensitive golden set measures false-positive refusal rate; the revision diff log feeds the eval harness to detect hallucinations introduced during revision; the shadow-run at 5% of production traffic gives a real-time signal.) “If a user complains that the model refused a legitimate medical question, how does that signal reach the calibration team?” (Answer: the benign-but-sensitive set should include a medical bucket; user-reported false refusals trigger a human-review annotation that feeds the next calibration round.) Anthropic's bar treats CAI (arXiv:2212.08073) as a first-class architectural component, not an inference-time add-on.
Key Takeaways
What to remember for interviews
- 1Separate p95 TTFT targets per tier — collapsing Free and Plus into one p95 hides regressions on the revenue-critical tier behind the larger free-tier volume.
- 2Design the eval harness before the architecture: multi-turn coherence and per-tier refusal rate are metrics that determine which components matter most, per Husain (hamel.dev) and Shankar et al. (arXiv:2404.12272).
- 3Prefix caching on the system prompt is a first-class cost lever — vLLM reports 2–4x throughput improvement from prefix reuse (arXiv:2309.06180); a 30% hit rate translates directly to 30% fewer prefill GPU-seconds.
- 4The conversation store has higher blast radius than a model-server node: its failure degrades every concurrent session at once and triggers a KV-cache miss storm on the GPU fleet.
- 5The shadow-judge pattern (score 5% of mini-routed responses via the flagship asynchronously) closes the routing quality feedback loop without full human review on every model update.
- 6Detection window sensitivity is the reliability investment: a per-tier cost alarm that fires in 2 minutes vs. a next-day billing review is a 120x difference in incident cost on a router regression.
Recap quiz
Defend the design end-to-end
Why does ChatGPT's ~2 billion messages/day figure imply a minimum cluster size in the tens of thousands of H100s, rather than hundreds?
Without chunked prefill, a single 8K-token prompt can stall all other decode streams on the same GPU for ~400 ms. Why is the latency impact on other users asymmetric — i.e., why does one long prefill hurt many short completions more than vice versa?
A ChatGPT Plus session accumulates a 16K-token context. Relative to an 8K-token context, how does KV-cache memory footprint scale, and what is the practical serving implication?
ChatGPT free tier enforces a rate cap (~40 msgs / 3 h). Which system-design mechanism is most likely enforcing this per-user limit at scale across a distributed fleet?
Constitutional AI uses a self-critique-and-revision loop to improve harmlessness. What is the primary latency cost of applying this technique at ChatGPT inference time versus training time?
The 1,000-token system prompt is byte-identical across all requests on a session. A serialization format change makes the prefix byte-unstable. What is the immediate cost consequence on the GPU fleet?
A router bug sends free-tier traffic to the flagship for 30 minutes (~20K free-tier QPS). Using a $5/M-token cost delta and 512 output tokens/request, what is the approximate cost overrun?
Detecting the same router regression in 2 minutes costs ~$6K; detecting it after 4 hours costs ~$720K. What is the detection-window sensitivity ratio, and what does it imply?
Anthropic's interviewer asks how the safety pipeline handles individually-benign turns that collectively constitute policy evasion. Which architectural choice makes detection possible?
Interview Questions
Showing 5 of 5
ChatGPT free tier routes to a cheaper model; Plus routes to the flagship. A naive implementation hard-codes this in the gateway. What is the single worst failure mode of that design, and how do you detect and fix it?
★★★Design the conversation state store for ChatGPT. What are the three failure modes it must survive, and why is its blast radius higher than a model-server node failure?
★★★The CFO asks why prefix caching on the system prompt matters to the bottom line. Give a number-backed answer.
★★☆p95 TTFT regresses from 400 ms to 1,100 ms after a traffic spike. No model changes shipped. Where do you look, in what order?
★★★Anthropic's interviewer asks: 'How does Constitutional AI change the architecture of the safety pipeline in a ChatGPT-like system?' Walk through the integration decisions.
★★★Further Reading
- vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention (Kwon et al., 2023) — The foundational serving-engine paper. PagedAttention and continuous batching are the two primitives the conversation-store and routing deep dives rely on most heavily. The 2–4x throughput gain cited in this module comes from the paper abstract.
- Anthropic — Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022) — The Constitutional AI paper. Directly cited in the safety-pipeline integration section and the Anthropic company-lens deep dive.
- Hamel Husain — Your AI Product Needs Evals — The practitioner post that made eval-first design mainstream. The three-level eval framework (unit tests → human/model eval → A/B) cited in the eval harness section comes from Husain's taxonomy.
- Shankar et al. — Who Validates the Validators? (2024, arXiv:2404.12272) — EvalGen paper on criteria drift — the finding that evaluation standards emerge through the grading process itself. Directly motivates the monthly human-calibration cadence in the eval harness.
- Google SRE Book — Service Level Objectives (Chapter 4) — The canonical SLO engineering reference. The error-budget math in the incident-cost section and the per-tier SLO separation are grounded in the framework described here.
- vLLM Blog — Production Serving Patterns — The vLLM team's engineering blog covering chunked prefill, disaggregated prefill/decode, and KV-cache offloading. The 2.7x throughput and 5x TPOT claims in the chunked-prefill discussion come from v0.6.0 release notes referenced here.