Case: Design ChatGPT — Transformer Math

📋

Requirements & SLOs

Working backwards from the user

“A ChatGPT user opens a conversation, types a message, and sees the first tokens of a response within a fraction of a second. The reply streams at a pace faster than they can read it. They continue the conversation with full context from prior turns. Free users get a capable model with generous but finite daily limits. Plus users get the best available model, faster responses, and priority during peak hours. Enterprise customers get dedicated capacity, audit logs, and contractual data-handling guarantees. The service is available at all hours because a productivity tool that is down during the workday is not a productivity tool.”

SLO table — four tiers, separate targets

Metric	Target	Why this value
p50 TTFT — all tiers		Sub-300 ms feels instantaneous; above 500 ms users describe the product as “slow” in UX research (qualitative, per general HCI literature on interaction latency)
p95 TTFT — Plus & Enterprise		Premium tier tail users should feel priority treatment; a paying customer waiting >1 s for first token is a churn signal
p95 TTFT — Free tier		Free users tolerate more latency; separate target prevents the large free-tier volume from masking Plus regressions
Median streaming token rate		Comfortably exceeds , or roughly 5-6 tokens/s in plain English. Below 30 tok/s the stream feels like the model is “thinking” rather than delivering
Availability	99.9% / month	; Enterprise contracts carry 99.95% with SLA credits (per Google SRE Book error-budget framework)
Free tier message cap	(community estimate)	Enough for a real work session; tight enough to make Plus attractive; exact value is P&L-adjusted continuously by OpenAI
Cost ceiling — free message	On the order of fractions of a cent (community estimate)	Must stay below the advertising-equivalent CAC amortization; exact number is internal; model routing is the primary lever

✨ Insight · Non-obvious SLO choice: separate p95 targets by tier. Most teams set a single p95 TTFT across all users. ChatGPT cannot — because the model router sends free-tier traffic to a smaller model on a separate fleet segment, their tail-latency distribution is structurally different from Plus. Collapsing them into one number masks regressions on the paid tier (revenue-critical) under the larger volume of free-tier requests. Separate SLO tracking per tier is not a reporting preference; it is a detection prerequisite. This follows directly from the Google SRE Book recommendation (Chapter 4) to define SLOs for distinct user populations rather than aggregate service behavior.

Quick check

Derivation

ChatGPT targets 99.9% monthly availability (~43 min error budget). A single incident consumes 30 min. What fraction of the monthly budget remains, and what does that imply for the rest of the month?

~70% remains — the system is still mostly healthy.~50% remains — 30 min is exactly half of a 60-min budget.~10% remains — 30 min is nearly the full budget so almost nothing is left.~30% remains (~13 min). Any further outage longer than 13 min this month breaches the SLO.

🧪

Eval Harness — design before architecture

You cannot improve what you cannot measure. Hamel Husain's practitioner framework (hamel.dev/blog/posts/evals) . The ChatGPT eval harness below maps directly to those levels. Write this before touching architecture — the metrics here determine which components matter most.

Level 1 — Unit tests (every deploy)

Routing correctness — assert that a sample of Free requests lands on the mini model and Plus requests land on the flagship. A bug in this layer is a cost incident, not just a quality incident.
Prefix-cache hit rate — assert that a canonical system prompt gets a cache hit on the second request in the same session. A serialization format change that breaks this is detectable at unit-test time, not just in production.
Safety gate smoke test — a small fixed set of clearly policy-violating prompts that must be blocked and clearly benign prompts that must be allowed. If either fails, the deploy stops.

Level 2 — Human / model eval (weekly)

LLM-judge on a golden set— approximately 1,000 prompt/ideal-response pairs sampled from real traffic (privacy-scrubbed), stratified across use-case buckets (coding, writing, Q&A, reasoning). The judge scores helpfulness, factual accuracy, and format adherence on a 1–5 scale. Calibration: an initial human-review round on a 50-example subsample anchors the judge scores to human ratings, following the approach described in Husain (2023).
Criteria drift check (per Shankar et al., 2024) — the EvalGen paper (arXiv:2404.12272) identifies . Monthly, re-calibrate the golden set by having human reviewers re-grade 50 examples and checking whether the judge's scores are still aligned. A drifted judge is worse than no judge because it silently approves regressions.
Multi-turn coherence (the metric most teams miss) — single-turn quality scores miss the dominant ChatGPT use case: sessions where the model must track entities, honor earlier constraints, and not contradict prior output. The metric is a context-faithfulness score: given a synthetic conversation of 4–8 turns, does the model in turn N correctly reference facts established in turn N−3? Evaluated by LLM-judge on a golden set of ~300 multi-turn synthetic conversations. This metric catches regressions in the conversation state store (wrong prefix delivered to the model) that single-turn quality scores cannot surface.
Refusal calibration — two golden sets: an adversarial set (~500 examples of known jailbreak patterns, red-team outputs) where target refusal rate is near 100%; and a benign-but-sensitive set (~500 examples of medical, legal, security questions a helpful assistant should answer) where target refusal rate is near 0%. Over-refusal on the benign set is a quality regression, not a safety win.

💡 Tip · Golden-set sizing math. For a binary refusal judgment at 95% expected pass rate, a set of 500 examples gives a 95% confidence interval of roughly ±2 percentage points — tight enough to detect a 5-point regression before it ships. A 100-example set gives ±6 points: too wide to detect gradual drift. The formula is CI = 1.96 × sqrt(p(1−p)/n) — plug in p=0.95, n=500 to verify. Size for the signal you need, not the round number that fits in a sprint.

Online bridge

An offline eval win that does not show up in online metrics is a signal you are measuring the wrong thing. Specify the bridge as a hypothesis to validate: offline LLM-judge quality score should predict 7-day retention; multi-turn coherence score should predict average turns-per-session. Validate by running cohort analysis after each major model update — if an offline improvement does not move the online metric within two weeks, golden-set composition needs revision.

Quick check

Trade-off

A deploy causes GPT-4o to refuse to write Python for-loops, labeling them potentially harmful. At which eval tier should this regression be caught before it reaches production?

Tier 2 — LLM-as-judge weekly eval; the judge would score the refusal as a quality regression.No automated eval catches this — only user reports in the support queue.Tier 3 — A/B test with real users; the refusal rate would show up in engagement metrics.Tier 1 — deterministic assertion: assert model output does not contain a refusal for a known-safe task.

🧮

Back-of-Envelope

ChatGPT processes — roughly 23,000 QPS averaged over 24 hours, with a peak-to-average ratio on the order of 2x, giving a peak of roughly 46,000 QPS. Not all traffic goes to the flagship model. Below we size the flagship-model tier at to represent the combined load seen by a large serving model, with a 30% prefix-cache hit rate on the system prompt (a large, repeated prefix that is the prime caching candidate). Interact with the sliders and find the inflection where the GPU count stops being linear.

Scenario: ChatGPT flagship tier: ~30K QPS (community estimate for combined-model load), 1,024-token input (system prompt + conversation history, single-turn assumption), 512-token output, 30% prefix-cache hit rate on the repeated system prompt.

Model Size70B

GPU TypeH100-80GB

QPS Target30,000

Input Tokens1024

Output Tokens512

Cache Hit Rate30%

Model Weights (FP16)	140 GB
KV Cache / Request	515.4 MB (1536 tokens)
Tokens/sec per GPU	600
Effective QPS (after cache)	21,000
GPUs Needed	17922
Cost / Month	$45,790,710
Est. p95 Latency	3.56s
Bottleneck	Compute/Bandwidth

GPU Memory Usage

61%

Compute Utilization

100%

Monthly Cost

$45,790,710

⚠ Warning · Gotcha: The default treats every request as a single-turn interaction with 1,024 input tokens. ChatGPT sessions commonly run 5–15 turns. By turn 8, a conversation has accumulated 4,000–8,000 tokens of history that must be prefilled (or served from KV cache) on every new turn. The single-turn estimate undershoots GPU memory pressure by 4–8x for active sessions. Fix: model the distribution of conversation length and size the KV cache for the p95 session length, not the per-request average.

✨ Insight · The deliberate mistake in the defaults above. The preset uses 1,024 input tokens — reasonable for turn 1. But multi-turn sessions accumulate history. A 10-turn session averaging 512 tokens per turn arrives with 5,000+ tokens of prefill context. The GPU memory required per active session grows proportionally. For the fleet to handle the p95 session without admission-control rejection, the KV cache must be sized for the distribution tail, not the mean — and that changes the GPU count estimate substantially.

Baseline: ChatGPT flagship fleet — 5000 GPUs @ $3.5/hr at p99 3000 ms, 30,000 QPS, 50% cache hit.

How tightening the p99 SLO drives GPU spend — and how prefix-cache hit rate is the single highest-leverage knob. Baseline figures are community estimates. Drag the SLO slider left and watch the GPU count curve steepen: halving latency roughly doubles the fleet.

p99 Latency Target3000 ms

Peak QPS30,000

Cache Hit Rate50%

Effective QPS (after cache)	15,000
Latency-batch factor	1.00×
GPUs Needed	5,000 (+0% latency vs baseline)
Hourly Burn	$17,500 (+0% vs baseline)
Cost / Request	$0.00016
Monthly Burn (24×7)	$12,775,000
Bottleneck	Balanced

⚠ Warning · Gotcha: The cache hit rate shown here (50%) is aggressive — it assumes a large, stable system prompt that prefix-caches perfectly. Newer models with longer context or per-user personalization drive hit rates toward 20–30%, materially increasing fleet size. Interviewers will probe whether you modeled the cache miss path or assumed best-case.

Quick check

Derivation

ChatGPT Plus targets a p95 TTFT of ~800 ms. If prefill runs at ~5K tokens/sec on an H100 for a 7B-class model, what is the maximum prompt length that fits inside the 800 ms budget with zero queuing delay?

~400 tokens~40,000 tokens~4,000 tokens~800 tokens

🏛️

Architecture

The diagram below shows the ChatGPT-specific topology: two model fleets, two safety classifiers (pre- and post-model), a conversation store on its own tier, and an async RLHF feedback queue. Hover each node to see its role.

ChatGPT Multi-Tier Serving Architecture

Hover over each component to see its role. Two-safety-classifier design and conversation store are ChatGPT-specific additions to the generic topology.

Component justification — ChatGPT-specific notes

Component	Why it exists	ChatGPT-specific note
API Gateway	Auth, per-tier rate limits, token-budget enforcement	Also assembles the conversation prefix from the store and injects the system prompt — the natural prefix-cache boundary is established here
Conversation Store	Holds accumulated history for multi-turn coherence and KV-cache reuse across turns	Two-tier: Redis hot path (30-min sessions) + object-storage durable layer with per-tier TTLs; prefix must be byte-stable for cache hits
Safety Classifier (pre)	Blocks clearly policy-violating requests before GPU consumption	Fast binary gate, sub-10 ms; saves full model cost on the easiest blocks; runs for all tiers equally
Model Router	Routes traffic to the appropriate model fleet to meet the cost SLO	Free → mini (community name: GPT-4o-mini); Plus/Enterprise → flagship; shadow-judges 5% of mini responses; canary-deploys threshold changes before full rollout
Flagship / Mini Model Fleets	Serve tokens with PagedAttention and continuous batching (per vLLM, arXiv:2309.06180)	Prefix cache on the shared system prompt is tracked as a business metric — the cache hit rate directly reduces prefill GPU-seconds
Safety Classifier (post)	Catches outputs that passed the pre-model gate but were made harmful by conversation context	Higher latency budget than pre-model (runs after generation); blocks prompt injections and adversarial suffixes that only become harmful in context
RLHF Feedback Queue	Collects thumbs-up/down and edit signals for reward-model retraining	Async, off the serving hot path; architecturally coupled because the data collected here re-trains the model that the router shadow-judges

Quick check

Trade-off

Why does a conversation store outage have a higher blast radius than a single model-server node failure?

Because the model server is stateless and can be replaced by any LLM provider.Because the conversation store is older code with more technical debt and a higher bug rate.A store outage hits every session at once — KV-cache miss storm on the GPU fleet means quality loss and cost spike together.Because the conversation store processes more queries per second than a model-server node.

🔬

Deep dives on the two load-bearing components

Expand the deep dives

Open for the full technical detail.

Expand

Two components matter most here: the conversation store and the router. Their failures hit the most SLOs at once.

Deep dive A — Conversation State Store

Approach

The conversation store is a two-tier architecture: a Redis cluster (or Redis-compatible in-memory store) serves as the hot path, holding sessions active in the last 30–60 minutes. The key is the session ID; the value is the serialized conversation prefix — the byte sequence that will be delivered to the model server as the KV-cache seed on the next turn. Write-through to a durable object-storage layer (S3 or equivalent) gives crash-resilience without requiring Redis durability (which would hurt write throughput). A versioning field on each conversation record allows the model server to detect staleness before prefilling.

TTL policies are tier-governed: Free tier conversations expire after 30 days of inactivity; Plus and Team after 90 days; Enterprise conversations are audit-locked with configurable retention, cannot be deleted without an explicit account action, and must survive a compliance audit. The TTL is not a cosmetic setting — it is the primary lever for keeping the storage cost bounded as the user base grows. Without TTLs, .

KV-cache reuse across turns — the non-obvious constraint

The conversation store is not just a data store; it is a prefix-stability contract. The model server's prefix cache (a core feature of vLLM-class engines, per arXiv:2309.06180) works by hashing the byte sequence of the input prefix and looking it up in a cache table. If the prefix on turn N is byte-identical to what was delivered on turn N−1 plus the new assistant turn appended, the prior turn's KV states are reused and only the new tokens are prefilled. If anything changes — a whitespace normalization, a system prompt version bump, a conversation serialization format drift — the hash misses and the model server re-prefills from scratch, burning GPU-seconds proportional to the full conversation length.

At a 10-turn session averaging 512 tokens per turn, a full re-prefill on turn 10 costs roughly 5,000 tokens of prefill work. On an H100 with a 70B model, a 5K-token prefill takes on the order of hundreds of milliseconds per 1K-token batch (inferred — verify against your serving stack). That translates to single-digit seconds of GPU time per affected session — or equivalently, the GPU-seconds that could have served many other users. The conversation store is therefore the guardian of the prefix-cache contract, and any change to how conversations are serialized must be treated as a potentially cache-invalidating event requiring a cache-warmup period.

Trade-off — consistency vs. latency on writes

The non-obvious trade-off is write consistency. If the Redis write completes before the durable-tier write, a Redis node failure between the two writes loses the turn permanently. If the durable write must complete before Redis is updated, every turn carries the latency of the object-storage round trip (typically 50–200 ms to cloud storage, which is within the TTFT budget but measurable). The practical resolution is a write-ahead log on the durable tier first, followed by the Redis write — Redis is the fast-read replica, not the source of truth. This adds roughly one object-storage write latency to each turn, but eliminates the loss window. The secondary effect is that Redis eviction under memory pressure is now safe: a cache miss falls back to the durable tier without data loss, paying only the cold-read latency cost.

Failure modes

Cache miss / cold read — Redis unavailable or session evicted under memory pressure. Every turn pays full prefill cost for the entire accumulated context. TTFT regresses by the prefill time for the full history: roughly (inferred from vLLM throughput figures, arXiv:2309.06180). A 10-turn session with 5K tokens of history adds ~1 second of prefill penalty per request — a Plus-tier SLO breach on its own.
State corruption— partial write during a network partition leaves a truncated conversation prefix. The model sees missing turns and produces incoherent output. If the truncated context drops a user-specified constraint (“don't include code”), the model violates it on the next turn, which is a visible quality regression, not just a latency regression.
Full store outage — every active multi-turn session simultaneously loses context coherence. Unlike a model-server node failure (which loses one in-flight request), a store outage hits every concurrent session at once. The cascade: all sessions send full history as raw tokens; the model fleet sees a simultaneous surge of long-context prefill requests; KV-cache memory pressure spikes; if admission control is not tuned for this scenario, the queues fill and p95 TTFT for all tiers breaches simultaneously.

Detection metric

The primary detection signal is the prefix-cache hit rateon the model-server fleet, tracked per tier and per conversation-length bucket. A sudden drop to zero on a specific tier points to a serialization change that invalidated the cache; a gradual drop across all tiers points to memory pressure on Redis causing excessive evictions. The secondary signal is the average prefill token count per request: it should stay roughly constant for a stable traffic mix; a spike means sessions are arriving without cached prefixes.

Mitigation

Redis cluster with read replicas eliminates single-node failure as a blast-radius event. Eviction policy set to LRU with a minimum memory headroom threshold (e.g., never exceed 85% utilization) prevents sudden eviction storms under traffic spikes. The versioned-prefix scheme allows the model server to detect a length mismatch — if the delivered prefix is shorter than the stored session length, fall back to a cold prefill rather than returning garbage. On a full store outage, a graceful degradation mode re-enables single-turn serving (no multi-turn coherence) while the store recovers, rather than blocking all requests.

Real-world example

Discord's 2017 Redis migration (discord.com/blog/how-discord-stores-billions-of-messages) is the closest published analog: a Redis eviction storm under memory pressure cascades into a thundering-herd of backend database reads. The ChatGPT version of this is a thundering-herd of full-history prefills on the GPU fleet — the cost vector is GPU-seconds instead of database IOPS, but the cascade mechanism is identical. The mitigation (circuit-breaker admission control, graceful fallback to cold-serve) is standard in Redis-backed web systems and directly applicable here.

Deep dive B — Model Router

Approach

The model router has two layers. The primary layer is tier-based: Free-tier and Team requests route to the smaller model (community name: GPT-4o-mini); Plus and Enterprise requests route to the flagship. This is a hard mapping — it can be implemented as a lookup table keyed by the authenticated tier in the JWT, with sub-millisecond latency.

The secondary layer is a lightweight request classifier that runs within the Plus tier to handle capacity overflow: when the flagship queue depth exceeds a threshold, requests classified as “easy” (short Q&A, simple writing help, factual lookups) can be served by the smaller model with a user-visible degradation banner, rather than timing out silently. The classifier is trained on ~5K human-labeled easy/hard examples, uses a small embedding model to featurize the prompt, and runs in under 5 ms to stay off the TTFT critical path. The fallback chain is: flagship → mini with banner → capacity-exceeded message with retry-after header. There is no silent fallback; every downgrade is visible to the user and logged for the eval harness.

The quality verifier shadow-judge closes the feedback loop: 5% of responses from mini-routed requests are asynchronously scored by the flagship as a judge. If the quality gap between the two model tracks widens beyond a threshold (measured as the delta in LLM-judge scores), the easy/hard classifier threshold re-calibrates to route more traffic to the flagship. This is the mechanism by which a model update that widens the quality gap is automatically reflected in routing policy without requiring a manual threshold change.

Trade-off — cost-per-miss math

The non-obvious trade-off is the cost-per-miss calculation. A routing miss has two types: a false downgrade (Plus request routed to mini when it should go to flagship) and a false upgrade (Free request routed to flagship when it should go to mini). The asymmetry is large: a false upgrade costs approximately the cost delta between the two models per request. With a 70B flagship and a 13B mini, the cost ratio is on the order of (community estimate based on relative parameter counts and memory bandwidth requirements). If the free tier contributes roughly (community estimate), even a 1% false-upgrade rate at 20K free-tier QPS means 200 requests/second being served at 5–10x their intended cost — roughly $500K/day at scale.

The false-downgrade cost is asymmetric: a Plus user getting a mini response loses quality, which is a retention risk (and therefore a revenue risk), but does not cause a direct cost overrun. The routing threshold should therefore be calibrated asymmetrically: err on the side of upgrading ambiguous Plus requests to the flagship, and err on the side of keeping Free requests on mini even for harder-looking prompts. The shadow-judge metric is the operational signal that tells the team when the error rate in either direction is crossing the acceptable threshold.

Failure modes

Classifier drift after a model update— the easy/hard classifier was trained on prompts graded against the prior model version. After a model update, the new model's difficulty distribution shifts. What was “easy” for the prior model may be “hard” for the new one (or vice versa). Detection: shadow-judge score divergence — if the mini quality score on the “easy” cohort drops after a model update, the classifier labels are stale. Mitigation: trigger a classifier retraining run on post-update traffic within 48 hours of a new model deploy.
Router regression — the costly failure mode — a config push, feature-flag flip, or canary-weight bug routes all free-tier traffic to the flagship for 30 minutes. At ~20K QPS of free-tier traffic, 512 output tokens per request, and a cost delta of approximately $5/M tokens between flagship and mini (community estimate), the cost overrun is: 20,000 req/s × 1,800 s × 512 tok/req × $5/10⁶ tok ≈ . Low six figures for a single 30-minute incident. A cost-SLO alarm on per-tier GPU-spend fires within 2 minutes if configured; without one, the first signal is the billing dashboard the next morning.
Shadow-judge latency budget exceeded — the shadow-judge runs asynchronously and should not affect serving latency. If the async queue depth grows (flagship is capacity-constrained), shadow-judge results become stale. The threshold re-calibration lags behind the actual quality gap. The mitigation is to dedicate a small fraction of flagship capacity (on the order of 2–5%) to the shadow-judge queue with a separate rate limit, so it cannot be starved by serving traffic.

Detection metric

Three metrics, in priority order: (1) per-tier GPU-spend rate — a cost alarm that fires within 2 minutes of a routing regression; (2) shadow-judge score delta — the quality gap between mini-routed and flagship-routed responses, tracked as a rolling 1-hour average; (3) per-tier queue depth— a leading indicator of capacity pressure before it shows up in TTFT. Watch queue-depth p95 per tier: a single burst of long-context Plus requests can starve regular Plus traffic if the scheduler's priority weights are stale.

Mitigation

Canary routing: roll routing threshold changes to 1% of traffic for 10 minutes and watch the cost and shadow-judge alarms before full rollout. Automated rollback: if per-tier cost-SLO breach is detected, the routing config reverts to the prior version without human intervention. Monthly classifier retraining: a scheduled pipeline pulls the past 30 days of shadow-judge-labeled traffic, adds human-reviewed examples from the escalation queue, and retrains the easy/hard classifier. The retrained classifier is canary-deployed using the same 1% process.

Real-world example

The vLLM team's v0.6.0 release notes (vllm.ai/blog) report from the chunked-prefill improvements in that release — a reminder that serving engine updates can change the easy/hard boundary substantially. A prompt that took 400 ms of prefill before the update may take 150 ms after, which reclassifies it from “hard (flagship only)” to “easy (mini viable)” from a latency-budget perspective. This is exactly the kind of model/infrastructure co-evolution that makes monthly classifier retraining necessary rather than optional.

✨ Insight · Why two, not four.Deep diving everything equally is a junior signal. The conversation store failure cascades into every active session simultaneously and triggers a KV-cache miss storm on the GPU fleet — both user-visible quality loss and a cost spike in the same event. The router failure is the fastest path to a six-figure cost incident. For everything else: “I would apply the same risk analysis — here is the failure mode I would watch.”

🔧

Break It

Three failure modes, each mapping to a metric in the eval harness. If you cannot measure the break, the mitigation is guesswork.

Remove prefix caching

Every request re-prefills the system prompt and full conversation history from scratch. At roughly 30K QPS with a 1,000-token system prompt and an average 4-turn history (approximately 2,000 tokens of prior turns), each request adds ~3,000 tokens of prefill versus the cached baseline. The vLLM paper (arXiv:2309.06180) reports from prefix caching on repeated prefixes. That 2–4x translates directly to fleet size: the flagship fleet would need 2–4x as many GPUs to serve the same QPS with the same latency. Alternatively, p95 TTFT regresses by the added prefill time — on the order of 500 ms–1 s per request for a session at turn 5. Detection: prefix-cache hit rate metric drops to 0%; cost-per-session metric spikes 2–4x. Mitigation: restore caching; audit what changed the prefix (system-prompt update, serialization format change) and ensure prefix bytes are stable across turns.

Disable conversation-store TTLs

Conversation histories accumulate indefinitely. Storage cost grows linearly with the user base. At 100M monthly active users with an average 50 KB of conversation history, that is 5 PB of object-storage accumulation with no upper bound — manageable financially in object storage, but now every retrieval can return arbitrarily long contexts, pressuring the model server with 10K+ token prefills for old-but-active accounts. The compliance cost is the larger issue: most privacy regulations (GDPR, CCPA) require a documented retention and deletion policy. A system with no TTL has no retention policy, which is a compliance blocker for any regulated-industry Enterprise customer. Detection: storage cost trend alert; compliance audit finding (not a real-time metric — a lagging indicator). Mitigation: restore per-tier TTLs; add a user-initiated deletion flow that satisfies right-to-erasure requirements.

Remove the model router

Route 100% of traffic to the flagship model. A 70B flagship costs roughly 5–10x more per token than a 7–13B mini model (community estimate based on relative parameter counts and memory bandwidth). If the mini model handles 60–70% of ChatGPT's free-tier volume, removing the router multiplies the cost of those requests by 5–10x. The free tier becomes immediately unprofitable at scale — not marginally, but by an order of magnitude. Within days, the cost model collapses. The free tier is either removed or rate-limited to near-zero, destroying the top-of-funnel for Plus conversion. Detection: per-tier cost-per-session metric crosses the ceiling within hours. Mitigation:restore routing; verify that the quality delta between mini and flagship is actually perceptible on the tasks the free tier does most (short Q&A, simple writing). If the quality gap has closed since the routing threshold was last calibrated, the threshold should be more aggressive.

Quick check

Trade-off

Disabling conversation-store TTLs is primarily a compliance risk, not just a storage cost problem. Why?

Because indefinite retention means the model must process arbitrarily long contexts, crashing the serving cluster.Because GDPR and CCPA require a documented retention and deletion policy; a system with no TTL has no retention policy and is a compliance blocker for regulated Enterprise customers.Because object storage at PB scale becomes unaffordably expensive within weeks.Because without TTLs, Redis runs out of memory within days.

💸

What does a bad day cost?

Reliability is a dollar number, not a percentage. The following estimates are back-of-envelope — the goal is order-of-magnitude accuracy. All model cost figures are community estimates unless cited.

Incident	Detected in 2 min	Detected in 4 h	Detection lever (10x sensitivity)
Router regression — free tier routes to flagship	(2 min × 20K QPS × 512 tok × $5/M delta; community estimate on cost delta)	(240 min × same rate — the same arithmetic at full window)	Per-tier GPU-spend alarm within 2 min vs. next-day billing dashboard — 120x cost difference
Prefix-cache invalidation — system prompt format changed	~2% TTFT regression on active sessions; small fleet overage (session-level, recoverable in minutes once cache warms)	4 h of 2–4x prefill cost across all active sessions; Plus-tier SLO breach for the full window; potential SLA credit trigger on Enterprise	Cache-hit-rate alarm (fires in <1 min) vs. no alarm (first signal is p95 TTFT report next morning) — qualitatively ~100x cost window difference
Safety classifier false-positive storm — benign requests refused	Small direct cost (refused requests skip GPU); social media posts begin within minutes of the storm	A 4-hour over-refusal event generates more support volume and brand damage than a 15-min availability outage; recovery requires a public post-mortem	Benign-but-sensitive golden set in CI (fires before deploy) vs. user-reported complaints — deploy-time detection vs. post-incident

⚠ Warning · Gate 7 lesson: the 10x detection window. The router regression row shows that detection at 2 minutes vs. 4 hours is a 120x cost difference for the same underlying bug. Reliability is not a percentage — it is a detection-window investment. The per-tier cost alarm costs nothing to build and is worth six figures per incident it catches. A team that monitors uptime but not per-tier cost is flying one-eyed.

🚨

ChatGPT On-call Runbook

Model router misroutes free-tier traffic to flagship

MTTR p50 / p99: 7 min / 20 min

Blast radius: All free-tier users receive flagship-quality responses; GPU spend spikes 3–5× on the over-served tier within minutes (community estimate on tier cost delta).

1. DetectPer-tier GPU-spend alarm fires within 2 min at 2× baseline; token throughput ratio alert (free:paid) deviates >20% from 7-day moving average.
2. EscalatePage serving on-call. Check router config diff against last deploy. Confirm via tier-split metric dashboard before rollback.
3. RollbackRevert router config to previous version (blue/green flip, ~5 min). Validate tier split restores to target ratio. MTTR dominated by detection, not rollback.
4. PostAdd pre-deploy integration test that routes a synthetic free-tier token through the router and asserts the response comes from the small-model tier. Gate all router deploys on this test.

Safety classifier false-positive storm during news event

MTTR p50 / p99: 15 min / 60 min

Blast radius: Benign requests refused at elevated rate (estimated 5–20% of traffic during storm). Users hit refusal messages; social media posts begin within minutes.

1. DetectBenign golden-set refusal rate alarm: fires if refusal rate on synthetic non-harmful probes exceeds 2× baseline for 60 s. Secondary: user-visible error rate spike.
2. EscalatePage safety on-call + policy lead. Do not roll back classifier silently — log the storm start time for post-incident audit. Notify comms if rate exceeds 10% for >5 min.
3. RollbackShadow-mode the new classifier (pass traffic to old weights, compare outputs). If old weights are clean, cut over. If no fallback: disable the classifier layer and accept higher TP rate temporarily — over-refusal is a worse user experience than mild under-refusal at scale.
4. PostExpand the benign golden set with query variants from the news event. Add a canary deploy stage that holds 1% traffic for 10 min with refusal rate monitoring before full rollout.

KV-cache OOM during viral prompt (unusually long context)

MTTR p50 / p99: 5 min / 25 min

Blast radius: Serving pods OOM-kill active sessions. Affected users see mid-stream disconnects. Cache eviction cascades can degrade adjacent sessions if not isolated per-pod.

1. DetectPod memory utilization alarm at 85% triggers; KV-cache eviction rate counter spikes above rolling baseline. Correlated with p99 TTFT increase.
2. EscalatePage infra on-call. Identify the prompt pattern driving the spike (log top-N input-length percentiles). Activate admission control to cap input length for new sessions.
3. RollbackEnable request-level input-length cap (e.g., 32 k tokens max for non-Enterprise). Restart OOM pods. For Enterprise sessions: route to dedicated long-context pool if capacity exists.
4. PostAdd input-length distribution to capacity model. Set KV-cache high-water-mark alarm at 70% so there is headroom before OOM. Investigate whether MQA/GQA can reduce per-session KV footprint.

For a direct comparison of how this runbook pattern differs under multi-turn agent workloads, see the Gemini case study and the Character.AI session-persistence design.

🏢

Company Lens — same system, different drills

OpenAI's drill

Expect the interviewer to drill on the routing-economics loop and model lifecycle management. Specific questions you should be able to answer: “How do you know when to update the routing threshold — what is the triggering signal and the deployment process?” (Answer: shadow-judge score delta alarm → canary routing deploy → auto-rollback on cost-SLO breach.) “When you ship a new model and retire the old one, how do you migrate without a conversation-store invalidation event that wipes every user's prefix cache?” (Answer: maintain both model versions behind the same router during a migration window; flush the prefix cache gradually by tier rather than all at once.) “How does the RLHF feedback from thumbs-down signals reach the reward model without contaminating the training set with adversarially-prompted negative labels?” OpenAI interviewers are acutely sensitive to the gap between the training-time world (clean batches, infinite budget) and the serving-time world (adversarial users, cost ceiling). Safety is a design constraint, not a sidebar — expect questions on how the safety classifiers integrate with the serving pipeline rather than bolt on.

Google's drill

Expect heavy emphasis on fair-share scheduling and multi-tenant queueing. Specific questions: “How do you prevent a single hot tenant from starving paying customers?” (Answer: per-tenant queue-depth p95 monitored in real time; fair-share weighted by GPU-second budgets so a burst above quota gets deprioritised rather than blocked, preserving SLO for other tenants.) “What is the Borg/Kubernetes integration for your GPU fleet — how does the scheduler know a prefill pod needs priority over a decode pod?” (Answer: prefill and decode pods carry different resource classes; the scheduler enforces disaggregated-prefill quotas per tier with preemption.) Google L6+ bar is heavy on scheduling theory — expect Borg-style gang-scheduling and bin-packing tradeoffs, not just serving metrics.

Anthropic's drill

Expect deep focus on the safety-helpfulness tradeoff as a measurable loss function, not a policy dial. Specific questions: “How does your system handle a sequence of requests that are each individually benign but collectively constitute policy evasion?” (Answer: the conversation store's multi-turn context is fed to the safety classifier, not just the current turn — context-aware classification requires the full prefix, not a single message.) “How do you measure whether Constitutional AI revisions are improving safety without degrading helpfulness?” (Answer: the benign-but-sensitive golden set measures false-positive refusal rate; the revision diff log feeds the eval harness to detect hallucinations introduced during revision; the shadow-run at 5% of production traffic gives a real-time signal.) “If a user complains that the model refused a legitimate medical question, how does that signal reach the calibration team?” (Answer: the benign-but-sensitive set should include a medical bucket; user-reported false refusals trigger a human-review annotation that feeds the next calibration round.) Anthropic's bar treats CAI (arXiv:2212.08073) as a first-class architectural component, not an inference-time add-on.

🧠

Key Takeaways

What to remember for interviews

1Separate p95 TTFT targets per tier — collapsing Free and Plus into one p95 hides regressions on the revenue-critical tier behind the larger free-tier volume.
2Design the eval harness before the architecture: multi-turn coherence and per-tier refusal rate are metrics that determine which components matter most, per Husain (hamel.dev) and Shankar et al. (arXiv:2404.12272).
3Prefix caching on the system prompt is a first-class cost lever — vLLM reports 2–4x throughput improvement from prefix reuse (arXiv:2309.06180); a 30% hit rate translates directly to 30% fewer prefill GPU-seconds.
4The conversation store has higher blast radius than a model-server node: its failure degrades every concurrent session at once and triggers a KV-cache miss storm on the GPU fleet.
5The shadow-judge pattern (score 5% of mini-routed responses via the flagship asynchronously) closes the routing quality feedback loop without full human review on every model update.
6Detection window sensitivity is the reliability investment: a per-tier cost alarm that fires in 2 minutes vs. a next-day billing review is a 120x difference in incident cost on a router regression.

🧠

Recap quiz

📚

Transformer Math