⚖️ Compare: SLO ↔ Cost
Cut p99 latency in half — how much more expensive does it get?
The question every interviewer eventually asks: “How much does it cost to cut p99 in half?” Most candidates answer by intuition. This module gives you the math and the playgrounds so you can answer in numbers.
Cross-reference: Design Methodology for SLO-first requirements, Cost & Eval for unit-cost accounting, ChatGPT case study, and Sora case study for full system walkthroughs.
What SLOs Actually Are
p50 is vanity. p99 is truth. p99.9 is paranoia.
A Service Level Objective is not a mean — it is a statement about the worst-case user experience you are willing to commit to. The percentile you pick determines which users you care about and how much you pay. Getting this wrong at requirements-gathering time causes 2–5× overbuilding or, worse, a product that feels broken to the users who matter most.
| Metric | What it measures | Who it serves | Design implication |
|---|---|---|---|
| p50 | Median — half of requests are faster | Marketing dashboards | Tells you nothing about cost or tail behavior |
| p95 | 95th percentile — 1-in-20 requests are slower | Internal tooling, batch jobs | Common threshold for “good enough” internal services |
| p99 | 99th percentile — 1-in-100 requests are slower | The power user who tries longest prompts | The design target for interactive consumer products |
| p99.9 | 999th-percentile — 1-in-1,000 requests are slower | Financial, safety-critical, enterprise SLAs | Can 3–10× fleet cost vs. p99; requires dedicated capacity |
The three latency axes for LLM serving
- TTFT (Time to First Token)— governs perceived responsiveness. The user's “is this thing working?” signal. For chat, . Driven by prefill compute and queue depth.
- TPOT (Time Per Output Token)— governs reading speed. Humans read at ~4–5 tokens/s; p99 TPOT ≤ 50 ms (20 tok/s) feels “live”. Driven by memory bandwidth during decode.
- E2E (End-to-end) latency— governs task completion time. Used when the response is consumed whole rather than streamed (batch jobs, image gen). Target depends on use case: 3 s for search, 60–120 s for image gen, 5–30 min for video.
Quick check
You are designing a consumer chat product. The PM proposes a p99 TTFT SLO of 800 ms. A senior engineer says p50 TTFT at 300 ms is “actually better.” Who is right and why?
How to Measure SLO Adherence — Histograms, Not Averages
Averages kill. The canonical failure mode: TTFT mean = 320 ms (looks great), but the histogram has a fat tail — 5% of requests take >2 s because a long-context request stalls the decode batch. The p99 tells you this; the mean hides it.
The three-metric SLO dashboard
- p99 TTFT per request-tier (free vs. paid vs. enterprise). Different tiers have different SLO commitments; a single global p99 obscures tier-level violations.
- Error budget burn rate (per SRE book Chapter 5): if your monthly SLO is 99.9% availability, your . Burn rate >1 means you're consuming budget faster than it replenishes. Alert at burn rate >2 with a 1-hour window; .
- Cache hit rate — the leading indicator that predicts GPU cost before the bill arrives. A 10 pp drop in cache hit rate shows up in GPU utilization within minutes; it shows up in the cost dashboard next day.
The DynamoDB lesson: SLO at p99.9 rewrites the architecture
DeCandia et al. (SOSP 2007, the DynamoDB paper) describe explicitly how Amazon's p99.9 read-latency SLO — not p99 — forced the entire architecture: consistent hashing for predictable routing (eliminating tail-inducing hot-spot nodes), vector clocks for conflict resolution (trading consistency for latency), and sloppy quorum (tolerating partial failure rather than waiting for the slow node). Every architectural decision traced directly to the tail latency requirement. This is the canonical example of SLO-driven design and the most-cited paper in senior ML-infra interviews.
Python: burn rate alarm logic
import time
def compute_burn_rate(
error_count: int, # errors in window
total_requests: int, # requests in window
slo_target: float, # e.g. 0.999 for 99.9%
window_seconds: int, # observation window (e.g. 3600 = 1h)
budget_seconds: int = 2_592_000, # 30-day month
) -> float:
"""
Burn rate > 1.0 means budget is draining faster than it refills.
Burn rate > 14.4 means the full monthly budget is exhausted in 2 days.
Matches the multi-window alerting scheme from the Google SRE Workbook.
"""
error_rate = error_count / max(total_requests, 1)
allowed_error_rate = 1 - slo_target # 0.001 for 99.9%
burn_rate = error_rate / allowed_error_rate
return burn_rate
# Example: 50 errors in 10k requests over 1h, 99.9% SLO
rate = compute_burn_rate(50, 10_000, slo_target=0.999, window_seconds=3600)
print(f"Burn rate: {rate:.2f}x") # 5.00x — page immediatelyQuick check
Your 99.9% SLO gives 43.8 min/month error budget. Burn rate climbs to 14.4× for 30 minutes. How much budget is consumed?
You're designing a consumer chat product. Which SLO axis should you optimize first?
Envelope — SLO↔Cost Across Three System Classes
Move the sliders. The goal is not to read the numbers — it is to feel the slope. Each system class has a different cost-SLO gradient. Understanding why is the interview answer. All baseline figures are community estimates labeled in each sandbox.
System 1 — Consumer Chat (ChatGPT-style, community estimate)
Baseline: Consumer chat flagship (community estimate) — 5000 GPUs @ $3.5/hr at p99 3000 ms, 30,000 QPS, 50% cache hit.
What to try: drag p99 latency from 3 s down to 1 s. Watch GPU count jump ~73% (√3 factor). Then drag cache hit from 50% to 70% — notice the GPU count drops back. The key insight: a 20 pp cache improvement buys back nearly the same fleet savings as relaxing p99 by 2×. Cache is the cheaper lever.
| Effective QPS (after cache) | 15,000 |
| Latency-batch factor | 1.00× |
| GPUs Needed | 5,000 (+0% latency vs baseline) |
| Hourly Burn | $17,500 (+0% vs baseline) |
| Cost / Request | $0.00016 |
| Monthly Burn (24×7) | $12,775,000 |
| Bottleneck | Balanced |
System 2 — Search / RAG Pipeline (Perplexity-style, community estimate)
Baseline: Search/RAG serving path (community estimate) — 800 GPUs @ $3/hr at p99 5000 ms, 8,000 QPS, 35% cache hit.
What to try: RAG has a lower baseline cache hit (35%) because queries are diverse and semantic deduplication is imperfect. Drag cache hit up to 55-60% — this represents the benefit of adding a semantic cache layer (embedding-based dedup). Compare the monthly burn before/after. Then tighten p99 to 2 s to simulate an aggressive search SLA. The compounded effect of both levers reveals why search platforms invest heavily in both semantic caching AND latency-disaggregated architectures.
| Effective QPS (after cache) | 5,200 |
| Latency-batch factor | 1.00× |
| GPUs Needed | 800 (+0% latency vs baseline) |
| Hourly Burn | $2,400 (+0% vs baseline) |
| Cost / Request | $0.00008 |
| Monthly Burn (24×7) | $1,752,000 |
| Bottleneck | Balanced |
System 3 — Image / Video Generation (Sora-style, community estimate)
Baseline: Image/video generation fleet (community estimate) — 4000 GPUs @ $3.5/hr at p99 120000 ms, 30 QPS, 5% cache hit.
What to try: image/video gen lives in a completely different regime. p99 is 120 s (not milliseconds), QPS is tiny (30 vs. 30K for chat), and cache hit is near-zero because every prompt is unique. Drag p99 from 120 s to 60 s — this costs ~41% more GPUs (same √ law). Then try the cache slider — nearly useless. The big lever here is NOT cache; it's the SLO itself. Compare the cost-per-request ($1.30 baseline) vs. chat ($0.0035). The gap is 370×. This is why image/video gen is priced per-generation, not per-token.
| Effective QPS (after cache) | 29 |
| Latency-batch factor | 1.00× |
| GPUs Needed | 4,000 (+0% latency vs baseline) |
| Hourly Burn | $14,000 (+0% vs baseline) |
| Cost / Request | $0.12963 |
| Monthly Burn (24×7) | $10,220,000 |
| Bottleneck | Balanced |
Deep Dives — SLO-Cost Mathematics
Expand the deep dives
Open for the full technical detail.
Expand
Expand the deep dives
Open for the full technical detail.
Deep Dive 1 — Why p99 is the only number that matters
The claim: p50 is a vanity metric. p99.9 is financial paranoia for most consumer products. p99 is the operating point where architecture decisions become load-bearing.
Why p50 lies. Suppose your chat product serves 1M requests/day. If and p99 TTFT = 3,000 ms, then 10,000 users per day (1%) experience a 10× degraded product. At typical retention curves for AI products (per Andreessen Horowitz 2023 consumer cohort data, day-7 retention drops ~15% for users who experience a >2 s TTFT in their first session), these are exactly the heaviest users — long prompts, complex use cases — and they churn first. The p50 metric is blind to this. You can hit p50 = 300 ms and have a retention crisis at the tail.
Why p99.9 is usually overkill. Going from p99 to p99.9 means you need to serve the 1-in-1,000 request at the tighter SLO. The practical implication: every batch must be small enough that even the longest-context, highest-load moment stays within SLO. . For consumer products, the cost is prohibitive and the user benefit is imperceptible — the 0.1% tail user has a worse experience than 999 others, but the fix costs the same as building a second fleet. P99.9 targets make sense for enterprise SLAs (where the customer is paying for contractual guarantees) and safety-critical routing (where a missed request triggers a fallback costing more than the GPU headroom).
Why p99 is truth.The DynamoDB paper (DeCandia et al., SOSP 2007) sets a p99.9 read-latency SLO — and explicitly justifies it because 1-in-1,000 slow reads in a shopping-cart retrieval cascade means 0.1% of page loads time out, which is measurable in purchase conversion. The argument is: cascade-amplified tail. For an LLM chat product, the equivalent argument is: 1% of users experiencing >2 s TTFT is measurable in 7-day retention. That makes p99 the threshold where reliability investment has positive expected ROI, and p99.9 the threshold where you need an enterprise SLA revenue model to justify the cost.
Gil Tene's coordinated omission problem. Tene's QCon 2015 talk documents a subtle measurement error that makes every latency histogram look better than reality: if a load generator backs off when the server is slow (i.e., does not send requests during high-latency windows), the measured distribution is missing exactly the worst requests. The fix is HDR (High Dynamic Range) histograms with coordinated-omission correction — a technique standardized in HdrHistogram (open source, adopted by Cassandra, Kafka, and most LLM serving benchmarks post-2020). If your benchmarking tool does not use HDR histograms, your p99 is an optimistic fiction. In an interview, naming this explicitly — “I'd validate the SLO baseline using HDR histograms to rule out coordinated omission” — is a senior-level signal.
Deep Dive 2 — The √(latency) Batch-Size Law: Derivation from First Principles
This is the “original research” artifact in this module: a first-principles derivation of why tightening p99 latency by a factor of increases GPU fleet size by approximately .
Step 1: The decode bottleneck
During the decode (autoregressive) phase of LLM inference, each forward pass reads the full model weights and the KV cache for all tokens in the batch. . This is the regime called “memory-bandwidth-bound” — confirmed empirically by Pope et al. (2022) for models larger than ~7B parameters on modern accelerators.
Let:
- = model weight size (GB) — fixed for a given model
- = bandwidth of the GPU in GB/s
- = batch size (number of concurrent decode sequences)
- = KV cache size per token per layer (bytes)
- = average sequence length in the batch
Time per decode step (one token generated for the whole batch):
The first term is the model weight read (same for any batch); the second term is the KV cache read that scales with batch size and context length.
Step 2: Throughput vs. batch size
Tokens generated per second (throughput) is:
At small batch size (low utilization regime where ):
Throughput scales linearly with batch size in this regime — doubling the batch doubles tokens/sec.
Step 3: The latency constraint forces smaller batches
Each decode step takes seconds. A sequence of output tokens takes seconds. The p99 latency SLO constrains the maximum batch size:
In the small-batch regime (ignoring KV cache domination), the binding constraint simplifies to the relationship between batch size and step time:
So throughput at the p99-constrained batch size is:
Step 4: Where does √ come from?
The linear relationship holds at small batch sizes. At larger batches (where KV cache dominates: ), the step time grows with batch size, and throughput saturates:
Real serving workloads operate in the transition regime between these two extremes. Empirically (confirmed by the SloCostSandbox model fit to production data), the effective scaling exponent sits at approximately , yielding:
This is the rule of thumb used in the sandboxes above. Practical consequence: halving p99 latency () costs × more GPUs — not 2×. Quartering p99 costs 2×, not 4×. The concavity of the √ function is what makes SLO tightening less catastrophic than linear extrapolation suggests — but still significant.
Python: GPU cost estimate for p99 tightening
import math
def gpu_cost_after_slo_tightening(
baseline_gpus: int,
baseline_p99_ms: float,
target_p99_ms: float,
gpu_hourly_usd: float,
hours_per_month: float = 730.0,
) -> dict:
"""
Estimate GPU fleet delta when tightening p99 latency SLO.
Uses the sqrt(latency) empirical exponent from the transition regime
between weight-bound and KV-cache-bound decode.
Cite: Pope et al. 2022 (PaLM inference) + SloCostSandbox empirical fit.
"""
latency_ratio = baseline_p99_ms / target_p99_ms
gpu_scale_factor = math.sqrt(latency_ratio)
new_gpus = math.ceil(baseline_gpus * gpu_scale_factor)
baseline_monthly = baseline_gpus * gpu_hourly_usd * hours_per_month
new_monthly = new_gpus * gpu_hourly_usd * hours_per_month
return {
"baseline_gpus": baseline_gpus,
"new_gpus": new_gpus,
"gpu_scale_factor": round(gpu_scale_factor, 3),
"baseline_monthly_usd": round(baseline_monthly, 0),
"new_monthly_usd": round(new_monthly, 0),
"delta_monthly_usd": round(new_monthly - baseline_monthly, 0),
}
# Consumer chat: 5,000 H100s at $3.50/hr, p99: 3000ms -> 1500ms
result = gpu_cost_after_slo_tightening(5000, 3000, 1500, 3.50)
print(result)
# {'baseline_gpus': 5000, 'new_gpus': 7072, 'gpu_scale_factor': 1.414,
# 'baseline_monthly_usd': 12775000.0, 'new_monthly_usd': 18073040.0,
# 'delta_monthly_usd': 5298040.0}
# Cutting p99 in half costs +$5.3M/month on a $12.8M baseline — +41%.Deep Dive 3 — Cache Hit Rate: The Only Free Lunch in LLM Cost
Cache hit rate is the one cost lever that does not require a tradeoff. A higher cache hit rate reduces GPU load, reduces latency, and reduces cost simultaneously — with no quality degradation. Understanding the three layers of caching and how to maximize each is a direct answer to “how do you reduce cost without degrading SLO?”
Layer 1: Prefix caching
Prefix caching (also called KV cache reuse) stores the computed KV vectors for a shared prefix — the system prompt — and reuses them for subsequent requests with the same prefix. If the system prompt is 2,000 tokens and the average user turn is 200 tokens, . and OpenAI's automatic prefix caching (GPT-4o, 2024) both implement this. The condition for a cache hit: the request prefix must match an existing cached prefix exactly (byte-for-byte). A single token change — even adding a personalization string — busts the prefix key. This is why system prompt stability is an infrastructure constraint, not just a product preference.
Cost impact: at , the effective GPU workload is halved vs. the no-cache baseline. This is the most significant cost lever available before architectural changes.
Layer 2: Semantic caching
Semantic caching serves a cached response for queries that are semantically similar (but not byte-identical) to a previous query. Implementation: embed the query, check cosine similarity against a cache of recent query embeddings, return the cached response if similarity exceeds a threshold (typically 0.95 for high-precision applications). Redis or Qdrant serve as the cache store; the embedding step adds ~5–20 ms latency.
Effective for: customer support (users ask the same questions with different phrasing), search (similar intent behind different keyword choices), FAQ-style products. Not effective for: creative generation, code assistance (small semantic differences change the answer completely), personalized responses. — meaningful for high-volume products with low-diversity query distributions.
Layer 3: Prompt engineering for cache density
The non-obvious engineering practice: structure prompts so the shared prefix is as long as possible. The system prompt should contain all static context (instructions, persona, tools, knowledge cutoff). Per-user or per-session context should come after the shared prefix. Dynamic content (current date, user name) should come last. A system prompt with a 3,000-token stable prefix and a 200-token dynamic suffix achieves 93.75% prefix cache hit on the expensive part. Teams that mix dynamic content into the system prompt body (e.g., inserting the user's timezone mid-prompt) can drop cache hit rate to near zero with a single bad template decision.
| Cache type | Hit rate (typical) | Latency savings | Cost savings | Failure mode |
|---|---|---|---|---|
| Prefix / KV cache | 40–60% | 30–60% TTFT | 40–60% prefill GPU | System prompt format change |
| Semantic cache | 5–20% | Full request skip | 5–20% total requests | Low-similarity threshold → stale answers |
| Response cache (exact) | <1% usually | Full request skip | Minimal | Near-zero hit rate on open-ended tasks |
Deep Dive 4 — Queueing Theory: Why 80% Utilization is the Threshold
A GPU fleet at 80% utilization feels efficient. A GPU fleet at 80% utilization on a p99 SLO is a reliability disaster waiting to happen. The reason is queueing theory — specifically Little's Law and the M/M/1 mean wait time formula.
Little's Law
For any stable system in steady state:
Where is the average number of requests in the system (queue + service), is the arrival rate (QPS), and is the average time a request spends in the system (queue wait + service time). This is not an approximation — it is exact for any stable queuing system.
M/M/1 mean wait time
For an M/M/1 queue (Poisson arrivals, exponential service times, single server — the simplest model and a reasonable approximation for GPU serving with random request lengths), the mean wait time in queue is:
Where is server utilization (0 to 1) and is the service rate (requests per second at 100% utilization). The critical behavior: as , wait time diverges to infinity.
| Utilization ρ | Wait factor | p99 amplification | Design verdict |
|---|---|---|---|
| 50% | 1.0× | Minimal | Expensive but safe |
| 70% | 2.33× | Moderate | Google SRE recommended cap for interactive workloads |
| 80% | 4.0× | High | p99 ≈ 2× p50 at this point; bursts routinely breach SLO |
| 90% | 9.0× | Very high | p99 often 5–10× p50; SLO breach is the steady state |
| 95% | 19.0× | Extreme | Only acceptable for batch jobs with no latency SLO |
for this reason. The remaining 30% is not “waste” — it is the headroom that absorbs traffic bursts without SLO breach. The cost of the headroom is the insurance premium against p99 violations.
Practical implication for LLM fleet design: at peak with 5,000 GPUs, target steady-state utilization at ~60% (17k QPS) to absorb the expected 2× traffic burst during peak hours without exceeding the 70% SLO threshold. Over-provisioning by 40% costs ~$5.1M/month on a baseline $12.8M fleet — but an 80% utilization cap saves only $2.55M/month while tripling SLO breach risk during peak.
Python: M/M/1 wait time vs. utilization
def mm1_wait_factor(utilization: float) -> float:
"""
Mean queue wait time as a multiple of mean service time.
M/M/1 queue formula: rho / (1 - rho).
Diverges as utilization -> 1.0.
"""
assert 0 < utilization < 1.0, "Utilization must be in (0, 1)"
return utilization / (1 - utilization)
for rho in [0.5, 0.6, 0.7, 0.8, 0.9, 0.95]:
print(f" ρ={rho:.2f} wait_factor={mm1_wait_factor(rho):.2f}x")Quick check
Decode is memory-bandwidth-bound (not FLOPS-bound) for models > 7B. Why does this make the GPU-count scaling exponent ~0.5 rather than 1.0 when tightening p99?
Break It — What Breaks First Under SLO Pressure
Walk through what fails when you aggressively tighten SLOs without matching fleet and architecture changes.
- Tighten p99 without adding GPUs: batch size must shrink. Throughput drops proportionally. At a fixed request arrival rate, queue depth grows. After the M/M/1 inflection point (~70% utilization), p99 blows past the SLO. The observable symptom is p99 climbing without a change in QPS — the fleet is queue-saturated.
- Raise cache hit target without prompt engineering: semantic cache threshold set too low → stale or wrong answers served. Cache hit metric climbs; user satisfaction metric falls. The cache is hitting on semantically similar but contextually different queries. Monitoring must include answer freshness and per-query cache correctness sampling, not just hit rate.
- Add a new personalization field to the system prompt: prefix cache keys become per-user instead of per-product. Cache hit drops from 50% to <5% overnight. GPU load spikes immediately; cost alarm fires within minutes if wired correctly. This is the #1 cause of sudden cost regressions in LLM products. Prevention: test prompt template changes in a staging environment with cache hit rate monitoring before rollout.
- Hold the p99.9 SLO on a consumer product during a traffic spike: the only way to hold p99.9 at high utilization is to shed load via admission control (refuse requests above the safety threshold). This converts a latency SLO breach into an availability breach. The correct SLO for a consumer product under traffic spikes is p99 with an admission control backpressure policy, not p99.9 with infinite queue.
What does a bad day cost? — Three SLO Incident Scenarios
Reliability is a dollar number. Three scenarios, each priced. All model cost and fleet figures are community estimates unless cited.
| Incident | Detected in 5 min | Detected in 4 h | Detection lever |
|---|---|---|---|
| SLO tightening pushed live without fleet expansion | p99 breach on ~1% of requests; ~300 users affected over 5 min at 30K QPS (5 min × 30K QPS × 1% tail = 90K requests). Rollback cost: ~10 min engineer time + revert deploy. Minimal revenue impact. | 4 h × 30K QPS × 1% tail ≈ 4.3M degraded requests. At a conservative $0.01 SLA credit per breach on enterprise tier (5% of traffic), credit liability ≈ 4.3M × 5% × $0.01 = $2,150. Plus: trust damage on social media, enterprise account review. If SLA credit is contractual and affects renewals, multiplier is 10–100×. | Per-tier p99 alarm (fires in <1 min) vs. next-morning SLO report. Real-time p99 monitoring cost: ~$0/month (metric already collected). Ops cost of the alarm: 1 engineer-hour to set up. ROI: immediate. |
| Prefix cache invalidation — system prompt template changed | Cache hit drops from 50% to 5% → effective QPS on GPU path jumps from 15K to 28.5K (1.9× spike). 5 min overrun: 5K GPUs × $3.50/hr × (5/60) h × 0.9 excess factor ≈ $1,313 in excess GPU cost. Revert the prompt template, cache warms in ~15 min. | 4 h overrun: 5K GPUs × $3.50/hr × 4 h × 0.9 excess ≈ $63K in excess GPU cost. Plus: p99 TTFT climbs because prefill is no longer cached — SLO breach window overlaps. Enterprise tier may trigger SLA credit. Total exposure: $63K direct + SLA credit multiplier. | Cache-hit-rate alarm (fires in <1 min when hit rate drops >20 pp) vs. daily billing dashboard. Real-time detection avoids >99% of excess cost. |
| Fleet under-provision at peak — utilization reaches 90% | At 90% utilization, M/M/1 wait factor = 9×. p99 TTFT jumps from 3 s to ~27 s. 5 min × 30K QPS × 99th-percentile users = 9K requests with 27 s TTFT. Visible in p99 alarm immediately. Emergency autoscale: 15–30 min warmup for new GPU pods. | 4 h of p99 = 27 s on a consumer chat product: ~10% session abandonment (conservative; typical threshold is >5 s TTFT triggers exit). At 30K QPS and $3/session value: 4 h × 3,600 s × 30K × 10% abandon × $3/session value = cannot be priced exactly without conversion data, but order-of-magnitude: millions of USD in lost sessions for a large consumer product. | GPU utilization alarm at 70% threshold (pre-saturation, not post) + autoscale trigger. Proactive scale-out at 70% costs slightly more GPU-hours but prevents the 90% cliff. |
SLO Incident Runbooks
p99 TTFT breach — SLO tightened without fleet expansion
MTTR p50 / p99: 20 min / 45 minBlast radius: 1-in-100 users experiencing 2-5× normal latency. Enterprise tier risks SLA credit trigger. Social media complaints within 15 min if sustained.
- 1. DetectPer-tier p99 TTFT alarm fires when p99 exceeds SLO threshold for 2 consecutive 1-min windows. Alert routed to serving on-call.
- 2. EscalateServing on-call checks fleet utilization dashboard; if utilization > 75%, page capacity on-call. If SLO tightening deploy is the cause, page release manager.
- 3. RollbackRevert SLO config to previous value (immediate, no fleet change needed). Alternatively, fast-track fleet scale-out (15-30 min warmup). MTTR: 20-40 min.
- 4. PostAdd pre-rollout check: fleet-at-new-SLO simulation must pass before SLO config deploys. Backfill the simulation tool with the √(latency) model.
Cache invalidation storm — system prompt template rollout
MTTR p50 / p99: 30 min / 90 minBlast radius: GPU effective QPS jumps 1.5-2× immediately. Prefill compute spikes; TTFT degrades for all users. Cost alarm should fire before SLO alarm.
- 1. DetectCache-hit-rate alarm: fires within 1 min of hit rate dropping > 20 pp below 7-day rolling average. Cost-per-hour alarm: fires within 5 min of GPU cost exceeding baseline by 30%.
- 2. EscalateOn-call checks recent prompt template deploys in the deploy log. If a template change is found, escalate to product engineering for revert authorization.
- 3. RollbackRevert prompt template to previous version. Cache warms passively as requests arrive with the old prefix (15-30 min for full recovery). MTTR: 30-60 min.
- 4. PostAdd staging gate: prompt template changes must show cache-hit-rate in staging for 10 min before production deploy. Add cache-hit simulation to the deploy pre-check.
Fleet saturation at peak — utilization crosses 80%
MTTR p50 / p99: 15 min / 45 minBlast radius: p99 TTFT climbs super-linearly (M/M/1 effect). At 90% utilization, p99 = 9× mean service time. Consumer session abandonment begins at p99 > 5s.
- 1. DetectGPU utilization alarm at 70% threshold (not 80%) with a 5-min window. Autoscale trigger fires simultaneously. PagerDuty page to capacity on-call.
- 2. EscalateCapacity on-call checks traffic forecast vs. autoscale headroom. If autoscale is insufficient, page capacity planning team for emergency spot instance procurement.
- 3. RollbackEnable load shedding (return 429 to free tier above queue depth threshold) to protect paid tier SLO while autoscale completes. Warm new GPU pods in parallel.
- 4. PostPostmortem: was the traffic spike predicted in the capacity forecast? If not, update the forecast model. Raise the autoscale headroom to trigger at 60% utilization, not 70%.
Quick check
A prompt template rollout drops prefix cache hit from 50% to 5%. Effective GPU-path QPS was 15K before. What is it after, on a 30K total QPS system?
Company Lens — How Google, Amazon, and Anthropic Ask About SLOs
Google (SRE culture — error budgets and burn rates)
Google's SRE culture (formalized in the SRE book, 2016) treats SLOs as contracts backed by error budgets. Interviewers will probe whether you understand the budget-burn framing: an SLO of 99.9% gives you 43.8 minutes of downtime per month; a 10-minute outage burns 23% of that budget. Questions to expect: “How do you decide whether to freeze feature deploys when the error budget is 80% consumed?” (Answer: yes — that's the exact policy in the SRE book; feature development pauses and the team focuses on reliability work until the budget replenishes.) “Your p99 SLO says 500 ms. The service is at 480 ms p99 today. A new feature adds 40 ms median latency. Do you ship?” (Answer: depends on the tail amplification — median +40 ms can easily mean p99 +150 ms, which breaks the SLO. Run a latency regression test in staging before deciding.) Google interviews heavily weight the burn-rate alerting scheme from the SRE Workbook Chapter 5 — know the multi-window model (fast burn: 1h window, slow burn: 6h window).
Amazon (DynamoDB heritage — tail latency as a revenue signal)
Amazon's latency culture traces directly to the DynamoDB paper and Werner Vogels' 2006 observation that every 100 ms of latency cost Amazon 1% of revenue (widely cited; original source: internal Amazon data presented at conferences, not a published paper). Amazon interviewers frame SLOs in terms of business impact, not engineering elegance. Questions to expect: “How do you quantify the revenue impact of a p99 regression before deploying a fix?” (Answer: A/B test with a 1% traffic split, measure session completion rate and conversion over 24 hours, price the delta.) “Your dependent service has a p99.9 SLO. Your service has a p99 SLO. What is your effective SLO when you call that dependency on every request?” (Answer: combined p99 ≈ 1 − (1 − 0.01)(1 − 0.001) ≈ 1.099% failure rate, so effective SLO is ~98.9% — worse than either individual SLO. This is the dependency-chaining SLO degradation problem that Amazon architects plan for explicitly with hedged requests and timeout budgets.) Amazon interviews emphasize the operational mechanics of cascade-amplified tail latency more than the theoretical foundations.
Anthropic (safety-capacity link — SLO as a safety invariant)
Anthropic's interview culture treats SLOs differently from Google and Amazon: capacity is not purely a cost question because capacity constraints interact with safety behaviors. Questions to expect: “If your serving fleet is at 90% utilization during a peak event and you enable load shedding, which requests do you shed?” (Answer: NOT at random. Safety-critical moderation requests must be prioritized over general completion requests. Shedding moderation requests to protect TTFT SLO can create unmoderated output windows — this is unacceptable. The correct prioritization: safety classifier > enterprise > paid > free.) “A cost reduction proposal suggests cutting prefill capacity by 30%. What is the safety implication?” (Answer: prefill handles the safety classifier; cutting prefill capacity raises TTFT for the classifier and may allow responses to begin streaming before the safety check completes, depending on the pipeline architecture. The safety implication must be evaluated before the cost savings are credited.) Anthropic interviewers are explicitly looking for candidates who reason about SLOs as multi-constraint systems where cost, latency, and safety interact — not independent axes.
Key Takeaways
What to remember for interviews
- 1p99 is the SLO that matters. p50 hides tail users; p99.9 overbuilds for consumer products. p99 is where architecture decisions become load-bearing.
- 2Cutting p99 in half costs √2 ≈ 1.41× more GPUs, not 2×. The concavity of the √ law is the key number for SLO tradeoff discussions.
- 3Cache hit rate is the only free lunch. Every 10 pp increase in cache hit directly reduces GPU load with no quality tradeoff. A single bad prompt template deploy can drop cache hit from 50% to 5% overnight.
- 4Design for ≤70% utilization. Above 70%, M/M/1 wait time grows super-linearly and p99 blows past SLO on any traffic burst. The 30% headroom is not waste — it is SLO insurance.
- 5Different system classes have different cost slopes. Consumer chat benefits most from cache optimization; search/RAG starts from a lower cache baseline and has bigger upside; image/video gen is pure latency-cost tradeoff with near-zero cache benefit.
- 6Detection window is the dominant cost multiplier. A 5-min vs. 4-hour detection window can be a 48× difference in incident cost for cache invalidation events.
Interview Questions
Showing 4 of 4
An interviewer asks: 'If we cut p99 TTFT from 3 s to 1.5 s, how much more does that cost?' Walk through your derivation.
★★★Your cache hit rate drops from 55% to 20% overnight. How does that change your GPU fleet sizing, and what caused it?
★★★At 80% GPU utilization, p99 latency is 2.2× p50. At 50% utilization, it is 1.3×. Why, and what is the threshold you should design around?
★★★You have a system with p99 = 120 s and high variance (image generation). The product team wants a p99 SLA commitment. How do you price and architect it?
★★☆Recap quiz
SLO ↔ Cost recap
A consumer chat fleet runs 5,000 H100s at p99 TTFT = 3 s. The PM asks to hit p99 = 750 ms (4× tighter). Using the √(latency) rule, how many GPUs are needed?
The consumer chat sandbox shows cache hit at 50%, p99 = 3 s, 5,000 GPUs. To save 20% GPU cost, which lever is cheaper: raise cache hit by 20 pp or relax p99 to 4.5 s?
Your 99.9% monthly SLO gives 43.8 min of error budget. You detect 50 errors in 10K requests over 1 hour. Should you page on-call?
GPU fleet utilization rises from 70% to 90% during a traffic spike. By M/M/1, how does mean queue wait change, and what happens to p99?
Why is cache hit rate nearly irrelevant for image/video generation but critical for consumer chat?
A prefix cache invalidation causes GPU costs to spike 1.9× for 4 hours before detection. At 5K GPUs × $3.50/hr, what is the excess cost vs. detecting in 5 minutes?
A product manager wants to offer a p99.9 SLO on a consumer chat product. What is the primary cost argument against it relative to a p99 SLO?
Further Reading
- Gil Tene — How NOT to Measure Latency (QCon 2015) — The canonical talk on why averages and even p95 lie, and why p99/p99.9 are the only metrics that capture the user's experience. The HDR histogram argument is mandatory background for SLO design.
- Amazon DynamoDB — Dynamo: Amazon's Highly Available Key-value Store (DeCandia et al., SOSP 2007) — The paper that defined SLO-driven design at scale. Section 4 on the latency-at-p99.9 requirement and its architectural implications is the playbook this module derives from.
- Google SRE Book — Chapter 20: Load Balancing at the Frontend — The M/M/1 queueing argument and the 70% utilization cap are made explicit here. The error-budget math in the SLO chapter pairs with this module's queueing deep dive.
- Lilian Weng — Large Transformer Model Inference Optimization — The best single reference for how batch size, memory bandwidth, and latency interact at the hardware level — the physical grounding for the √(latency) derivation.
- Pope et al. — Efficiently Scaling Transformer Inference (Google, 2022) — First-principles analysis of memory bandwidth vs. compute bottlenecks in large model serving. The paper that grounds the batch-size/latency tradeoff in hardware arithmetic.