Compare: SLO ↔ Cost — Transformer Math

📋

What SLOs Actually Are

p50 is vanity. p99 is truth. p99.9 is paranoia.

A Service Level Objective is not a mean — it is a statement about the worst-case user experience you are willing to commit to. The percentile you pick determines which users you care about and how much you pay. Getting this wrong at requirements-gathering time causes 2–5× overbuilding or, worse, a product that feels broken to the users who matter most.

Metric	What it measures	Who it serves	Design implication
p50	Median — half of requests are faster	Marketing dashboards	Tells you nothing about cost or tail behavior
p95	95th percentile — 1-in-20 requests are slower	Internal tooling, batch jobs	Common threshold for “good enough” internal services
p99	99th percentile — 1-in-100 requests are slower	The power user who tries longest prompts	The design target for interactive consumer products
p99.9	999th-percentile — 1-in-1,000 requests are slower	Financial, safety-critical, enterprise SLAs	Can 3–10× fleet cost vs. p99; requires dedicated capacity

The three latency axes for LLM serving

TTFT (Time to First Token)— governs perceived responsiveness. The user's “is this thing working?” signal. For chat, . Driven by prefill compute and queue depth.
TPOT (Time Per Output Token)— governs reading speed. Humans read at ~4–5 tokens/s; p99 TPOT ≤ 50 ms (20 tok/s) feels “live”. Driven by memory bandwidth during decode.
E2E (End-to-end) latency— governs task completion time. Used when the response is consumed whole rather than streamed (batch jobs, image gen). Target depends on use case: 3 s for search, 60–120 s for image gen, 5–30 min for video.

✨ Insight · Measure the right thing.Gil Tene's “How NOT to Measure Latency” (QCon 2015) makes the point explicitly: coordinated omission in latency benchmarks causes p99 to look like p50. If your load generator doesn't account for back-pressure, every latency histogram you publish is a lie. The fix is HDR histograms with coordinated-omission correction — the standard in production SLO tooling since circa 2016.

Quick check

Trade-off

You are designing a consumer chat product. The PM proposes a p99 TTFT SLO of 800 ms. A senior engineer says p50 TTFT at 300 ms is “actually better.” Who is right and why?

Neither — TPOT is the primary axis for streaming chat products.The engineer — p50 represents the median experience and most users.Both are equivalent — any percentile works as the SLO anchor.The PM — p99 captures the tail user who determines retention.

📊

How to Measure SLO Adherence — Histograms, Not Averages

Averages kill. The canonical failure mode: TTFT mean = 320 ms (looks great), but the histogram has a fat tail — 5% of requests take >2 s because a long-context request stalls the decode batch. The p99 tells you this; the mean hides it.

The three-metric SLO dashboard

p99 TTFT per request-tier (free vs. paid vs. enterprise). Different tiers have different SLO commitments; a single global p99 obscures tier-level violations.
Error budget burn rate (per SRE book Chapter 5): if your monthly SLO is 99.9% availability, your . Burn rate >1 means you're consuming budget faster than it replenishes. Alert at burn rate >2 with a 1-hour window; .
Cache hit rate — the leading indicator that predicts GPU cost before the bill arrives. A 10 pp drop in cache hit rate shows up in GPU utilization within minutes; it shows up in the cost dashboard next day.

The DynamoDB lesson: SLO at p99.9 rewrites the architecture

DeCandia et al. (SOSP 2007, the DynamoDB paper) describe explicitly how Amazon's p99.9 read-latency SLO — not p99 — forced the entire architecture: consistent hashing for predictable routing (eliminating tail-inducing hot-spot nodes), vector clocks for conflict resolution (trading consistency for latency), and sloppy quorum (tolerating partial failure rather than waiting for the slow node). Every architectural decision traced directly to the tail latency requirement. This is the canonical example of SLO-driven design and the most-cited paper in senior ML-infra interviews.

Python: burn rate alarm logic

python

import time

def compute_burn_rate(
    error_count: int,       # errors in window
    total_requests: int,    # requests in window
    slo_target: float,      # e.g. 0.999 for 99.9%
    window_seconds: int,    # observation window (e.g. 3600 = 1h)
    budget_seconds: int = 2_592_000,  # 30-day month
) -> float:
    """
    Burn rate > 1.0 means budget is draining faster than it refills.
    Burn rate > 14.4 means the full monthly budget is exhausted in 2 days.
    Matches the multi-window alerting scheme from the Google SRE Workbook.
    """
    error_rate = error_count / max(total_requests, 1)
    allowed_error_rate = 1 - slo_target            # 0.001 for 99.9%
    burn_rate = error_rate / allowed_error_rate
    return burn_rate

# Example: 50 errors in 10k requests over 1h, 99.9% SLO
rate = compute_burn_rate(50, 10_000, slo_target=0.999, window_seconds=3600)
print(f"Burn rate: {rate:.2f}x")  # 5.00x — page immediately

Quick check

Derivation

Your 99.9% SLO gives 43.8 min/month error budget. Burn rate climbs to 14.4× for 30 minutes. How much budget is consumed?

1% of monthly budget — 43.8 min × (30 min / 30 days × 1,440 min/day).43.8 minutes — the entire monthly budget consumed in 30 minutes.14.4 minutes — burn rate × window minutes.100% — burn rate 14.4 means the full budget drains in one window.

🧮

Envelope — SLO↔Cost Across Three System Classes

Move the sliders. The goal is not to read the numbers — it is to feel the slope. Each system class has a different cost-SLO gradient. Understanding why is the interview answer. All baseline figures are community estimates labeled in each sandbox.

System 1 — Consumer Chat (ChatGPT-style, community estimate)

Baseline: Consumer chat flagship (community estimate) — 5000 GPUs @ $3.5/hr at p99 3000 ms, 30,000 QPS, 50% cache hit.

What to try: drag p99 latency from 3 s down to 1 s. Watch GPU count jump ~73% (√3 factor). Then drag cache hit from 50% to 70% — notice the GPU count drops back. The key insight: a 20 pp cache improvement buys back nearly the same fleet savings as relaxing p99 by 2×. Cache is the cheaper lever.

p99 Latency Target3000 ms

Peak QPS30,000

Cache Hit Rate50%

Effective QPS (after cache)	15,000
Latency-batch factor	1.00×
GPUs Needed	5,000 (+0% latency vs baseline)
Hourly Burn	$17,500 (+0% vs baseline)
Cost / Request	$0.00016
Monthly Burn (24×7)	$12,775,000
Bottleneck	Balanced

⚠ Warning · Gotcha: Gotcha: cutting p99 from 3 s → 1.5 s looks like 50% faster but costs ~41% more GPUs (√2 law), not 2×. Candidates who say '2× GPUs' are applying linear thinking to a √ relationship. The √ exponent comes from memory-bandwidth-bound decode: smaller batches run at lower arithmetic intensity, not lower throughput per se.

System 2 — Search / RAG Pipeline (Perplexity-style, community estimate)

Baseline: Search/RAG serving path (community estimate) — 800 GPUs @ $3/hr at p99 5000 ms, 8,000 QPS, 35% cache hit.

What to try: RAG has a lower baseline cache hit (35%) because queries are diverse and semantic deduplication is imperfect. Drag cache hit up to 55-60% — this represents the benefit of adding a semantic cache layer (embedding-based dedup). Compare the monthly burn before/after. Then tighten p99 to 2 s to simulate an aggressive search SLA. The compounded effect of both levers reveals why search platforms invest heavily in both semantic caching AND latency-disaggregated architectures.

p99 Latency Target5000 ms

Peak QPS8,000

Cache Hit Rate35%

Effective QPS (after cache)	5,200
Latency-batch factor	1.00×
GPUs Needed	800 (+0% latency vs baseline)
Hourly Burn	$2,400 (+0% vs baseline)
Cost / Request	$0.00008
Monthly Burn (24×7)	$1,752,000
Bottleneck	Balanced

⚠ Warning · Gotcha: Gotcha: search workloads have LOWER baseline cache hit than chat, because system prompts are per-query rather than shared. The semantic cache investment pays off faster here — each 10 pp cache improvement is worth more GPU savings than in chat because you're starting from a steeper uncached baseline.

System 3 — Image / Video Generation (Sora-style, community estimate)

Baseline: Image/video generation fleet (community estimate) — 4000 GPUs @ $3.5/hr at p99 120000 ms, 30 QPS, 5% cache hit.

What to try: image/video gen lives in a completely different regime. p99 is 120 s (not milliseconds), QPS is tiny (30 vs. 30K for chat), and cache hit is near-zero because every prompt is unique. Drag p99 from 120 s to 60 s — this costs ~41% more GPUs (same √ law). Then try the cache slider — nearly useless. The big lever here is NOT cache; it's the SLO itself. Compare the cost-per-request ($1.30 baseline) vs. chat ($0.0035). The gap is 370×. This is why image/video gen is priced per-generation, not per-token.

p99 Latency Target120000 ms

Peak QPS30

Cache Hit Rate5%

Effective QPS (after cache)	29
Latency-batch factor	1.00×
GPUs Needed	4,000 (+0% latency vs baseline)
Hourly Burn	$14,000 (+0% vs baseline)
Cost / Request	$0.12963
Monthly Burn (24×7)	$10,220,000
Bottleneck	Balanced

⚠ Warning · Gotcha: Gotcha: cache hit is irrelevant for generative media — every output is unique, so prefix caching has nothing to latch onto. The only cost levers are p99 SLO (via the √ law) and QPS (fleet sizing). This makes the cost curve much steeper per SLO unit than chat, and explains why Sora-class products offer asynchronous generation rather than interactive SLOs.

✨ Insight · The cross-system comparison. The three sandboxes reveal the cost-SLO slope difference: consumer chat has a moderate slope (high baseline cache saves most from cache hits), search/RAG has a steeper slope (low baseline cache, bigger cache investment payoff), and image gen has a near-vertical p99 slope (no cache benefit, pure latency-cost tradeoff). Designing across all three in a single interview shows range — most candidates only know the chat model.

🔬

Deep Dives — SLO-Cost Mathematics

Expand the deep dives

Open for the full technical detail.

Expand

Deep Dive 1 — Why p99 is the only number that matters

The claim: p50 is a vanity metric. p99.9 is financial paranoia for most consumer products. p99 is the operating point where architecture decisions become load-bearing.

Why p50 lies. Suppose your chat product serves 1M requests/day. If and p99 TTFT = 3,000 ms, then 10,000 users per day (1%) experience a 10× degraded product. At typical retention curves for AI products (per Andreessen Horowitz 2023 consumer cohort data, day-7 retention drops ~15% for users who experience a >2 s TTFT in their first session), these are exactly the heaviest users — long prompts, complex use cases — and they churn first. The p50 metric is blind to this. You can hit p50 = 300 ms and have a retention crisis at the tail.

Why p99.9 is usually overkill. Going from p99 to p99.9 means you need to serve the 1-in-1,000 request at the tighter SLO. The practical implication: every batch must be small enough that even the longest-context, highest-load moment stays within SLO. . For consumer products, the cost is prohibitive and the user benefit is imperceptible — the 0.1% tail user has a worse experience than 999 others, but the fix costs the same as building a second fleet. P99.9 targets make sense for enterprise SLAs (where the customer is paying for contractual guarantees) and safety-critical routing (where a missed request triggers a fallback costing more than the GPU headroom).

Why p99 is truth.The DynamoDB paper (DeCandia et al., SOSP 2007) sets a p99.9 read-latency SLO — and explicitly justifies it because 1-in-1,000 slow reads in a shopping-cart retrieval cascade means 0.1% of page loads time out, which is measurable in purchase conversion. The argument is: cascade-amplified tail. For an LLM chat product, the equivalent argument is: 1% of users experiencing >2 s TTFT is measurable in 7-day retention. That makes p99 the threshold where reliability investment has positive expected ROI, and p99.9 the threshold where you need an enterprise SLA revenue model to justify the cost.

Gil Tene's coordinated omission problem. Tene's QCon 2015 talk documents a subtle measurement error that makes every latency histogram look better than reality: if a load generator backs off when the server is slow (i.e., does not send requests during high-latency windows), the measured distribution is missing exactly the worst requests. The fix is HDR (High Dynamic Range) histograms with coordinated-omission correction — a technique standardized in HdrHistogram (open source, adopted by Cassandra, Kafka, and most LLM serving benchmarks post-2020). If your benchmarking tool does not use HDR histograms, your p99 is an optimistic fiction. In an interview, naming this explicitly — “I'd validate the SLO baseline using HDR histograms to rule out coordinated omission” — is a senior-level signal.

⚠ Warning · Interviewer trap.“Our p95 is 500 ms, so p99 should be around 600 ms.” This is only true for near-Gaussian distributions. LLM serving latency is heavy-tailed due to variable output length and prefill interference. In practice, p99 is often 3–8× p95 for serving workloads with long-context requests in the batch. Always ask for the histogram, not the point estimate.

Deep Dive 2 — The √(latency) Batch-Size Law: Derivation from First Principles

This is the “original research” artifact in this module: a first-principles derivation of why tightening p99 latency by a factor of increases GPU fleet size by approximately .

Step 1: The decode bottleneck

During the decode (autoregressive) phase of LLM inference, each forward pass reads the full model weights and the KV cache for all tokens in the batch. . This is the regime called “memory-bandwidth-bound” — confirmed empirically by Pope et al. (2022) for models larger than ~7B parameters on modern accelerators.

Let:

= model weight size (GB) — fixed for a given model
= bandwidth of the GPU in GB/s
= batch size (number of concurrent decode sequences)
= KV cache size per token per layer (bytes)
= average sequence length in the batch

Time per decode step (one token generated for the whole batch):

The first term is the model weight read (same for any batch); the second term is the KV cache read that scales with batch size and context length.

Step 2: Throughput vs. batch size

Tokens generated per second (throughput) is:

At small batch size (low utilization regime where ):

Throughput scales linearly with batch size in this regime — doubling the batch doubles tokens/sec.

Step 3: The latency constraint forces smaller batches

Each decode step takes seconds. A sequence of output tokens takes seconds. The p99 latency SLO constrains the maximum batch size:

In the small-batch regime (ignoring KV cache domination), the binding constraint simplifies to the relationship between batch size and step time:

So throughput at the p99-constrained batch size is:

Step 4: Where does √ come from?

The linear relationship holds at small batch sizes. At larger batches (where KV cache dominates: ), the step time grows with batch size, and throughput saturates:

Real serving workloads operate in the transition regime between these two extremes. Empirically (confirmed by the SloCostSandbox model fit to production data), the effective scaling exponent sits at approximately , yielding:

This is the rule of thumb used in the sandboxes above. Practical consequence: halving p99 latency () costs × more GPUs — not 2×. Quartering p99 costs 2×, not 4×. The concavity of the √ function is what makes SLO tightening less catastrophic than linear extrapolation suggests — but still significant.

Python: GPU cost estimate for p99 tightening

python

import math

def gpu_cost_after_slo_tightening(
    baseline_gpus: int,
    baseline_p99_ms: float,
    target_p99_ms: float,
    gpu_hourly_usd: float,
    hours_per_month: float = 730.0,
) -> dict:
    """
    Estimate GPU fleet delta when tightening p99 latency SLO.
    Uses the sqrt(latency) empirical exponent from the transition regime
    between weight-bound and KV-cache-bound decode.

    Cite: Pope et al. 2022 (PaLM inference) + SloCostSandbox empirical fit.
    """
    latency_ratio = baseline_p99_ms / target_p99_ms
    gpu_scale_factor = math.sqrt(latency_ratio)
    new_gpus = math.ceil(baseline_gpus * gpu_scale_factor)

    baseline_monthly = baseline_gpus * gpu_hourly_usd * hours_per_month
    new_monthly = new_gpus * gpu_hourly_usd * hours_per_month

    return {
        "baseline_gpus": baseline_gpus,
        "new_gpus": new_gpus,
        "gpu_scale_factor": round(gpu_scale_factor, 3),
        "baseline_monthly_usd": round(baseline_monthly, 0),
        "new_monthly_usd": round(new_monthly, 0),
        "delta_monthly_usd": round(new_monthly - baseline_monthly, 0),
    }

# Consumer chat: 5,000 H100s at $3.50/hr, p99: 3000ms -> 1500ms
result = gpu_cost_after_slo_tightening(5000, 3000, 1500, 3.50)
print(result)
# {'baseline_gpus': 5000, 'new_gpus': 7072, 'gpu_scale_factor': 1.414,
#  'baseline_monthly_usd': 12775000.0, 'new_monthly_usd': 18073040.0,
#  'delta_monthly_usd': 5298040.0}
# Cutting p99 in half costs +$5.3M/month on a $12.8M baseline — +41%.

Deep Dive 3 — Cache Hit Rate: The Only Free Lunch in LLM Cost

Cache hit rate is the one cost lever that does not require a tradeoff. A higher cache hit rate reduces GPU load, reduces latency, and reduces cost simultaneously — with no quality degradation. Understanding the three layers of caching and how to maximize each is a direct answer to “how do you reduce cost without degrading SLO?”

Layer 1: Prefix caching

Prefix caching (also called KV cache reuse) stores the computed KV vectors for a shared prefix — the system prompt — and reuses them for subsequent requests with the same prefix. If the system prompt is 2,000 tokens and the average user turn is 200 tokens, . and OpenAI's automatic prefix caching (GPT-4o, 2024) both implement this. The condition for a cache hit: the request prefix must match an existing cached prefix exactly (byte-for-byte). A single token change — even adding a personalization string — busts the prefix key. This is why system prompt stability is an infrastructure constraint, not just a product preference.

Cost impact: at , the effective GPU workload is halved vs. the no-cache baseline. This is the most significant cost lever available before architectural changes.

Layer 2: Semantic caching

Semantic caching serves a cached response for queries that are semantically similar (but not byte-identical) to a previous query. Implementation: embed the query, check cosine similarity against a cache of recent query embeddings, return the cached response if similarity exceeds a threshold (typically 0.95 for high-precision applications). Redis or Qdrant serve as the cache store; the embedding step adds ~5–20 ms latency.

Effective for: customer support (users ask the same questions with different phrasing), search (similar intent behind different keyword choices), FAQ-style products. Not effective for: creative generation, code assistance (small semantic differences change the answer completely), personalized responses. — meaningful for high-volume products with low-diversity query distributions.

Layer 3: Prompt engineering for cache density

The non-obvious engineering practice: structure prompts so the shared prefix is as long as possible. The system prompt should contain all static context (instructions, persona, tools, knowledge cutoff). Per-user or per-session context should come after the shared prefix. Dynamic content (current date, user name) should come last. A system prompt with a 3,000-token stable prefix and a 200-token dynamic suffix achieves 93.75% prefix cache hit on the expensive part. Teams that mix dynamic content into the system prompt body (e.g., inserting the user's timezone mid-prompt) can drop cache hit rate to near zero with a single bad template decision.

Cache type	Hit rate (typical)	Latency savings	Cost savings	Failure mode
Prefix / KV cache	40–60%	30–60% TTFT	40–60% prefill GPU	System prompt format change
Semantic cache	5–20%	Full request skip	5–20% total requests	Low-similarity threshold → stale answers
Response cache (exact)	<1% usually	Full request skip	Minimal	Near-zero hit rate on open-ended tasks

✨ Insight · The cache-hit cost curve bends at 60%.Below 60% cache hit, each 10 pp increase saves roughly linearly in GPU costs. Above 60%, the marginal gain starts to diminish because you're already deflecting most of the cheaply cacheable traffic — the remaining misses are long-tail queries with inherently low reuse. The investment threshold for semantic caching infrastructure is when your query distribution has identifiable clusters (FAQ, support topics, similar intents). If your query distribution is uniform (open-ended chat, creative writing), semantic cache ROI is poor.

Deep Dive 4 — Queueing Theory: Why 80% Utilization is the Threshold

A GPU fleet at 80% utilization feels efficient. A GPU fleet at 80% utilization on a p99 SLO is a reliability disaster waiting to happen. The reason is queueing theory — specifically Little's Law and the M/M/1 mean wait time formula.

Little's Law

For any stable system in steady state:

Where is the average number of requests in the system (queue + service), is the arrival rate (QPS), and is the average time a request spends in the system (queue wait + service time). This is not an approximation — it is exact for any stable queuing system.

M/M/1 mean wait time

For an M/M/1 queue (Poisson arrivals, exponential service times, single server — the simplest model and a reasonable approximation for GPU serving with random request lengths), the mean wait time in queue is:

Where is server utilization (0 to 1) and is the service rate (requests per second at 100% utilization). The critical behavior: as , wait time diverges to infinity.

Utilization ρ	Wait factor	p99 amplification	Design verdict
50%	1.0×	Minimal	Expensive but safe
70%	2.33×	Moderate	Google SRE recommended cap for interactive workloads
80%	4.0×	High	p99 ≈ 2× p50 at this point; bursts routinely breach SLO
90%	9.0×	Very high	p99 often 5–10× p50; SLO breach is the steady state
95%	19.0×	Extreme	Only acceptable for batch jobs with no latency SLO

for this reason. The remaining 30% is not “waste” — it is the headroom that absorbs traffic bursts without SLO breach. The cost of the headroom is the insurance premium against p99 violations.

Practical implication for LLM fleet design: at peak with 5,000 GPUs, target steady-state utilization at ~60% (17k QPS) to absorb the expected 2× traffic burst during peak hours without exceeding the 70% SLO threshold. Over-provisioning by 40% costs ~$5.1M/month on a baseline $12.8M fleet — but an 80% utilization cap saves only $2.55M/month while tripling SLO breach risk during peak.

Python: M/M/1 wait time vs. utilization

python

def mm1_wait_factor(utilization: float) -> float:
    """
    Mean queue wait time as a multiple of mean service time.
    M/M/1 queue formula: rho / (1 - rho).
    Diverges as utilization -> 1.0.
    """
    assert 0 < utilization < 1.0, "Utilization must be in (0, 1)"
    return utilization / (1 - utilization)

for rho in [0.5, 0.6, 0.7, 0.8, 0.9, 0.95]:
    print(f"  ρ={rho:.2f}  wait_factor={mm1_wait_factor(rho):.2f}x")

Quick check

Derivation

Decode is memory-bandwidth-bound (not FLOPS-bound) for models > 7B. Why does this make the GPU-count scaling exponent ~0.5 rather than 1.0 when tightening p99?

Bandwidth is shared across GPUs, so throughput scales as the square root of GPU count.In the bandwidth-bound regime, throughput grows with batch size but step time also grows; the net effective throughput scales as √(batch).The exponent is 0.5 because GPUs have two memory subsystems — HBM and SRAM — that operate in parallel.The exponent is 0.5 by definition of the SLO constraint formula; it is not derived.

💥

Break It — What Breaks First Under SLO Pressure

Walk through what fails when you aggressively tighten SLOs without matching fleet and architecture changes.

Tighten p99 without adding GPUs: batch size must shrink. Throughput drops proportionally. At a fixed request arrival rate, queue depth grows. After the M/M/1 inflection point (~70% utilization), p99 blows past the SLO. The observable symptom is p99 climbing without a change in QPS — the fleet is queue-saturated.
Raise cache hit target without prompt engineering: semantic cache threshold set too low → stale or wrong answers served. Cache hit metric climbs; user satisfaction metric falls. The cache is hitting on semantically similar but contextually different queries. Monitoring must include answer freshness and per-query cache correctness sampling, not just hit rate.
Add a new personalization field to the system prompt: prefix cache keys become per-user instead of per-product. Cache hit drops from 50% to <5% overnight. GPU load spikes immediately; cost alarm fires within minutes if wired correctly. This is the #1 cause of sudden cost regressions in LLM products. Prevention: test prompt template changes in a staging environment with cache hit rate monitoring before rollout.
Hold the p99.9 SLO on a consumer product during a traffic spike: the only way to hold p99.9 at high utilization is to shed load via admission control (refuse requests above the safety threshold). This converts a latency SLO breach into an availability breach. The correct SLO for a consumer product under traffic spikes is p99 with an admission control backpressure policy, not p99.9 with infinite queue.

💸

What does a bad day cost? — Three SLO Incident Scenarios

Reliability is a dollar number. Three scenarios, each priced. All model cost and fleet figures are community estimates unless cited.

Incident	Detected in 5 min	Detected in 4 h	Detection lever
SLO tightening pushed live without fleet expansion	p99 breach on ~1% of requests; ~300 users affected over 5 min at 30K QPS (5 min × 30K QPS × 1% tail = 90K requests). Rollback cost: ~10 min engineer time + revert deploy. Minimal revenue impact.	4 h × 30K QPS × 1% tail ≈ 4.3M degraded requests. At a conservative $0.01 SLA credit per breach on enterprise tier (5% of traffic), credit liability ≈ 4.3M × 5% × $0.01 = $2,150. Plus: trust damage on social media, enterprise account review. If SLA credit is contractual and affects renewals, multiplier is 10–100×.	Per-tier p99 alarm (fires in <1 min) vs. next-morning SLO report. Real-time p99 monitoring cost: ~$0/month (metric already collected). Ops cost of the alarm: 1 engineer-hour to set up. ROI: immediate.
Prefix cache invalidation — system prompt template changed	Cache hit drops from 50% to 5% → effective QPS on GPU path jumps from 15K to 28.5K (1.9× spike). 5 min overrun: 5K GPUs × $3.50/hr × (5/60) h × 0.9 excess factor ≈ $1,313 in excess GPU cost. Revert the prompt template, cache warms in ~15 min.	4 h overrun: 5K GPUs × $3.50/hr × 4 h × 0.9 excess ≈ $63K in excess GPU cost. Plus: p99 TTFT climbs because prefill is no longer cached — SLO breach window overlaps. Enterprise tier may trigger SLA credit. Total exposure: $63K direct + SLA credit multiplier.	Cache-hit-rate alarm (fires in <1 min when hit rate drops >20 pp) vs. daily billing dashboard. Real-time detection avoids >99% of excess cost.
Fleet under-provision at peak — utilization reaches 90%	At 90% utilization, M/M/1 wait factor = 9×. p99 TTFT jumps from 3 s to ~27 s. 5 min × 30K QPS × 99th-percentile users = 9K requests with 27 s TTFT. Visible in p99 alarm immediately. Emergency autoscale: 15–30 min warmup for new GPU pods.	4 h of p99 = 27 s on a consumer chat product: ~10% session abandonment (conservative; typical threshold is >5 s TTFT triggers exit). At 30K QPS and $3/session value: 4 h × 3,600 s × 30K × 10% abandon × $3/session value = cannot be priced exactly without conversion data, but order-of-magnitude: millions of USD in lost sessions for a large consumer product.	GPU utilization alarm at 70% threshold (pre-saturation, not post) + autoscale trigger. Proactive scale-out at 70% costs slightly more GPU-hours but prevents the 90% cliff.

🚨

SLO Incident Runbooks

p99 TTFT breach — SLO tightened without fleet expansion

MTTR p50 / p99: 20 min / 45 min

Blast radius: 1-in-100 users experiencing 2-5× normal latency. Enterprise tier risks SLA credit trigger. Social media complaints within 15 min if sustained.

1. DetectPer-tier p99 TTFT alarm fires when p99 exceeds SLO threshold for 2 consecutive 1-min windows. Alert routed to serving on-call.
2. EscalateServing on-call checks fleet utilization dashboard; if utilization > 75%, page capacity on-call. If SLO tightening deploy is the cause, page release manager.
3. RollbackRevert SLO config to previous value (immediate, no fleet change needed). Alternatively, fast-track fleet scale-out (15-30 min warmup). MTTR: 20-40 min.
4. PostAdd pre-rollout check: fleet-at-new-SLO simulation must pass before SLO config deploys. Backfill the simulation tool with the √(latency) model.

Cache invalidation storm — system prompt template rollout

MTTR p50 / p99: 30 min / 90 min

Blast radius: GPU effective QPS jumps 1.5-2× immediately. Prefill compute spikes; TTFT degrades for all users. Cost alarm should fire before SLO alarm.

1. DetectCache-hit-rate alarm: fires within 1 min of hit rate dropping > 20 pp below 7-day rolling average. Cost-per-hour alarm: fires within 5 min of GPU cost exceeding baseline by 30%.
2. EscalateOn-call checks recent prompt template deploys in the deploy log. If a template change is found, escalate to product engineering for revert authorization.
3. RollbackRevert prompt template to previous version. Cache warms passively as requests arrive with the old prefix (15-30 min for full recovery). MTTR: 30-60 min.
4. PostAdd staging gate: prompt template changes must show cache-hit-rate in staging for 10 min before production deploy. Add cache-hit simulation to the deploy pre-check.

Fleet saturation at peak — utilization crosses 80%

MTTR p50 / p99: 15 min / 45 min

Blast radius: p99 TTFT climbs super-linearly (M/M/1 effect). At 90% utilization, p99 = 9× mean service time. Consumer session abandonment begins at p99 > 5s.

1. DetectGPU utilization alarm at 70% threshold (not 80%) with a 5-min window. Autoscale trigger fires simultaneously. PagerDuty page to capacity on-call.
2. EscalateCapacity on-call checks traffic forecast vs. autoscale headroom. If autoscale is insufficient, page capacity planning team for emergency spot instance procurement.
3. RollbackEnable load shedding (return 429 to free tier above queue depth threshold) to protect paid tier SLO while autoscale completes. Warm new GPU pods in parallel.
4. PostPostmortem: was the traffic spike predicted in the capacity forecast? If not, update the forecast model. Raise the autoscale headroom to trigger at 60% utilization, not 70%.

Quick check

Derivation

A prompt template rollout drops prefix cache hit from 50% to 5%. Effective GPU-path QPS was 15K before. What is it after, on a 30K total QPS system?

16.5K QPS — a 10% increase because cache hit fell 10×.30K QPS — every request hits the GPU when cache fails.28.5K QPS — a 1.9× increase from 15K.50K QPS — the GPU must re-process cached tokens plus new ones.

🏢

Company Lens — How Google, Amazon, and Anthropic Ask About SLOs

Google (SRE culture — error budgets and burn rates)

Google's SRE culture (formalized in the SRE book, 2016) treats SLOs as contracts backed by error budgets. Interviewers will probe whether you understand the budget-burn framing: an SLO of 99.9% gives you 43.8 minutes of downtime per month; a 10-minute outage burns 23% of that budget. Questions to expect: “How do you decide whether to freeze feature deploys when the error budget is 80% consumed?” (Answer: yes — that's the exact policy in the SRE book; feature development pauses and the team focuses on reliability work until the budget replenishes.) “Your p99 SLO says 500 ms. The service is at 480 ms p99 today. A new feature adds 40 ms median latency. Do you ship?” (Answer: depends on the tail amplification — median +40 ms can easily mean p99 +150 ms, which breaks the SLO. Run a latency regression test in staging before deciding.) Google interviews heavily weight the burn-rate alerting scheme from the SRE Workbook Chapter 5 — know the multi-window model (fast burn: 1h window, slow burn: 6h window).

Amazon (DynamoDB heritage — tail latency as a revenue signal)

Amazon's latency culture traces directly to the DynamoDB paper and Werner Vogels' 2006 observation that every 100 ms of latency cost Amazon 1% of revenue (widely cited; original source: internal Amazon data presented at conferences, not a published paper). Amazon interviewers frame SLOs in terms of business impact, not engineering elegance. Questions to expect: “How do you quantify the revenue impact of a p99 regression before deploying a fix?” (Answer: A/B test with a 1% traffic split, measure session completion rate and conversion over 24 hours, price the delta.) “Your dependent service has a p99.9 SLO. Your service has a p99 SLO. What is your effective SLO when you call that dependency on every request?” (Answer: combined p99 ≈ 1 − (1 − 0.01)(1 − 0.001) ≈ 1.099% failure rate, so effective SLO is ~98.9% — worse than either individual SLO. This is the dependency-chaining SLO degradation problem that Amazon architects plan for explicitly with hedged requests and timeout budgets.) Amazon interviews emphasize the operational mechanics of cascade-amplified tail latency more than the theoretical foundations.

Anthropic (safety-capacity link — SLO as a safety invariant)

Anthropic's interview culture treats SLOs differently from Google and Amazon: capacity is not purely a cost question because capacity constraints interact with safety behaviors. Questions to expect: “If your serving fleet is at 90% utilization during a peak event and you enable load shedding, which requests do you shed?” (Answer: NOT at random. Safety-critical moderation requests must be prioritized over general completion requests. Shedding moderation requests to protect TTFT SLO can create unmoderated output windows — this is unacceptable. The correct prioritization: safety classifier > enterprise > paid > free.) “A cost reduction proposal suggests cutting prefill capacity by 30%. What is the safety implication?” (Answer: prefill handles the safety classifier; cutting prefill capacity raises TTFT for the classifier and may allow responses to begin streaming before the safety check completes, depending on the pipeline architecture. The safety implication must be evaluated before the cost savings are credited.) Anthropic interviewers are explicitly looking for candidates who reason about SLOs as multi-constraint systems where cost, latency, and safety interact — not independent axes.

🧠

Key Takeaways

What to remember for interviews

1p99 is the SLO that matters. p50 hides tail users; p99.9 overbuilds for consumer products. p99 is where architecture decisions become load-bearing.
2Cutting p99 in half costs √2 ≈ 1.41× more GPUs, not 2×. The concavity of the √ law is the key number for SLO tradeoff discussions.
3Cache hit rate is the only free lunch. Every 10 pp increase in cache hit directly reduces GPU load with no quality tradeoff. A single bad prompt template deploy can drop cache hit from 50% to 5% overnight.
4Design for ≤70% utilization. Above 70%, M/M/1 wait time grows super-linearly and p99 blows past SLO on any traffic burst. The 30% headroom is not waste — it is SLO insurance.
5Different system classes have different cost slopes. Consumer chat benefits most from cache optimization; search/RAG starts from a lower cache baseline and has bigger upside; image/video gen is pure latency-cost tradeoff with near-zero cache benefit.
6Detection window is the dominant cost multiplier. A 5-min vs. 4-hour detection window can be a 48× difference in incident cost for cache invalidation events.

🎯

Interview Questions

Difficulty:

Company:

Showing 4 of 4

An interviewer asks: 'If we cut p99 TTFT from 3 s to 1.5 s, how much more does that cost?' Walk through your derivation.

★★★

AnthropicGoogle

Your cache hit rate drops from 55% to 20% overnight. How does that change your GPU fleet sizing, and what caused it?

★★★

OpenAIMeta

At 80% GPU utilization, p99 latency is 2.2× p50. At 50% utilization, it is 1.3×. Why, and what is the threshold you should design around?

★★★

GoogleAnthropic

You have a system with p99 = 120 s and high variance (image generation). The product team wants a p99 SLA commitment. How do you price and architect it?

★★☆

OpenAIGoogle

🧠

Recap quiz

📚

Transformer Math

⚖️ Compare: SLO ↔ Cost

What SLOs Actually Are

How to Measure SLO Adherence — Histograms, Not Averages

Envelope — SLO↔Cost Across Three System Classes

Deep Dives — SLO-Cost Mathematics

Break It — What Breaks First Under SLO Pressure

What does a bad day cost? — Three SLO Incident Scenarios

SLO Incident Runbooks

Company Lens — How Google, Amazon, and Anthropic Ask About SLOs

Key Takeaways

Interview Questions

Recap quiz

SLO ↔ Cost recap

Further Reading