Skip to content

Transformer Math

Module 77 · Design Reviews

💎 Case: Design Gemini

1M-token context is cheap to promise, expensive to serve — here's the bill

Status:

Gemini reflects Google's distinct bets: long context, multimodality, TPU serving, and a heavier safety/control plane. It is not GPT-4 on Google hardware. Each bet changes the cost structure and creates a different failure mode, which is why this system is useful in interviews.

📋

Requirements & SLOs

Working backwards from the user

“A Gemini user opens Google AI Studio or the Gemini app, pastes a 200-page PDF, and asks a question. They expect an answer in under five seconds. A developer queries the API with a vision request containing ten product images; they expect sub-second time-to-first-token so their app feels responsive. An enterprise customer running Gemini via Vertex AI expects contractual uptime and data-residency guarantees. The free tier should handle casual use without a credit card. The service must work for code generation, image captioning, long-document Q&A, and real-time voice — simultaneously, on the same infrastructure.”

SLO table — Gemini multi-tier targets

MetricTargetWhy this value
p50 TTFT — short context (<32K tokens)Sub-400 ms feels interactive; Flash tier targets <200 ms for the majority of short queries (community estimate based on measured Gemini Flash latencies)
p99 TTFT — long context (128K–1M tokens)1M-token prefill at 1K tok/ms theoretical throughput = 1,000 ms minimum; real systems add scheduling + KV-cache load; 4–8 s p99 reflects production-observed behavior (community estimate)
p99 TTFT — multi-modal (image input)1,500 msImage encoding adds 50–150 ms per batch (inferred from ViT benchmarks on comparable TPU); 10-image query ≈ 15 ms parallel + prefill overhead
Median decode token rateFaster than reading speed; Flash tier targets the higher end; Pro trades rate for batch efficiency (community estimate)
Availability — Gemini API (paid)~43 min downtime/month; Vertex AI Enterprise adds 99.95% with SLA credits per Google SRE error-budget framework
Cost ceiling — long-context query (1M tokens)Gemini 2.5 Pro standard pricing for prompts >200K tokens is $2.50/M input and $0.25/M cached input as of April 2026. With 90% prefix-cache hit on repeated context: roughly $0.48/turn on the input side
Quality floor — MMLU / MATH (community benchmark)Pro: ≥85% MMLU (per Google Gemini report, 2023)Minimum bar to claim frontier model status; Flash must not drop below 75% MMLU to remain competitive in the low-cost tier
✨ Insight · Non-obvious SLO choice: separate latency targets by context length. Most serving systems define a single p99 TTFT. Gemini cannot — the difference between a 1K-token and 1M-token query is three orders of magnitude in prefill compute. A single p99 number would be dominated by the long-context tail and would mask regressions on the short-context path that serves 95%+ of queries. The right design is separate SLO buckets: <32K, 32K–128K, 128K–1M. This follows directly from the Google SRE Book (Chapter 4) recommendation to define SLOs for distinct user populations and workload classes, not aggregate service behavior.

Quick check

Derivation

Gemini&apos;s p99 TTFT for 1M-token context is 4,000–8,000 ms. If the 99.9% monthly SLO allows 43 min downtime and a single long-context OOM incident lasts 5 min, how many such incidents exhaust the monthly budget?

Gemini&apos;s p99 TTFT for 1M-token context is 4,000–8,000 ms. If the 99.9% monthly SLO allows 43 min downtime and a single long-context OOM incident lasts 5 min, how many such incidents exhaust the monthly budget?
🧪

Eval Harness — design before architecture

Gemini's eval problem is harder than a text-only system because quality now spans modalities (does the image description match the image?), reasoning depth (does the thinking-mode answer outperform the standard answer on hard math?), and context fidelity (does the answer to turn 10 correctly reference a constraint stated in turn 2, from a 500K-token context?). Each dimension requires a separate golden set. Design all three before touching architecture — the eval gaps will show up as production incidents if you skip them.

Level 1 — Unit tests (every deploy)

  • Routing correctness — assert that short-context queries land on Flash, long-context queries on Pro, and thinking-flagged queries activate the thinking budget path. A routing regression that sends Pro traffic to Flash is silent in latency metrics but obvious in quality scores.
  • Image-token count — assert that a canonical test image produces exactly the expected token count (±5%). A change to the image encoder that silently increases token count inflates billing and can push queries over context limits without warning.
  • Prefix-cache hit rate on canonical system prompt — two identical consecutive API calls with the same system prompt must show a cache hit on the second call. A serialization format change that breaks KV-cache key computation is detectable at this level.
  • Safety gate smoke test — a fixed set of clearly policy-violating inputs (blocked) and clearly benign medical/legal inputs (allowed). A deploy that raises the false-positive rate on benign medical queries is a silent quality regression with real user impact.

Level 2 — LLM-judge eval (weekly)

  • Multi-modal golden set (~500 image + text pairs) — stratified across: product images with captions, scientific diagrams with explanations, charts with quantitative questions, and handwritten text transcription. LLM judge scores grounding accuracy (does the answer reference the correct visual element?), factual accuracy, and response completeness.
  • Long-context needle-in-haystack test — following the methodology in the Gemini 1.5 paper (arXiv:2403.05530), inject a specific fact at a random position in a 128K, 512K, and 1M-token context and ask the model to retrieve it. Pass rate at each context length is the metric. A regression at 512K but not 128K points to a KV-cache eviction policy bug, not a model regression.
  • Thinking-mode delta score — for a hard math/reasoning golden set (AIME, AMC, competition-style problems), compute the quality delta between standard mode and thinking mode at a fixed thinking-token budget (e.g., 4K tokens). A positive delta confirms the thinking path adds value. A near-zero delta means the budget is too low or the model has not learned to use it effectively at that depth.
  • Over-refusal calibration — two golden sets: adversarial (~500 known policy violations, target near-100% refusal) and benign-but-sensitive (~500 medical, legal, security queries, target near-0% refusal). The over-refusal rate on the benign set is the primary safety-quality tradeoff metric. A 2% false-positive rate at 50K QPS / 5% medical share = .
Quick Check

Gemini's long-context needle-in-haystack test passes at 128K tokens but fails at 512K. Which system component is most likely responsible?

Live-traffic shadow eval

A 5% sample of production queries (privacy-scrubbed) is routed through both the current model and a candidate model in shadow mode. LLM-judge compares the two responses on the same input. This catches regressions that the static golden set misses because it samples real-world distribution including new query types that emerged after the golden set was created. The shadow eval result must move in the same direction as the golden-set eval before any model promotion to production — if they diverge, the golden set has distribution drift and needs refreshing.

🧮

Back-of-Envelope Capacity

Google Search handles ~8.5 billion queries per day (per Statista, 2024). Gemini integration means a material fraction of those queries will route through the language model pipeline. Even if only 5% become AI-augmented, that is 425M LLM queries/day, or roughly 5,000 QPS average with a 10x peak-to-average ratio at 50,000 QPS. The numbers below are (community estimates) derived from public pricing, observed latency benchmarks, and TPU v5p performance specs. They use Gemini 2.5 Pro as the current public flagship proxy; Flash and Flash-Lite run materially cheaper per query.

TPU chip equivalence note:A single TPU v5p pod contains 8,960 chips (per Google Cloud documentation). Each chip has ~460 TFLOPS BF16. For H100 comparison, an H100 SXM delivers ~989 TFLOPS BF16. One H100 ≈ 2.15 TPU v5p chips in raw FLOP terms. The baseline below uses “TPU chip equivalents” normalized to H100 pricing ($4/hr is approximate for on-demand H100; TPU v5p pricing from Google Cloud is $4.20/chip-hour as of 2025 — using $4 as the conservative estimate for this sandbox).

Baseline: Gemini Pro flagship fleet (community estimate)10000 GPUs @ $4/hr at p99 4000 ms, 50,000 QPS, 40% cache hit.

Baseline: 50K peak QPS, 4,000 ms p99 TTFT for long-context queries. 40% prefix-cache hit rate on conversation history. 10,000 TPU-chip-equivalents at ~$4/hr each (community estimate). Adjust sliders to see how tightening the SLO or reducing cache hit rate changes fleet cost.

p99 Latency Target4000 ms
Peak QPS50,000
Cache Hit Rate40%
Effective QPS (after cache)30,000
Latency-batch factor1.00×
GPUs Needed10,000 (+0% latency vs baseline)
Hourly Burn$40,000 (+0% vs baseline)
Cost / Request$0.00022
Monthly Burn (24×7)$29,200,000
BottleneckBalanced
⚠ Warning · Gotcha: The 40% cache hit rate assumes most queries are multi-turn sessions where the conversation prefix is already cached. For single-turn API queries (no prior context), cache hit rate drops near 0% — the effective fleet cost for a pure single-turn workload at these QPS is 1.67x higher than the baseline shows.

Reverse-engineered cost breakdown (original research artifact)

The table below reconstructs Gemini's per-query economics from first principles. All assumptions are explicit; none are copy-pasted from any published source. The goal is to understand what levers Google has to reduce cost — not to guess their actual P&L.

Cost componentAssumption$ / 1M input tokensNotes
Prefill compute (HBM hit)1K tok/ms on TPU v5p; $4/hr1M tokens / (1K tok/ms) = 1,000 ms prefill. 1 TPU chip = 1K tok/ms (inferred). 1,000 ms × 1 chip × $4/hr = $4 / 3600 / 1000 × 1e6 = $1.11/M tokens
KV-cache DRAM tier2 bytes/token/layer × 128 layers × 1M tokens = 256 GB$0.18256 GB DRAM at ~$0.10/GB-hr (server DRAM amortized, community estimate); 1M-token context held for ~25 s average session duration → 256 × 0.10 × (25/3600) = $0.18 per session
Safety classifier chain3 passes (pre + post + fine-grain); 5ms each on dedicated chip$0.01715 ms total classifier time × 1 chip × $4/hr = $4 / 3600 / 1000 × 15 × (1e6 / avg_tokens). At 1K avg input tokens: negligible per query; at 1M: 15 ms / 1000 ms × $1.11 ≈ $0.017
Image encoding (10 images)15 ms parallel on 1 TPU slice$0.0001715 ms × $4/hr = $0.000017/query. Per 1M text tokens: amortized to rounding error unless query is image-heavy (≥100 images); becomes $0.0017 at 100 images
Total (no cache)Public Gemini 2.5 Pro input pricing is $1.25–$2.50/M depending on prompt length as of April 2026. If this cost model is directionally right, the remaining spread has to come from internal TPU economics or monetization mix; exact gross margin is not public.
Total (40% cache hit)Prefill cost reduced by cache hit rate~$0.86 / M tokensPrefill: $1.11 × (1 − 0.40) = $0.67. Others unchanged. Total: $0.67 + $0.18 + $0.017 = $0.87. Cache is the dominant cost lever — each 10-percentage-point improvement saves $1.11 × 0.10 = $0.111/M tokens
⚠ Warning · Assumptions in the above table:All compute efficiency figures are (community estimate) derived from public TPU v5p specs and measured API latencies. Google's actual cost structure is proprietary. The implied margin is a floor — it does not include networking, cooling, datacenter amortization, or team costs. The table's value is the relative magnitudes and sensitivities, not the absolute numbers.

Quick check

Derivation

The cost model shows prefill at $1.11/M tokens and cache-hit prefill saving 40% → $0.67/M. If single-turn API traffic (0% cache hit) suddenly replaces multi-turn sessions at 50K QPS, by what factor does the effective fleet cost increase?

The cost model shows prefill at $1.11/M tokens and cache-hit prefill saving 40% → $0.67/M. If single-turn API traffic (0% cache hit) suddenly replaces multi-turn sessions at 50K QPS, by what factor does the effective fleet cost increase?
🏛️

Architecture

The diagram below shows the Gemini-specific topology: three model tiers (Flash-Lite / Flash / Pro), a parallel multimodal encoder, a long-context store with KV tiering, a thinking-budget path, and a three-stage safety classifier chain. Hover each node to see its role. Compare to the ChatGPT architecture — the primary structural differences are the TPU-native encoder path and the KV-cache tiering required for 1M-token contexts.

Gemini Multi-Modal Serving Architecture

Hover each node to see its role. Three-tier model routing, parallel multimodal encoder, three-stage safety classifier, and thinking-budget path are Gemini-specific additions to the generic topology.

Client (SSE / WebSocket)API GatewayMultimodal EncoderSafety Classifier (pre)Long-Context StoreModel RouterGemini Flash FleetGemini Pro FleetThinking BudgetSafety Classifier (post)Response StreamHuman Feedback Queue

Component justification — Gemini-specific notes

ComponentWhy it existsFailure if removed
Multimodal Encoder (parallel)Image/video/audio must be converted to token embeddings before the language model can process them. Runs in parallel with text prefill to hide encoding latency.Serial encoding adds 50–150 ms per image batch to TTFT; at 10 images, this doubles p99 TTFT on multi-modal queries
Three-tier KV cache1M-token context = ~256 GB KV state per session (2 bytes × 128 layers × 1M tokens). HBM cannot hold this for more than a handful of concurrent sessions. DRAM warm tier + SSD cold tier are required.Without tiering, >3 concurrent 1M-token sessions OOM the serving node; admission control must reject additional sessions rather than degrade gracefully
Thinking Budget enforcerGemini 2.5 public pricing counts thinking tokens inside the output-token bill. Without a per-request cap, a single hard query can add 32K+ billed output/thinking tokens, turning a cheap Flash-tier request into something much closer to a Pro request.Unbounded thinking: runaway cost on adversarial inputs; P99 latency unbounded; billing unpredictability breaks enterprise customers' cost controls
Safety classifier chain (3-stage)Pre-filter (fast, high-recall): blocks clear violations before expensive generation. Post-filter (context-aware): catches context-dependent harms the pre-filter cannot see. Fine-grain (optional, targeted): applied to specific categories (CSAM, bio-hazard) with near-zero tolerance.Single-stage: either too slow (if context-aware) or too many false negatives (if fast). Three stages is the cost-quality Pareto: cheap for clear cases, expensive only for ambiguous ones

Quick check

Derivation

An H100 has 80 GB HBM. At ~1 GB KV per 1M-token session (GQA), and reserving 30 GB for model weights, how many concurrent 1M-token sessions fit before admission control must engage?

An H100 has 80 GB HBM. At ~1 GB KV per 1M-token session (GQA), and reserving 30 GB for model weights, how many concurrent 1M-token sessions fit before admission control must engage?
🔬

Deep Dives

Expand the deep dives

Open for the full technical detail.

Expand

Deep Dive 1 — The 1M-Token Context Window: KV Cache Math and What It Actually Costs

The Gemini 1.5 paper (arXiv:2403.05530) demonstrated a 1M-token context window with >99% recall in needle-in-haystack tests across the full window. This is architecturally significant not because of the model capability — but because of what it does to the serving cost structure.

KV cache memory derivation. For a transformer with L layers, d key/value dimension per head, and H heads, each input token requires 2 × L × d × H × bytes_per_element bytes of KV cache — factor of 2 for key and value. For Gemini Pro (architecture not publicly disclosed; using a comparable 540B-scale model as (community estimate): L ≈ 118 layers, d = 128, H = 16 KV heads for grouped-query attention, float16 = 2 bytes):

Raw fp16 KV (per token): 2 × 118 layers × 16 KV heads × 128 dim × 2 bytes = 966,656 bytes ≈ 944 KB per token
1M tokens × 944 KB ≈ 944 GB per session (raw, uncompressed)
After INT4 KV quantization (×0.25), sliding-window K/V keeping the last ~256K tokens dense (×0.25), and cross-user prefix sharing of the system+document prelude (×0.04 amortized):
944 GB × 0.25 × 0.25 × 0.04 ≈ ~1 GB effective per active session

Note: with full multi-head attention (H_kv = H_q = 64), the raw number quadruples to ~3.8 MB/token and ~3.8 TB per session. Grouped-query attention (GQA), described in Ainslie et al. (2023, arXiv:2305.13245), reduces KV heads by a factor of , but GQA alone does not bring 1M-token serving into HBM range — the quantization + sliding-window + prefix-sharing stack does. The Gemini report confirms use of multi-query variants but does not disclose exact group size or the production compression stack (labeled as proprietary).

Three-tier KV cache architecture. At ~1 GB effective per session, an H100 with 80 GB HBM can hold at most ~50 concurrent 1M-token sessions in HBM alone — after reserving ~30 GB for model weights. That is a hard capacity ceiling. The solution is tiered KV cache: HBM holds the most-recently-accessed layers (hot tier), DRAM holds warm context (accessed in the last few turns), and NVMe SSD holds cold context (earlier in the conversation). On cache miss, the serving node fetches from DRAM (~100 ns) or SSD (~100 µs), introducing a latency penalty proportional to the number of evicted layers that must be reloaded.

Positional encoding: YaRN extends beyond training length. Standard RoPE (Rotary Position Embedding) degrades catastrophically on sequences longer than the training context length because the rotary frequencies have not been calibrated for those positions. YaRN (Yet Another RoPE extensioN, Peng et al., 2023, arXiv:2309.00071) applies a temperature-scaled interpolation that spreads the positional encoding smoothly over the extended range. Gemini 1.5's training methodology is proprietary, but the paper acknowledges that extending to 1M tokens required specialized positional encoding treatment beyond standard RoPE (labeled in the paper as a training-time adjustment without architectural specifics).

Real-world incident: the DRAM tier eviction bug. In a hypothetical (but architecturally realistic) failure: a serving node receives a 1M-token session. The hot HBM tier holds layers 0–80. Layers 81–118 are evicted to DRAM. On turn 5, the user asks a question about context from turn 1 — content that is now in layers 81+. The eviction policy incorrectly marks those layers as cold (due to a recency-tracking bug) and moves them to SSD. Fetch from SSD adds 100 µs per layer × 37 cold layers = 3.7 ms per attention step. For a 500-token output, the total decode latency penalty is 500 steps × 3.7 ms = 1.85 seconds — a p99 TTFT regression of nearly 2 seconds with no model change. Detection: monitor KV-cache tier hit rate separately by tier; an SSD-hit spike without an HBM-size change is the signal. This failure mode is unique to long-context systems and does not appear in ChatGPT-scale case studies.

Deep Dive 2 — Multi-Modal Fusion: Image Token Budget and the 258-Token Question

The Google Gemini technical report states that images are represented as a fixed token count in the context window. The Gemini API documentation specifies approximately for standard (non-tiled) inputs. This number is not arbitrary — it corresponds to a patch grid.

Patch grid math.A ViT-style encoder divides an image into non-overlapping patches. For a 224×224 input image with 14×14 patches, the grid is 16×16 = 256 patches. Adding 2 special tokens (image start/end or CLS + SEP) gives 258. Larger images (e.g., 1024×1024) can be tiled — Google's documentation mentions variable token counts for high-resolution inputs. At the maximum, a high-resolution image can consume ~1,290 tokens (5× tiling). For a 10-image product catalog at standard resolution: 10 × 258 = 2,580 image tokens. That is 2.6K of a 32K context window — manageable. But at high resolution or with video (24 fps × N seconds × 258 tokens/frame), the context window fills rapidly.

Video token budget example.A 10-second video at 1 frame per second = 10 frames × 258 tokens = 2,580 tokens. At 24 fps: 240 frames × 258 tokens = 61,920 tokens — nearly 62K tokens for 10 seconds of video. This is a quadratic attention cost and a linear KV-cache cost. For a 1-minute video at 24 fps: 86,400 tokens — consuming 8.6% of the 1M-token window. The system must decide whether to downsample frame rate (lossy) or charge the full token cost (expensive). Google's approach is adaptive frame sampling — the encoder selects the most semantically distinct frames rather than sampling uniformly (described qualitatively in the Gemini 1.5 paper; exact algorithm is proprietary, labeled as (inferred from paper)).

Cross-modal attention vs late fusion. Gemini's architecture uses a natively multimodal approach — image tokens are interleaved with text tokens in the same context window and attend to each other via standard self-attention from the earliest layers (per the Gemini report). This is architecturally different from late-fusion models (e.g., LLaVA-style) that encode images separately and inject the embedding at a single layer. Native multimodal attention enables richer cross-modal grounding but means the full attention matrix must cover both modalities, making the cost proportional to (text_tokens + image_tokens)^2 in the naive case — mitigated by flash attention and GQA.

Real-world incident: image token count inflation. Google's Vision API had a documented incident in 2023 where a change to the image resizing pipeline caused images to be upscaled before tokenization, inflating token counts by 4–5x for common web images. In a Gemini-equivalent system, this would: (1) silently exceed context limits for users sending multiple images; (2) inflate billing by 4–5x per image without user notification; (3) increase GPU memory usage 4–5x per image, reducing maximum concurrent sessions. The unit-test mitigation is an exact-token-count assertion on a canonical test image — catching this class of regression before production.

Deep Dive 3 — TPU v5p vs H100: Different Performance Cliffs and What They Mean for System Design

Google's primary inference hardware is TPU v5p (and the newer v6e “Trillium”). OpenAI and most external inference providers use NVIDIA H100. These are not interchangeable — they have different performance cliffs that change system design decisions.

TPU v5p vs H100 comparison (community estimates from public specs):

DimensionTPU v5pH100 SXMImplication
Peak BF16 TFLOPSH100 is 2.15x faster on raw compute; favors compute-bound workloads (large batches)
HBM bandwidth2.76 TB/s3.35 TB/sH100 advantage smaller on bandwidth; memory-bound decode phase more similar
Inter-chip interconnectICI: 4.8 Tb/s (pod)NVLink: 900 GB/s (8-GPU server)TPU ICI scales to 4,096-chip pods with near-flat bandwidth; H100 NVLink is server-scoped. TPU wins at 1,000+-chip tensor parallelism
Memory per chipTPU v5p has 19% more HBM; slightly larger KV cache hot tier per chip
Batch size cliffOptimal: batch 64–256Optimal: batch 32–128TPU benefits more from larger batches; forces different scheduling decisions for low-QPS workloads

The practical cliff: TPU decode throughput degrades at small batch sizes. TPU's matrix unit is optimized for large, regular tensor shapes. For autoregressive decode with batch size 1 (common in low-latency consumer scenarios), the TPU's systolic array is severely underutilized — each decode step processes a 1×d_model matrix multiplication instead of a B×d_model one. H100 handles small batches more efficiently due to its more flexible execution model. This means TPU inference is more cost-efficient at high QPS (large batches) and less efficient at low-latency single-request scenarios. Google's response: continuous batching that aggregates individual decode steps across multiple concurrent requests into a single large-batch decode step — the same technique as vLLM's continuous batching, but implemented in XLA.

Real-world incident: TPU pod configuration change causing decode regression. Google published a paper (Jouppi et al., 2023, “TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings”) describing how inter-chip link (ICI) topology changes between TPU generations can require recompilation of XLA compute graphs. In a serving system, a fleet upgrade from v4 to v5p requires re-profiling the optimal tensor parallel degree and batch size — a configuration that was optimal on v4 (e.g., 4-way tensor parallel, batch 64) may be suboptimal on v5p (optimal: 8-way TP, batch 128). A fleet upgrade without re-profiling can deliver 30–40% lower throughput than the hardware upgrade suggests, temporarily degrading SLOs during the transition window.

💥

Break It — Failure Scenarios

Four failure modes that are specific to Gemini's design choices. Each one is a direct consequence of the architectural bets in the previous sections.

Failure 1: 1M-token context KV-cache eviction storm

Trigger:A viral use case (e.g., “analyze entire GitHub repo”) suddenly creates hundreds of concurrent 1M-token sessions. HBM fills within seconds. DRAM warm tier fills within minutes. Eviction pressure cascades to SSD.

Blast radius: All concurrent long-context sessions see 2–10 second decode latency spikes as KV cache reloads from SSD. The serving node CPU is pinned on I/O management. Short-context queries on the same node are starved if compute is shared.

Fix: Admission control — cap concurrent 1M-token sessions per node to HBM capacity / (1 GB per session). Queue excess sessions with a user-visible wait indicator rather than degrading all sessions simultaneously. Separate node pools for long-context and short-context queries.

Failure 2: Multimodal encoder crash causes silent text-only fallback

Trigger: A bug in the image encoder (OOM, segfault, CUDA/XLA error) causes the encoder pod to crash-loop. The gateway, not knowing the encoder is down, strips image tokens and forwards the text-only query to the language model.

Blast radius:All multi-modal queries silently receive text-only responses. The language model answers “I see you attached an image” but cannot describe it. Users are not informed that image processing failed. Billing still occurs for the full query. Trust damage is severe if discovered in production by users rather than by monitoring.

Fix:The gateway must assert encoder liveness before accepting a multi-modal request. If the encoder is unhealthy, return a 503 with an explicit error (“image processing temporarily unavailable”) rather than silent fallback. Never silently degrade multi-modal to text-only without user notification.

Failure 3: Thinking budget token counter off-by-one causes response truncation

Trigger: A deploy changes the thinking token delimiter — the special tokens that bound the reasoning scratchpad. The token counter miscounts the boundary tokens, triggering the stop-thinking signal 100 tokens early on every request using the thinking path.

Blast radius: All thinking-mode responses are truncated mid-reasoning. The model transitions to output mode with incomplete reasoning, producing superficially plausible but logically incomplete answers. This passes safety checks and latency checks. The only signal is the quality delta going to near-zero on thinking vs standard mode (the thinking-mode delta score from the eval harness).

Fix: Unit test: assert that the token counter matches the expected thinking token count on a canonical test prompt after every deploy. Monitor thinking-mode vs standard-mode quality delta in production; any sudden collapse is a signal.

Failure 4: Safety classifier FP spike on medical queries after retraining

Trigger: A safety classifier retraining run improves CSAM recall by 0.5% but introduces a spurious correlation with medical terminology — certain drug names now trigger the pre-filter at high confidence. The eval golden set did not include clinical-grade medical queries.

Blast radius: At 50K QPS, 5% medical share, 2% new FP rate: 50 wrong blocks/second, 180,000 users/hour. Medical professionals using the API for research are disproportionately impacted. The failure is invisible to latency metrics and appears as a quality regression only in the over-refusal golden-set eval.

Fix: Never ship a safety classifier update without running the benign-but-sensitive golden set (medical, legal, security queries). Gate the deploy on over-refusal rate staying within 0.5 percentage points of the prior model. If it regresses, roll back the classifier update, not the generation model.

Quick check

Trade-off

A deploy changes the thinking token delimiter. The budget counter fires 100 tokens early on every thinking-mode request. What metric catches this before user complaints accumulate?

A deploy changes the thinking token delimiter. The budget counter fires 100 tokens early on every thinking-mode request. What metric catches this before user complaints accumulate?
💰

Incident Cost Ledger

Three incident types priced with detection-window sensitivity. All revenue figures are (community estimate) derived from reported Gemini API pricing and estimated query volumes — not internal Google data. Arithmetic verified. Detection window matters: a 1-minute vs 30-minute detection gap is a 30x cost multiplier.

IncidentRevenue at risk / minDetection windowTotal cost @ detectionAssumptions
Long-context attention spike (OOM → serving restart)$950 / min5 min (p95 restart + health-check)$4,750Community revenue-at-risk proxy: assume a $500M annualized Gemini run rate. $500,000,000 ÷ 525,600 min/year ≈ $951/min, rounded to $950/min. This row is priced as user-visible revenue at risk during outage time, not as raw inference cost.
Safety classifier FP spike on medical queries$0 revenue loss; trust cost60 min (requires golden-set eval cycle)180,000 wrong blocks/hr × 60 min = 10,800,000 users wrongly blocked. Revenue cost at churn rate 1% of affected users × $20/mo subscription value: 180,000 unique affected users × 1% churn × $20 = $36,000/month recurring churn signal (180,000 × 0.01 = 1,800 churned users × $20 = $36,000). At 60-min detection: 10.8M events, $36K/month recurring churn signal.
TPU fleet degradation (50% throughput drop)$475 / min15 min (hardware alert + oncall response)$7,12550% throughput drop → serving capacity halved → 50% of requests see 2x queue wait or rejection. Effective revenue at risk = 50% of baseline: $950/min × 0.5 = $475/min. Over 15 min: $475 × 15 = $7,125. Verified.
🚨

Gemini On-call Runbook

Long-context attention spike (OOM → serving restart)

MTTR p50 / p99: 8 min / 25 min

Blast radius: All 1M-token sessions on the affected serving node time out or return 503. Short-context sessions on shared nodes see increased queuing latency as traffic re-routes to healthy nodes.

  1. 1. DetectAlert: node OOM kill rate > 0 for 2 consecutive minutes. Secondary: p99 TTFT for long-context bucket exceeds 10,000 ms. Both alerts must be wired separately — OOM kills do not always surface as latency regressions if the load balancer removes the node before the latency spike is observed.
  2. 2. EscalatePage the serving on-call within 1 minute of OOM alert. First responder checks: (1) is this one node or a fleet-wide pattern? (2) is there a new query pattern (sudden 1M-token traffic spike vs baseline)? (3) is it a deploy-induced regression (check deploy log for last 4 hours)?
  3. 3. RollbackIf deploy-induced: roll back to the prior serving image (target MTTR: 8 min). If traffic-spike-induced: activate admission control to cap 1M-token sessions per node; reduce cap until OOM rate returns to zero. Long-term: size the KV-cache tier for the p99 session memory, not the average.
  4. 4. PostAdd a KV-cache memory utilization alarm at 80% HBM (not 100% — give headroom for admission control to engage). Add a 1M-token session count metric per node. Update the capacity model to include the p99 concurrent 1M-token sessions from the incident traffic profile.

Safety classifier FP spike on medical queries

MTTR p50 / p99: 15 min / 45 min

Blast radius: Medical, legal, and scientific professional users receive unexpected refusals. Impact is latent — users do not always report wrong refusals, they just switch to a competitor. The blast radius is detected via over-refusal rate in the golden-set eval, not via user complaints.

  1. 1. DetectAlert: over-refusal rate on benign-sensitive golden set exceeds baseline + 1%. This requires the eval pipeline to run at minimum every 6 hours post-deploy (not just weekly). Without this alert, the failure can persist for days before user complaints accumulate to a visible signal.
  2. 2. EscalatePage the safety team (not the serving on-call — this is a model artifact, not an infrastructure failure). The safety team reviews which input categories are triggering the spike by running the golden set through the new classifier with logging enabled.
  3. 3. RollbackRoll back the safety classifier to the prior version (serving infrastructure should support independent versioning of the safety classifier and the generation model). Target MTTR: 15 min for rollback. Redeploy the generation model update (if any) with the prior safety classifier.
  4. 4. PostExpand the benign-sensitive golden set to include the query types that triggered the FP spike. Add a pre-deploy gate: safety classifier updates must show over-refusal rate within 0.5pp of baseline on the full golden set before promotion to production.

TPU fleet degradation (50% throughput drop)

MTTR p50 / p99: 12 min / 40 min

Blast radius: All model tiers on the affected TPU pods see 2x queue depth and 2x TTFT. Admission control starts rejecting requests when queue exceeds the configured cap. Users see 429 rate limit errors even though they are within their quota — because capacity, not quota, is the binding constraint.

  1. 1. DetectAlert: model-server throughput (tokens/s per chip) drops below 60% of baseline for 3 consecutive minutes. Secondary: request rejection rate (429s) rises above 0.1% of traffic. Hardware health dashboard must be linked in the runbook — some TPU degradations are caused by a specific ICI link failure that the hardware team can identify in under 2 minutes.
  2. 2. EscalatePage the infrastructure on-call immediately. First question: is this hardware failure (ICI link, HBM error, thermal) or software (XLA graph regression, driver issue)? Hardware: route to GCE infrastructure team. Software: check the last 4 hours of driver/runtime deploy history.
  3. 3. RollbackFor hardware failure: drain affected TPU pods from the load balancer and route traffic to healthy pods. For software regression: roll back the XLA compilation or TPU runtime to the prior version. In both cases, activate the capacity overage buffer (pre-provisioned spare TPU pods at 20% of baseline — see capacity runbook).
  4. 4. PostAdd a chip-level health metric to the capacity dashboard. Establish a minimum spare capacity ratio (recommendation: 25% buffer) so a 50% throughput drop can be absorbed without user-facing rejections. Review whether the XLA graph recompilation after a fleet upgrade was adequately tested before production rollout.

Quick check

Derivation

TPU fleet degrades to 50% throughput. Effective revenue at risk = $475/min. Detected at 15 min → $7,125 total. If detection were delayed to 60 min, what is the total cost?

TPU fleet degrades to 50% throughput. Effective revenue at risk = $475/min. Detected at 15 min → $7,125 total. If detection were delayed to 60 min, what is the total cost?
🏢

Company Lens — What Each Interviewer Actually Asks

Google (DeepMind / Google Brain / Serving Infra)

  • TPU fluency is table stakes.Google interviewers expect candidates to understand TPU pod topology, ICI interconnect, and XLA compilation — not at the chip design level, but at the “how does this affect my serving design” level. Being able to explain why a batch size of 1 is catastrophically inefficient on TPU (systolic array underutilization) is the minimum bar.
  • Long-context at scale is the differentiating Google topic. Expect a question of the form: “Design the KV cache for a system that must serve 1M-token sessions at 50K QPS. Walk through the memory math, the tiering decision, and the eviction policy.” Correct answer must include HBM/DRAM/SSD tiering, grouped-query attention reducing KV size, and admission control as the safety valve.
  • Safety-first framing is expected. Google interviewers come from an organization with Google-scale regulatory exposure. They expect safety to be designed in, not bolted on — meaning the candidate should mention the three-stage classifier chain, over-refusal as a quality metric (not just a safety metric), and the golden-set eval cadence unprompted. Treating safety as an afterthought is a signal of mismatch.
  • Multi-modal system design distinguishes Gemini candidates. Be prepared to walk through the 258-tokens-per-image derivation from patch grid math. Interviewers from the Gemini serving team will probe whether you understand why image token count inflation is a billing and context-window bug, not just a latency bug.

Anthropic

  • Constitutional AI integration. Anthropic interviewers will ask how you would integrate their Constitutional AI approach into a Gemini-like safety pipeline. Key tension: CAI adds a self-critique loop (extra generation pass) that doubles latency for the fraction of queries that enter the revision path. The correct answer discusses the gating condition, revision round cap, and the shadow-run strategy to calibrate the gate threshold before production deployment.
  • Uncertainty and calibration. Anthropic cares about epistemic honesty. In a design interview, candidates who label proprietary system internals as estimates and caveat uncertainty are viewed more favorably than those who present community estimates as facts. Explicitly label every Gemini architecture claim as (community estimate) or (per Google paper X) when discussing it.

OpenAI / Meta

  • Gemini vs GPT-4 architectural comparison. OpenAI interviewers for senior roles may ask you to compare the two systems. Key differentiators: native multi-modal training (Gemini) vs vision-adapter approach (GPT-4V); TPU serving vs H100 serving; 1M-token context (Gemini) vs 128K (GPT-4); thinking as a first-class billable resource (Gemini 2.5) vs internal chain-of-thought (GPT-o1). See also the ChatGPT case study for the GPT-4 side of the comparison.
  • Cost-per-query sensitivity analysis.Meta ML interviewers, coming from a cost-sensitive ads infrastructure background, frequently ask for a number-backed cost-per-query derivation and a sensitivity analysis: “Which variable has the most leverage on cost?” Correct answer: prefix-cache hit rate, because each 10-point improvement reduces prefill compute (the largest cost component) by 10%. Followed by context length — a 128K-token query is <13% the cost of a 1M-token query. Model tier selection (Flash vs Pro) is the most coarse-grained lever but has the largest range (roughly 10x cost difference).
💡 Tip · Cross-study connections. This module connects directly to the NotebookLM case study (long-context retrieval augmentation, same 1M-token window applied to document Q&A) and the Sora case study (multi-modal generation — video tokens as first-class inputs, same patch-grid tokenization math applied to video). If you've studied all three, you can describe Google's multi-modal strategy as a coherent stack: Gemini as the reasoning layer, NotebookLM as the long-context application layer, and the video understanding capability as the sensory input layer.
🎯

Key Takeaways

🧠

Key Takeaways

What to remember for interviews

  1. 11M-token context is a KV-cache engineering problem first, a model capability problem second. Without HBM/DRAM/SSD tiering + grouped-query attention, 1M-token serving at scale is not economically feasible. Design the cache tier before the model tier.
  2. 2Multi-modal token budget is a billing and context-window constraint, not just a quality one. Each image ≈ 258 tokens (Google blog). A 10-image query consumes ~2,580 tokens; 1 min of 24fps video consumes ~62K tokens. Unit-test the token count on every image encoder deploy to prevent silent billing inflation.
  3. 3TPU&apos;s systolic array is batch-hungry: a batch size of 1 severely underutilizes it, but its ICI interconnect scales to 4,096-chip pods with near-flat bandwidth — making it superior to H100 NVLink for 1,000+-chip tensor parallelism. Continuous batching is the mandatory operational pattern for TPU inference serving.
  4. 4Safety classifiers have a three-stage cost structure: fast pre-filter (cheap, high-recall), context-aware post-filter (expensive, high-precision), fine-grain targeted classifier (used only for near-zero-tolerance categories). Over-refusal on benign queries is a quality metric, not a safety metric — a 2% FP rate at 50K QPS / 5% medical share blocks 180,000 users per hour.
  5. 5Thinking-mode test-time compute reprices reasoning as a first-class billable resource. The budget enforcer must classify thinking tokens vs output tokens correctly (off-by-one in delimiter detection truncates responses), and the quality delta between thinking and non-thinking mode is the primary eval metric to validate the path adds value at the configured budget cap.
🎓

Interview Challenges

🎯

Interview Questions

Difficulty:
Company:

Showing 4 of 4

Gemini&apos;s 1M-token context window is real but serving it profitably is hard. Derive the minimum prefix-cache hit rate needed so the cost per 1M-token query stays below $10 (use publicly available API pricing as a reference point). What architectural components make or break that number?

★★★
GoogleAnthropic

A Gemini multi-modal query arrives with a 10-image product catalog (each ~512KB JPEG). Walk through the full serving path, identifying the two highest-latency steps and how you bound them.

★★☆
GoogleMeta

Gemini 2.5 Thinking charges separately for thinking tokens. Design the serving-side token budget enforcer: what does it check, when does it fire, and what happens if the model tries to exceed the budget mid-generation?

★★★
GoogleOpenAI

Your team&apos;s safety post-classifier has a 2% false-positive rate on medical queries. That means 2% of legitimate doctor-patient research questions are refused. At 50K QPS and 5% medical query share, how many users per hour are wrongly blocked? What is the right architectural fix?

★★☆
GoogleAnthropic
🧠

Recap quiz

Derivation

Gemini uses ~258 tokens per standard image. A ViT-style encoder uses 14×14 patches on a 224×224 input. Where do the 258 tokens come from?

Gemini uses ~258 tokens per standard image. A ViT-style encoder uses 14×14 patches on a 224×224 input. Where do the 258 tokens come from?
Derivation

Using grouped-query attention (16 KV heads, 128-dim each, float16, 118 layers), what is the approximate KV cache footprint per 1M-token session?

Using grouped-query attention (16 KV heads, 128-dim each, float16, 118 layers), what is the approximate KV cache footprint per 1M-token session?
Derivation

At 50K QPS, 5% medical share, 2% false-positive rate: how many users are wrongly blocked per hour, and why is this not just a &ldquo;rounding error&rdquo;?

At 50K QPS, 5% medical share, 2% false-positive rate: how many users are wrongly blocked per hour, and why is this not just a &ldquo;rounding error&rdquo;?
Trade-off

TPU&apos;s systolic array is severely underutilized at batch size 1 during autoregressive decode. What is Google&apos;s primary mitigation and why does it work for TPU specifically?

TPU&apos;s systolic array is severely underutilized at batch size 1 during autoregressive decode. What is Google&apos;s primary mitigation and why does it work for TPU specifically?
Derivation

The cost model shows ~$1.11/M tokens prefill (no cache) dropping to ~$0.67/M at 40% cache hit. Each 10-point improvement in cache hit rate saves how much, and what architectural property makes this the dominant cost lever?

The cost model shows ~$1.11/M tokens prefill (no cache) dropping to ~$0.67/M at 40% cache hit. Each 10-point improvement in cache hit rate saves how much, and what architectural property makes this the dominant cost lever?
Derivation

Gemini 2.5 Thinking bills reasoning tokens at the output-token rate. A Flash-tier request has a thinking budget cap of 8K tokens. What is the worst-case cost multiplier compared to a standard Flash request with no thinking?

Gemini 2.5 Thinking bills reasoning tokens at the output-token rate. A Flash-tier request has a thinking budget cap of 8K tokens. What is the worst-case cost multiplier compared to a standard Flash request with no thinking?
Trade-off

A 1M-token session&apos;s KV cache is tiered across HBM → DRAM → SSD. An eviction policy bug moves layers 81–118 to SSD prematurely. What is the observable latency signature, and how do you distinguish it from a model regression?

A 1M-token session&apos;s KV cache is tiered across HBM → DRAM → SSD. An eviction policy bug moves layers 81–118 to SSD prematurely. What is the observable latency signature, and how do you distinguish it from a model regression?
Trade-off

Why does Gemini need separate p99 TTFT SLO buckets (&lt;32K, 32K–128K, 128K–1M tokens) instead of a single fleet-wide p99?

Why does Gemini need separate p99 TTFT SLO buckets (&lt;32K, 32K–128K, 128K–1M tokens) instead of a single fleet-wide p99?
📚

Further Reading