Skip to content

Transformer Math

Module 70 · Design Reviews

🔭 Case: Design Perplexity

Half retrieval system, half LLM — which half should you optimize first?

Status:

Perplexity is an answer engine, not chat. Retrieval and citations dominate; the LLM is the summarizer. Design retrieval first, because trust is usually lost on the first bad citation, not the first weak phrasing.

📋

Requirements & SLOs

The Working-Backwards paragraph

Write the customer-facing sentence before any engineering word.

“A user types a question and receives a synthesized answer with inline citations in under two seconds. Every factual claim in the answer links to a real source that actually supports that claim. For news and prices, the answer reflects information from the past fifteen minutes. When sources are sparse or contradictory, the system says so — it does not fabricate a confident answer from thin evidence.”

SLO table

MetricTargetWhy this value
p95 first-result latencySlower than a pure chat product because retrieval + rerank adds ~400–600 ms to the critical path; streaming hides the rest
Citation precisionDoes the cited source actually support the claim? Measured via NLI entailment on a sampled production stream — this is the core trust metric
Groundedness scoreFraction of atomic claims in the answer traceable to a retrieved source; below 90% means the LLM is filling gaps from training memory
Freshness SLO (news / prices)Contractual for Pro tier; freshness SLO drives the live-web fetch path design — index lag > 15 min triggers live fetch regardless of classifier output
Refusal/uncertainty behaviorCalibrated disclosure, not hard refusalOn low-evidence queries, the system discloses uncertainty inline rather than refusing; refusal trains users to distrust the product for edge cases
AvailabilityStandard enterprise-grade SaaS; partial shard degradation counts as degraded-not-down unless citation precision falls below the floor
✨ Insight · Margin note. Notice that the latency SLO is looser than a pure chat product ( vs ~800 ms). This is intentional: retrieval adds budget. Trying to hit 800 ms would force you to cut the reranker — and the reranker is where citation precision lives. Do not let a naive latency target kill the quality component that defines your product.

Quick check

Trade-off

The freshness SLO is “index lag ≤15 min for news.” When does this SLO force the live web fetch path to activate, even if the query classifier is uncertain?

The freshness SLO is “index lag ≤15 min for news.” When does this SLO force the live web fetch path to activate, even if the query classifier is uncertain?
🧪

Eval Harness (design first)

This is the most important section for Perplexity — and the one most candidates skip. Unlike a chat product where generation quality is the primary eval target, here you need four independent eval axes, and they are not equally important.

(a) Groundedness — NLI-style claim check

Extract atomic claims from the generated answer. For each claim, use a natural language inference (NLI) model to check whether the cited source entails that claim, contradicts it, or is neutral. Groundedness = entailed claims / total claims. Threshold: ≥90%. This is the eval that catches the LLM drifting to its training memory when retrieved sources are weak — the worst invisible failure mode.

(b) Citation precision

Distinct from groundedness: groundedness asks whether any source supports the claim; citation precision asks whether the specific source cited inline supports it. A system can be highly grounded but have poor citation precision if the citation binder mis-assigns sources to the wrong claims. Measure via the same NLI pipeline, but restricted to the (claim, cited-source) pair rather than the full retrieved context.

(c) Retrieval recall@k on golden queries

Maintain a golden query set of ~500 queries where the correct source documents are known (manually annotated). Measure whether those source documents appear in the top-k retrieved results (Recall@5, Recall@10). This eval decouples retrieval quality from generation quality — if Recall@5 is 60%, no amount of LLM improvement will fix citation precision. Retrieval failure is the root cause; generation quality improvement is noise.

(d) End-to-end task success

A curated benchmark of ~200 query/expected-answer pairs, judged by an LLM evaluator calibrated against a 50-example human-reviewed subsample. Measures whether the full pipeline (retrieval → reranking → generation → citation binding) produces a correct and useful answer. This is the metric tied to user satisfaction in A/B tests — but it is a lagging indicator. Groundedness and citation precision are leading indicators that predict task success before a user ever sees the output.

⚠ Warning · Shreya Shankar's “Who Validates the Validators” problem. Your LLM judge for groundedness and citation precision is itself an LLM. It needs calibration: run the judge on a 50-example human-labeled subsample and measure judge accuracy before trusting its scores at scale. An uncalibrated judge that overestimates groundedness by 8 pp gives you a false sense of security — and in a system where citations are the trust signal, false security has direct user-trust consequences.

The offline/online bridge for Perplexity

The offline eval metric that most reliably predicts user retention is citation precision — not task success, not generation quality. When running A/B experiments, track citation precision as the primary guardrail metric. A variant that improves task success by 2% but drops citation precision by 5% should be rejected: users will feel the citation break before the task-success improvement shows in retention numbers.

🧮

Back-of-Envelope

The QPS target for Perplexity-scale is lower than a pure chat product — retrieval-heavy queries are more expensive end-to-end, so the same revenue supports fewer concurrent LLM requests. Seed the calculator at 200 QPS with a 70B flagship and a 10% prefix cache hit rate (lower than chat because query diversity is higher — few users repeat the exact same question).

Critical caveat: the calculator models only the LLM serving cost. For Perplexity, this is not the full picture.

Scenario: Perplexity answer engine: 200 QPS target, 70B flagship model. Input tokens are large (1,024) because the retrieved context (typically 3–5 chunks, ~800 tokens) is stuffed into the prompt before generation. Cache hit rate is low (10%) because query diversity is high — few users ask the exact same question twice.
Model Size70B
GPU TypeA100-80GB
QPS Target200
Input Tokens1024
Output Tokens512
Cache Hit Rate10%
Model Weights (FP16)140 GB
KV Cache / Request515.4 MB (1536 tokens)
Tokens/sec per GPU300
Effective QPS (after cache)180
GPUs Needed309
Cost / Month$451,140
Est. p95 Latency3.56s
BottleneckCompute/Bandwidth

GPU Memory Usage

60%

Compute Utilization

100%

Monthly Cost

$451,140

⚠ Warning · Gotcha: This calculator does not include retrieval or reranking cost. At Perplexity scale, the embedding model (for query encoding), vector search, and cross-encoder reranker together represent a significant fraction of per-request spend — community estimates put retrieval + reranking at 30–50% of total per-query cost. The LLM line item is the visible cost; retrieval is the hidden one.
✨ Insight · The LLM is cheap; retrieval is the bill. A common mistake in Perplexity-style system design is treating the LLM as the dominant cost center and optimizing there first. In practice, at steady-state scale, the embedding model for query encoding, the vector index serving layer, and the cross-encoder reranker together account for a substantial fraction of per-request spend (community estimates: ). Before cutting the LLM size to save money, check whether the reranker can be made leaner or the cache hit rate can be improved.

Baseline: Perplexity search-RAG path800 GPUs @ $3/hr at p99 5000 ms, 8,000 QPS, 35% cache hit.

The RAG path has a fundamentally different SLO shape than pure chat: a 5 s p99 budget allows larger batches, but low cache hit rates (35%, community estimate) mean most requests hit the GPU cold. Tighten the SLO to 2 s and watch the GPU count nearly triple — the latency budget is the primary lever here, not QPS.

p99 Latency Target5000 ms
Peak QPS8,000
Cache Hit Rate35%
Effective QPS (after cache)5,200
Latency-batch factor1.00×
GPUs Needed800 (+0% latency vs baseline)
Hourly Burn$2,400 (+0% vs baseline)
Cost / Request$0.00008
Monthly Burn (24×7)$1,752,000
BottleneckBalanced
⚠ Warning · Gotcha: Cache hit rate is structurally lower for a search product than for chat. Query diversity is near-maximal — almost no two users ask the exact same question. The 35% baseline already assumes semantic-cache deduplication across paraphrases. Without it, effective hit rate drops to sub-10%, and the GPU requirement scales accordingly.
🏛️

Architecture

The RAG preset below shows the production topology. Hover over each component — the description column in the table below matches the tooltip.

Perplexity-Style Answer Engine Architecture

Hover each component to see its role. The critical path runs: Gateway → Retriever → Reranker → LLM Server → Citation Binder.

ClientAPI GatewayRetrieverVector StoreLive Web FetchRerankerLLM ServerCitation Binder

Component justification table

ComponentLatency budgetWhy it exists
API Gateway< 10 ms (typical for in-datacenter L7 proxy)Auth, per-tier rate limiting, query logging, freshness classifier routing. Needed because free and Pro tiers have different latency and freshness SLOs.
Hybrid Retriever (dense + BM25) (HNSW ANN p50; see FAISS benchmarks at ann-benchmarks.com)Dense (bi-encoder embeddings) captures semantic similarity; BM25 sparse retrieval captures exact-match long-tail queries (entity names, product codes, rare terms). Reciprocal rank fusion combines the two result lists without a learned weighting parameter.
Vector Storeincluded aboveHNSW-backed approximate nearest-neighbor index, sharded by content domain or recency tier. Needs tenant-aware sharding for Pro-tier custom corpus features. Online re-indexing pipeline for freshness.
Cross-Encoder Reranker (typically on T4-class GPU scoring top-50 passages; derived: 50 × ~6 ms/pair inference, minus batching gains)Takes top-50 from hybrid retrieval, re-scores each (query, passage) pair with a cross-encoder model (full cross-attention), outputs top-5. This is where citation precision is won or lost — the reranker is the last retrieval filter before the LLM sees the context.
Live Web Fetch (conditional)200–500 ms (typical external HTTP round-trip; depends on publisher CDN; qualitative range)Triggered only for freshness-critical queries (news, prices, sports scores). Adds latency and upstream publisher rate-limit risk. The freshness SLO (≤15 min index lag) drives when this path is required vs. the index path being sufficient.
LLM Server (with prefix cache) (CapacityCalculator derivation above; 70B on A100, 200 QPS)Receives the assembled prompt: system instructions + top-5 retrieved chunks + query. Prefix cache saves compute on the static system-prompt portion. Output is streamed as it generates — TTFT is what matters; full completion is hidden by streaming.
Citation Binder< 50 ms (string-match or embedding lookup over top-5 passages; CPU-bound)Attaches source IDs to generated spans. Two approaches: in-prompt tagging (LLM told to output citation tokens inline) vs. post-hoc alignment (citation spans extracted from the completed output). The binder is a first-class component because citation precision is a first-class SLO.
Semantic Cache< 5 ms (hit) (in-process embedding lookup; cache hit skips all downstream stages)Caches full (answer + citations) for high-frequency queries using embedding similarity as the cache key. Hit rate is low () compared to chat products because answer engine queries are more diverse, but the cost savings on cache hits are large — the entire retrieval + rerank + LLM stack is skipped.

Quick check

Trade-off

A PM proposes skipping the cross-encoder reranker on free-tier queries to save 300–400 ms of latency. The engineering team pushes back. What is the most defensible technical reason?

A PM proposes skipping the cross-encoder reranker on free-tier queries to save 300–400 ms of latency. The engineering team pushes back. What is the most defensible technical reason?
🔬

Deep Dives

Expand the deep dives

Open for the full technical detail.

Expand

Two components carry most of the product risk: the retrieval stack and the citation binder.

Deep dive A — Hybrid retriever + reranker stack

Approach

The retrieval stack runs in two stages deliberately. Stage 1 is the recall stage: a hybrid of dense bi-encoder retrieval and BM25 sparse retrieval returns the top-50 candidate passages from the full index in 50–100 ms. Stage 2 is the precision stage: a cross-encoder reranker re-scores each of those 50 (query, passage) pairs with full cross-attention, selecting the top-5 that actually go into the LLM prompt.

Dense embedding model choice. A larger embedding model (e.g., OpenAI text-embedding-3-large, 3072-dim) retrieves more accurately than smaller open alternatives (e.g., BGE-base, 768-dim), but at 4× the embedding inference cost per query. The production trade-off: use the larger model for Pro tier and a distilled open-source model (e.g., BGE-large, 1024-dim) for the free tier. Measure the quality delta on your golden query set — if Recall@5 only improves 2–3 pp, the cost difference is not worth it.

BM25 sparse for long-tail and exact-match. Dense retrieval systematically underperforms on rare entity names, technical identifiers, and exact-match queries where the user typed the same words that appear in the source. BM25 (or equivalent inverted index) catches these. The two result lists are merged via reciprocal rank fusion (RRF): each document's combined score is the sum of 1/(k + rank) across both lists, where k is a smoothing constant (typically 60). RRF requires no learned weighting parameter — it is robust to scale differences between the two rankers. This matters because the score distributions from dense cosine similarity and BM25 TF-IDF are not on the same scale; any learned combiner would overfit to your training distribution.

Cross-encoder reranker on top-50 → top-5. The cross-encoder (e.g., Cohere Rerank, or an open-source model like bge-reranker-large) receives the full (query, passage) pair and scores them with cross-attention — seeing both simultaneously. This is significantly more accurate than bi-encoder dot product at the cost of O(k) inference calls per query. Applying it to top-50 only keeps the latency budget manageable (typically on a T4-class GPU, derived from ~50 pairs × ~6–8 ms per pair, offset by batching). Never apply the cross-encoder to the full index — that is the role of the fast ANN retriever.

Trade-off (the non-obvious one)

The non-obvious trade-off is not latency vs. quality — it is diversity vs. precision. The cross-encoder optimizes relevance; it does not optimize for whether the top-5 passages collectively cover the full answer. If the reranker surfaces five passages that all say the same thing about a topic, the LLM will produce a narrowly-sourced answer with high citation precision but low completeness. For multi-faceted queries (e.g., “compare X and Y on dimensions A, B, C”), you may need to inject diversity explicitly — either by capping the number of passages from any single source URL in the top-5, or by using a maximal-marginal-relevance (MMR) re-selection step after reranking. The right choice depends on whether citation precision or answer completeness is the higher-priority SLO for your product.

Failure mode

Reranker drift on model upgrades.If the reranker model is updated and quietly changes its scoring distribution, the new top-5 may systematically favor certain content types or recency signals — degrading citation precision without triggering any retrieval-recall alert (because the new model might have higher Recall@50, just a different top-5 composition). This is the “Who Validates the Validators” problem applied to retrieval: the metric used to qualify the reranker (NDCG on the benchmark) does not measure the downstream metric you actually care about (citation precision in production).

Detection metric

Shadow-eval citation precision: run the new reranker model in parallel on 5% of live traffic, compare citation precision (NLI entailment on the sampled stream) against the production model. Alert threshold: if shadow precision falls more than 2 pp below the production baseline, block promotion. This needs to run for at least 24 hours to cover the daily query distribution shift (news traffic spikes in the morning; technical queries spike at night).

Mitigation

Block reranker upgrades behind the shadow-eval gate. Secondary effect to watch: a new reranker that scores differently may change which sources dominate the top-5. If the new model consistently favors lower-authority sources (e.g., social media over academic or news sources), citation credibility degrades even if NLI entailment is unchanged — users perceive source authority, not just entailment scores.

Real-world example

The Cohere team's Rerank blog post documents 10–15 NDCG point improvements on MS MARCO when adding a cross-encoder reranker over dense retrieval alone — and the complementary result is that removingthe reranker costs that same precision gap. Eugene Yan's LLM patterns post documents the same two-stage pattern (recall then precision) as the production-grade baseline for RAG systems and notes that single-stage retrieval reliably underperforms on citation-heavy use cases.

Deep dive B — Citation binder

Approach

The citation binder is the component that attaches a source ID to each factual claim in the generated answer. Two production approaches exist, with meaningfully different quality characteristics.

In-prompt tagging (preferred for accuracy). The LLM is instructed in the system prompt to output inline citation tokens (e.g., [1], [2]) keyed to source IDs passed in the context. The model learns to cite as it generates — the citation token appears immediately after the claim it supports. Citations are semantically grounded at generation time; the model is deciding citation placement while it has full attention over both the claim being generated and the source passages. Disadvantage: requires the model to be fine-tuned or carefully prompted to reliably produce citation tokens in the right position; citation precision degrades when the model hallucinates a citation index that does not exist in the provided sources. A citation token like [7] when only 5 sources were provided is a silent hallucination — the binder must handle this gracefully (strip the token, log it, alert if the rate exceeds 1%).

Post-hoc alignment (simpler, lower accuracy). Generate the answer first without citation tokens. Then a separate alignment step identifies claim spans in the output and matches each span to the most similar retrieved source chunk via embedding similarity or NLI entailment scoring. Works with any base LLM without prompt engineering changes — useful when you cannot fine-tune the model. The critical disadvantage: a claim that was genuinely synthesized from training memory (not from any retrieved source) will still receive a citation if the NLI model finds a plausible-but-wrong match. This inflates apparent citation precision while masking groundedness failures — you appear to have high citation coverage while the LLM is quietly hallucinating.

Trade-off (the non-obvious one)

The non-obvious trade-off in in-prompt tagging is between citation granularity and generation fluency. A model that cites at the sentence level produces high-precision citations but generates choppy prose interrupted by citation tokens. A model that cites at the paragraph level produces fluent prose but loses the precision mapping of which specific sentence relied on which source — making it harder for users to verify claims and harder for your eval pipeline to score citation precision accurately. The right granularity is claim-level (each independently verifiable assertion), but getting a model to reliably identify claim boundaries at generation time requires fine-tuning, not just prompting.

Failure mode

Silent binder breakage on LLM upgrades. A new LLM version is deployed. The previous model emitted [1] style tokens; the new model was instruction-tuned to use [Source 1] or footnote-style markers. The binder's regex parser sees zero recognized citation tokens, falls back to post-hoc alignment, and produces lower-precision citations — or produces uncited output entirely. The generation quality metrics improve (the new model is better), but citation precision plummets. This fails silently: the LLM output looks fine; the binder fails downstream, invisible to standard LLM quality metrics.

Detection metric

Monitor citation-token parse failure ratein the binder as a first-class operational metric. Alert threshold: >1% of responses returning zero parseable citation tokens (outside of known low-evidence query types). Also track citation precision on the golden query set continuously — a drop that appears within hours of an LLM deployment is the signature of a binder compatibility break, not a retrieval issue.

Mitigation

Treat every LLM upgrade as requiring a citation-format compatibility test before promotion: run the new model on the golden query set, verify citation token parse rate ≥ 99%, verify citation precision ≥ 95% on the sample. Gate promotion on both checks. Secondary effect: adding citation format to the pre-deployment smoke test suite makes LLM upgrades safer across all downstream components, not just the binder.

A practical hardening step often missed: version-pin the citation prompt template alongside the LLM model version in your deployment config. When a rollback is needed, rolling back the model without rolling back the prompt produces a new mismatch. Treat (model version, prompt version, binder parser version) as a triple that must always move together — deploy as a unit, rollback as a unit. This eliminates an entire class of silent binder failures that arise from partial rollbacks.

Real-world example

LlamaIndex's citation eval framework (documented in their citation evaluation docs) shows that post-hoc alignment consistently scores 5–10 pp lower on citation faithfulness than in-prompt tagging when evaluated against the same retrieved context — confirming that the choice of binder architecture has a measurable, non-trivial effect on the product's core trust metric. The gap widens on multi-hop queries where the LLM synthesizes a conclusion from two sources simultaneously: post-hoc alignment can only attach one source to the synthesized claim, while in-prompt tagging can emit [1][2] inline as the model generates the fused sentence. Eugene Yan's LLM patterns survey identifies the citation binder as a first-class design component in production RAG systems, not a post-processing afterthought — and notes that the most common production mistake is treating it as a string-formatting step rather than a component with its own eval gate and monitoring.

✨ Insight · Margin note. Most candidates deep-dive the LLM selection. The LLM is the leastdifferentiating component — any sufficiently capable model can summarize five retrieved chunks. The reranker and citation binder are where Perplexity's product quality actually lives. Deep dive those.
Quick Check

Your eval says LLM generation quality improved 3% after upgrading the model, but user satisfaction is flat for the week after launch. What do you investigate first?

🔧

Break It

Three failure modes, each mappable to a metric in the eval harness. Every break should have a detection path — if you can't measure the break, the mitigation is guesswork.

Remove the cross-encoder reranker

Top-50 from the hybrid retriever feeds directly into the LLM without reranking. The LLM now receives a noisier, less-precise top-5 (selected by simple score threshold rather than cross-encoder precision). Citation precision drops measurably — community benchmarks (e.g., MS MARCO two-stage retrieval, Nogueira et al. 2020 and the Cohere Rerank blog) suggest 10–15 precision-point regression when removing a reranker from a two-stage RAG pipeline. Groundedness score regresses in parallel because the LLM is filling gaps with training memory when the retrieved context is less relevant. Detection: citation precision metric crosses the 95% floor; NLI groundedness drops below 90%. Mitigation: restore the reranker; if latency is the concern, run the reranker on a dedicated low-latency CPU fleet rather than GPU — cross-encoders are not as compute-intensive as the LLM.

Remove the freshness tier

All queries served from the static index with no live web fetch. For general knowledge queries: no visible impact. For news, sports, prices, and breaking events: users immediately receive stale answers — yesterday's stock price, last week's election result, a product price from three months ago. The trust collapse is disproportionate to the query volume because freshness failures are the most visible failure mode — a user who asks “what is the score right now?” and receives a two-day old answer immediately notices and attributes it to the product overall. Detection:freshness-critical query cohort satisfaction drops in user feedback; post-hoc timestamp analysis of cited sources shows lag > 15 min SLO. Mitigation: restore live fetch for the freshness-critical query classifier output; consider a tiered freshness SLO disclosure so users know which queries are live-sourced vs. index-sourced.

Swap the LLM without refreshing the citation prompt

A new model version is deployed. The system prompt instructs the LLM to emit inline citation tokens like [1], [2]. The new model was trained with a different citation format (e.g., [Source 1] or footnote-style). Citation tokens are now malformed, unparseable, or missing entirely. The citation binder silently produces uncited answers or garbled source links. This fails silently — the generation quality may be identical or better, but citations break without any error signal in the LLM output itself. Detection:citation-token parse failure rate in the binder should be a monitored metric; set an alert for >1% parse failure. Citation precision metric crosses the floor within hours if the golden-set eval runs continuously. Mitigation: treat LLM upgrades as requiring a citation format compatibility test before promotion; add citation-token parsing to the pre-deployment smoke test suite.

Quick check

Trade-off

The freshness tier is removed. News queries are served from the static index. The trust collapse is described as &ldquo;disproportionate to the query volume.&rdquo; Why is a freshness failure more damaging per-query than a citation precision failure?

The freshness tier is removed. News queries are served from the static index. The trust collapse is described as &ldquo;disproportionate to the query volume.&rdquo; Why is a freshness failure more damaging per-query than a citation precision failure?
💸

What Does a Bad Day Cost?

Three failure modes with worked-math cost rows. Assumptions are labeled — replace with your actuals on the whiteboard. The detection-window column is the key insight: for silent quality regressions, cost scales with detection lag, not incident duration.

IncidentDetected in 2 minDetected in 4 hDetection ratio
HNSW shard offline — affected queries receive degraded answers with no error to user2 min × 200 QPS × 5% shard share = 1,200 degraded queries
Assumption: shard covers 5% of index; 200 QPS steady state. Engineering cost: on-call response (~$500 loaded eng-hour × 0.5 h = $250). Trust cost: ~1,200 users affected, low virality risk.
240 min × 200 QPS × 5% = 144,000 degraded queries
120× more users receive wrong citations. At ~2% of users sharing negative experiences publicly (assumption), that is ~2,880 public signals — measurable brand damage.
120×
Reranker upgrade silently regresses citation precision 95% → 85%2 min × 200 QPS × 10pp drop × $0.002 trust-damage proxy = ~$5
Trust-damage proxy: $0.002/degraded-query ≈ Pro-tier churn value × churn probability per bad citation (assumption: Pro ARPU ~$20/mo, 1% churn probability per bad-citation encounter = $0.20/user; sampled at 1% of traffic = $0.002/query). With 2-min detection, rollback before meaningful user exposure.
240 min × 200 QPS × 10pp drop × $0.002 = ~$576
Same math, 120× more queries exposed. Additionally: the regression may persist for days if NLI sampling is absent — multiplying the window further. A 24-hour undetected regression at these assumptions costs ~$6,900 in trust damage proxy, plus non-linear churn compounding once users encounter multiple bad citations.
120× (or 1,000× if undetected for 24 h)
Publisher rate-limit breaks real-time news freshness SLOAffected: news-query share × 200 QPS × 2 min ≈ 15% × 200 × 2 = 360 stale news queries
Assumption: news queries ~15% of traffic. Contractual exposure: Pro-tier SLO breach clock starts immediately. Engineering cost: incident response + publisher negotiation.
15% × 200 × 240 = 43,200 stale news queries
At 4 h, a major breaking news event (election result, market crash) may have been fully served stale. PR risk is non-linear — one viral screenshot of a wrong election result costs more than the SLO breach math suggests.
120×
⚠ Warning · The hardest SLO to operationalize is citation precision. Latency and availability fire binary alerts. Citation precision requires continuous sampling, an NLI inference pipeline on production traffic, and a calibrated judge — all running at non-trivial cost. The reranker regression row above shows why this is worth building: a 4-hour detection window vs. a 2-minute detection window is a 120× cost multiplier, and the cost compounds non-linearly if the regression persists for days. The organizations that skip citation-precision monitoring discover the regression from a viral tweet, not a dashboard.
🚨

Perplexity On-call Runbook

Index staleness after upstream crawler rate-limit

MTTR p50 / p99: 20 min / 90 min

Blast radius: News and finance queries return documents up to 24 h stale. Pro-tier freshness SLO (≤15 min lag for news) breached immediately. High virality risk if a major breaking event is served stale.

  1. 1. DetectIndex-lag monitor: compares crawl-ingest timestamp of top-10 news domains against wall clock. Alarm at lag >30 min. Secondary: freshness-intent query share spiking with unchanged answer age.
  2. 2. EscalatePage crawler on-call and search infra lead. Identify which publisher domains are rate-limited (crawler error log). Do not surface stale results silently — add an &ldquo;information may be outdated&rdquo; badge in the UI while recovery runs.
  3. 3. RollbackSwitch news-intent queries to a backup crawler with separate rate-limit budget. If unavailable: disable live-fetch for the affected domains and serve vector-index results with explicit staleness disclosure. Negotiate emergency rate-limit increase with publisher.
  4. 4. PostAdd per-domain crawl-lag SLO with automatic publisher failover. Implement index-age badge in answer UI so users can self-assess freshness. Add freshness-intent golden queries to CI so regressions surface before deploy.

Reranker latency spike on adversarial long queries

MTTR p50 / p99: 10 min / 35 min

Blast radius: Cross-encoder reranker times out on queries with >50 retrieved candidates and >500-token query length. Affected requests fall back to BM25-only ranking — citation precision drops 15–20 pp (community estimate). Concentrated in power users who write detailed queries.

  1. 1. DetectReranker p99 latency alarm at 2× baseline (community estimate: ~300 ms baseline → alarm at 600 ms). Correlated spike in fallback-path counter.
  2. 2. EscalatePage search on-call. Confirm via reranker latency heatmap whether the spike is query-length-correlated or uniform. If correlated: activate query-length cap on the reranker input.
  3. 3. RollbackApply reranker input cap: truncate candidate list to top-20 and query to 200 tokens for the duration. Precision degrades slightly but latency stabilizes. Full rollback requires reranker model rollout — plan for 30-min window.
  4. 4. PostAdd adversarial long-query stress test to reranker CI. Implement soft cap with graceful degradation (not hard timeout). Profile whether dynamic candidate-count scaling based on query complexity is feasible.

Citation drift in browsing mode — source URL returns 404 after answer is cached

MTTR p50 / p99: 30 min / 4 h

Blast radius: Users click citations and hit dead links. Trust damage is disproportionate: a broken citation is more visible than a wrong answer. Affects cached answers for queries where the source was ephemeral (news redirect, paywall shift).

  1. 1. DetectAsync citation-health checker: re-fetches top-3 citations for a 1% sample of served answers every 10 min. 404-rate alarm at >5% of checked citations.
  2. 2. EscalatePage search infra. Identify whether the dead-link rate is domain-wide (publisher site restructure) or scattered (individual article deletions). Prioritize by answer-impression volume.
  3. 3. RollbackFor cached answers with dead citations: re-fetch the query live and regenerate with fresh sources. For systematic publisher restructure: update URL-normalization rules in the crawler and reindex affected domain. Surface &ldquo;source unavailable&rdquo; inline when re-fetch fails.
  4. 4. PostAdd citation-health check to the answer caching pipeline — only cache answers whose top-3 citations return HTTP 200. Set a TTL on cached answers proportional to domain stability (news: 1 h; reference: 7 days).

Compare this freshness-driven runbook with the document-grounding challenges in NotebookLM's closed-corpus design and the broader taxonomy in RAG architecture comparison.

Quick check

Derivation

The incident-cost table shows a 120× multiplier between 2-minute and 4-hour detection windows. This is the same ratio as ChatGPT&apos;s routing regression. What does this structural invariant reveal about detection investment?

The incident-cost table shows a 120× multiplier between 2-minute and 4-hour detection windows. This is the same ratio as ChatGPT&apos;s routing regression. What does this structural invariant reveal about detection investment?
🏢

Company Lens — same design, different pushes

Google's push

Expect the interviewer to drill on freshness and sharding at web scale. “How do you index the web and keep embeddings fresh when the corpus is tens of billions of documents? What is your sharding strategy for the vector index at 10 B+ passages? Have you considered Spanner-backed embedding stores for consistency? How do you handle embedding model upgrades when a billion vectors need to be recomputed?” Google's bar for infrastructure depth is high — treat the vector store as a first-class distributed systems problem with its own design section, not a managed service you plug in.

Anthropic's push

Expect drill-down on refusal behavior and preventing confident-but-wrong answers. “What is the right behavior on a query where only one low-quality source is retrieved? How do you prevent the model from synthesizing a confident answer from insufficient evidence? How is refusal calibrated — when is uncertainty disclosure appropriate versus unhelpful? What does your groundedness eval actually measure, and how do you know the NLI judge isn't systematically wrong?” Anthropic cares deeply about calibrated uncertainty — the difference between a system that is honest about what it doesn't know and one that generates plausible text regardless.

Meta / OpenAI's push

Expect the drill to center on open-source serving stack choices and cost-per-query economics on owned hardware. The specific question pattern: “You're serving a 70B reranker and a 70B generator on your own A100 fleet. How do you decide between vLLM and TensorRT-LLM for each? What is your cost-per-query target, and how does it change when you switch from continuous batching to chunked prefill?”

The non-obvious answer: vLLM (community estimate: dominant for rapid iteration and broad model support) and TRT-LLM (community estimate: 1.5–2× higher throughput on fixed NVIDIA hardware via kernel fusion and quantization, per NVIDIA TRT-LLM benchmarks) make different trade-offs. For the LLM generator — high query diversity, output-token-heavy workloads — vLLM's PagedKV and continuous batching are a natural fit and iteration speed matters. For the cross-encoder reranker — fixed input length, no output generation, latency-sensitive — TRT-LLM's kernel fusion and INT8 quantization can cut per-query latency by 30–40% on the same hardware, which directly improves the 300–400 ms reranker budget. The cost-per-query math on owned hardware: at ~$2/A100-hr (on-prem amortized assumption), a 70B reranker processing 50 passages per query at 200 QPS consumes roughly 1 GPU continuously — $2/hr ÷ 200 QPS × 3,600 = $0.003/query just for reranking. Quantizing to INT8 cuts this roughly in half. This is the lever Meta and OpenAI interviewers want to see you reach for — not a managed API, but the hardware efficiency math.

🧠

Key Takeaways

What to remember for interviews

  1. 1Retrieval quality dominates model quality for answer engines. Optimize the retrieval stack — hybrid retriever, reranker — before touching the LLM.
  2. 2Write the eval harness first, before architecture. Groundedness (NLI claim check) and citation precision are the leading indicators that predict user trust.
  3. 3The LLM judge evaluating your groundedness needs its own calibration (Shreya Shankar's 'Who Validates the Validators'). An uncalibrated judge gives false confidence.
  4. 4The cross-encoder reranker is where citation precision is won or lost. It operates on top-50 → top-5, applying full cross-attention; never skip it to save latency without measuring the quality cost.
  5. 5LLM upgrades are implicit changes to the citation binder. Require a citation-format compatibility test on every LLM version change — silent binder breakage is the sneakiest regression type.
  6. 6Silent quality regressions — especially citation precision drops — require continuous NLI sampling of production traffic to detect. They won't show up in uptime metrics.
🧠

Recap quiz

Trade-off

Perplexity&apos;s p95 first-result SLO is 1.5 s, looser than a pure chat product at ~800 ms. Which component makes tightening to 800 ms dangerous for product quality?

Perplexity&apos;s p95 first-result SLO is 1.5 s, looser than a pure chat product at ~800 ms. Which component makes tightening to 800 ms dangerous for product quality?
Derivation

A Perplexity-style system runs at 200 QPS. The cross-encoder reranker consumes roughly 1 GPU (T4) continuously. At $2/GPU-hr on-prem amortized, what is the reranker&apos;s cost per query, and which optimization halves it without changing the retrieval stack?

A Perplexity-style system runs at 200 QPS. The cross-encoder reranker consumes roughly 1 GPU (T4) continuously. At $2/GPU-hr on-prem amortized, what is the reranker&apos;s cost per query, and which optimization halves it without changing the retrieval stack?
Derivation

Citation precision silently regresses from 95% to 85% after a reranker upgrade. The regression is undetected for 4 hours at 200 QPS. How many queries receive degraded citations, and why doesn&apos;t uptime monitoring catch this?

Citation precision silently regresses from 95% to 85% after a reranker upgrade. The regression is undetected for 4 hours at 200 QPS. How many queries receive degraded citations, and why doesn&apos;t uptime monitoring catch this?
Trade-off

Dense bi-encoder retrieval and BM25 sparse retrieval are fused via reciprocal rank fusion (RRF). What is the non-obvious reason RRF is preferred over a learned score combiner for this fusion?

Dense bi-encoder retrieval and BM25 sparse retrieval are fused via reciprocal rank fusion (RRF). What is the non-obvious reason RRF is preferred over a learned score combiner for this fusion?
Trade-off

A new LLM is deployed. Generation quality improves, but citation precision drops from 95% to 60% within 2 hours. The most likely root cause is:

A new LLM is deployed. Generation quality improves, but citation precision drops from 95% to 60% within 2 hours. The most likely root cause is:
Trade-off

A Perplexity-style cache achieves only 5–15% hit rate versus ~35% for a chat product. The root cause is structural, not a tuning problem. Which architectural property of an answer engine drives this?

A Perplexity-style cache achieves only 5–15% hit rate versus ~35% for a chat product. The root cause is structural, not a tuning problem. Which architectural property of an answer engine drives this?
Derivation

A system reports 92% groundedness but only 80% citation precision. Is this possible, and what does it reveal about the system&apos;s architecture?

A system reports 92% groundedness but only 80% citation precision. Is this possible, and what does it reveal about the system&apos;s architecture?
🎯

Interview Questions

Difficulty:
Company:

Showing 5 of 5

Your eval shows LLM generation quality increased 3% after swapping to a larger model, but user satisfaction is flat. Where do you look first?

★★☆
AnthropicGoogle

Design the freshness subsystem for Perplexity. How do you decide which queries trigger a live web fetch versus serving from the vector index?

★★★
GoogleOpenAI

The cross-encoder reranker was upgraded. Citation precision silently dropped from 95% to 85%. How was this not caught before it reached production?

★★★
AnthropicGoogle

Perplexity's vector index has 10B passages. An HNSW shard goes offline. What does the user experience? How should the system degrade gracefully?

★★★
Google

On low-evidence queries ('what are the symptoms of X rare disease in children?'), Perplexity has insufficient retrieved documents. What should the system do?

★★☆
Anthropic
📚

Further Reading