🔭 Case: Design Perplexity
Half retrieval system, half LLM — which half should you optimize first?
Perplexity is an answer engine, not chat. Retrieval and citations dominate; the LLM is the summarizer. Design retrieval first, because trust is usually lost on the first bad citation, not the first weak phrasing.
Requirements & SLOs
The Working-Backwards paragraph
Write the customer-facing sentence before any engineering word.
SLO table
| Metric | Target | Why this value |
|---|---|---|
| p95 first-result latency | Slower than a pure chat product because retrieval + rerank adds ~400–600 ms to the critical path; streaming hides the rest | |
| Citation precision | Does the cited source actually support the claim? Measured via NLI entailment on a sampled production stream — this is the core trust metric | |
| Groundedness score | Fraction of atomic claims in the answer traceable to a retrieved source; below 90% means the LLM is filling gaps from training memory | |
| Freshness SLO (news / prices) | Contractual for Pro tier; freshness SLO drives the live-web fetch path design — index lag > 15 min triggers live fetch regardless of classifier output | |
| Refusal/uncertainty behavior | Calibrated disclosure, not hard refusal | On low-evidence queries, the system discloses uncertainty inline rather than refusing; refusal trains users to distrust the product for edge cases |
| Availability | Standard enterprise-grade SaaS; partial shard degradation counts as degraded-not-down unless citation precision falls below the floor |
Quick check
The freshness SLO is “index lag ≤15 min for news.” When does this SLO force the live web fetch path to activate, even if the query classifier is uncertain?
Eval Harness (design first)
This is the most important section for Perplexity — and the one most candidates skip. Unlike a chat product where generation quality is the primary eval target, here you need four independent eval axes, and they are not equally important.
(a) Groundedness — NLI-style claim check
Extract atomic claims from the generated answer. For each claim, use a natural language inference (NLI) model to check whether the cited source entails that claim, contradicts it, or is neutral. Groundedness = entailed claims / total claims. Threshold: ≥90%. This is the eval that catches the LLM drifting to its training memory when retrieved sources are weak — the worst invisible failure mode.
(b) Citation precision
Distinct from groundedness: groundedness asks whether any source supports the claim; citation precision asks whether the specific source cited inline supports it. A system can be highly grounded but have poor citation precision if the citation binder mis-assigns sources to the wrong claims. Measure via the same NLI pipeline, but restricted to the (claim, cited-source) pair rather than the full retrieved context.
(c) Retrieval recall@k on golden queries
Maintain a golden query set of ~500 queries where the correct source documents are known (manually annotated). Measure whether those source documents appear in the top-k retrieved results (Recall@5, Recall@10). This eval decouples retrieval quality from generation quality — if Recall@5 is 60%, no amount of LLM improvement will fix citation precision. Retrieval failure is the root cause; generation quality improvement is noise.
(d) End-to-end task success
A curated benchmark of ~200 query/expected-answer pairs, judged by an LLM evaluator calibrated against a 50-example human-reviewed subsample. Measures whether the full pipeline (retrieval → reranking → generation → citation binding) produces a correct and useful answer. This is the metric tied to user satisfaction in A/B tests — but it is a lagging indicator. Groundedness and citation precision are leading indicators that predict task success before a user ever sees the output.
The offline/online bridge for Perplexity
The offline eval metric that most reliably predicts user retention is citation precision — not task success, not generation quality. When running A/B experiments, track citation precision as the primary guardrail metric. A variant that improves task success by 2% but drops citation precision by 5% should be rejected: users will feel the citation break before the task-success improvement shows in retention numbers.
Back-of-Envelope
The QPS target for Perplexity-scale is lower than a pure chat product — retrieval-heavy queries are more expensive end-to-end, so the same revenue supports fewer concurrent LLM requests. Seed the calculator at 200 QPS with a 70B flagship and a 10% prefix cache hit rate (lower than chat because query diversity is higher — few users repeat the exact same question).
Critical caveat: the calculator models only the LLM serving cost. For Perplexity, this is not the full picture.
| Model Weights (FP16) | 140 GB |
| KV Cache / Request | 515.4 MB (1536 tokens) |
| Tokens/sec per GPU | 300 |
| Effective QPS (after cache) | 180 |
| GPUs Needed | 309 |
| Cost / Month | $451,140 |
| Est. p95 Latency | 3.56s |
| Bottleneck | Compute/Bandwidth |
GPU Memory Usage
60%
Compute Utilization
100%
Monthly Cost
$451,140
Baseline: Perplexity search-RAG path — 800 GPUs @ $3/hr at p99 5000 ms, 8,000 QPS, 35% cache hit.
The RAG path has a fundamentally different SLO shape than pure chat: a 5 s p99 budget allows larger batches, but low cache hit rates (35%, community estimate) mean most requests hit the GPU cold. Tighten the SLO to 2 s and watch the GPU count nearly triple — the latency budget is the primary lever here, not QPS.
| Effective QPS (after cache) | 5,200 |
| Latency-batch factor | 1.00× |
| GPUs Needed | 800 (+0% latency vs baseline) |
| Hourly Burn | $2,400 (+0% vs baseline) |
| Cost / Request | $0.00008 |
| Monthly Burn (24×7) | $1,752,000 |
| Bottleneck | Balanced |
Architecture
The RAG preset below shows the production topology. Hover over each component — the description column in the table below matches the tooltip.
Perplexity-Style Answer Engine Architecture
Hover each component to see its role. The critical path runs: Gateway → Retriever → Reranker → LLM Server → Citation Binder.
Component justification table
| Component | Latency budget | Why it exists |
|---|---|---|
| API Gateway | < 10 ms (typical for in-datacenter L7 proxy) | Auth, per-tier rate limiting, query logging, freshness classifier routing. Needed because free and Pro tiers have different latency and freshness SLOs. |
| Hybrid Retriever (dense + BM25) | (HNSW ANN p50; see FAISS benchmarks at ann-benchmarks.com) | Dense (bi-encoder embeddings) captures semantic similarity; BM25 sparse retrieval captures exact-match long-tail queries (entity names, product codes, rare terms). Reciprocal rank fusion combines the two result lists without a learned weighting parameter. |
| Vector Store | included above | HNSW-backed approximate nearest-neighbor index, sharded by content domain or recency tier. Needs tenant-aware sharding for Pro-tier custom corpus features. Online re-indexing pipeline for freshness. |
| Cross-Encoder Reranker | (typically on T4-class GPU scoring top-50 passages; derived: 50 × ~6 ms/pair inference, minus batching gains) | Takes top-50 from hybrid retrieval, re-scores each (query, passage) pair with a cross-encoder model (full cross-attention), outputs top-5. This is where citation precision is won or lost — the reranker is the last retrieval filter before the LLM sees the context. |
| Live Web Fetch (conditional) | 200–500 ms (typical external HTTP round-trip; depends on publisher CDN; qualitative range) | Triggered only for freshness-critical queries (news, prices, sports scores). Adds latency and upstream publisher rate-limit risk. The freshness SLO (≤15 min index lag) drives when this path is required vs. the index path being sufficient. |
| LLM Server (with prefix cache) | (CapacityCalculator derivation above; 70B on A100, 200 QPS) | Receives the assembled prompt: system instructions + top-5 retrieved chunks + query. Prefix cache saves compute on the static system-prompt portion. Output is streamed as it generates — TTFT is what matters; full completion is hidden by streaming. |
| Citation Binder | < 50 ms (string-match or embedding lookup over top-5 passages; CPU-bound) | Attaches source IDs to generated spans. Two approaches: in-prompt tagging (LLM told to output citation tokens inline) vs. post-hoc alignment (citation spans extracted from the completed output). The binder is a first-class component because citation precision is a first-class SLO. |
| Semantic Cache | < 5 ms (hit) (in-process embedding lookup; cache hit skips all downstream stages) | Caches full (answer + citations) for high-frequency queries using embedding similarity as the cache key. Hit rate is low () compared to chat products because answer engine queries are more diverse, but the cost savings on cache hits are large — the entire retrieval + rerank + LLM stack is skipped. |
Quick check
A PM proposes skipping the cross-encoder reranker on free-tier queries to save 300–400 ms of latency. The engineering team pushes back. What is the most defensible technical reason?
Deep Dives
Expand the deep dives
Open for the full technical detail.
Expand
Expand the deep dives
Open for the full technical detail.
Two components carry most of the product risk: the retrieval stack and the citation binder.
Deep dive A — Hybrid retriever + reranker stack
Approach
The retrieval stack runs in two stages deliberately. Stage 1 is the recall stage: a hybrid of dense bi-encoder retrieval and BM25 sparse retrieval returns the top-50 candidate passages from the full index in 50–100 ms. Stage 2 is the precision stage: a cross-encoder reranker re-scores each of those 50 (query, passage) pairs with full cross-attention, selecting the top-5 that actually go into the LLM prompt.
Dense embedding model choice. A larger embedding model (e.g., OpenAI text-embedding-3-large, 3072-dim) retrieves more accurately than smaller open alternatives (e.g., BGE-base, 768-dim), but at 4× the embedding inference cost per query. The production trade-off: use the larger model for Pro tier and a distilled open-source model (e.g., BGE-large, 1024-dim) for the free tier. Measure the quality delta on your golden query set — if Recall@5 only improves 2–3 pp, the cost difference is not worth it.
BM25 sparse for long-tail and exact-match. Dense retrieval systematically underperforms on rare entity names, technical identifiers, and exact-match queries where the user typed the same words that appear in the source. BM25 (or equivalent inverted index) catches these. The two result lists are merged via reciprocal rank fusion (RRF): each document's combined score is the sum of 1/(k + rank) across both lists, where k is a smoothing constant (typically 60). RRF requires no learned weighting parameter — it is robust to scale differences between the two rankers. This matters because the score distributions from dense cosine similarity and BM25 TF-IDF are not on the same scale; any learned combiner would overfit to your training distribution.
Cross-encoder reranker on top-50 → top-5. The cross-encoder (e.g., Cohere Rerank, or an open-source model like bge-reranker-large) receives the full (query, passage) pair and scores them with cross-attention — seeing both simultaneously. This is significantly more accurate than bi-encoder dot product at the cost of O(k) inference calls per query. Applying it to top-50 only keeps the latency budget manageable (typically on a T4-class GPU, derived from ~50 pairs × ~6–8 ms per pair, offset by batching). Never apply the cross-encoder to the full index — that is the role of the fast ANN retriever.
Trade-off (the non-obvious one)
The non-obvious trade-off is not latency vs. quality — it is diversity vs. precision. The cross-encoder optimizes relevance; it does not optimize for whether the top-5 passages collectively cover the full answer. If the reranker surfaces five passages that all say the same thing about a topic, the LLM will produce a narrowly-sourced answer with high citation precision but low completeness. For multi-faceted queries (e.g., “compare X and Y on dimensions A, B, C”), you may need to inject diversity explicitly — either by capping the number of passages from any single source URL in the top-5, or by using a maximal-marginal-relevance (MMR) re-selection step after reranking. The right choice depends on whether citation precision or answer completeness is the higher-priority SLO for your product.
Failure mode
Reranker drift on model upgrades.If the reranker model is updated and quietly changes its scoring distribution, the new top-5 may systematically favor certain content types or recency signals — degrading citation precision without triggering any retrieval-recall alert (because the new model might have higher Recall@50, just a different top-5 composition). This is the “Who Validates the Validators” problem applied to retrieval: the metric used to qualify the reranker (NDCG on the benchmark) does not measure the downstream metric you actually care about (citation precision in production).
Detection metric
Shadow-eval citation precision: run the new reranker model in parallel on 5% of live traffic, compare citation precision (NLI entailment on the sampled stream) against the production model. Alert threshold: if shadow precision falls more than 2 pp below the production baseline, block promotion. This needs to run for at least 24 hours to cover the daily query distribution shift (news traffic spikes in the morning; technical queries spike at night).
Mitigation
Block reranker upgrades behind the shadow-eval gate. Secondary effect to watch: a new reranker that scores differently may change which sources dominate the top-5. If the new model consistently favors lower-authority sources (e.g., social media over academic or news sources), citation credibility degrades even if NLI entailment is unchanged — users perceive source authority, not just entailment scores.
Real-world example
The Cohere team's Rerank blog post documents 10–15 NDCG point improvements on MS MARCO when adding a cross-encoder reranker over dense retrieval alone — and the complementary result is that removingthe reranker costs that same precision gap. Eugene Yan's LLM patterns post documents the same two-stage pattern (recall then precision) as the production-grade baseline for RAG systems and notes that single-stage retrieval reliably underperforms on citation-heavy use cases.
Deep dive B — Citation binder
Approach
The citation binder is the component that attaches a source ID to each factual claim in the generated answer. Two production approaches exist, with meaningfully different quality characteristics.
In-prompt tagging (preferred for accuracy). The LLM is instructed in the system prompt to output inline citation tokens (e.g., [1], [2]) keyed to source IDs passed in the context. The model learns to cite as it generates — the citation token appears immediately after the claim it supports. Citations are semantically grounded at generation time; the model is deciding citation placement while it has full attention over both the claim being generated and the source passages. Disadvantage: requires the model to be fine-tuned or carefully prompted to reliably produce citation tokens in the right position; citation precision degrades when the model hallucinates a citation index that does not exist in the provided sources. A citation token like [7] when only 5 sources were provided is a silent hallucination — the binder must handle this gracefully (strip the token, log it, alert if the rate exceeds 1%).
Post-hoc alignment (simpler, lower accuracy). Generate the answer first without citation tokens. Then a separate alignment step identifies claim spans in the output and matches each span to the most similar retrieved source chunk via embedding similarity or NLI entailment scoring. Works with any base LLM without prompt engineering changes — useful when you cannot fine-tune the model. The critical disadvantage: a claim that was genuinely synthesized from training memory (not from any retrieved source) will still receive a citation if the NLI model finds a plausible-but-wrong match. This inflates apparent citation precision while masking groundedness failures — you appear to have high citation coverage while the LLM is quietly hallucinating.
Trade-off (the non-obvious one)
The non-obvious trade-off in in-prompt tagging is between citation granularity and generation fluency. A model that cites at the sentence level produces high-precision citations but generates choppy prose interrupted by citation tokens. A model that cites at the paragraph level produces fluent prose but loses the precision mapping of which specific sentence relied on which source — making it harder for users to verify claims and harder for your eval pipeline to score citation precision accurately. The right granularity is claim-level (each independently verifiable assertion), but getting a model to reliably identify claim boundaries at generation time requires fine-tuning, not just prompting.
Failure mode
Silent binder breakage on LLM upgrades. A new LLM version is deployed. The previous model emitted [1] style tokens; the new model was instruction-tuned to use [Source 1] or footnote-style markers. The binder's regex parser sees zero recognized citation tokens, falls back to post-hoc alignment, and produces lower-precision citations — or produces uncited output entirely. The generation quality metrics improve (the new model is better), but citation precision plummets. This fails silently: the LLM output looks fine; the binder fails downstream, invisible to standard LLM quality metrics.
Detection metric
Monitor citation-token parse failure ratein the binder as a first-class operational metric. Alert threshold: >1% of responses returning zero parseable citation tokens (outside of known low-evidence query types). Also track citation precision on the golden query set continuously — a drop that appears within hours of an LLM deployment is the signature of a binder compatibility break, not a retrieval issue.
Mitigation
Treat every LLM upgrade as requiring a citation-format compatibility test before promotion: run the new model on the golden query set, verify citation token parse rate ≥ 99%, verify citation precision ≥ 95% on the sample. Gate promotion on both checks. Secondary effect: adding citation format to the pre-deployment smoke test suite makes LLM upgrades safer across all downstream components, not just the binder.
A practical hardening step often missed: version-pin the citation prompt template alongside the LLM model version in your deployment config. When a rollback is needed, rolling back the model without rolling back the prompt produces a new mismatch. Treat (model version, prompt version, binder parser version) as a triple that must always move together — deploy as a unit, rollback as a unit. This eliminates an entire class of silent binder failures that arise from partial rollbacks.
Real-world example
LlamaIndex's citation eval framework (documented in their citation evaluation docs) shows that post-hoc alignment consistently scores 5–10 pp lower on citation faithfulness than in-prompt tagging when evaluated against the same retrieved context — confirming that the choice of binder architecture has a measurable, non-trivial effect on the product's core trust metric. The gap widens on multi-hop queries where the LLM synthesizes a conclusion from two sources simultaneously: post-hoc alignment can only attach one source to the synthesized claim, while in-prompt tagging can emit [1][2] inline as the model generates the fused sentence. Eugene Yan's LLM patterns survey identifies the citation binder as a first-class design component in production RAG systems, not a post-processing afterthought — and notes that the most common production mistake is treating it as a string-formatting step rather than a component with its own eval gate and monitoring.
Your eval says LLM generation quality improved 3% after upgrading the model, but user satisfaction is flat for the week after launch. What do you investigate first?
Break It
Three failure modes, each mappable to a metric in the eval harness. Every break should have a detection path — if you can't measure the break, the mitigation is guesswork.
Remove the cross-encoder reranker
Top-50 from the hybrid retriever feeds directly into the LLM without reranking. The LLM now receives a noisier, less-precise top-5 (selected by simple score threshold rather than cross-encoder precision). Citation precision drops measurably — community benchmarks (e.g., MS MARCO two-stage retrieval, Nogueira et al. 2020 and the Cohere Rerank blog) suggest 10–15 precision-point regression when removing a reranker from a two-stage RAG pipeline. Groundedness score regresses in parallel because the LLM is filling gaps with training memory when the retrieved context is less relevant. Detection: citation precision metric crosses the 95% floor; NLI groundedness drops below 90%. Mitigation: restore the reranker; if latency is the concern, run the reranker on a dedicated low-latency CPU fleet rather than GPU — cross-encoders are not as compute-intensive as the LLM.
Remove the freshness tier
All queries served from the static index with no live web fetch. For general knowledge queries: no visible impact. For news, sports, prices, and breaking events: users immediately receive stale answers — yesterday's stock price, last week's election result, a product price from three months ago. The trust collapse is disproportionate to the query volume because freshness failures are the most visible failure mode — a user who asks “what is the score right now?” and receives a two-day old answer immediately notices and attributes it to the product overall. Detection:freshness-critical query cohort satisfaction drops in user feedback; post-hoc timestamp analysis of cited sources shows lag > 15 min SLO. Mitigation: restore live fetch for the freshness-critical query classifier output; consider a tiered freshness SLO disclosure so users know which queries are live-sourced vs. index-sourced.
Swap the LLM without refreshing the citation prompt
A new model version is deployed. The system prompt instructs the LLM to emit inline citation tokens like [1], [2]. The new model was trained with a different citation format (e.g., [Source 1] or footnote-style). Citation tokens are now malformed, unparseable, or missing entirely. The citation binder silently produces uncited answers or garbled source links. This fails silently — the generation quality may be identical or better, but citations break without any error signal in the LLM output itself. Detection:citation-token parse failure rate in the binder should be a monitored metric; set an alert for >1% parse failure. Citation precision metric crosses the floor within hours if the golden-set eval runs continuously. Mitigation: treat LLM upgrades as requiring a citation format compatibility test before promotion; add citation-token parsing to the pre-deployment smoke test suite.
Quick check
The freshness tier is removed. News queries are served from the static index. The trust collapse is described as “disproportionate to the query volume.” Why is a freshness failure more damaging per-query than a citation precision failure?
What Does a Bad Day Cost?
Three failure modes with worked-math cost rows. Assumptions are labeled — replace with your actuals on the whiteboard. The detection-window column is the key insight: for silent quality regressions, cost scales with detection lag, not incident duration.
| Incident | Detected in 2 min | Detected in 4 h | Detection ratio |
|---|---|---|---|
| HNSW shard offline — affected queries receive degraded answers with no error to user | 2 min × 200 QPS × 5% shard share = 1,200 degraded queries Assumption: shard covers 5% of index; 200 QPS steady state. Engineering cost: on-call response (~$500 loaded eng-hour × 0.5 h = $250). Trust cost: ~1,200 users affected, low virality risk. | 240 min × 200 QPS × 5% = 144,000 degraded queries 120× more users receive wrong citations. At ~2% of users sharing negative experiences publicly (assumption), that is ~2,880 public signals — measurable brand damage. | 120× |
| Reranker upgrade silently regresses citation precision 95% → 85% | 2 min × 200 QPS × 10pp drop × $0.002 trust-damage proxy = ~$5 Trust-damage proxy: $0.002/degraded-query ≈ Pro-tier churn value × churn probability per bad citation (assumption: Pro ARPU ~$20/mo, 1% churn probability per bad-citation encounter = $0.20/user; sampled at 1% of traffic = $0.002/query). With 2-min detection, rollback before meaningful user exposure. | 240 min × 200 QPS × 10pp drop × $0.002 = ~$576 Same math, 120× more queries exposed. Additionally: the regression may persist for days if NLI sampling is absent — multiplying the window further. A 24-hour undetected regression at these assumptions costs ~$6,900 in trust damage proxy, plus non-linear churn compounding once users encounter multiple bad citations. | 120× (or 1,000× if undetected for 24 h) |
| Publisher rate-limit breaks real-time news freshness SLO | Affected: news-query share × 200 QPS × 2 min ≈ 15% × 200 × 2 = 360 stale news queries Assumption: news queries ~15% of traffic. Contractual exposure: Pro-tier SLO breach clock starts immediately. Engineering cost: incident response + publisher negotiation. | 15% × 200 × 240 = 43,200 stale news queries At 4 h, a major breaking news event (election result, market crash) may have been fully served stale. PR risk is non-linear — one viral screenshot of a wrong election result costs more than the SLO breach math suggests. | 120× |
Perplexity On-call Runbook
Index staleness after upstream crawler rate-limit
MTTR p50 / p99: 20 min / 90 minBlast radius: News and finance queries return documents up to 24 h stale. Pro-tier freshness SLO (≤15 min lag for news) breached immediately. High virality risk if a major breaking event is served stale.
- 1. DetectIndex-lag monitor: compares crawl-ingest timestamp of top-10 news domains against wall clock. Alarm at lag >30 min. Secondary: freshness-intent query share spiking with unchanged answer age.
- 2. EscalatePage crawler on-call and search infra lead. Identify which publisher domains are rate-limited (crawler error log). Do not surface stale results silently — add an “information may be outdated” badge in the UI while recovery runs.
- 3. RollbackSwitch news-intent queries to a backup crawler with separate rate-limit budget. If unavailable: disable live-fetch for the affected domains and serve vector-index results with explicit staleness disclosure. Negotiate emergency rate-limit increase with publisher.
- 4. PostAdd per-domain crawl-lag SLO with automatic publisher failover. Implement index-age badge in answer UI so users can self-assess freshness. Add freshness-intent golden queries to CI so regressions surface before deploy.
Reranker latency spike on adversarial long queries
MTTR p50 / p99: 10 min / 35 minBlast radius: Cross-encoder reranker times out on queries with >50 retrieved candidates and >500-token query length. Affected requests fall back to BM25-only ranking — citation precision drops 15–20 pp (community estimate). Concentrated in power users who write detailed queries.
- 1. DetectReranker p99 latency alarm at 2× baseline (community estimate: ~300 ms baseline → alarm at 600 ms). Correlated spike in fallback-path counter.
- 2. EscalatePage search on-call. Confirm via reranker latency heatmap whether the spike is query-length-correlated or uniform. If correlated: activate query-length cap on the reranker input.
- 3. RollbackApply reranker input cap: truncate candidate list to top-20 and query to 200 tokens for the duration. Precision degrades slightly but latency stabilizes. Full rollback requires reranker model rollout — plan for 30-min window.
- 4. PostAdd adversarial long-query stress test to reranker CI. Implement soft cap with graceful degradation (not hard timeout). Profile whether dynamic candidate-count scaling based on query complexity is feasible.
Citation drift in browsing mode — source URL returns 404 after answer is cached
MTTR p50 / p99: 30 min / 4 hBlast radius: Users click citations and hit dead links. Trust damage is disproportionate: a broken citation is more visible than a wrong answer. Affects cached answers for queries where the source was ephemeral (news redirect, paywall shift).
- 1. DetectAsync citation-health checker: re-fetches top-3 citations for a 1% sample of served answers every 10 min. 404-rate alarm at >5% of checked citations.
- 2. EscalatePage search infra. Identify whether the dead-link rate is domain-wide (publisher site restructure) or scattered (individual article deletions). Prioritize by answer-impression volume.
- 3. RollbackFor cached answers with dead citations: re-fetch the query live and regenerate with fresh sources. For systematic publisher restructure: update URL-normalization rules in the crawler and reindex affected domain. Surface “source unavailable” inline when re-fetch fails.
- 4. PostAdd citation-health check to the answer caching pipeline — only cache answers whose top-3 citations return HTTP 200. Set a TTL on cached answers proportional to domain stability (news: 1 h; reference: 7 days).
Compare this freshness-driven runbook with the document-grounding challenges in NotebookLM's closed-corpus design and the broader taxonomy in RAG architecture comparison.
Quick check
The incident-cost table shows a 120× multiplier between 2-minute and 4-hour detection windows. This is the same ratio as ChatGPT's routing regression. What does this structural invariant reveal about detection investment?
Company Lens — same design, different pushes
Google's push
Expect the interviewer to drill on freshness and sharding at web scale. “How do you index the web and keep embeddings fresh when the corpus is tens of billions of documents? What is your sharding strategy for the vector index at 10 B+ passages? Have you considered Spanner-backed embedding stores for consistency? How do you handle embedding model upgrades when a billion vectors need to be recomputed?” Google's bar for infrastructure depth is high — treat the vector store as a first-class distributed systems problem with its own design section, not a managed service you plug in.
Anthropic's push
Expect drill-down on refusal behavior and preventing confident-but-wrong answers. “What is the right behavior on a query where only one low-quality source is retrieved? How do you prevent the model from synthesizing a confident answer from insufficient evidence? How is refusal calibrated — when is uncertainty disclosure appropriate versus unhelpful? What does your groundedness eval actually measure, and how do you know the NLI judge isn't systematically wrong?” Anthropic cares deeply about calibrated uncertainty — the difference between a system that is honest about what it doesn't know and one that generates plausible text regardless.
Meta / OpenAI's push
Expect the drill to center on open-source serving stack choices and cost-per-query economics on owned hardware. The specific question pattern: “You're serving a 70B reranker and a 70B generator on your own A100 fleet. How do you decide between vLLM and TensorRT-LLM for each? What is your cost-per-query target, and how does it change when you switch from continuous batching to chunked prefill?”
The non-obvious answer: vLLM (community estimate: dominant for rapid iteration and broad model support) and TRT-LLM (community estimate: 1.5–2× higher throughput on fixed NVIDIA hardware via kernel fusion and quantization, per NVIDIA TRT-LLM benchmarks) make different trade-offs. For the LLM generator — high query diversity, output-token-heavy workloads — vLLM's PagedKV and continuous batching are a natural fit and iteration speed matters. For the cross-encoder reranker — fixed input length, no output generation, latency-sensitive — TRT-LLM's kernel fusion and INT8 quantization can cut per-query latency by 30–40% on the same hardware, which directly improves the 300–400 ms reranker budget. The cost-per-query math on owned hardware: at ~$2/A100-hr (on-prem amortized assumption), a 70B reranker processing 50 passages per query at 200 QPS consumes roughly 1 GPU continuously — $2/hr ÷ 200 QPS × 3,600 = $0.003/query just for reranking. Quantizing to INT8 cuts this roughly in half. This is the lever Meta and OpenAI interviewers want to see you reach for — not a managed API, but the hardware efficiency math.
Key Takeaways
What to remember for interviews
- 1Retrieval quality dominates model quality for answer engines. Optimize the retrieval stack — hybrid retriever, reranker — before touching the LLM.
- 2Write the eval harness first, before architecture. Groundedness (NLI claim check) and citation precision are the leading indicators that predict user trust.
- 3The LLM judge evaluating your groundedness needs its own calibration (Shreya Shankar's 'Who Validates the Validators'). An uncalibrated judge gives false confidence.
- 4The cross-encoder reranker is where citation precision is won or lost. It operates on top-50 → top-5, applying full cross-attention; never skip it to save latency without measuring the quality cost.
- 5LLM upgrades are implicit changes to the citation binder. Require a citation-format compatibility test on every LLM version change — silent binder breakage is the sneakiest regression type.
- 6Silent quality regressions — especially citation precision drops — require continuous NLI sampling of production traffic to detect. They won't show up in uptime metrics.
Recap quiz
Perplexity's p95 first-result SLO is 1.5 s, looser than a pure chat product at ~800 ms. Which component makes tightening to 800 ms dangerous for product quality?
A Perplexity-style system runs at 200 QPS. The cross-encoder reranker consumes roughly 1 GPU (T4) continuously. At $2/GPU-hr on-prem amortized, what is the reranker's cost per query, and which optimization halves it without changing the retrieval stack?
Citation precision silently regresses from 95% to 85% after a reranker upgrade. The regression is undetected for 4 hours at 200 QPS. How many queries receive degraded citations, and why doesn't uptime monitoring catch this?
Dense bi-encoder retrieval and BM25 sparse retrieval are fused via reciprocal rank fusion (RRF). What is the non-obvious reason RRF is preferred over a learned score combiner for this fusion?
A new LLM is deployed. Generation quality improves, but citation precision drops from 95% to 60% within 2 hours. The most likely root cause is:
A Perplexity-style cache achieves only 5–15% hit rate versus ~35% for a chat product. The root cause is structural, not a tuning problem. Which architectural property of an answer engine drives this?
A system reports 92% groundedness but only 80% citation precision. Is this possible, and what does it reveal about the system's architecture?
Interview Questions
Showing 5 of 5
Your eval shows LLM generation quality increased 3% after swapping to a larger model, but user satisfaction is flat. Where do you look first?
★★☆Design the freshness subsystem for Perplexity. How do you decide which queries trigger a live web fetch versus serving from the vector index?
★★★The cross-encoder reranker was upgraded. Citation precision silently dropped from 95% to 85%. How was this not caught before it reached production?
★★★Perplexity's vector index has 10B passages. An HNSW shard goes offline. What does the user experience? How should the system degrade gracefully?
★★★On low-evidence queries ('what are the symptoms of X rare disease in children?'), Perplexity has insufficient retrieved documents. What should the system do?
★★☆Further Reading
- Perplexity Engineering Blog — Primary source for Perplexity's architecture choices — retrieval stack, index design, and pro-tier feature decisions from the team that built it.
- Shreya Shankar — Who Validates the Validators? Towards LLM-Assisted Evaluation — The foundational paper for understanding why the LLM judge evaluating your RAG system needs its own calibration. Essential reading before designing any groundedness or citation-precision eval.
- Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al., 2020) — The DPR paper that established dual-encoder dense retrieval as the production baseline. The retrieval recall numbers here are the standard against which all Perplexity-style systems are measured.
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction — Khattab & Zaharia 2020 — the architecture that keeps per-token embeddings and uses MaxSim scoring. Relevant to understanding why single-vector bi-encoders are the retrieval floor, not the ceiling.
- Eugene Yan — Patterns for Building LLM-Based Systems & Products — Practitioner-level survey of RAG, evals, guardrails, and citation patterns. The sections on retrieval, memory, and guardrails map directly to the Perplexity design problem.
- Chip Huyen — Building LLM Applications for Production — The canonical post on production LLM engineering. The hallucination and evaluation sections ground the citation-precision and groundedness design choices in this module.