Case: Design an Embeddings Platform

📋

Requirements & SLOs

Working backwards from consumers

“Internal teams query the embeddings platform at query time (online, latency-sensitive) or in bulk (offline, throughput- sensitive). Search needs high recall on long-tail queries. Recsys needs recall on trending content. Ads needs the tightest recall because a missed embedding means a missed impression. Dedup needs high-precision near-duplicate detection at ingestion time. All four expect that when the platform upgrades its embedding model, their downstream quality does not regress — and the upgrade completes within a bounded window.”

Consumer SLO table

Consumer	Recall@k target	Query latency p95	Why this value
Search	0.85 recall@10	<50ms	High-recall long-tail; 0.85 matches human-rated relevance baseline (inferred from Pinterest Unifying Visual Embeddings blog — search relevance baseline)
Recsys	0.75 recall@50	<50ms	Candidate pool is large (50 items); tolerable miss rate lower than search (community estimate)
Ads	0.90 recall@5	<30ms	Direct revenue impact per miss; tightest SLO, dedicated capacity (community estimate)
Dedup	0.95 precision@1	async batch	False dedup (removing a unique item) is worse than false negative (community estimate)

Platform-level SLOs

Metric	Target	Why this value
Embedding freshness		Viral content must appear in search before it peaks
Online query p95		ANN lookup is one leg of a multi-stage ranking pipeline; must leave budget for ranker
Migration window		7d dual-write doubles storage cost; beyond that, infra cost exceeds migration value
Online/offline parity	Recall delta <2pp	Offline eval must predict online regression; >2pp gap means the eval set is wrong
Availability		Platform is shared infra; downtime degrades all consumers simultaneously

✨ Insight · The migration SLO is the least obvious. Without a 7-day budget cap, teams underestimate dual-index storage costs. — manageable. At 1B items, that is 6TB of extra storage that must be provisioned, warmed, and then decommissioned in a bounded window. The feed ranking pipeline that consumes these embeddings is covered in the Feed Ranking case study. For the retrieval-augmented generation pattern that uses embedding lookup at inference time, see the RAG comparison module.

Quick check

Trade-off

Ads requires <30ms p95 and 0.90 recall@5. Search needs only <50ms p95 and 0.85 recall@10. What is the key reason ads gets dedicated capacity rather than sharing the search pool?

Ads queries contain more tokens, so the embedder takes longer per request.Ads data is stored in a separate datacenter due to regulatory requirements.Ads uses a larger HNSW ef_search, which requires more CPU and cannot share cores with search.A missed ad embedding is a discrete revenue event; sharing capacity allows search spikes to degrade ads recall.

🧪

Eval Harness

The eval harness must answer a single question before any migration proceeds: does the new model improve recall@k for every consumer on their golden query set? An aggregate win that hides a per-consumer regression is a failed migration.

Offline: recall@k on curated golden queries

Golden query set per consumer— 500–1,000 queries per consumer, curated from production traffic and human-labeled with relevant items. Sized so a 2pp recall change is detectable at 95% confidence. Separate sets per consumer because a query that is “relevant” for recsys may not be for search.
A/B between old and new embedding model — embed the same golden set with both models, run ANN against both indexes, compare recall@k side by side. Never compare against a held-out model trained on different data without controlling for index freshness.
Semantic drift detection on historical embeddings — embed a fixed canary set (1K items, stable over time) with both models; report the average cosine similarity between old and new embeddings. High similarity (>0.95) means the model change is incremental; low similarity (<0.80) signals a full-corpus backfill is necessary before any recall comparison is valid.

Online: canary traffic during migration

Canary traffic split— route 5% of each consumer's traffic to the new index during the migration window. Compare downstream engagement metrics (CTR for ads, session depth for recsys) against control. This is the only reliable parity check — offline recall is necessary but not sufficient.
Online/offline parity gate — if the offline recall delta predicts a +3pp improvement but the canary shows no engagement lift, halt migration and audit the eval set. The golden queries may not represent production distribution.

💡 Tip · Eval-first discipline pays off most during rollback. If the migration goes wrong mid-way, the eval harness determines exactly which consumer crossed the recall regression threshold, enabling selective rollback (roll back ads but keep recsys on the new index) rather than a full revert.

🧮

Back-of-Envelope (Embedder Cost)

The calculator below is seeded for the embedder workload: 10M new items per day at 200 tokens each. During a migration window, both the old and new index must be written simultaneously — storage and serving cost approximately double. Move the sliders to find where the GPU count inflects.

Scenario: Embedder workload: 10M new items/day = ~116 items/sec average. Each item is 200 tokens. Dual-write during migration doubles index storage. Output tokens = 1 (embedding vector, not generation).

Model Size7B

GPU TypeA100-80GB

QPS Target116

Input Tokens200

Output Tokens1

Cache Hit Rate0%

Model Weights (FP16)	14 GB
KV Cache / Request	27.0 MB (201 tokens)
Tokens/sec per GPU	2,400
Effective QPS (after cache)	116
GPUs Needed	1
Cost / Month	$1,460
Est. p95 Latency	0.20s
Bottleneck	Balanced

GPU Memory Usage

21%

Compute Utilization

100%

Monthly Cost

$1,460

⚠ Warning · Gotcha: This calculator assumes LLM generation serving. An embedder model is 200M–1B params — roughly 10–70x smaller than a 7B LLM. It also batches far more aggressively (batch size 256–512 vs LLM batch size 32). Real embedder cost is 10–20x lower per item than this calculator suggests. Use these numbers for storage and throughput planning, not GPU count.

✨ Insight · Storage math matters more than compute here. During a 7-day migration window, dual-write adds another ~210GB of temporary storage. Budget for this in your capacity plan — it is the infra cost that constrains the migration window, not GPU time.

Baseline: Pinterest embeddings serve — 150 GPUs @ $2/hr at p99 80 ms, 200,000 QPS, 70% cache hit.

Fleet cost sensitivity for a Pinterest-scale embeddings platform. Fleet size and cost are community estimates — Pinterest does not publish embeddings infrastructure details.

p99 Latency Target80 ms

Peak QPS200,000

Cache Hit Rate70%

Effective QPS (after cache)	60,000
Latency-batch factor	1.00×
GPUs Needed	150 (+0% latency vs baseline)
Hourly Burn	$300 (+0% vs baseline)
Cost / Request	$0.00000
Monthly Burn (24×7)	$219,000
Bottleneck	Balanced

⚠ Warning · Gotcha: The 70% cache-hit rate is realistic for an embeddings platform: item embeddings are precomputed and rarely recomputed unless the underlying content changes. Most online queries hit the HNSW index directly (read-only), not the embedder GPU. The GPU fleet is sized primarily for batch backfill and online re-embedding on content updates, not for read-path serving.

🏛️

Architecture

Eight components. The two that distinguish this from a naive vector-store design are the Dual-Write Migration Layer and the Eval / Backfill Pipeline — both exist solely to handle model upgrades safely.

Embeddings Platform Architecture

Dual-write layer and eval/backfill pipeline are the migration-critical components.

Component justification table

Component	Why it exists	What breaks without it
Client Services	Different consumers, different SLOs — unified behind a single platform API	Each team builds its own embedding infra; no shared migration path
Platform API	Online + async endpoints; routes to right embedder version per consumer	Consumers call embedder directly; version coordination is manual
Embedder Service	GPU-accelerated batch encoding with high throughput; versioned model artifacts	; no versioning means non-reproducible embeddings
HNSW Index Store (sharded)	Sub-linear ANN lookup; shard by item type or ID hash for load distribution	Flat scan at 10M+ items violates latency SLO; single shard creates hot spots
Dual-Write Migration Layer	Writes to old + new index during model upgrade; enables blended reads and atomic cutover	Skipping this means the new index misses all items ingested during backfill; recall regresses overnight
Eval / Backfill Pipeline	Re-embeds full corpus with new model; computes recall@k gates; triggers cutover	No objective criterion for cutover; team ships on vibes, not eval
Consumer Services	Downstream search / recsys / ads systems read from their own HNSW shard replica	Shared index means one consumer's load degrades another's recall
Semantic Drift Monitor	Detects silent model regressions by tracking cosine similarity distributions over time	A bad model push corrupts 10M embeddings silently; detected only by downstream engagement drop

Quick check

Trade-off

A team proposes skipping the Dual-Write Migration Layer and instead deploying the new embedder, letting the backfill rebuild the new index before cutting over. What is the critical flaw?

Fresh items still write to the old index; the new index misses all content ingested during the backfill window.Items ingested post-migration go to the new index only; a rollback loses them from the old index.The backfill pipeline cannot execute without dual-write being active first.Backfill without dual-write doubles GPU cost — a budget constraint that blocks the migration.

🔬

Deep Dives

Expand the deep dives

Open for the full technical detail.

Expand

Two components need explicit design here: migration and index maintenance. The rest is standard distributed systems work.

Deep Dive A — Dual-Write Migration Strategy

Approach

The dual-write layer is active for the entire migration window (target: <7 days). The core mechanism is a write fan-out at the ingestion point: every item the Embedder Service processes is written to both the old index and the new index atomically from the perspective of the inbound item stream. This guarantees that the new index never falls behind on fresh content — only on the existing corpus, which the offline backfill pipeline catches up. The state machine proceeds in five phases:

Dual-write on: every item ingested after migration start is written to both indexes. The old index remains the read source (100% weight). The new index accumulates fresh items but has zero coverage on historical corpus.
Backfill runs: the offline pipeline re-embeds the full corpus in priority order — recency first, because recent items have the highest query probability. At 10M items/day steady state and 100M item corpus, backfill throughput must exceed sustained. This is a throughput target, not just a latency target — it drives the embedder GPU count for the migration window.
Blended read phase: once new index coverage passes 50%, the retrieval layer blends results from both indexes using a sliding weight controlled by a feature flag per consumer: day 1 at 10% new, day 3 at 50% new, day 6 at 90% new. Blending is at result-set level (merge and re-rank by similarity score), not request level (routing). Consumer eval runs continuously during this window on the golden query set to gate each weight increment.
Cutover gate:flip to 100% new index only when new index recall@k meets or exceeds old index on all consumer golden query sets, AND canary traffic (5% of production) shows no engagement regression (<0.5pp CTR drop for ads, no session depth regression for recsys).
Decommission: old index offline, dual-write layer removed, storage reclaimed within 24 hours of cutover. The 24-hour grace period allows a fast rollback if the cutover reveals a production issue not caught by the canary.

Trade-off: backfill throughput vs. migration window cost

The non-obvious trade-off is not “how fast can you backfill” but “what does each day of dual-write cost vs. the risk of cutting over with incomplete coverage.”, plus the embedder GPU time to process the backfill. That storage cost scales linearly with migration window length. The temptation is to cut the window short by accepting partial coverage (e.g., cutting over at 90% corpus coverage). The non-obvious danger: the missing 10% is not random — it is the oldest, least-recently-updated items, which are precisely the long-tail catalog content that search depends on for rare queries. A 90%-coverage cutover produces normal aggregate recall but catastrophic recall on long-tail queries.

Failure mode

Index skew during dual-write: a network partition or embedder crash causes a write to succeed on the old index but fail on the new index, creating silent divergence. The new index silently underrepresents items that arrived during the outage. Because the old index is still the primary read source during early phases, this divergence is invisible — users see correct results. But when the blended weight shifts to 50%+ new, those missing items drop out of results.

Detection metric

Per-index write success rate, tracked per shard per minute. Alert threshold: new index write success rate falls below 99.5% of old index write success rate for more than 2 minutes. Secondary check: consistency job samples 1% of items hourly, compares both indexes, alerts if item-level divergence exceeds 0.5%. A diverged item is re-queued for the embedder with high priority.

Mitigation

The blended-read feature flag is the rollback lever — any consumer can be reverted to 100% old index in seconds, independent of other consumers, with no infrastructure change. Secondary-effect callout: reverting one consumer (e.g., ads) while keeping others on the new index means the dual-write layer must remain active and storage costs continue to accrue until the slowest consumer completes migration. Budget for this in the 7-day window estimate.

Real-world example

Pinterest's Unifying Visual Embeddings migration (inferred from the Pinterest Engineering blog) unified multiple per-product visual embedding models into a single platform serving visual search, related pins, and ad retrieval. The migration required that all three consumers cut over from their own bespoke indexes to the shared platform index simultaneously, because the unified model produced a shared embedding space — partial migration would mean consumers were comparing vectors from different spaces. The lesson applied here: if any consumer cannot complete within the migration window, the entire migration must be deferred, not partially shipped. The per-consumer eval gate exists precisely to catch this before dual-write storage costs compound.

Deep Dive B — HNSW Index Sharding & Rebuild Cadence

Approach

HNSW (Hierarchical Navigable Small World) is the dominant production ANN algorithm because it achieves sub-linear query time via a layered proximity graph. The top layers are sparse long-range links for fast traversal; the bottom layer is a dense neighborhood graph for precision. At M=16 (max neighbors per node) and ef_construction=200 (search width during build), (per Malkov & Yashunin 2018, Table 2). The operational challenge is not building the index — it is maintaining recall quality over time as the live corpus changes, and distributing query load across shards fairly.

Sharding strategy

Shard by item ID hash, not content cluster. Content-based sharding (e.g., by category or embedding cluster centroid) is tempting for locality but creates hot shards during viral events: all queries for the viral topic land on the shard whose centroid is nearest, saturating it while other shards sit idle. Hash-based sharding distributes items and queries uniformly in expectation.

Hot/cold tiers:recent items (last 30 days) on GPU/DRAM hot shards; older items on CPU-RAM cold shards. The majority of real-time queries hit recent content — a viral pin from today, not a pin from 2019. The hot shard absorbs the latency-sensitive load; cold shards serve long-tail catalog queries at a more relaxed latency budget (<100ms vs. <30ms).
Per-consumer shard replicas: each consumer owns a dedicated replica of each shard, with ef_search tuned to its recall target. Ads replica: ef_search=128 for . Recsys replica: ef_search=64 for 0.75 recall@50. One consumer cannot saturate another's shard. This is the critical isolation invariant — a shared shard with a single ef_search must be set to the tightest SLO (ads), which over-serves recsys and wastes GPU cycles.

Rebuild cadence

HNSW is append-friendly: new items can be inserted online without a full rebuild. Deletions are the problem. HNSW does not support true node removal — deleted items remain as “tombstones” in the graph. Traversal still visits tombstone nodes, wasting ef_search budget on items that will never be returned. As tombstone rate climbs, recall degrades silently — the graph is intact, queries succeed, but a growing fraction of traversal steps are wasted on deleted items.

Rebuild trigger:tombstone rate >10% of shard item count, OR recall@k on the golden query set drops more than 2pp from the post-build baseline. The tombstone rate is cheap to compute (item count delta); the recall check is expensive but catches drift that tombstone rate misses (e.g., distribution shift without deletion).
Rebuild procedure: build new index offline from clean corpus (exclude tombstones), swap atomically via the dual-write layer, decommission old. Rebuild time scales with corpus size and M: at M=16, ef_construction=200, a 50M-item shard rebuilds in approximately 30–60 minutes on a 32-core CPU node (community estimate based on Weaviate HNSW benchmarks). Schedule rebuilds during off-peak windows; hot shards can serve from the old index during rebuild via the same dual-write swap mechanism used for model migrations.

Failure mode: hot shard during viral event

Even with hash-based partitioning, a viral event can overload a single shard if the viral content hashes disproportionately to one shard (hash collision clusters are rare but possible at scale), or if a flood of writes for viral items all arrive in a short window and the shard's write-lock serializes insertions. The manifestation: per-shard CPU climbs to 100%, , while other shards look healthy. Aggregate P95 latency looks acceptable because it is diluted across shards — this is why aggregate metrics are insufficient.

Detection metric

Per-shard query latency P95, tracked per consumer per minute. Alert threshold: any shard P95 exceeds 3× its 7-day rolling baseline for more than 3 minutes. Secondary: per-shard CPU utilization alert at 80% sustained >5 minutes.

Mitigation

Shard-split: detect the hot shard, rehash its item ID range into two sub-shards (even/odd split), and use the backfill pipeline to populate both sub-shards within hours. The dual-write layer routes new writes to both sub-shards during the split. Secondary-effect callout: a shard split increases the total shard count, which increases the fan-out for scatter-gather queries — every query now touches more shards. Monitor per-query scatter-gather latency after a split; if fan-out cost exceeds the savings from load reduction, the split threshold was triggered too early and the alert threshold should be raised.

Real-world example

The Weaviate HNSW benchmark documents rebuild timing for HNSW at varying corpus sizes and M values. At M=16 on a 10M-vector dataset, full index rebuild completes in under 10 minutes on modern hardware, but at 100M vectors the build time grows super-linearly (graph construction is O(N log N)) and the rebuild window must be planned into the operational schedule — it cannot be triggered reactively during a production incident. The operational lesson: set tombstone-rate alerts at 8% (not 10%) so rebuilds are scheduled proactively, not triggered in the middle of peak traffic.

✨ Insight · Interview framing.Every interviewer will ask “how do you upgrade the model?” The wrong answer is “retrain and redeploy.” The right answer starts with: “model upgrade is a migration event — here's the dual-write protocol, here's the backfill SLA, and here's the eval gate that triggers cutover.”

🔧

Break It

Three failure modes that every senior reviewer will probe. Each maps to a measurable SLO breach.

Skip dual-write during model upgrade

You deploy the new embedder and start writing all new items to only the new index. But the old index had 10M items; the new index has zero. For the first day, every search query that would have returned an existing item returns nothing — recall@k drops to near zero for catalog content. By the time backfill completes (3–5 days), users have experienced a catastrophic regression. Detection: recall@k on golden queries alarms within minutes of cutover. Mitigation:mandatory dual-write protocol; cutover is gated on backfill coverage >99%, not just model deployment.

Skip eval during migration (deploy on vibes)

The new model scores better on the paper's benchmark (MTEB), so the team cuts over without running consumer-specific recall@k. MTEB measures general text similarity; the production corpus is images or product descriptions with domain-specific vocabulary. The new model improves recsys recall by +3pp but drops ads recall by 4pp. Ads revenue falls. The regression is only detected via business metrics 48 hours later. Detection: per-consumer recall@k gates on golden queries before cutover; these would have caught the ads regression. Mitigation: offline eval gates are mandatory; public benchmarks are signals, not gates.

Single-shard HNSW for all consumers

All four consumers share one HNSW index with one ef_search setting. During a viral event, search QPS spikes 5x. The shared shard saturates; latency climbs from 15ms to 250ms. The high-latency requests are not evenly distributed — ads is the most latency-sensitive but shares the same overloaded shard. Fairness across consumers breaks: some see 250ms, others see 15ms depending on when their request arrived. Detection: per-consumer P95 latency alert (aggregate hides the skew). Mitigation: dedicated per-consumer shard replicas with independent capacity; ads replica is never shared.

💸

What does a bad day cost?

Incidents have a dollar number. These three scenarios are the most likely failure modes and their approximate cost.

Mid-migration rollback (new model deemed not ready at day 5) — 5 days of dual-write doubles index storage and embedder throughput. At 10M items/day × 5 days × 2x = 100M extra embeddings computed, plus 150GB temporary storage. Engineering opportunity cost of a senior team for the week. Estimate: low six figures in wasted compute + engineering time. Root cause prevention: run offline eval earlier in the process; don't start dual-write until offline eval shows the new model meets recall gates.
Embedder bad push corrupts 10M items (silent normalization bug) — a dependency update changes L2 normalization, producing embeddings with different magnitude. All ANN lookups silently degrade because cosine similarity is now computing against unnormalized vectors. 10M items must be re-embedded and re-indexed; backfill takes 24–48 hours. During that window, recall@k is 20–30% below SLO for all consumers. Cost: 24h of degraded ads performance (revenue impact proportional to miss rate) plus backfill compute. Mitigation: the semantic drift monitor would catch this within 1 hour by comparing canary item embeddings against reference; deploy-time embedding sanity check (compare 1K canary items vs saved reference vectors, alert if mean cosine similarity < 0.99).
Shard outage: one HNSW shard goes down (VM failure or OOM) — because different consumers have different shard affinities, the outage impacts consumers unevenly. Ads and search may be fine if their shards are healthy; recsys is dark if its shard is down. This is a silent partial degradation — the platform appears up in aggregate metrics, but recsys candidates are empty. Downstream: the ranker receives fewer candidates than expected and silently returns lower-quality recommendations. Detection window without per-consumer monitoring: hours, via engagement metric drop. With per-consumer latency alerting: minutes. Mitigation: per-consumer shard replicas with hot standby; failure of primary promotes replica within 30 seconds.

Detection window sensitivity: 10× cost delta

The embedder bad-push scenario above (silent normalization bug, 10M items re-embedding required) illustrates why detection window is a dollar number, not a percentage. Concretely:

Detection scenario	Re-embed cost	Ads miss cost	Total approx. cost
Drift monitor fires at 1h	10M items × $0.0001/item = $1,000	1h degraded ads recall (~$5K–$20K at scale, community estimate)	~$6K–$21K
Detected at 48h via engagement metrics	10M items × $0.0001/item = $1,000 (same re-embed)	48h degraded ads recall ≈ 48× the 1h cost	~$240K–$960K (40–46× higher)

The re-embed cost is the same in both rows — the corpus damage is identical. The difference is purely the detection window. The drift monitor (canary set of 1K items, cosine similarity check every 5 minutes, alert if mean similarity drops below 0.99) costs effectively nothing to run. The 48× cost delta is the return on investment for that monitor. Re-embed unit cost ($0.0001/item) is a back-of-envelope at embedding API pricing; actual cost depends on whether you run on-prem GPU or cloud inference.

⚠ Warning · Silent degradation is the hardest incident type. The platform API returns 200 OK. The embedder is running. The HNSW index is healthy. But recall@k has dropped 15pp because a shard is serving stale embeddings from before the last rebuild. Only the per-consumer recall@k monitor catches this — which is why building that monitor is not optional.

🚨

On-call Runbook

Embedding model swap breaks HNSW ANN recall

MTTR p50 / p99: 30 min–2 h for rollback; 24–48 h for full backfill recovery if partial write occurred

Blast radius: All consumers see recall@k regression post-migration; ads miss rate increases; search quality drops; downstream rankers receive worse candidates

1. DetectCanary recall@k check: compare 1K reference item embeddings against saved baseline after every deploy; alert if mean cosine similarity <0.99
2. EscalateOn-call ML Infra + Embeddings Platform team; if ads miss rate exceeds threshold, loop in Revenue Eng as P1
3. RollbackHalt dual-write; restore old index as primary; re-point all consumers to old model checkpoint via feature flag
4. PostEnforce offline recall gate before starting dual-write; require A/B shadow traffic on 1% of consumers for 24 h before full cutover

Backfill job stuck on corrupt shard

MTTR p50 / p99: 1–4 h to detect, skip, and resume from checkpoint

Blast radius: Backfill stalls; dual-write window extends beyond planned budget; storage cost doubles for the extended window; migration SLO breached

1. DetectBackfill progress monitor: items-processed/hour drops >50% vs expected throughput fires P2 within 15 min
2. EscalateOn-call Data Infra + Embeddings Platform team; if stuck >1 h, escalate to P1 and notify capacity team of extended dual-write cost
3. RollbackSkip corrupt shard; re-queue from last valid checkpoint; run integrity check on shard before retrying
4. PostAdd per-shard checksum validation before backfill starts; implement idempotent backfill with resume-from-checkpoint capability

Query-side drift after upstream vocab change

MTTR p50 / p99: 2–6 h to detect (long-tail queries have lower traffic); 1–2 h to apply query-side rollback

Blast radius: Search and recsys queries tokenized differently from indexed items; ANN recall drops for long-tail queries; high-frequency queries unaffected (masking the issue)

1. DetectPer-consumer recall@k monitor: long-tail query bucket recall drops >5pp vs 7-day baseline fires P2; semantic drift monitor on canary query set
2. EscalateOn-call ML Infra + Search Infra teams; if ads impacted, escalate to P1 immediately
3. RollbackPin query-side tokenizer to pre-change version via config flag; re-embed query path only (cheaper than full corpus backfill)
4. PostAdd query/index vocab parity check to CI pipeline; require joint deploy of query encoder and index encoder changes

Quick check

Derivation

The embedder bad-push scenario shows ~$6K–$21K total cost at 1-hour detection vs ~$240K–$960K at 48-hour detection. The re-embed cost ($1,000) is the same in both rows. What tool produces the 1-hour detection?

Per-consumer recall@k eval on the golden query set, scheduled to run every hour.Semantic drift monitor comparing cosine similarity of 1K canary items to a reference, polled every 5 minutes.Ads CTR business metrics dashboard, configured to alert on a 5% engagement drop.HNSW shard CPU utilization monitor, which spikes when corrupted vectors cause excess traversal.

🏢

Company Lens

Meta's push — scale, sharding, and infra primitives

Expect the interviewer to push on shard count, failure domains, and how you'd run this on 1B+ items. Meta's culture values systems that scale 10x without architecture changes. “What happens at 100M items per day? At 10B total items? How many shards do you need and how do you rebalance them? What's the cross-datacenter replication story?” The right posture: start with the sharding invariants (hash-partitioned, hot/cold tiered), then derive the shard count from storage and latency constraints, then describe the rebalancing protocol using the same dual-write mechanism.

Google's push — evaluation rigor and systems primitives

Expect drill-down on the eval methodology. “How do you size the golden query set? How do you control for position bias in human labels? What statistical test determines that the new model is significantly better, not just better on this sample? How does the online canary translate offline recall into an engagement prediction?” Google's ML infra bar is heavy on measurement discipline — the eval harness is not a one-liner, it's a designed system. Also expect questions on the Bigtable / Spanner / Colossus primitives you'd use for the index store and migration metadata.

🧠

Key Takeaways

What to remember for interviews

1An embedding model upgrade is a migration event, not a deployment event — every index goes stale simultaneously because the vector space changes.
2Dual-write to both old and new indexes during migration; blended reads with a sliding weight give you a rollback lever at any point during the transition.
3Eval gates are per-consumer, not aggregate — a new model that improves recsys but degrades ads recall is a failed migration even if the mean is better.
4Shard by item ID hash, not content cluster; content-based sharding creates viral-event hot spots that break per-consumer SLA fairness.
5The semantic drift monitor is the early-warning system for silent model regressions — track cosine similarity of canary item embeddings daily.
6Online/offline parity is a platform SLO, not a nice-to-have; if offline recall doesn't predict online engagement, the golden query set is wrong.

🧠

Recap quiz

📚

Transformer Math