Skip to content

Transformer Math

Module 73 · Design Reviews

📱 Case: Design TikTok For-You Ranking

Why the Explore/Exploit slider matters more than the model

Status:

The TikTok For-You feed is the canonical ML@scale interview problem. The naive framing is “build a better ranker.” The senior framing is different: explore/exploit matters more than the model. It is a product lever that controls long-run retention and creator health. ML owns candidate quality; Product owns the setpoint.

The second insight is that offline recsys metrics are weak proxies for online retention. NDCG can go up while retention goes down. That is why counterfactual logging, multi-guardrail A/Bs, and feature-store parity are load-bearing.

📋

Requirements & SLOs

Working backwards from the user

“A user opens TikTok and sees a feed of videos that feel personally chosen within a fraction of a second. New videos from creators they follow appear in the feed within minutes of upload. The feed never feels repetitive or trapped in a single topic. Creators — including new ones with zero history — can reach new audiences. The platform surface is free of harmful content even when a novel category of harm appears.”

SLO table

MetricTargetWhy this value
p99 end-to-end rank latency<200 msFeed scroll is a real-time UX; visible lag above 200 ms causes users to swipe away or stop scrolling
Freshness: new content to first impressionViral content must propagate fast; creator trust depends on seeing new uploads surface quickly
Personalization recall (held-out set)Tracked; gap monitoredNDCG@k and recall@k on a held-out impression log; used to detect retrieval regressions, not as an absolute target
Diversity floorNo topic >40% of top-20 slots (example threshold — exact value A/B tested per deployment; no public TikTok source)Prevents filter bubbles; threshold is a product decision tuned on 30-day retention cohorts, not a fixed engineering constant
Availability99.99% / monthDerivation: 99.99% allows 0.01% downtime × 43,200 min/month = 4.32 min/month unplanned outage. Feed is the core revenue surface; a single 5-minute outage at × ≈ $90K in lost ad impressions, making even 4.32 min/month expensive. Compare: 99.9% = 43 min/month, acceptable for less latency-sensitive API tiers.
Creator fairnessNew creator first-impression rate tracked weeklyPlatform health; creator cohort retention depends on perceived fairness of initial distribution
✨ Insight · The setpoint as a product SLO. The explore/exploit ratio does not appear in the table above as a fixed number — and that is intentional. It is a variable controlled by Product, tuned via A/B experiments on retention and creator health metrics. ML teams that treat it as a model hyperparameter and tune it on offline NDCG will optimize it in the wrong direction. Offline NDCG rewards exploit (known-good items); long-run retention often rewards explore (novel content that prevents filter-bubble fatigue). The SLO table is where Product declares the goal; the setpoint is one dial they turn to achieve it. The candidate generation stage relies on the same ANN index infrastructure covered in the Embeddings Platform case study. For a cross-system failure taxonomy comparison, see Failure Taxonomy Compare.
🧪

Eval Harness — Offline, Online, and Counterfactual

Recsys eval is where teams most often fool themselves. The offline metrics are seductive — NDCG@k and AUC are easy to compute and easy to improve. The problem is that they are computed on the historical impression log, which was filtered by the old ranker. The new ranker retrieves items the old one never showed; those items have no labels. This is the counterfactual gap, and it means an NDCG improvement can coexist with an online retention regression.

Offline metrics

  • AUC on engagement labels— binary label (did the user watch >50% of the video?). Fast to compute, correlated with click-through quality. Caveat: AUC rewards the rank order of items that were shown; items never shown get no gradient.
  • NDCG@k — discounted cumulative gain at position k (typically k = 10 or 20). Rewards putting high-relevance items higher in the list. Same counterfactual gap applies.
  • Calibration on fresh-item hold-out — a separate eval set composed only of items uploaded in the last 24 hours. Specifically tests whether the cold-start mitigation is working. Many teams skip this; it is the first signal that viral content propagation is broken.

Online A/B — multiple guardrail metrics

  • Primary: 7-day user retention — the metric the business cares about. Requires at least 2 weeks of experiment runtime to detect a meaningful cohort effect.
  • Guardrail: average session length — a proxy for per-session engagement. A ranker change that boosts retention while tanking session length may be exploiting a novelty effect.
  • Guardrail: creator fairness — new creator first-impression rate in the treatment vs control. A ranker that increases engagement by concentrating impressions on established creators fails this guardrail.
  • Guardrail: safety policy violation rate — any regression here is a P0 regardless of retention improvement. Measured via a classifier on served content, not self-reported.

Counterfactual logging

Every ranking request logs: the candidate set, the scores assigned by the ranker, the position each item was placed, and the user's actual engagement. This enables post-hoc inverse propensity scoring (IPS) to de-bias offline metrics: re-weight each impression by the inverse of the probability that it was shown under the logging policy. Counterfactual logs also enable re-ranking experiments without a live rollout — you can apply a new re-ranker to historical logged candidate sets and estimate its effect before it ships.

⚠ Warning · The bitter experience of recsys eval. The field has decades of evidence that offline NDCG improvements do not reliably translate to online retention gains. The YouTube two-tower paper noted this explicitly: the most important signal was whether the model improved live A/B metrics, not offline numbers. Design the eval harness to treat offline metrics as regression detectors (did something break?) and online A/B as the source of truth for improvements.

Quick check

Derivation

Inverse Propensity Scoring (IPS) re-weights offline eval examples. What problem does it correct, and what does it require that most teams don&apos;t have?

Inverse Propensity Scoring (IPS) re-weights offline eval examples. What problem does it correct, and what does it require that most teams don&apos;t have?
🧮

Back-of-Envelope

The ranking pipeline is not a large language model, but we can use the capacity calculator to size the heavy ranker's GPU fleet. Scenario: at feed-refresh time (TikTok-scale peak), each request triggering one candidate-generation call (two-tower ANN) plus one heavy-ranker call scoring ~500 candidates, plus one re-rank pass. The heavy ranker is a dense model — far smaller than a 7B LLM. The calculator below is seeded at 7B to give a worst-case upper bound; read the gotcha below before treating the numbers as real.

Scenario: Feed ranker worst-case sizing: 100K QPS, each request scores ~500 candidates. Input token count is a rough proxy for the feature vector passed to the ranker per candidate batch.
Model Size7B
GPU TypeA100-80GB
QPS Target100,000
Input Tokens256
Output Tokens16
Cache Hit Rate0%
Model Weights (FP16)14 GB
KV Cache / Request36.5 MB (272 tokens)
Tokens/sec per GPU2,400
Effective QPS (after cache)100,000
GPUs Needed667
Cost / Month$973,820
Est. p95 Latency1.15s
BottleneckCompute/Bandwidth

GPU Memory Usage

24%

Compute Utilization

100%

Monthly Cost

$973,820

⚠ Warning · Gotcha: The ranker is much smaller than 7B — a production feed ranker is typically under 500M parameters, which is an order of magnitude cheaper per request than the 7B estimate shown. The real constraint is not GPU FLOPS but feature-store read latency and ANN index lookup time. This calculator overstates per-request GPU cost by 5–10x for a real feed ranker.
✨ Insight · The real bottleneck is not the ranker. At , the GPU budget for the heavy ranker is manageable because the model is small and the computation per request (scoring ~500 candidates) is highly parallelizable. The harder engineering problem is the feature-store read latency: every request needs to assemble real-time user features (last-N interactions) plus item features (freshness score, engagement rate) for all candidates within the budget. Optimizing feature-store read latency — batching reads, pre-computing hot user embeddings, sharding by user ID — is where the real capacity work lives.

Baseline: TikTok For-You ranker200 GPUs @ $2/hr at p99 150 ms, 500,000 QPS, 20% cache hit.

Fleet cost sensitivity for a TikTok-scale feed ranker. All fleet and cost numbers are community estimates — TikTok does not publish ranker infrastructure details.

p99 Latency Target150 ms
Peak QPS500,000
Cache Hit Rate20%
Effective QPS (after cache)400,000
Latency-batch factor1.00×
GPUs Needed200 (+0% latency vs baseline)
Hourly Burn$400 (+0% vs baseline)
Cost / Request$0.00000
Monthly Burn (24×7)$292,000
BottleneckBalanced
⚠ Warning · Gotcha: Feed ranking cost per request is extremely low because the model is small (<500M params, community estimate) and requests are heavily batched. The 20% cache-hit rate models pre-fetched candidate lists for users with predictable scroll patterns. Unlike diffusion or LLM serving, GPU cost is not the binding constraint here — feature-store I/O and ANN index memory are.
🏛️

Architecture

The TikTok For-You pipeline follows a classic three-stage funnel: candidate generation (billions of items → ~1,000 candidates) → heavy ranking (1,000 → ordered top-50) → diversity and policy re-ranking (top-50 → final feed). The architecture below adds the recsys-specific components that most system-design diagrams omit: the request scorer that assembles user state and the explore/exploit setpoint, the feature store with its online/offline parity obligation, and the diversity re-ranker as a separate stage from the relevance ranker.

TikTok For-You Ranking Pipeline

Hover each node. Note: Feature Store sits in the middle because both candidate generation and heavy ranking read from it.

ClientEdge APIRequest ScorerCandidate GeneratorFeature StoreHeavy RankerDiversity / Policy Re-rankerResponse

Component justification table

ComponentWhy it existsRecsys-specific note
Request ScorerAssembles user context before the main pipeline touches itHolds the explore/exploit setpoint; a product config, not a model weight
Two-Tower Candidate GeneratorNarrows billions of items to ~1,000 via ANN on learned embeddingsOnline index insertion required to satisfy the 5-min freshness SLO for new items
Feature StoreSingle source of truth for online and offline featuresOnline/offline parity is a first-class requirement; skew silently degrades ranking quality
Heavy RankerDense cross-feature scoring on all candidates — captures feature interactions the two-tower model cannot; the bottleneck is feature reads, not GPU flops
Diversity / Policy Re-rankerEnforces non-relevance constraints: topic diversity, creator fairness floors, safety policy capsExplicitly separate from the relevance ranker — entangling them makes policy changes require model re-training
Counterfactual Log (in Response)Records the ranked candidate set and scores alongside subsequent user engagementEnables post-hoc A/B analysis; without it, every experiment requires a live rollout

Quick check

Trade-off

Why must the diversity/policy re-ranker be a separate stage rather than a regularization term in the heavy ranker&apos;s loss function?

Why must the diversity/policy re-ranker be a separate stage rather than a regularization term in the heavy ranker&apos;s loss function?
🔬

Deep Dives

Expand the deep dives

Open for the full technical detail.

Expand

Two components deserve the deep dive: the two-tower retriever and the feature store.

Deep dive A — Two-tower candidate generator + ANN index

Approach

The two-tower architecture (introduced at scale by Covington et al., YouTube RecSys 2016) trains a user tower and an item tower with separate parameters. At training time the towers are jointly optimized via a softmax over in-batch negatives: the inner product of a user embedding and a positive item embedding should score higher than all other items in the batch. At inference time the item tower runs offline and writes all item embeddings into an approximate nearest-neighbor (ANN) index — HNSW or ScaNN — that supports over billions of items via quantized vectors and graph traversal. The user tower runs online per request, embedding the user's real-time feature vector, and issues an ANN query to return the top-K candidates ( — per Instagram Engineering and Covington et al.). The key design constraint: because retrieval is inner-product search, the user and item towers cannot capture cross-feature interaction terms — those are deferred to the heavy ranker. This is an intentional layering, not a limitation to fix.

Trade-off

The non-obvious trade-off is between index freshness and ANN graph quality. HNSW maintains a navigable small-world graph; inserting a new node online updates only the local neighborhood. The graph's global recall — the fraction of the true top-K neighbors that the ANN search finds — degrades slightly with each incremental insert because the global structure is never globally re-optimized. A batch-rebuilt index is globally optimal; an online-insert index accumulates small-world violations over time. In practice, a rolling partial rebuild (e.g., rebuild the graph for items older than 24 hours nightly, while online inserts handle the fresh-item tail) achieves both freshness SLO and acceptable ANN recall. Teams that skip the rolling rebuild see recall degradation after months of online inserts — manifesting as a slow drift in retrieval quality that is hard to attribute without explicit recall benchmarks against a brute-force index.

Failure mode

A video uploaded 2 minutes ago has no engagement history. The item tower can produce an embedding from content features (visual, audio, text), but if the ANN index rebuild runs hourly (common in batch architectures), the new item is absent from the index for up to 59 minutes — missing the <5 min freshness SLO by an order of magnitude. A secondary failure: even with online insertion, if the item tower's content encoder was not trained on the new item's modality (e.g., a new vertical-format video style with no training examples), the embedding lands in a low-density region of the embedding space and is never retrieved, regardless of index freshness.

Detection metric

Fresh-item retrieval rate: the fraction of videos uploaded in the last 5 minutes that appear in at least one user's candidate set within the next 5 minutes. Alert threshold: drop below 80% of baseline. Complement with

Mitigation

  • Online index insertion — new items are added to the HNSW graph incrementally at upload time. Requires a concurrent-write-safe ANN implementation; ScaNN and HNSW v0.7+ both support this. Insertion latency is typically under 5 ms per item; secondary-effect callout: high upload spikes (viral moments, live events) can cause insertion queue backpressure — shard the insertion workers by content category to bound tail latency.
  • Content embedding as cold-start signal — the item tower uses video content features (visual, audio, text caption) extracted via a pre-trained encoder as a proxy for missing engagement history. Secondary-effect callout: if the pre-trained encoder is updated less frequently than the item tower, embedding space drift causes the cold-start embeddings to be incompatible with warm-item embeddings, degrading nearest-neighbor coherence for new items.
  • Explore budget injection— the Request Scorer reserves a configurable fraction of each candidate set for fresh-item slots (age < N hours), bypassing ANN retrieval entirely for that fraction. This decouples freshness SLO from index rebuild cadence. Secondary-effect callout: explore slots reduce the effective candidate pool for the heavy ranker, which increases the variance of per-session quality; set explore budget conservatively (5–15%) and A/B test the setpoint, not the mechanism.

Real-world example

Covington et al. (YouTube, RecSys 2016) document the exact cold-start issue in Section 4.2: the item tower is pre-computed nightly; new videos are handled by a separate “new video” feature that injects them into candidate sets via a dedicated freshness pool. This is the published precedent for the explore-budget injection pattern above — the mechanism predates the HNSW-online-insert solution and is simpler to operate at the cost of lower personalization fidelity for cold-start items. For systems where personalization quality on new items is critical (short-form video where a single cold-start video may define a creator's trajectory), online insertion is the correct upgrade path.

Deep dive B — Feature store with online/offline parity

The feature store serves features to two very different consumers: the training pipeline (which reads in bulk from a batch store, tolerating minutes to hours of latency) and the online serving pipeline (which reads per-request from a low-latency KV store, requiring sub-5 ms p99). If these two paths compute the same feature differently, the model trains on distribution A and serves on distribution B — the definition of train/serve skew.

Feature tiers

  • Real-time (sub-second)— user's last-N interactions (watch events, likes, skips) computed from an event stream (Kafka + Flink), written to Redis. These features change with every swipe; stale values here directly degrade ranking for the current session.
  • Near-real-time (minutes) — item freshness score, trending topic signals, creator upload rate. Computed in micro-batch or streaming aggregation. Written to the feature store and served from a faster read tier (Redis or Cassandra with a hot cache).
  • Batch (hourly to daily) — user long-term interest embeddings, creator authority scores, item content embeddings. Expensive to compute; served from a read-optimized store (BigTable, Cassandra).

Train/serve skew — the silent failure mode

The failure: the training pipeline backfills the user watch-count feature from the batch store (hourly snapshots). The serving pipeline reads from the real-time store (per-event updates). For a user who watched 50 videos in the last hour, the batch store shows the value from an hour ago; the real-time store shows the current value. The model learned the relationship at training-time feature values; at serve time, the distribution is shifted. Offline NDCG looks healthy because eval uses the training features; online retention quietly regresses.

  • Feature logging — log the exact feature values served at inference time alongside the request. This is the ground truth for diagnosing skew.
  • Parity dashboards — daily comparison of the distribution of each feature in the training data vs the serving log. Alert if mean or variance shifts by more than a configurable threshold. Catch skew before it accumulates.
  • Backfill strategy — when adding a new feature, backfill it historically using the same computation path that will be used at serve time. Never backfill from a different source than the online path; this is the source of most parity bugs.
✨ Insight · Why the diversity re-ranker is separate.It is tempting to add diversity constraints directly into the heavy ranker's loss function (e.g., a diversity regularization term). Resist this. Entangling relevance and diversity in a single model means every policy change — a new safety rule, a new creator-fairness target — requires re-training and re-deploying the ranker. A separate re-ranker is deterministic, fast, and policy-configurable without ML involvement. The division of labor: ML maximizes relevance; the re-ranker applies constraints. This mirrors the Product/ML ownership boundary in the explore/exploit setpoint.
Quick Check

Your PM says the explore/exploit ratio should be increased from 10% to 20% to help new creators. Which of the following correctly describes who makes this decision and how it is validated?

🔧

Break It

Three surgical removals, each mapping to a metric in the eval harness.

Remove the diversity re-ranker

The heavy ranker optimizes for predicted engagement, which is strongly correlated with topic familiarity. Without the diversity re-ranker enforcing a topic-mix floor, the feed rapidly converges to the 2–3 topics a user engaged with most recently. Observed behavior:within 48 hours of removing diversity enforcement, filter-bubble formation is measurable in cohort data — users' topic diversity (entropy of watched categories) drops sharply. 30-day retention starts declining by day 10 as the feed feels “narrow.” The effect is delayed because users don't immediately churn; they reduce session frequency gradually. Detection: topic-diversity metric on served feeds, tracked daily. Creator fairness metric also degrades — mid-tier creators in non-dominant topics lose impressions. Mitigation: restore the re-ranker; the relevance ranker has no mechanism to enforce diversity on its own.

Remove counterfactual logging

Without counterfactual logs (candidate set + scores + user engagement), every ranking experiment requires a live A/B rollout. This has three consequences. First, the experiment cycle time doubles or triples — you cannot pre-screen ideas offline before committing to a live test. Second, you lose the ability to do post-hoc policy re-ranking: applying a new diversity rule to historical data to estimate its effect before it ships. Third, model debugging becomes much harder — when a cohort's retention drops, you cannot inspect what the ranker scored vs what the user actually watched. Detection: absence of counterfactual logs is a process gap, not a metric regression — which makes it easy to defer. Teams that skip it pay the cost compounded over every subsequent experiment. Mitigation: counterfactual logging is cheap (log IDs and scores, not full feature vectors) and should be part of the day-one serving pipeline, not a retrofit.

Skip feature-store online/offline parity checks

This is the most dangerous removal because it produces no immediate failure signal. Offline NDCG improves as you train new models (the training pipeline reads features that look reasonable). Online retention quietly regresses as the serving pipeline reads a different feature distribution than the model was trained on. Observed pattern: a team adds a new real-time feature (user watch-count in the last hour), backfills it for training from a batch snapshot (hourly), but serves it from the event stream (per-event). The model trains on the batch distribution (lower values, smoothed); serves on the streaming distribution (higher values, spiky). The engagement predictions are systematically biased. Offline NDCG looks great. Online retention regresses by a few percent — within normal noise, easy to miss for 2–3 weeks. Detection: parity dashboard comparing training-feature distributions to serving-feature distributions. Aim to catch skew within 24 hours of introducing it. Mitigation: feature logging + daily distribution comparison. Build this before adding real-time features, not after.

Quick check

Trade-off

Counterfactual logging is removed to cut storage costs. What is the first concrete capability you lose, and what is the cheapest alternative?

Counterfactual logging is removed to cut storage costs. What is the first concrete capability you lose, and what is the cheapest alternative?
💸

What Does a Bad Day Cost?

Three failure modes, with the P-level and detection window for each.

Worked example: feature-store lag at 100K QPS

Assume 100,000 QPS, , and a feed incident where freshness scores go stale, causing the ranker to serve lower-relevance content. Model a 15% engagement-rate drop during the outage window, which translates to a 15% reduction in monetizable impressions.

Detection windowImpacted impressionsRevenue impact
5 min (alert fires promptly)100K × 300 s × 0.15 = 4.5M impressions lost4.5M ÷ 1,000 × $3 ≈ $13,500
30 min (P2 detection SLO)100K × 1,800 s × 0.15 = 27M impressions lost27M ÷ 1,000 × $3 ≈ $81,000
2 hr (missed by daily checks)100K × 7,200 s × 0.15 = 108M impressions lost108M ÷ 1,000 × $3 ≈ $324,000

Detection-window sensitivity: 5 min vs 2 hr = 24× cost difference— well above the 10× gate required for reliability to be a dollar number, not a percentage. The implication: a real-time freshness-lag alert (p99 feature age > 2× expected cadence) that pages on-call in under 5 minutes pays for itself in the first incident.

  • Cold-start feature-store lag (creator fairness incident) — a streaming aggregation job falls behind during a traffic spike; item freshness scores in the feature store go stale. New videos from the last 30 minutes receive near-zero freshness scores and are not retrieved by the candidate generator. Creator first-upload success rate drops sharply for a cohort. Creators report via support tickets and social media within hours. This is a P2 / high-severity incident that may become P1 if sustained past 2 hours, due to reputational damage in the creator community. Detection: feature freshness lag alert on the streaming pipeline; creator fairness metric drop in the serving dashboard. MTTR is typically hours if the pipeline is well-monitored.
  • Ranker deploy regresses silently (retention damage) — a new ranker ships with train/serve skew on a recently-added feature. Offline NDCG shows improvement; online A/B guardrails are measuring over a 1-week window. The skew is subtle enough to evade the A/B guardrails for the first 7 days. By the time the 14-day cohort retention data arrives, 2 weeks of production traffic has been served by the regressed model. This is the most expensive failure mode in recsys because it is delayed and cohort-scoped — only users who were active during the rollout window are affected, and their retention curve is damaged in a way that cannot be reversed by rolling back the model. Prevention: feature parity dashboard, shadow scoring, and a 14-day minimum A/B runtime for any ranker change touching real-time features.
  • Safety-policy bug lets harmful content dominate (reputational P0) — a policy configuration change in the diversity/policy re-ranker inadvertently removes a cap on a category of content that was being suppressed. The heavy ranker, optimizing for predicted engagement, immediately fills the vacuum with the previously-suppressed content. This is detectable within minutes via the safety policy violation rate metric — but only if that metric is actively monitored in real-time (not on a daily dashboard check). The blast radius is large: the content appears in the For-You feeds of users who did not seek it out, which is the worst-case distribution for reputational harm. Fix: automated rollback of the policy re-ranker config when the violation rate crosses a threshold; manual review before any policy config change ships to production.
⚠ Warning · The diversity re-ranker is your safety circuit breaker. Because it sits between the relevance ranker and the user, it is the correct place to enforce policy constraints, creator-fairness floors, and content caps. Putting safety logic in the relevance ranker couples two concerns that should evolve independently — a policy change should not require a model re-train.
🚨

On-call Runbook

Feature store stale after schema migration

MTTR p50 / p99: 30 min–2 h depending on rollback complexity and schema diff size

Blast radius: Ranker receives zero or malformed features for all requests; recommendation quality collapses; creator freshness SLO breached

  1. 1. DetectFeature freshness lag alert: p99 feature age >2× expected cadence fires P2; creator fairness metric drop visible on serving dashboard within 5 min
  2. 2. EscalateOn-call SRE + Feature Platform team; escalate to P1 if freshness lag sustained >30 min (creator SLO breach)
  3. 3. RollbackRoll back schema migration on feature store; serve from last known-good snapshot until migration is re-validated
  4. 4. PostAdd pre-migration canary: run 5-min shadow read on new schema before cutting over; add per-feature staleness alert by feature group

Model regression drops CTR at peak hour

MTTR p50 / p99: 5–15 min for rollback; hours to confirm recovery in engagement metrics

Blast radius: All users on regressed ranker see lower-relevance feed; engagement rate drops; A/B guardrails may not catch for 7–14 days if window is too short

  1. 1. DetectOnline CTR monitor: >3% relative drop vs control bucket over 1 h triggers P2; shadow scorer parity check at deploy time
  2. 2. EscalateOn-call ML Infra + Recsys team; if >5% CTR drop or policy violation, escalate to P1 and loop in product
  3. 3. RollbackImmediate rollback to previous model checkpoint via feature flag; rollback completes within one deployment cycle (~5 min)
  4. 4. PostRequire 14-day A/B minimum for any ranker touching real-time features; add train/serve feature parity dashboard to deploy checklist

Retrieval index skew after cold-start user flood

MTTR p50 / p99: 1–4 h to restore embedding warm-up pipeline; 24 h to fully recover cold-start cohort quality

Blast radius: New users receive near-random recommendations; explore/exploit setpoint pulls toward exploitation with no history; creator fairness drops for new-content surfacing

  1. 1. DetectCold-start cohort p99 session length alert: new-user median watch time drops >20% vs rolling 7-day baseline fires P2
  2. 2. EscalateOn-call Recsys + Data Infra team; loop in Growth team if new-user activation rate also drops
  3. 3. RollbackSwitch cold-start users to popularity-based fallback ranker; restore user-embedding warm-up job
  4. 4. PostPre-compute warm-up embeddings for new users from first 3 interactions; add cold-start cohort monitoring as a permanent SLO

Quick check

Derivation

A feature-store lag drops engagement 15% at 100K QPS and $3 CPM. At what detection window does the incident cost cross $100,000?

A feature-store lag drops engagement 15% at 100K QPS and $3 CPM. At what detection window does the incident cost cross $100,000?
🏢

Company Lens

Meta's push (Reels / Feed ranking)

Meta has done this problem at scale across multiple surfaces (News Feed, Reels, Stories). Expect the interviewer to drill on scale and fairness simultaneously: “How do you ensure creator fairness across demographic groups, not just follower-count cohorts? What is your definition of fairness in a recsys context, and how do you operationalize it?” Meta's design bar also includes safety integration — the interviewer will expect you to know where the safety classifier sits in the pipeline (pre-ranker at the candidate level and post-ranker in the re-ranker) and why both are needed. Additionally: how do you handle cross-surface feature sharing (a user's Reels engagement informing their Feed ranking) without blowing up your feature store architecture?

Google's push (YouTube, Search personalization)

Google's design bar is heavy on the theoretical foundations: the two-tower paper came from YouTube, and interviewers will expect you to understand its limitations — specifically, why the inner-product retrieval formulation means the user and item towers cannot capture the interaction terms the heavy ranker can. Expect questions on the offline/online evaluation gap (“how do you know your NDCG improvement is real?”) and on the systems primitives: how does the ANN index scale to billions of items, what is the indexing latency SLO, and how does online index insertion interact with the ANN graph's quality guarantees? Google also cares about the ranking theory: why does position bias in training data corrupt your ranker, and how do you correct for it without counterfactual data?

🧠

Key Takeaways

What to remember for interviews

  1. 1Explore/exploit ratio is a product lever, not an ML hyperparameter. ML owns the quality of explore candidates; Product owns the setpoint.
  2. 2Offline NDCG is a regression detector, not a measure of improvement. Online A/B with retention and creator fairness guardrails is the source of truth.
  3. 3The diversity re-ranker must be separate from the relevance ranker — entangling them makes policy changes require model re-training.
  4. 4Feature-store online/offline parity is a first-class requirement, not an ops concern. Train/serve skew is the most common silent failure in production recsys.
  5. 5Counterfactual logging is cheap to build and expensive to retrofit. Log candidate sets and scores from day one.
  6. 6Cold-start is an architectural problem, not a model problem. Online ANN index insertion and explore-budget allocation are the correct fixes, not a fancier item tower.
🎯

Interview Questions

Difficulty:
Company:

Showing 5 of 5

The PM asks to 'increase explore/exploit to boost new creator growth.' What questions do you ask before touching a single model parameter, and how do you frame the tradeoff?

★★☆
MetaGoogle

Offline NDCG@10 improved by 1.5 points in your candidate generator experiment. The online A/B shows flat retention and a small drop in creator fairness. Explain why, and what you do next.

★★★
MetaGoogle

Design the feature store for a TikTok-scale feed ranker. What features live in which tier, and what is the failure mode if online/offline feature parity breaks?

★★★
MetaDatabricks

A new video goes viral within 10 minutes of upload. Your ranker gives it near-zero relevance scores. What architectural components are failing, and what is the fix?

★★☆
MetaGoogle

Databricks asks: how do you structure the ML training pipeline so that a new ranker version can be shadow-tested, compared to the champion, and promoted — without taking the feature store offline or requiring a full data re-backfill?

★★★
DatabricksMeta
🧠

Recap quiz

🧠

Feed Ranking recap

Trade-off

Who owns the explore/exploit setpoint in a TikTok-scale feed system, and what metric validates a change to it?

Who owns the explore/exploit setpoint in a TikTok-scale feed system, and what metric validates a change to it?
Derivation

Why can&apos;t the two-tower candidate generator capture the same signal as the heavy ranker, even with a much larger model?

Why can&apos;t the two-tower candidate generator capture the same signal as the heavy ranker, even with a much larger model?
Trade-off

A team adds a real-time feature (user watch-count in the last hour) trained from hourly batch snapshots but served from a per-event stream. Offline NDCG improves. What happens online, and why?

A team adds a real-time feature (user watch-count in the last hour) trained from hourly batch snapshots but served from a per-event stream. Offline NDCG improves. What happens online, and why?
Derivation

A batch ANN index rebuilds hourly. A new video is uploaded. What is the maximum delay before it can appear in any user&apos;s candidate set, and which SLO does this violate?

A batch ANN index rebuilds hourly. A new video is uploaded. What is the maximum delay before it can appear in any user&apos;s candidate set, and which SLO does this violate?
Trade-off

The diversity re-ranker is removed from the pipeline as a latency optimization. The heavy ranker stays. What is the primary long-term failure mode?

The diversity re-ranker is removed from the pipeline as a latency optimization. The heavy ranker stays. What is the primary long-term failure mode?
Derivation

A new candidate generator retrieves different items than the old one. NDCG@10 on the offline eval set improves by 1.5 points. Why might this overstate the real improvement?

A new candidate generator retrieves different items than the old one. NDCG@10 on the offline eval set improves by 1.5 points. Why might this overstate the real improvement?
Derivation

At 100K QPS and $3 CPM, a feature-store lag causes a 15% engagement drop. Detection in 5 min costs ~$13,500; detection in 2 hr costs ~$324,000. What is the cost multiplier?

At 100K QPS and $3 CPM, a feature-store lag causes a 15% engagement drop. Detection in 5 min costs ~$13,500; detection in 2 hr costs ~$324,000. What is the cost multiplier?
📚

Further Reading