Cost Accounting & Eval-Driven Design

🧪

Part 1 — Eval-Driven Design (write the eval first)

The load-bearing claim

You cannot design a system you cannot measure. Write the eval harness before the architecture. This is not a rhetorical flourish; it's the single most predictive habit in shipped AI systems (Shankar 2024, Husain 2023, Yan 2023).

The four-part eval spec

What are we measuring? One or two primary metrics, not ten. For a chat product: task success rate + safety refusal precision. For RAG: groundedness + citation accuracy.
How do we measure it? Concrete rubric with example-level guidance. LLM judge prompt printed verbatim in the doc. Human-labeled calibration set of 50 examples.
On what? Golden set sized for statistical power on the primary metric, stratified by traffic cohort.
When does it run? Pre-merge blocking eval, daily regression eval, weekly stratified deep-dive. Different cadences for different costs.

✨ Insight · The judge is a system with its own eval. An LLM-as-judge is a model whose output drives your go/no-go decisions. It deserves the same scrutiny as the production model — calibration against human labels, drift monitoring, and a refresh schedule. — another reason to calibrate against human labels rather than trust the judge out-of-the-box. Shankar's “Who Validates the Validators” (arxiv 2404.12272) documents what happens when you skip this.

Golden-set sizing — the math

For a binary quality metric (pass/fail) with expected pass rate , the 95% confidence interval half-width on the observed pass rate is approximately:

So for , : — a ±5.5 pp interval. To resolve a 2 pp change you need roughly . If you care about sub-cohorts (per tier, per language, per prompt shape), multiply by the number of cells.

Python: golden-set sizing

python

import math

def golden_set_size(
    expected_pass_rate: float,   # e.g. 0.80
    target_half_width: float,    # e.g. 0.02 for ±2 pp
    n_cohorts: int = 1,
    confidence_z: float = 1.96,  # 95% CI
) -> int:
    """Size a golden set for a binary pass/fail metric.

    Multiplies by n_cohorts when you want independent power
    within each stratum (per-tier, per-language, etc.).
    """
    p = expected_pass_rate
    per_cohort = (confidence_z ** 2) * p * (1 - p) / (target_half_width ** 2)
    return int(math.ceil(per_cohort * n_cohorts))

# Example: 80% pass rate, ±2pp, 4 cohorts
print(golden_set_size(0.80, 0.02, n_cohorts=4))
# -> 6147

Quick check

Derivation

At p = 0.8 with n = 200 examples, the CI half-width is ±5.5 pp. A 3-cohort breakdown multiplies sample needs how?

3× larger total — one cohort's power applies to all three.No change — only the aggregate pass rate drives the size.√3 × larger — variance adds quadratically across cohorts.3× larger total to maintain the same CI width in each cohort.

🔬

Part 2 — Deep Dives on the Two Hardest Eval Problems

Expand the deep dives

Open for the full technical detail.

Expand

Deep Dive A — LLM-Judge Calibration Pipeline

Approach

The default LLM-as-judge is a prompted model that returns a score or a verdict. Two architectural families exist, and they sit at different points on the cost-recall curve:

Embedding-similarity judges compute cosine similarity between a reference response embedding and the candidate embedding using a shared encoder (e.g., text-embedding-3-large or a fine-tuned bi-encoder). Cheap — roughly 1/50th the cost of an LLM call — and fast enough to run at 100% coverage. But they collapse on semantically equivalent paraphrases that differ in safety or factual precision, because embedding similarity captures surface meaning, not logical entailment. Use them as a first-pass filter, not a final gate.
Rubric-based LLM judgesprompt a capable model with a structured rubric and a few-shot calibration set of human-labeled examples. They catch semantic quality failures that embeddings miss, but at 20–50× the token cost. The judge prompt must be versioned; the rubric must have positive and negative examples for every criterion; and the judge's output must be structured (JSON with per-criterion scores, not a free-form paragraph) so downstream tooling can aggregate by failure mode.

A production-grade pipeline layers them: embeddings for bulk coverage, LLM judge for the cases where embeddings express low confidence (cosine sim in the 0.6–0.8 range where calibration shows high disagreement with humans). The boundary threshold is itself a calibrated parameter, not a round number.

Trade-off (the non-obvious one)

The obvious trade-off is cost vs. coverage. The non-obvious one is judge-model co-drift: when you use the same model family for both production and judging, a model update that improves the production model also silently shifts the judge's preferences. The judge starts rating the new model's outputs as better — not because they are better by the original rubric, but because the judge now shares the new model's stylistic tendencies. You see an eval win without any real quality improvement. This is distinct from ordinary judge drift (where the judge drifts and the production model is stable) — it is correlated drift, and it is invisible to the standard calibration-check workflow unless you explicitly maintain a frozen-model reference judge.

The mitigation is to pin the judge model version and only upgrade it on a deliberate schedule with a full re-calibration pass — never allow judge and production model to update in the same deploy.

Failure mode

Calibration rot. The human-labeled calibration set was drawn from production traffic at a point in time. Six months later, user intent distribution has shifted (new features, new user segments, seasonal query patterns), but the calibration set hasn't. The judge is still calibrated to old traffic. Spearman between judge verdicts and fresh human labels drops from 0.85 (calibration day) to 0.62 (six months later) without anyone noticing, because the Spearman calculation itself isn't run again. The judge continues to report high-confidence verdicts; they are quietly wrong on the new traffic distribution.

Detection metric (named, with threshold)

Track between LLM judge verdicts and a rolling human-labeled spot-check sample. Production teams typically set an alert threshold of ρ < 0.75 — below that, the judge's ranking of better-vs-worse responses is no longer reliable enough to gate launches. The spot-check sample should be 50–100 examples drawn from the current week's production traffic, not the original calibration set. Frequency: monthly at minimum; weekly during active training runs.

Secondary metric: judge confidence calibration — the fraction of high-confidence verdicts (score ≥ 4/5 or "strongly pass") that are confirmed correct by humans. If the judge assigns high confidence to 70% of examples but only 60% of those are confirmed by humans, the judge is overconfident and the effective detection threshold needs to be raised.

Mitigation

Implement a calibration refresh schedule: draw a 50-example sample from each calendar month's production traffic, have two human annotators label them independently (inter-annotator agreement > 0.80 Cohen's κ required before including the batch), and run a Spearman check against the judge. If ρ drops below 0.75, block all judge-gated launches until the judge prompt is updated and re-calibrated. The secondary effect to watch: a prompt update that fixes calibration on the newest traffic slice can degrade calibration on older slices if the rubric is over-fit to recent examples. Keep a held-out validation set from six months ago and verify that the updated judge doesn't regress on it.

For judge-model co-drift specifically: pin the judge to a specific model version checkpoint and treat judge upgrades as their own release event, with a full calibration report before each upgrade goes live.

✨ Insight · Real-world example. Shankar et al. (2024) “Who Validates the Validators?” is the canonical empirical study of LLM-judge reliability in production. The paper instruments Spearman ρ between four judge-model configurations (GPT-4, Llama-70B, and two rubric variants) and human raters across 2,200 labeled examples, finding that — a nearly 2× spread. The paper's central finding for practitioners: no judge works well out-of-the-box; all require domain-specific calibration sets and regular refresh. The cost-recall tradeoff between embedding and LLM judges is documented in Figure 3 of the paper.

Deep Dive B — Cost-Router Design for Eval Economics

Approach

A cost-router classifies incoming requests into model tiers before they reach a serving pool. The canonical design has three tiers: cheap (small open-weight or fine-tuned model, on the order of $0.5–1/1M output tokens), mid (mid-size API model, on the order of $3–5/1M), and flagship (largest available model, on the order of $15–30/1M). — making the fraction of traffic routed to the cheap tier the single largest cost lever available to the serving team. The router's job is to send the maximum fraction of traffic to the cheap tier without degrading quality below the eval gate threshold. RouteLLM (Ong et al., 2024, arxiv 2406.18665) formally defines this as the cost-quality Pareto frontier problem and evaluates four router architectures: a similarity-weighted ranker, a BERT-class classifier, a causal LLM router, and a matrix factorization approach trained on pairwise preference data. — measurably better than random routing but still leaving substantial room for improvement on specialized domains.

In production, the router is a lightweight inference endpoint (latency budget: <20 ms p95, since it sits on the critical path before the model call). It outputs a tier assignment plus a confidence score. Low-confidence assignments — typically the bottom 20% by confidence — are escalated to the next tier rather than committed, because the expected quality cost of a wrong cheap-tier assignment on a borderline request exceeds the savings from avoiding the mid-tier call.

Trade-off (the non-obvious one)

The obvious trade-off is router latency vs. cost savings: a slower router that classifies better is worse than a fast router that sometimes mis-routes, because mis-routing costs are bounded by the tier delta, while latency costs compound at every request. At 1K QPS, a 50 ms router adds 50 GPU-seconds per second of router latency — more than a marginal mis-route.

The non-obvious trade-off is router drift under distribution shift. The router was trained on historical traffic; its accuracy degrades when the request distribution shifts (new product features, new user segments, seasonal variation). Unlike production model drift — which shows up in quality metrics — router drift manifests as cost drift: the cheap-tier fraction silently drops (more borderline requests, more escalations) or quality drops (the router confidently assigns hard requests to the cheap tier). Both failure modes are slow and linear, not bursty, so they slip past alert thresholds calibrated for incident-style regressions. The Martian routing-as-a-service product addresses this with continuous online learning on recent traffic, at the cost of adding a dependency on an external preference-data pipeline.

Failure mode

Silent flagship bleed.After a product update that changes the request distribution (say, a new long-context feature goes live), the router starts assigning 20% more traffic to the flagship tier because the classifier's “hard” decision boundary was calibrated on shorter prompts. At 1K QPS and $66K/day baseline, a 20% flagship bleed adds $119K/day in overrun (0.2 × $66K × 9×, where 9× is the excess above baseline cost on the regressed slice). The cost accumulates over days before crossing any alert threshold unless per-tier cost-share is monitored continuously.

Detection metric (named, with threshold)

Track per-tier traffic share (cheap%, mid%, flagship%) as a primary signal with a 1-hour rolling average. Alert when flagship% rises more than 5 percentage points above the 30-day baseline. Secondary: router confidence distribution— the p10 of confidence scores across all routed requests. If p10 drops below 0.55 (from a healthy baseline of ~0.72), the classifier is operating in uncertain territory on a large fraction of traffic, indicating distribution shift. Tertiary: shadow a 5% sample of cheap-tier completions through the flagship judge on a 24-hour delay; alert if quality delta widens by >3 pp relative to the rolling baseline.

Mitigation

Implement shadow traffic + canary gatingfor router updates. When the router is retrained on fresh traffic, run it in shadow mode for 48 hours alongside the live router — compare per-tier distributions and quality on the shadow sample before promoting. The canary gate requires: (1) per-tier traffic share within ±3 pp of the live router, (2) quality delta <1 pp vs. live router on the shadow judge sample, (3) router p95 latency within the 20 ms budget. Only if all three pass does the retrained router promote.

Secondary effect to watch: shadow traffic routing adds a second full model call for 5% of requests on the shadow slice, which inflates cost during the canary window. Budget this explicitly; it's roughly 5% × (flagship cost − cheap cost) per request, which is small but not zero, and it spikes if the canary window extends past 48 hours.

For per-tier cost-share alerts: instrument a cost-attribution pipeline that segments spend by router tier in real time. Alert pages when flagship cost-share crosses 110% of the 7-day moving average, regardless of absolute cost level — this catches slow drift before it becomes a P1 incident.

✨ Insight · Real-world example. RouteLLM (Ong et al., 2024) is the most rigorous public evaluation of classifier-gated model routing. The paper benchmarks four router architectures on MT-Bench, MMLU, and GSM8K, measuring the cost-quality frontier for each. Key result: on MT-Bench, the matrix factorization router achieves a 2× cost reduction with <5% quality degradation vs. always routing to GPT-4. The paper also demonstrates that all router architectures degrade under distribution shift between training and test domains — the routers trained on chatbot-arena data underperform by 8–12 pp quality on coding-heavy benchmarks — which is the same drift failure mode described above. Martian (a commercial routing-as-a-service product) extends the RouteLLM approach with online retraining but does not publish accuracy numbers for its production router.

💰

Part 3 — Cost Accounting (price the bad day, not the request)

Three cost metrics, three decisions

Metric	Used for	Aggregation
Steady-state unit cost	Capacity planning, pricing	$ / 1M output tokens
Cost per bad day	Incident severity, on-call SLA	$ during a defined incident window
Cost per user retained	Product strategy, tier design	$ / (cohort retained at 90 days)

The bad-day formula

where is the detection + mitigation window (often dominated by detection), and is the per-token cost delta between baseline and incident state. This formula has one design implication: reduce the detection window. A 5-minute detection beats a cost-optimized steady state when amortized over incidents. — making the detection-window gap the dominant cost variable, not the per-token rate.

⚠ Warning · The reliability-is-a-dollar-number reframe. — yet that budget hides the asymmetry between a 10-second blip and a 4-hour partial regression. Pricing both in dollars surfaces what the SLO obscures: partial regressions are often the expensive incidents, not the full outages.

Quick check

Derivation

At 1K QPS, 256 avg output tokens, $3/1M output tokens — what is the baseline daily cost, and which variable has the largest leverage for reducing incident cost?

$66K/day; shortening the detection window is the largest lever.$6.6K/day; reducing per-token price matters most.$66K/day; reducing output token count is the largest lever.$660K/day; QPS must be reduced to cut incident costs.

🔗

Part 4 — The Offline/Online Bridge

Every offline eval metric should predict an online outcome. If it doesn't, it's a vanity metric — useful for impressing the team, useless for launch decisions.

The bridge contract

For each offline metric, state the online metric it predicts and the expected effect size.
Measure the correlation on the first N launches. Metrics that don't correlate get deprecated; they're attractive nuisances.
When they disagree (“offline says +5%, online is flat”), the most common root cause is one of three identifiable failures: (a) Golden-set staleness— the golden set was sampled from a traffic slice where the old model was already strong; the new model gains on cases the old model already handled well, which don't drive user behavior. Fix by re-sampling the golden set from the most recent 7 days of production traffic before each major eval. (b) Judge-model co-drift— as described in Deep Dive A, the judge has silently shifted preferences to favor the new model's style, inflating the offline win; Spearman ρ on a fresh human-labeled spot-check will drop below 0.75, confirming the diagnosis. (c) Metric-outcome mismatch — you are measuring response quality (does the answer sound correct?) while users are measuring task completion (did I get what I came for?). The Shankar et al. 2024 dataset provides a concrete example: judges rated verbose responses 0.4 points higher on a 5-point scale, but user session length was 12% shorter after those responses, because verbosity correlated with over-hedging that left the question unanswered.

✨ Insight · The eval that's too green. When the offline eval consistently shows bigger wins than the online test, you almost certainly have a selection bias in the golden set — it over-represents cases where your model is already strong. Fix by re-sampling from recent production traffic.

💸

Part 5 — Incident-Cost Ledger (price three failure modes)

Reliability is a dollar number, not a percentage. Every design review should have these three rows priced before sign-off. At least one row must demonstrate detection-window sensitivity of 10× or more.

Row 1 — Router drift, 20% flagship bleed

Setup: Product runs at 1K QPS, $3/1M output tokens, 256 avg output tokens. Baseline daily cost = 1K × 86,400 × 256 × $3/1M = $66K/day (derived). A router drift incident routes 20% of traffic to the flagship (10× cost).

Overrun math: . The (10−1) = 9× factor is the excess above baseline on the regressed slice — not 10×, because baseline already prices routing at $3; the flagship costs $3 × 10 = $30, so the delta per regressed request is $27 vs. $3, a ratio of 9× the baseline cost on that slice.

Detection window	Cost of the incident	Note
5 minutes	$119K × (5/1440) ≈ $415	Per-tier cost-share alert on 1-hour rolling average
4 hours	$119K × (4/24) ≈ $19,900	End-of-shift review catches it
24 hours (full day)	$119K	Next-day billing review catches it

. This is the canonical argument for real-time per-tier cost-share monitoring rather than end-of-day billing review.

Row 2 — Silent quality regression via judge-model co-drift

Setup:A model update ships that also updates the judge model (same provider, same family). Judge-model co-drift silently inflates scores for the new model. The production model is actually worse on a key cohort — say, long-context queries — but the judge's Spearman ρ against humans has dropped from 0.82 to 0.61 on that cohort.

User impact (quantitative):If long-context queries are 15% of traffic and the quality regression causes a 12% increase in user abandonment on that cohort (inferred from session length drop and thumbs-down rate — the concrete proxy Shankar et al. use), then approximately 0.15 × 0.12 = 1.8% of total sessions are abandoned that would previously have succeeded. At 1K QPS, that is roughly 1,500 failed sessions per day. Translating to revenue depends on the product's conversion rate, but even at $0.10/session in LTV, that is $150/day of silent churn compounding week over week.

Dollar precision: The churn figure is qualitatively indicative — exact LTV depends on product pricing. The user-impact math (% abandonment × % traffic) is derived and can be checked; the $0.10/session figure is a placeholder you replace with your product's actual LTV. The point is that even modest LTV numbers compound to meaningful weekly losses before any alert fires, because there is no per-cohort quality SLO to breach.

Detection gap:No cost alert fires (costs are flat — the model update didn't change routing). Only a per-cohort Spearman ρ refresh against fresh human labels would catch this within a week. This is why monthly judge calibration is the minimum viable cadence, not the recommended one.

Row 3 — Golden-set rot

Setup: The golden set was sampled from production traffic 6 months ago, when the product had a different feature set. A new long-context capability has since become 20% of traffic, but 0% of the golden set covers it. The eval now measures quality on query types that represent only 80% of production traffic.

Quantified drift: If the new capability has a 15% lower quality pass rate than the legacy queries (reasonable for an immature feature), and it constitutes 20% of traffic, then the true production quality is 0.8 × (legacy rate) + 0.2 × (legacy rate − 15pp) = legacy rate − 3pp. The golden-set eval reports legacy rate unchanged. The eval is overstating production quality by approximately 3 percentage points — enough to approve launches that would fail if measured against real traffic.

Detection: Track the fraction of production traffic covered by at least one golden-set example in the same query cluster. Alert when coverage drops below 80% of current traffic distribution (measured by embedding-cluster overlap between recent traffic samples and golden-set items). Schedule a golden-set refresh when coverage drops to 75%.

Quick check

Trade-off

A 99.9% SLO allows ~43 min/month of downtime. Why does pricing incidents in dollars reveal a risk the SLO percentage hides?

Because dollar costs accumulate per request, not per minute.Because a partial regression and a full outage burn equal SLO budget but incur very different dollar costs.Because SLOs are measured monthly but incidents are priced daily.Because 99.9% SLOs are calculated before incidents, not after.

🏢

Part 6 — Company Lens (same concepts, different drills)

Senior interviewers at different companies will zoom into different sub-problems. Know the drill before you walk in.

Anthropic drill

The question you will get:“Walk me through how you would set up a judge-calibration pipeline for a model that handles safety-sensitive queries. How do you size the calibration set, how often do you refresh it, and how do you detect when the judge has drifted on the safety cohort specifically?”

The shape of the answer:Start with the asymmetry: false negatives on safety are more expensive than false positives, so the calibration set over-represents borderline cases (the “shadow zone” where the judge is uncertain), not the bulk of traffic. Golden-set sizing math: use the Wilson score interval formula sized for the safety cohort specifically, not the aggregate. At 95% CI and a target half-width of ±3pp on a 95% expected pass rate, you need n ≈ 200 examples — but 80% of those should come from the borderline slice, not random traffic. Calibration refresh: quarterly at minimum, triggered immediately whenever a model update touches the safety training. Judge-model co-drift detection on the safety cohort: run a Spearman ρ check against fresh human labels every time the judge model version changes, before any judge-gated launch is approved. The non-obvious insight: safety eval is where judge-model co-drift is most dangerous, because safety training changes the model's refusal style, which changes the judge's learned preference for what a “good” response looks like — exactly the co-drift pattern described in Deep Dive A.

Google drill

The question you will get:“You're running eval at 10K+ QPS across multiple internal tenants sharing a Borg cluster. How do you do cost accounting per tenant when the underlying GPUs are shared, and how do you ensure one tenant 's eval workload doesn't starve another's production traffic?”

The shape of the answer:Two separate problems. Cost attribution: instrument at the request level, not the cluster level. Each eval request carries a tenant tag and a workload class (eval vs. production). The fair-share scheduler (Borg's priority + quota system, inferred from public Google SRE documentation) enforces priority weights: production jobs run at priority 100, eval jobs at priority 50. GPU-second consumption is logged per tenant-tag and aggregated hourly; billing is computed as (tenant eval GPU-seconds) / (total cluster GPU-seconds) × (cluster cost per hour). The non-obvious issue at 10K+ QPS is priority inversion: a long eval job that holds a large KV-cache allocation blocks memory from production jobs even if its CPU priority is lower. Mitigation: enforce a KV-cache quota per workload class, not just a CPU priority — production workloads get 70% of KV-cache capacity, eval gets 30%, regardless of CPU priority. Per-cohort cost attribution at this scale requires that the eval orchestrator emits structured logs with tenant, model-tier, and request-shape fields — aggregate in a streaming pipeline (Dataflow-style), not in a batch job, so the attribution is visible within 5 minutes of the eval run.

🧠

Key Takeaways

What to remember for interviews

1Write the eval before the architecture. Design flows from measurement.
2The LLM judge is a system with its own eval. Calibrate against humans on a schedule; track Spearman ρ monthly.
3Judge-model co-drift is the hardest failure mode: both production model and judge update together, inflating offline scores silently.
4Golden-set size is a statistical-power problem, not a round-number aesthetic.
5Router drift under distribution shift is a cost drift, not a quality regression — only per-tier traffic-share monitoring catches it.
6Three cost metrics (unit, incident, retention) for three decisions. One blended number hides the important asymmetries.
7Shorten the detection window before optimizing steady-state cost — a 5-min detection vs. 4-hour detection is a 48× cost difference on a router-drift incident.
8Every offline metric must predict an online outcome. Disagreement diagnoses to one of three root causes: golden-set staleness, judge co-drift, or metric-outcome mismatch.

🧠

Recap quiz

📚

Transformer Math