📊 LLM Evaluation
Your model scores 90% on MMLU but users hate it — why?
Interactive Sandbox
How do you know if your model is good? Evaluation is the hardest unsolved problem in LLMs — and every company asks about it in interviews.
The diagram below shows three main evaluation approaches: automated benchmarks, LLM-as-judge, and human evaluation. Each has tradeoffs between cost, speed, and reliability.
The Intuition
Imagine you built a customer support chatbot.How do you know it's good? You need three things: (1) automated benchmarks to catch regressions fast, (2) LLM-as-judge for nuanced quality, (3) human eval for the final call. Each trades speed for accuracy — you use all three in production.
| What you're measuring | Best eval method | Why |
|---|---|---|
| Factual accuracy | Benchmarks (MMLU, TruthfulQA) | Automated, reproducible, fast |
| Response quality | LLM-as-judge (GPT-4 eval) | Nuanced, scales, cheap vs human |
| User satisfaction | Human eval (A/B test) | Ground truth, but slow and expensive |
Three approaches to evaluating LLMs, each with failure modes:
- Benchmarks(MMLU, HumanEval) — standardized tests. Fast and reproducible, but can be gamed and don't capture real-world usefulness
- LLM-as-judge — use GPT-4 to rate outputs. Scalable and cheap, but has biases (verbosity, position, self-enhancement)
- Human evaluation — the gold standard. Captures nuance, but slow, expensive, and humans disagree (~70-80% inter-annotator agreement)
In practice, production systems combine all three: benchmarks for regression testing, LLM-as-judge for scaling, human eval for calibration.
How LLM-as-judge actually works (G-Eval). The naive approach — asking an LLM to rate a response 1-10 — produces noisy, biased scores. The structured approach used in G-Eval (Liu et al., 2023) is: (1) give the judge an explicit evaluation form with step-by-step criteria, (2) ask the judge to output a chain-of-thought reasoning before scoring, (3) compute the final score as the probability-weighted sum over score tokens (using the logprobs of "1" through "5"), not just the argmax. The logprob-weighted score is far more stable than raw text output — a single bad sample doesn't dominate. G-Eval achieves ~0.51 Spearman correlation with human judgments on summarization (compared to ~0.40 for ROUGE-2) — better than ROUGE but still far from perfect agreement with human judges.
Quick check
An LLM judge rates response A above B in 72% of 500 pairwise trials. Before concluding A is better, which control is most critical?
What is benchmark contamination and why is it a problem?
Step-by-Step Derivation
Perplexity — Language Model Quality
Perplexity measures how "surprised" the model is by the test data. Lower perplexity = better language modeling:
where is the cross-entropy loss. Intuitively: a model with PPL=10 is as uncertain as randomly choosing among 10 equally likely options.
Pass@k — Code Evaluation
The probability that at least 1 of k generated code samples passes all test cases:
where is total samples generated, is the number of correct samples. HumanEval uses this metric with .
ELO Rating — Head-to-Head Comparison
Compare models head-to-head (used by Chatbot Arena). After each match, update ratings based on expected vs actual outcome:
where is expected score, is actual score (1=win, 0=loss), is update magnitude.
Contamination — The Silent Score Inflator
When test data leaks into training data, benchmark scores become meaningless. This is increasingly common as models train on web-scale data that includes benchmark datasets:
- Detection: n-gram overlap analysis, canary strings, rephrased benchmarks
- Mitigation: use held-out private test sets, temporal splits (post-training-cutoff data), dynamic benchmarks
Quick check
A model generates n=10 samples per problem. For a hard problem, only c=2 are correct. Which k gives pass@k closest to 50%?
Python: Simple Eval Harness
import json
from openai import OpenAI
def evaluate_model(model: str, test_cases: list[dict]) -> dict:
"""Run eval suite and compute pass rates by category."""
client = OpenAI()
results = {"correct": 0, "total": 0, "by_category": {}}
for case in test_cases:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": case["prompt"]}],
temperature=0, # deterministic for eval
)
answer = response.choices[0].message.content
correct = case["check_fn"](answer, case["expected"])
results["total"] += 1
results["correct"] += int(correct)
cat = case.get("category", "general")
results["by_category"].setdefault(cat, {"correct": 0, "total": 0})
results["by_category"][cat]["total"] += 1
results["by_category"][cat]["correct"] += int(correct)
results["accuracy"] = results["correct"] / results["total"]
return resultsPython: LLM-as-Judge Prompt Template (G-Eval style)
JUDGE_SYSTEM = """You are an expert evaluator. Score the response on the following criteria.
Think step by step before assigning a score."""
JUDGE_TEMPLATE = """
## Task
{task_description}
## User Query
{query}
## Response to Evaluate
{response}
## Evaluation Rubric
Score each dimension 1-5 (5 = best):
**Faithfulness** — Does every claim in the response follow from the provided context?
1 = Hallucinated facts unrelated to context
5 = Every claim is directly supported by context
**Relevance** — Does the response address what the user actually asked?
1 = Completely off-topic
5 = Directly answers the question with appropriate scope
**Completeness** — Are all parts of the question addressed?
1 = Misses key aspects of the query
5 = Covers all relevant aspects
## Chain-of-Thought Reasoning
Think through each dimension before scoring:
## Scores (JSON)
{{"faithfulness": <1-5>, "relevance": <1-5>, "completeness": <1-5>}}
"""
def llm_judge(query: str, response: str, context: str, task_desc: str) -> dict:
"""Call GPT-4 as judge, parse JSON scores from response."""
from openai import OpenAI
client = OpenAI()
prompt = JUDGE_TEMPLATE.format(
task_description=task_desc,
query=query,
response=response,
# context injected into task_description in practice
)
result = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": JUDGE_SYSTEM},
{"role": "user", "content": prompt},
],
temperature=0,
response_format={"type": "json_object"},
)
import json
return json.loads(result.choices[0].message.content)Python: BLEU and ROUGE Computation
# pip install nltk rouge-score
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
def compute_bleu(reference: str, hypothesis: str) -> float:
"""Compute sentence-level BLEU-4. Range [0, 1]."""
ref_tokens = reference.lower().split()
hyp_tokens = hypothesis.lower().split()
smoothie = SmoothingFunction().method1 # avoid 0 on short sentences
return sentence_bleu([ref_tokens], hyp_tokens, smoothing_function=smoothie)
def compute_rouge(reference: str, hypothesis: str) -> dict:
"""Compute ROUGE-1, ROUGE-2, ROUGE-L F1 scores."""
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
scores = scorer.score(reference, hypothesis)
return {
"rouge1": round(scores["rouge1"].fmeasure, 4),
"rouge2": round(scores["rouge2"].fmeasure, 4),
"rougeL": round(scores["rougeL"].fmeasure, 4),
}
# Example: summarization quality check
reference = "The transformer uses self-attention to process sequences in parallel."
hypothesis = "Transformers apply attention mechanisms across the full sequence simultaneously."
print("BLEU-4:", compute_bleu(reference, hypothesis))
# → BLEU-4: 0.089 (low — different wording, same meaning — BLEU misses this)
print("ROUGE:", compute_rouge(reference, hypothesis))
# → {'rouge1': 0.35, 'rouge2': 0.08, 'rougeL': 0.25}
# ROUGE-1 is better but still low — use BERTScore for semantic similarityWorked Example — Evaluating a RAG System
Setup: You have a RAG chatbot over a company knowledge base. You run 100 test queries with known ground-truth answers and retrieved document sets. Three metrics tell different stories:
| Metric | What it measures | How to compute | Your score |
|---|---|---|---|
| Retrieval Relevance | Did we fetch the right docs? | Recall@k: fraction of ground-truth docs in top-k results | 88% |
| Answer Correctness | Is the answer factually right? | LLM judge compares answer vs. ground truth, 1-5 scale | 71% |
| Faithfulness | Does the answer stay grounded in retrieved context? | NLI model checks each claim for entailment vs. context | 54% |
Root cause analysis on the 46 faithfulness failures:
- 22 cases: model added plausible-sounding details not present in retrieved docs (e.g., specific version numbers, dates)
- 14 cases: model synthesized across multiple docs and introduced reasoning errors in the synthesis step
- 10 cases:retrieved docs were relevant but incomplete — model "filled the gap" with memorized knowledge instead of saying "I don't know"
Fix:Add a faithfulness gate — after generation, run an NLI check on each sentence. Flag any claim with entailment probability < 0.7 for either abstention or human review. This drops faithfulness failures from 46% to ~12% in practice, at the cost of ~15% of responses being flagged as uncertain.
Break It
Toggle these failure modes to see what goes wrong in real eval pipelines.
Quick check
You remove contamination detection from your eval pipeline to cut build time by 20 minutes. A model's MMLU score jumps from 82% to 90%. What is the most likely explanation?
Real-World Numbers
| Model | MMLU | HumanEval pass@1 | Arena ELO |
|---|---|---|---|
| GPT-4 | ~67% | 1,250+ | |
| Claude 3 Opus | ~70% | 1,240+ | |
| Llama-3 70B | ~50% | 1,200+ | |
| Human expert | — | — |
Chatbot Arena Elo Ratings (lmsys.org, early 2025)
| Model | Arena Elo | Notes |
|---|---|---|
| GPT-4o | ~1,287 | Highest among GPT-4 variants |
| Claude 3.5 Sonnet | ~1,271 | Strong coding + reasoning |
| Gemini 1.5 Pro | ~1,254 | Long context specialist |
| Llama-3 70B | ~1,213 | Best open-weight model |
Elo ratings are based on blind human preference votes. A 100-point Elo gap corresponds to ~64% win rate in head-to-head comparisons. Scores shift as more votes accumulate — treat as approximate.
Human vs. Automated Eval Agreement Rates
| Comparison | Agreement rate | Source |
|---|---|---|
| Human–human preference | Chatbot Arena inter-annotator | |
| GPT-4-as-judge vs. human | MT-Bench paper (Zheng et al. 2023) | |
| ROUGE-2 vs. human (summarization) | G-Eval paper (Liu et al. 2023) | |
| G-Eval logprob score vs. human | G-Eval paper (Liu et al. 2023) |
Key insight: GPT-4-as-judge reaches near human–human agreement (80–85% vs. 73–80%), making it a viable proxy when calibrated. Raw ROUGE has far lower correlation — never use it as a primary quality signal for free-form generation.
Frontier Benchmarks (2024–2025) — as of 2025-Q1; scores shift fast
Interviewers at Anthropic, Google DeepMind, and OpenAI regularly ask “what’s the current frontier?” — knowing HLE and ARC-AGI-2 numbers signals you track the field in real time.
| Benchmark | What it tests | SOTA (2025-Q1) | Human baseline | Status |
|---|---|---|---|---|
| HLE | Expert-level multi-domain (2,500 Qs, 100+ fields) | ~100% (domain experts) | Active | |
| ARC-AGI-2 | Fluid intelligence vs. memorization (Chollet, Mar 2025) | ~98% (humans) | Active | |
| GPQA Diamond | 198 PhD-level “Google-proof” science Qs | ~65% (PhD experts) | Saturating | |
| GPQA | Graduate-level science (PhD-hard) | ~75% (o3) | ~65% (domain experts) | Saturating |
| SWE-Bench | Real GitHub issues resolved | ~100% (human dev) | Active | |
| MATH | Competition-style math problems | ~97% (o3) | ~40% (high schoolers) | Saturated |
| ARC-AGI | Abstract pattern reasoning (original) | ~88% (o3-high) | ~98% (humans) | Saturated → ARC-AGI-2 |
Scores as of 2025-Q1; GPQA/SWE-Bench from early 2025 reports; ARC-AGI o3 from December 2024 OpenAI release; HLE leaderboard at agi.safe.ai; ARC-AGI-2 leaderboard at arcprize.org. All proprietary model internals labeled (estimated) or (per public eval).
Quick check
GPT-4-as-judge agrees with humans 80-85% of the time. Human-human agreement is 73-80%. A PM says “the judge is superhuman.” What is wrong with that claim?
Key Takeaways
What to remember for interviews
- 1No single eval method is sufficient — production systems layer benchmarks, LLM-as-judge, and human eval.
- 2Benchmark contamination silently inflates scores: always verify with rephrased or held-out test sets.
- 3LLM judges must be calibrated against human labels (target ≥0.7 Pearson) before shipping.
- 4G-Eval's logprob-weighted scoring reaches ~0.51 Spearman with humans on summarization — better than ROUGE-2's ~0.40, but not a replacement for human eval.
- 5For RAG systems, measure retrieval (Recall@k) and generation (faithfulness) separately — high retrieval does not prevent hallucination.
- 6GPT-4-as-judge agrees with humans 80–85% of the time, near the 73–80% human–human agreement ceiling.
Recap quiz
Evaluation recap
Your LLM judge shows 82% agreement with human raters overall, but the team suspects position bias. Which experiment most directly isolates that bias?
You generate n=20 samples per HumanEval problem and observe c=16 correct. What is pass@1 using the unbiased estimator?
G-Eval achieves ~0.51 Spearman with humans; ROUGE-2 achieves ~0.40 on the same summarization task. A reviewer argues G-Eval is “production-ready” because it beats ROUGE. What is the strongest counterargument?
A new model scores 91% on MMLU but drops to 78% on a rephrased version of the same questions. What is the most defensible interpretation?
Your team ships a new prompt template and re-runs the 80-example golden set. All 80 pass. A colleague says “regression gate passed, ship it.” What is the key risk?
SWE-Bench Verified SOTA is ~50%. A team argues this means agents can handle half of real GitHub issues autonomously in production. What is the key flaw in that reasoning?
In a RAG eval, retrieval Recall@5 is 88% but faithfulness is 54%. Where should you invest engineering effort first?
Further Reading
- Measuring Massive Multitask Language Understanding (MMLU) — Hendrycks et al. 2020 — 57-subject benchmark testing broad knowledge and reasoning
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng et al. 2023 — LLM-as-judge evaluation and Elo-based human preference ranking
- Holistic Evaluation of Language Models (HELM) — Liang et al. 2022 — multi-metric evaluation framework covering accuracy, fairness, robustness, and more
- Hamel Husain — Your AI Product Needs Evals — The definitive practitioner guide to building eval pipelines: golden datasets, LLM judges, regression gates, and CI integration
- Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators — Dubois et al. 2024 — length-controlled win rates to debias automatic evaluators; addresses verbosity inflation in GPT-4 judge scoring
- Anthropic — Developing Evaluations for Claude — Practical guide to designing task-specific evals, calibrating LLM judges, and building regression gates for Claude-based applications
Interview Questions
Showing 6 of 6
How would you evaluate hallucination in production?
★★★Is hallucination binary (yes/no) or is there a spectrum?
What is LLM-as-judge? What are its failure modes?
★★★How do you detect benchmark contamination?
★★☆Design an evaluation suite for a customer-facing chatbot.
★★★What is the MMLU benchmark and what does it measure?
★☆☆How would you set up A/B testing for LLM outputs?
★★☆