Skip to content

Transformer Math

Module 41 · Trust & Evaluation

📊 LLM Evaluation

Your model scores 90% on MMLU but users hate it — why?

Status:
🎮

Interactive Sandbox

How do you know if your model is good? Evaluation is the hardest unsolved problem in LLMs — and every company asks about it in interviews.

The diagram below shows three main evaluation approaches: automated benchmarks, LLM-as-judge, and human evaluation. Each has tradeoffs between cost, speed, and reliability.

LLM Evaluation ApproachesModel OutputBenchmarksMMLU, HumanEvalLLM-as-JudgeGPT-4 rates outputHuman Evalgold standardAccuracy %Rating 1-5PreferenceModel Score /DecisionBenchmarksLLM-as-JudgeHuman EvalDecision
💡

The Intuition

Imagine you built a customer support chatbot.How do you know it's good? You need three things: (1) automated benchmarks to catch regressions fast, (2) LLM-as-judge for nuanced quality, (3) human eval for the final call. Each trades speed for accuracy — you use all three in production.

What you're measuringBest eval methodWhy
Factual accuracyBenchmarks (MMLU, TruthfulQA)Automated, reproducible, fast
Response qualityLLM-as-judge (GPT-4 eval)Nuanced, scales, cheap vs human
User satisfactionHuman eval (A/B test)Ground truth, but slow and expensive

Three approaches to evaluating LLMs, each with failure modes:

  • Benchmarks(MMLU, HumanEval) — standardized tests. Fast and reproducible, but can be gamed and don't capture real-world usefulness
  • LLM-as-judge — use GPT-4 to rate outputs. Scalable and cheap, but has biases (verbosity, position, self-enhancement)
  • Human evaluation — the gold standard. Captures nuance, but slow, expensive, and humans disagree (~70-80% inter-annotator agreement)

In practice, production systems combine all three: benchmarks for regression testing, LLM-as-judge for scaling, human eval for calibration.

How LLM-as-judge actually works (G-Eval). The naive approach — asking an LLM to rate a response 1-10 — produces noisy, biased scores. The structured approach used in G-Eval (Liu et al., 2023) is: (1) give the judge an explicit evaluation form with step-by-step criteria, (2) ask the judge to output a chain-of-thought reasoning before scoring, (3) compute the final score as the probability-weighted sum over score tokens (using the logprobs of "1" through "5"), not just the argmax. The logprob-weighted score is far more stable than raw text output — a single bad sample doesn't dominate. G-Eval achieves ~0.51 Spearman correlation with human judgments on summarization (compared to ~0.40 for ROUGE-2) — better than ROUGE but still far from perfect agreement with human judges.

💡 Tip · Calibration is mandatory before deploying LLM judges. Collect 200-500 human-labeled examples. Compute Pearson/Spearman correlation between judge scores and human scores. If correlation is below 0.7, the judge is miscalibrated — add rubric details, few-shot examples of edge cases, or switch judge models. Never ship an uncalibrated judge into production.

Quick check

Trade-off

An LLM judge rates response A above B in 72% of 500 pairwise trials. Before concluding A is better, which control is most critical?

An LLM judge rates response A above B in 72% of 500 pairwise trials. Before concluding A is better, which control is most critical?
Quick Check

What is benchmark contamination and why is it a problem?

📐

Step-by-Step Derivation

Perplexity — Language Model Quality

Perplexity measures how "surprised" the model is by the test data. Lower perplexity = better language modeling:

where is the cross-entropy loss. Intuitively: a model with PPL=10 is as uncertain as randomly choosing among 10 equally likely options.

✨ Insight · Perplexity limitations: It measures language modeling quality, not task performance. A model with lower PPL is not necessarily better at following instructions or being helpful. Use PPL for comparing base models, not chat models.

Pass@k — Code Evaluation

The probability that at least 1 of k generated code samples passes all test cases:

where is total samples generated, is the number of correct samples. HumanEval uses this metric with .

ELO Rating — Head-to-Head Comparison

Compare models head-to-head (used by Chatbot Arena). After each match, update ratings based on expected vs actual outcome:

where is expected score, is actual score (1=win, 0=loss), is update magnitude.

💡 Tip · Chatbot Arena(lmsys.org) uses blind human preference voting with ELO. It's considered the most reliable open evaluation — but requires thousands of human votes per model.

Contamination — The Silent Score Inflator

When test data leaks into training data, benchmark scores become meaningless. This is increasingly common as models train on web-scale data that includes benchmark datasets:

  • Detection: n-gram overlap analysis, canary strings, rephrased benchmarks
  • Mitigation: use held-out private test sets, temporal splits (post-training-cutoff data), dynamic benchmarks
⚠ Warning · Key insight: If a model scores 90% on but struggles with rephrased versions of the same questions, it likely memorized the benchmark rather than learning the underlying knowledge.

Quick check

Derivation

A model generates n=10 samples per problem. For a hard problem, only c=2 are correct. Which k gives pass@k closest to 50%?

A model generates n=10 samples per problem. For a hard problem, only c=2 are correct. Which k gives pass@k closest to 50%?

Python: Simple Eval Harness

python
import json
from openai import OpenAI

def evaluate_model(model: str, test_cases: list[dict]) -> dict:
    """Run eval suite and compute pass rates by category."""
    client = OpenAI()
    results = {"correct": 0, "total": 0, "by_category": {}}

    for case in test_cases:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": case["prompt"]}],
            temperature=0,  # deterministic for eval
        )
        answer = response.choices[0].message.content
        correct = case["check_fn"](answer, case["expected"])

        results["total"] += 1
        results["correct"] += int(correct)
        cat = case.get("category", "general")
        results["by_category"].setdefault(cat, {"correct": 0, "total": 0})
        results["by_category"][cat]["total"] += 1
        results["by_category"][cat]["correct"] += int(correct)

    results["accuracy"] = results["correct"] / results["total"]
    return results

Python: LLM-as-Judge Prompt Template (G-Eval style)

python
JUDGE_SYSTEM = """You are an expert evaluator. Score the response on the following criteria.
Think step by step before assigning a score."""

JUDGE_TEMPLATE = """
## Task
{task_description}

## User Query
{query}

## Response to Evaluate
{response}

## Evaluation Rubric
Score each dimension 1-5 (5 = best):

**Faithfulness** — Does every claim in the response follow from the provided context?
1 = Hallucinated facts unrelated to context
5 = Every claim is directly supported by context

**Relevance** — Does the response address what the user actually asked?
1 = Completely off-topic
5 = Directly answers the question with appropriate scope

**Completeness** — Are all parts of the question addressed?
1 = Misses key aspects of the query
5 = Covers all relevant aspects

## Chain-of-Thought Reasoning
Think through each dimension before scoring:

## Scores (JSON)
{{"faithfulness": <1-5>, "relevance": <1-5>, "completeness": <1-5>}}
"""

def llm_judge(query: str, response: str, context: str, task_desc: str) -> dict:
    """Call GPT-4 as judge, parse JSON scores from response."""
    from openai import OpenAI
    client = OpenAI()

    prompt = JUDGE_TEMPLATE.format(
        task_description=task_desc,
        query=query,
        response=response,
        # context injected into task_description in practice
    )
    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": JUDGE_SYSTEM},
            {"role": "user", "content": prompt},
        ],
        temperature=0,
        response_format={"type": "json_object"},
    )
    import json
    return json.loads(result.choices[0].message.content)

Python: BLEU and ROUGE Computation

python
# pip install nltk rouge-score
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer

def compute_bleu(reference: str, hypothesis: str) -> float:
    """Compute sentence-level BLEU-4. Range [0, 1]."""
    ref_tokens = reference.lower().split()
    hyp_tokens = hypothesis.lower().split()
    smoothie = SmoothingFunction().method1  # avoid 0 on short sentences
    return sentence_bleu([ref_tokens], hyp_tokens, smoothing_function=smoothie)

def compute_rouge(reference: str, hypothesis: str) -> dict:
    """Compute ROUGE-1, ROUGE-2, ROUGE-L F1 scores."""
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    scores = scorer.score(reference, hypothesis)
    return {
        "rouge1": round(scores["rouge1"].fmeasure, 4),
        "rouge2": round(scores["rouge2"].fmeasure, 4),
        "rougeL": round(scores["rougeL"].fmeasure, 4),
    }

# Example: summarization quality check
reference = "The transformer uses self-attention to process sequences in parallel."
hypothesis = "Transformers apply attention mechanisms across the full sequence simultaneously."

print("BLEU-4:", compute_bleu(reference, hypothesis))
# → BLEU-4: 0.089  (low — different wording, same meaning — BLEU misses this)

print("ROUGE:", compute_rouge(reference, hypothesis))
# → {'rouge1': 0.35, 'rouge2': 0.08, 'rougeL': 0.25}
# ROUGE-1 is better but still low — use BERTScore for semantic similarity
🔬

Worked Example — Evaluating a RAG System

Setup: You have a RAG chatbot over a company knowledge base. You run 100 test queries with known ground-truth answers and retrieved document sets. Three metrics tell different stories:

MetricWhat it measuresHow to computeYour score
Retrieval RelevanceDid we fetch the right docs?Recall@k: fraction of ground-truth docs in top-k results88%
Answer CorrectnessIs the answer factually right?LLM judge compares answer vs. ground truth, 1-5 scale71%
FaithfulnessDoes the answer stay grounded in retrieved context?NLI model checks each claim for entailment vs. context54%
⚠ Warning · The trap: Retrieval Relevance is 88% — the pipeline looks healthy. But Faithfulness is only 54%. The model retrieved the right documents then hallucinated beyond them in nearly half of responses. High retrieval recall does not prevent generation-level hallucination — you must measure both layers independently.

Root cause analysis on the 46 faithfulness failures:

  • 22 cases: model added plausible-sounding details not present in retrieved docs (e.g., specific version numbers, dates)
  • 14 cases: model synthesized across multiple docs and introduced reasoning errors in the synthesis step
  • 10 cases:retrieved docs were relevant but incomplete — model "filled the gap" with memorized knowledge instead of saying "I don't know"

Fix:Add a faithfulness gate — after generation, run an NLI check on each sentence. Flag any claim with entailment probability < 0.7 for either abstention or human review. This drops faithfulness failures from 46% to ~12% in practice, at the cost of ~15% of responses being flagged as uncertain.

💡 Tip · Lesson for RAG evals:Always measure retrieval and generation separately. A two-layer eval (Recall@k for retrieval + faithfulness for generation) catches failures that a single "answer correctness" metric will miss entirely.
🔥

Break It

Toggle these failure modes to see what goes wrong in real eval pipelines.

Remove contamination checks
Use a single LLM judge without calibration
Use only accuracy as your eval metric
Trust benchmark scores at face value
Skip human eval entirely

Quick check

Trade-off

You remove contamination detection from your eval pipeline to cut build time by 20 minutes. A model&apos;s MMLU score jumps from 82% to 90%. What is the most likely explanation?

You remove contamination detection from your eval pipeline to cut build time by 20 minutes. A model&apos;s MMLU score jumps from 82% to 90%. What is the most likely explanation?
📊

Real-World Numbers

ModelMMLUHumanEval pass@1Arena ELO
GPT-4~67%1,250+
Claude 3 Opus~70%1,240+
Llama-3 70B~50%1,200+
Human expert
⚠ Warning · Human eval cost: $5-50 per annotation depending on task complexity. A single model evaluation round with 500 annotations can cost $2,500-25,000. This is why LLM-as-judge is so appealing — but it must be calibrated against human judgments.

Chatbot Arena Elo Ratings (lmsys.org, early 2025)

ModelArena EloNotes
GPT-4o~1,287Highest among GPT-4 variants
Claude 3.5 Sonnet~1,271Strong coding + reasoning
Gemini 1.5 Pro~1,254Long context specialist
Llama-3 70B~1,213Best open-weight model

Elo ratings are based on blind human preference votes. A 100-point Elo gap corresponds to ~64% win rate in head-to-head comparisons. Scores shift as more votes accumulate — treat as approximate.

Human vs. Automated Eval Agreement Rates

ComparisonAgreement rateSource
Human–human preferenceChatbot Arena inter-annotator
GPT-4-as-judge vs. humanMT-Bench paper (Zheng et al. 2023)
ROUGE-2 vs. human (summarization)G-Eval paper (Liu et al. 2023)
G-Eval logprob score vs. humanG-Eval paper (Liu et al. 2023)

Key insight: GPT-4-as-judge reaches near human–human agreement (80–85% vs. 73–80%), making it a viable proxy when calibrated. Raw ROUGE has far lower correlation — never use it as a primary quality signal for free-form generation.

Frontier Benchmarks (2024–2025) — as of 2025-Q1; scores shift fast

Interviewers at Anthropic, Google DeepMind, and OpenAI regularly ask “what’s the current frontier?” — knowing HLE and ARC-AGI-2 numbers signals you track the field in real time.

BenchmarkWhat it testsSOTA (2025-Q1)Human baselineStatus
HLEExpert-level multi-domain (2,500 Qs, 100+ fields)~100% (domain experts)Active
ARC-AGI-2Fluid intelligence vs. memorization (Chollet, Mar 2025)~98% (humans)Active
GPQA Diamond198 PhD-level “Google-proof” science Qs~65% (PhD experts)Saturating
GPQAGraduate-level science (PhD-hard)~75% (o3)~65% (domain experts)Saturating
SWE-BenchReal GitHub issues resolved~100% (human dev)Active
MATHCompetition-style math problems~97% (o3)~40% (high schoolers)Saturated
ARC-AGIAbstract pattern reasoning (original)~88% (o3-high)~98% (humans)Saturated → ARC-AGI-2

Scores as of 2025-Q1; GPQA/SWE-Bench from early 2025 reports; ARC-AGI o3 from December 2024 OpenAI release; HLE leaderboard at agi.safe.ai; ARC-AGI-2 leaderboard at arcprize.org. All proprietary model internals labeled (estimated) or (per public eval).

💡 Tip · Why interviewers ask about frontier benchmarks: Knowing that HLE top models score only 9–15% while ARC-AGI-2 sits at ~37.6% signals you understand the gap between “saturated benchmarks” (MMLU, MATH) and genuinely hard open problems. ARC-AGI-2 was designed specifically to resist the memorization that let o3 reach 88% on ARC-AGI-1. GPQA Diamond is now saturating — interviewers use HLE and ARC-AGI-2 as the new frontier signal.

Quick check

Trade-off

GPT-4-as-judge agrees with humans 80-85% of the time. Human-human agreement is 73-80%. A PM says &ldquo;the judge is superhuman.&rdquo; What is wrong with that claim?

GPT-4-as-judge agrees with humans 80-85% of the time. Human-human agreement is 73-80%. A PM says &ldquo;the judge is superhuman.&rdquo; What is wrong with that claim?
🧠

Key Takeaways

What to remember for interviews

  1. 1No single eval method is sufficient — production systems layer benchmarks, LLM-as-judge, and human eval.
  2. 2Benchmark contamination silently inflates scores: always verify with rephrased or held-out test sets.
  3. 3LLM judges must be calibrated against human labels (target ≥0.7 Pearson) before shipping.
  4. 4G-Eval's logprob-weighted scoring reaches ~0.51 Spearman with humans on summarization — better than ROUGE-2's ~0.40, but not a replacement for human eval.
  5. 5For RAG systems, measure retrieval (Recall@k) and generation (faithfulness) separately — high retrieval does not prevent hallucination.
  6. 6GPT-4-as-judge agrees with humans 80–85% of the time, near the 73–80% human–human agreement ceiling.
🧠

Recap quiz

🧠

Evaluation recap

Trade-off

Your LLM judge shows 82% agreement with human raters overall, but the team suspects position bias. Which experiment most directly isolates that bias?

Your LLM judge shows 82% agreement with human raters overall, but the team suspects position bias. Which experiment most directly isolates that bias?
Derivation

You generate n=20 samples per HumanEval problem and observe c=16 correct. What is pass@1 using the unbiased estimator?

You generate n=20 samples per HumanEval problem and observe c=16 correct. What is pass@1 using the unbiased estimator?
Trade-off

G-Eval achieves ~0.51 Spearman with humans; ROUGE-2 achieves ~0.40 on the same summarization task. A reviewer argues G-Eval is &ldquo;production-ready&rdquo; because it beats ROUGE. What is the strongest counterargument?

G-Eval achieves ~0.51 Spearman with humans; ROUGE-2 achieves ~0.40 on the same summarization task. A reviewer argues G-Eval is &ldquo;production-ready&rdquo; because it beats ROUGE. What is the strongest counterargument?
Trade-off

A new model scores 91% on MMLU but drops to 78% on a rephrased version of the same questions. What is the most defensible interpretation?

A new model scores 91% on MMLU but drops to 78% on a rephrased version of the same questions. What is the most defensible interpretation?
Derivation

Your team ships a new prompt template and re-runs the 80-example golden set. All 80 pass. A colleague says &ldquo;regression gate passed, ship it.&rdquo; What is the key risk?

Your team ships a new prompt template and re-runs the 80-example golden set. All 80 pass. A colleague says &ldquo;regression gate passed, ship it.&rdquo; What is the key risk?
Trade-off

SWE-Bench Verified SOTA is ~50%. A team argues this means agents can handle half of real GitHub issues autonomously in production. What is the key flaw in that reasoning?

SWE-Bench Verified SOTA is ~50%. A team argues this means agents can handle half of real GitHub issues autonomously in production. What is the key flaw in that reasoning?
Trade-off

In a RAG eval, retrieval Recall@5 is 88% but faithfulness is 54%. Where should you invest engineering effort first?

In a RAG eval, retrieval Recall@5 is 88% but faithfulness is 54%. Where should you invest engineering effort first?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 6 of 6

How would you evaluate hallucination in production?

★★★
OpenAIAnthropic

Is hallucination binary (yes/no) or is there a spectrum?

What is LLM-as-judge? What are its failure modes?

★★★
AnthropicOpenAI

How do you detect benchmark contamination?

★★☆
OpenAIGoogle

Design an evaluation suite for a customer-facing chatbot.

★★★
GoogleDatabricks

What is the MMLU benchmark and what does it measure?

★☆☆
OpenAI

How would you set up A/B testing for LLM outputs?

★★☆
GoogleDatabricks