LLM Evaluation — Transformer Math

Module 41 · Trust & Evaluation

📊 LLM Evaluation

Your model scores 90% on MMLU but users hate it — why?

Status:

🎮

Interactive Sandbox

How do you know if your model is good? Evaluation is the hardest unsolved problem in LLMs — and every company asks about it in interviews.

The diagram below shows three main evaluation approaches: automated benchmarks, LLM-as-judge, and human evaluation. Each has tradeoffs between cost, speed, and reliability.

💡

The Intuition

Imagine you built a customer support chatbot.How do you know it's good? You need three things: (1) automated benchmarks to catch regressions fast, (2) LLM-as-judge for nuanced quality, (3) human eval for the final call. Each trades speed for accuracy — you use all three in production.

What you're measuring	Best eval method	Why
Factual accuracy	Benchmarks (MMLU, TruthfulQA)	Automated, reproducible, fast
Response quality	LLM-as-judge (GPT-4 eval)	Nuanced, scales, cheap vs human
User satisfaction	Human eval (A/B test)	Ground truth, but slow and expensive

Three approaches to evaluating LLMs, each with failure modes:

Benchmarks(MMLU, HumanEval) — standardized tests. Fast and reproducible, but can be gamed and don't capture real-world usefulness
LLM-as-judge — use GPT-4 to rate outputs. Scalable and cheap, but has biases (verbosity, position, self-enhancement)
Human evaluation — the gold standard. Captures nuance, but slow, expensive, and humans disagree (~70-80% inter-annotator agreement)

In practice, production systems combine all three: benchmarks for regression testing, LLM-as-judge for scaling, human eval for calibration.

How LLM-as-judge actually works (G-Eval). The naive approach — asking an LLM to rate a response 1-10 — produces noisy, biased scores. The structured approach used in G-Eval (Liu et al., 2023) is: (1) give the judge an explicit evaluation form with step-by-step criteria, (2) ask the judge to output a chain-of-thought reasoning before scoring, (3) compute the final score as the probability-weighted sum over score tokens (using the logprobs of "1" through "5"), not just the argmax. The logprob-weighted score is far more stable than raw text output — a single bad sample doesn't dominate. G-Eval achieves ~0.51 Spearman correlation with human judgments on summarization (compared to ~0.40 for ROUGE-2) — better than ROUGE but still far from perfect agreement with human judges.

💡 Tip · Calibration is mandatory before deploying LLM judges. Collect 200-500 human-labeled examples. Compute Pearson/Spearman correlation between judge scores and human scores. If correlation is below 0.7, the judge is miscalibrated — add rubric details, few-shot examples of edge cases, or switch judge models. Never ship an uncalibrated judge into production.

Quick check

Trade-off

An LLM judge rates response A above B in 72% of 500 pairwise trials. Before concluding A is better, which control is most critical?

Run the same judge on 1,000 pairs to reduce sampling noise.Swap A/B order and check whether the 72% win rate holds.Use a rubric to break ties, then recount the 72%.Filter trials where A's response is longer and recompute.

Quick Check

What is benchmark contamination and why is it a problem?

📐

Step-by-Step Derivation

Perplexity — Language Model Quality

Perplexity measures how "surprised" the model is by the test data. Lower perplexity = better language modeling:

where is the cross-entropy loss. Intuitively: a model with PPL=10 is as uncertain as randomly choosing among 10 equally likely options.

✨ Insight · Perplexity limitations: It measures language modeling quality, not task performance. A model with lower PPL is not necessarily better at following instructions or being helpful. Use PPL for comparing base models, not chat models.

Pass@k — Code Evaluation

The probability that at least 1 of k generated code samples passes all test cases:

where is total samples generated, is the number of correct samples. HumanEval uses this metric with .

ELO Rating — Head-to-Head Comparison

Compare models head-to-head (used by Chatbot Arena). After each match, update ratings based on expected vs actual outcome:

where is expected score, is actual score (1=win, 0=loss), is update magnitude.

💡 Tip · Chatbot Arena(lmsys.org) uses blind human preference voting with ELO. It's considered the most reliable open evaluation — but requires thousands of human votes per model.

Contamination — The Silent Score Inflator

When test data leaks into training data, benchmark scores become meaningless. This is increasingly common as models train on web-scale data that includes benchmark datasets:

Detection: n-gram overlap analysis, canary strings, rephrased benchmarks
Mitigation: use held-out private test sets, temporal splits (post-training-cutoff data), dynamic benchmarks

⚠ Warning · Key insight: If a model scores 90% on but struggles with rephrased versions of the same questions, it likely memorized the benchmark rather than learning the underlying knowledge.

Quick check

Derivation

A model generates n=10 samples per problem. For a hard problem, only c=2 are correct. Which k gives pass@k closest to 50%?

k=1 (draw one sample and check if it passes).k=3 gives pass@3 ≈ 0.47 — close enough to 50%.k=5 gives pass@5 ≈ 0.50 exactly by symmetry of 2/10.k=2 gives pass@2 ≈ 0.50 by the symmetry of 2 correct out of 10.

Python: Simple Eval Harness

python

import json
from openai import OpenAI

def evaluate_model(model: str, test_cases: list[dict]) -> dict:
    """Run eval suite and compute pass rates by category."""
    client = OpenAI()
    results = {"correct": 0, "total": 0, "by_category": {}}

    for case in test_cases:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": case["prompt"]}],
            temperature=0,  # deterministic for eval
        )
        answer = response.choices[0].message.content
        correct = case["check_fn"](answer, case["expected"])

        results["total"] += 1
        results["correct"] += int(correct)
        cat = case.get("category", "general")
        results["by_category"].setdefault(cat, {"correct": 0, "total": 0})
        results["by_category"][cat]["total"] += 1
        results["by_category"][cat]["correct"] += int(correct)

    results["accuracy"] = results["correct"] / results["total"]
    return results

Python: LLM-as-Judge Prompt Template (G-Eval style)

python

JUDGE_SYSTEM = """You are an expert evaluator. Score the response on the following criteria.
Think step by step before assigning a score."""

JUDGE_TEMPLATE = """
## Task
{task_description}

## User Query
{query}

## Response to Evaluate
{response}

## Evaluation Rubric
Score each dimension 1-5 (5 = best):

**Faithfulness** — Does every claim in the response follow from the provided context?
1 = Hallucinated facts unrelated to context
5 = Every claim is directly supported by context

**Relevance** — Does the response address what the user actually asked?
1 = Completely off-topic
5 = Directly answers the question with appropriate scope

**Completeness** — Are all parts of the question addressed?
1 = Misses key aspects of the query
5 = Covers all relevant aspects

## Chain-of-Thought Reasoning
Think through each dimension before scoring:

## Scores (JSON)
{{"faithfulness": <1-5>, "relevance": <1-5>, "completeness": <1-5>}}
"""

def llm_judge(query: str, response: str, context: str, task_desc: str) -> dict:
    """Call GPT-4 as judge, parse JSON scores from response."""
    from openai import OpenAI
    client = OpenAI()

    prompt = JUDGE_TEMPLATE.format(
        task_description=task_desc,
        query=query,
        response=response,
        # context injected into task_description in practice
    )
    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": JUDGE_SYSTEM},
            {"role": "user", "content": prompt},
        ],
        temperature=0,
        response_format={"type": "json_object"},
    )
    import json
    return json.loads(result.choices[0].message.content)

Python: BLEU and ROUGE Computation

python

# pip install nltk rouge-score
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer

def compute_bleu(reference: str, hypothesis: str) -> float:
    """Compute sentence-level BLEU-4. Range [0, 1]."""
    ref_tokens = reference.lower().split()
    hyp_tokens = hypothesis.lower().split()
    smoothie = SmoothingFunction().method1  # avoid 0 on short sentences
    return sentence_bleu([ref_tokens], hyp_tokens, smoothing_function=smoothie)

def compute_rouge(reference: str, hypothesis: str) -> dict:
    """Compute ROUGE-1, ROUGE-2, ROUGE-L F1 scores."""
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    scores = scorer.score(reference, hypothesis)
    return {
        "rouge1": round(scores["rouge1"].fmeasure, 4),
        "rouge2": round(scores["rouge2"].fmeasure, 4),
        "rougeL": round(scores["rougeL"].fmeasure, 4),
    }

# Example: summarization quality check
reference = "The transformer uses self-attention to process sequences in parallel."
hypothesis = "Transformers apply attention mechanisms across the full sequence simultaneously."

print("BLEU-4:", compute_bleu(reference, hypothesis))
# → BLEU-4: 0.089  (low — different wording, same meaning — BLEU misses this)

print("ROUGE:", compute_rouge(reference, hypothesis))
# → {'rouge1': 0.35, 'rouge2': 0.08, 'rougeL': 0.25}
# ROUGE-1 is better but still low — use BERTScore for semantic similarity

🔬

Worked Example — Evaluating a RAG System

Setup: You have a RAG chatbot over a company knowledge base. You run 100 test queries with known ground-truth answers and retrieved document sets. Three metrics tell different stories:

Metric	What it measures	How to compute	Your score
Retrieval Relevance	Did we fetch the right docs?	Recall@k: fraction of ground-truth docs in top-k results	88%
Answer Correctness	Is the answer factually right?	LLM judge compares answer vs. ground truth, 1-5 scale	71%
Faithfulness	Does the answer stay grounded in retrieved context?	NLI model checks each claim for entailment vs. context	54%

⚠ Warning · The trap: Retrieval Relevance is 88% — the pipeline looks healthy. But Faithfulness is only 54%. The model retrieved the right documents then hallucinated beyond them in nearly half of responses. High retrieval recall does not prevent generation-level hallucination — you must measure both layers independently.

Root cause analysis on the 46 faithfulness failures:

22 cases: model added plausible-sounding details not present in retrieved docs (e.g., specific version numbers, dates)
14 cases: model synthesized across multiple docs and introduced reasoning errors in the synthesis step
10 cases:retrieved docs were relevant but incomplete — model "filled the gap" with memorized knowledge instead of saying "I don't know"

Fix:Add a faithfulness gate — after generation, run an NLI check on each sentence. Flag any claim with entailment probability < 0.7 for either abstention or human review. This drops faithfulness failures from 46% to ~12% in practice, at the cost of ~15% of responses being flagged as uncertain.

💡 Tip · Lesson for RAG evals:Always measure retrieval and generation separately. A two-layer eval (Recall@k for retrieval + faithfulness for generation) catches failures that a single "answer correctness" metric will miss entirely.

🔥

Break It

Toggle these failure modes to see what goes wrong in real eval pipelines.

Remove contamination checks

Use a single LLM judge without calibration

Use only accuracy as your eval metric

Trust benchmark scores at face value

Skip human eval entirely

Quick check

Trade-off

You remove contamination detection from your eval pipeline to cut build time by 20 minutes. A model's MMLU score jumps from 82% to 90%. What is the most likely explanation?

The model genuinely improved — 8 points is within normal fine-tuning gains.The training pipeline ingested MMLU test questions and the model memorized formats.The evaluation harness sampled a lucky subset of easy questions.MMLU scores fluctuate by ±8 points due to temperature sampling.

📊

Real-World Numbers

Model	HumanEval pass@1	Arena ELO
GPT-4	~67%	1,250+
Claude 3 Opus	~70%	1,240+
Llama-3 70B	~50%	1,200+
Human expert	—	—

⚠ Warning · Human eval cost: $5-50 per annotation depending on task complexity. A single model evaluation round with 500 annotations can cost $2,500-25,000. This is why LLM-as-judge is so appealing — but it must be calibrated against human judgments.

Chatbot Arena Elo Ratings (lmsys.org, early 2025)

Model	Arena Elo	Notes
GPT-4o	~1,287	Highest among GPT-4 variants
Claude 3.5 Sonnet	~1,271	Strong coding + reasoning
Gemini 1.5 Pro	~1,254	Long context specialist
Llama-3 70B	~1,213	Best open-weight model

Elo ratings are based on blind human preference votes. A 100-point Elo gap corresponds to ~64% win rate in head-to-head comparisons. Scores shift as more votes accumulate — treat as approximate.

Human vs. Automated Eval Agreement Rates

Comparison	Agreement rate	Source
Human–human preference		Chatbot Arena inter-annotator
GPT-4-as-judge vs. human		MT-Bench paper (Zheng et al. 2023)
ROUGE-2 vs. human (summarization)		G-Eval paper (Liu et al. 2023)
G-Eval logprob score vs. human		G-Eval paper (Liu et al. 2023)

Key insight: GPT-4-as-judge reaches near human–human agreement (80–85% vs. 73–80%), making it a viable proxy when calibrated. Raw ROUGE has far lower correlation — never use it as a primary quality signal for free-form generation.

Frontier Benchmarks (2024–2025) — as of 2025-Q1; scores shift fast

Interviewers at Anthropic, Google DeepMind, and OpenAI regularly ask “what’s the current frontier?” — knowing HLE and ARC-AGI-2 numbers signals you track the field in real time.

Benchmark	What it tests	SOTA (2025-Q1)	Human baseline	Status
HLE	Expert-level multi-domain (2,500 Qs, 100+ fields)		~100% (domain experts)	Active
ARC-AGI-2	Fluid intelligence vs. memorization (Chollet, Mar 2025)		~98% (humans)	Active
GPQA Diamond	198 PhD-level “Google-proof” science Qs		~65% (PhD experts)	Saturating
GPQA	Graduate-level science (PhD-hard)	~75% (o3)	~65% (domain experts)	Saturating
SWE-Bench	Real GitHub issues resolved		~100% (human dev)	Active
MATH	Competition-style math problems	~97% (o3)	~40% (high schoolers)	Saturated
ARC-AGI	Abstract pattern reasoning (original)	~88% (o3-high)	~98% (humans)	Saturated → ARC-AGI-2

Scores as of 2025-Q1; GPQA/SWE-Bench from early 2025 reports; ARC-AGI o3 from December 2024 OpenAI release; HLE leaderboard at agi.safe.ai; ARC-AGI-2 leaderboard at arcprize.org. All proprietary model internals labeled (estimated) or (per public eval).

💡 Tip · Why interviewers ask about frontier benchmarks: Knowing that HLE top models score only 9–15% while ARC-AGI-2 sits at ~37.6% signals you understand the gap between “saturated benchmarks” (MMLU, MATH) and genuinely hard open problems. ARC-AGI-2 was designed specifically to resist the memorization that let o3 reach 88% on ARC-AGI-1. GPQA Diamond is now saturating — interviewers use HLE and ARC-AGI-2 as the new frontier signal.

Quick check

Trade-off

GPT-4-as-judge agrees with humans 80-85% of the time. Human-human agreement is 73-80%. A PM says “the judge is superhuman.” What is wrong with that claim?

The 80-85% figure is measured on MT-Bench only and may not generalize.Comparing judge-human agreement to human-human agreement overstates the judge because the metrics differ.GPT-4 cannot be superhuman because it was trained on human-generated data.The 73-80% human baseline includes novice annotators, skewing the comparison.

🧠

Key Takeaways

What to remember for interviews

1No single eval method is sufficient — production systems layer benchmarks, LLM-as-judge, and human eval.
2Benchmark contamination silently inflates scores: always verify with rephrased or held-out test sets.
3LLM judges must be calibrated against human labels (target ≥0.7 Pearson) before shipping.
4G-Eval's logprob-weighted scoring reaches ~0.51 Spearman with humans on summarization — better than ROUGE-2's ~0.40, but not a replacement for human eval.
5For RAG systems, measure retrieval (Recall@k) and generation (faithfulness) separately — high retrieval does not prevent hallucination.
6GPT-4-as-judge agrees with humans 80–85% of the time, near the 73–80% human–human agreement ceiling.

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 6 of 6

How would you evaluate hallucination in production?

★★★

OpenAIAnthropic

Is hallucination binary (yes/no) or is there a spectrum?

What is LLM-as-judge? What are its failure modes?

★★★

AnthropicOpenAI

How do you detect benchmark contamination?

★★☆

OpenAIGoogle

Design an evaluation suite for a customer-facing chatbot.

★★★

GoogleDatabricks

What is the MMLU benchmark and what does it measure?

★☆☆

OpenAI

How would you set up A/B testing for LLM outputs?

★★☆

GoogleDatabricks

🔄 Eval-Driven Development

→

Transformer Math

📊 LLM Evaluation

Interactive Sandbox

The Intuition

Step-by-Step Derivation

Worked Example — Evaluating a RAG System

Break It

Real-World Numbers

Key Takeaways

Recap quiz

Evaluation recap

Further Reading

Interview Questions