🧪 Agent Evaluation
Both agents got the right answer — but one cost 6x more tokens
Evaluating a standalone LLM is hard enough. Evaluating agents is fundamentally harder — you are no longer scoring a single output, but an entire trajectory of decisions, tool calls, and recoveries. An agent that gets the right answer through luck is not the same as one that reasons its way there efficiently.
Evaluating Agent Trajectories
What you're seeing: An agent eval pipeline has two distinct evaluation layers — trajectory (process) and outcome (result). The non-deterministic, multi-step nature of agents makes both layers necessary.
What to notice: The same task can yield different trajectories on different runs. Outcome eval alone misses the difference between Agent A (3 correct steps) and Agent B (7 chaotic steps that happen to finish correctly).
Same Task, Two Agents — Spot the Difference
Both agents solve the same task correctly. But Agent B is far worse — can you see why?
Agent A — Score: 0.95
Agent B — Score: 0.31
The Intuition
LLM evalasks: “Is this output correct?” One input, one output, one score. Agent evalasks: “Was this sequence of decisions good?” Multiple steps, tool interactions, branching paths, and non-deterministic outcomes.
Why agent eval is harder — four reasons:
- Multi-step decisions — a 20-step trajectory has exponentially more failure modes than a single generation
- Non-determinism — the same agent on the same task may take a different path every run (temperature, tool latency, environment state)
- Tool interactions — agents have side effects: file writes, API calls, database mutations. A wrong tool call cannot be unscored like a wrong token
- Process vs outcome— an agent that deletes the test file to make tests pass “succeeds” by outcome but fails by process
Three dimensions of agent quality:
- Functional correctness — did the agent complete the task? (binary gate)
- Process quality — were the intermediate steps reasonable, safe, and well-planned?
- Efficiency — how many steps, tokens, and dollars did it cost?
Trajectory evaluation scores the full decision chain, not just the final answer. Record every tool call, file read, edit, and search. Compare against optimal (shortest successful) paths. Penalize backtracks, unnecessary calls, and dangerous actions.
Regression testing for agentsuses golden test sets (50-200 curated tasks), snapshot testing (diff trajectories across runs), behavioral contracts (“must not exceed 3 API calls for simple lookups”), and flakiness tracking (per-task pass rate over time).
Process Reward Models (PRMs) — rewarding the right steps. Outcome-only evaluation rewards any path to the correct answer, including dangerous shortcuts. PRMs assign a reward to every intermediate step, not just the final output. OpenAI's Let's Verify Step by Step (Lightman et al., 2023) shows that process supervision — training with per-step rewards — produces more reliable reasoning: on MATH competition problems, a PRM-guided best-of-N search reached versus — a 29 point gap using the same underlying model. The evaluation implication: if you only score final answers during eval, you cannot detect whether your agent reasons correctly or just gets lucky. PRM-style step scoring during evaluation (even with an LLM-as-judge per step) catches dangerous shortcuts before they compound across longer tasks.
Quick check
An agent reaches the correct final answer via 15 wasted steps and one dangerous file deletion. Which eval catches the danger?
An agent completes a task in 20 steps. The optimal path is 5 steps. It made 15 tool calls, 9 of which were correct. What is its efficiency score?
Key Metrics
Task Completion Rate
The most basic metric — what fraction of tasks does the agent complete successfully?
Step Efficiency
Ratio of the optimal number of steps (from human solutions or shortest successful trajectory) to actual steps taken. is perfect; lower means wasted work:
Tool Selection Accuracy
Of all tool calls the agent made, how many were correct choices for the given state? Penalizes both wrong tools and unnecessary calls:
Composite Trajectory Score
Weighted combination of the three dimensions. Weights are tuned per deployment — safety-critical agents weight process quality higher:
Cost-Normalized Score
Quality per dollar — critical for production deployment where budget is finite:
Python: Trajectory Evaluator
from dataclasses import dataclass
@dataclass
class TrajectoryResult:
completed: bool
actual_steps: int
optimal_steps: int
correct_tool_calls: int
total_tool_calls: int
tokens_used: int
def evaluate_trajectory(
result: TrajectoryResult,
w1: float = 0.5, # completion weight
w2: float = 0.3, # efficiency weight
w3: float = 0.2, # tool accuracy weight
cost_per_token: float = 3e-6, # ~$3/M tokens
) -> dict:
"""Score a single agent trajectory."""
completion = 1.0 if result.completed else 0.0
efficiency = min(result.optimal_steps / max(result.actual_steps, 1), 1.0)
tool_acc = result.correct_tool_calls / max(result.total_tool_calls, 1)
# Composite trajectory score
trajectory_score = w1 * completion + w2 * efficiency + w3 * tool_acc
# Cost-normalized score (quality per dollar)
cost = result.tokens_used * cost_per_token
cost_normalized = trajectory_score / max(cost, 1e-9)
return {
"completion": completion,
"efficiency": efficiency,
"tool_accuracy": tool_acc,
"trajectory_score": trajectory_score,
"cost_usd": cost,
"cost_normalized": cost_normalized,
}
# Example: agent solves task in 12 steps (optimal: 5), 8/10 correct tools
result = TrajectoryResult(
completed=True, actual_steps=12, optimal_steps=5,
correct_tool_calls=8, total_tool_calls=10, tokens_used=15000,
)
scores = evaluate_trajectory(result)
# => trajectory_score=0.785, efficiency=0.417, cost_normalized=17.44Quick check
An agent completes a task (completion=1.0), efficiency=0.42, tool_acc=0.80. With weights w₁=0.5, w₂=0.3, w₃=0.2, what is its composite trajectory score T?
Break It — See What Happens
Quick check
A team runs each eval task exactly once and uses the binary pass/fail as their regression signal. What is the primary failure mode?
Real-World Numbers
| Benchmark | Representative Score | Human Baseline | What It Measures |
|---|---|---|---|
| SWE-bench Verified | End-to-end software engineering — real GitHub issues, must produce passing patches | ||
| WebArena | Web navigation — across 5 real websites (Reddit, GitLab, shopping) | ||
| AgentBench | varies | Generalization breadth — 8 environments (OS, DB, web, games, knowledge graph) | |
| HumanEval (agent) | Code generation — agents with tool use (run tests, iterate) surpass single-shot LLMs | ||
| Chatbot Arena (LLM preference) | ELO ~1250+ | human pref | Head-to-head preference — users choose which agent response they prefer |
2024–2025 SOTA Update
— up from GPT-4's 14.4% baseline as CUA systems (Anthropic Computer Use, OpenAI Operator) apply RL post-training on browser tasks. Human baseline remains 78.2%. vs human 72% — computer-use evals now test perception accuracy and UI grounding, not just JSON-clean tool calls. A model that outputs the right function name but clicks the wrong pixel fails. This shifts eval from schema-conformance checks to visual grounding metrics: correct element identification rate, pixel-distance error, and task-completion under dynamic UI state changes.
Key Takeaways
What to remember for interviews
- 1Agent evaluation must score full trajectories, not just final answers — an agent that solves a task in 47 steps is worse than one that fails cleanly in 3.
- 2Non-determinism requires running N≥5 trials per task and reporting pass@k or mean±variance, never a single-run pass/fail.
- 3Trajectory score combines three dimensions: functional correctness (binary gate), process quality (step reasoning), and efficiency (tokens/steps/cost).
- 4Process Reward Models (PRMs) assign rewards to every intermediate step, catching flawed reasoning that happens to reach the correct answer — PRM-guided selection hit 78.2% on MATH vs 49.6% for outcome-only.
- 5Regression testing for agents needs golden test sets, snapshot trajectory diffing, behavioral contracts, and per-task flakiness tracking across model updates.
Further Reading
- AgentBench: Evaluating LLMs as Agents — Liu et al. 2023 — multi-dimensional benchmark for evaluating LLM agent capabilities
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — Jimenez et al. 2023 — benchmark for evaluating code agents on real software engineering tasks
- WebArena: A Realistic Web Environment for Building Autonomous Agents — Zhou et al. 2023 — realistic web-based benchmark for autonomous agent evaluation
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments — Xie et al. 2024 — agents operate on a real desktop OS with screenshots; top models score ~15% vs human 72%
- Tau-bench: A Benchmark for Tool-Agent-User Interaction in Real World Domains — Yao et al. 2024 — measures whether agents correctly follow policies, handle user pushback, and maintain consistency across multi-turn tool interactions
- DeepEval — Open Source LLM Evaluation Framework — Python framework with 14+ metrics (faithfulness, hallucination, tool correctness), trajectory evaluation support, and CI integration
- Hamel Husain — Your AI Product Needs Evals — Practical guide to building eval pipelines for agent products — golden datasets, LLM judges, regression gates, and CI integration
Recap quiz
Agent Eval recap
An agent deletes the failing test file and all remaining tests pass. Which evaluation approach alone would catch this failure?
An agent solves a task in 20 steps; the optimal path is 5 steps. What is its step-efficiency score?
On MATH competition problems, PRM-guided best-of-N selection reached 78.2% accuracy. Outcome-reward-only selection hit 49.6%. What does this 29-point gap tell an evaluator?
An agent is run 10 times on a task; it succeeds 4 times. What is the estimated pass@3?
SWE-bench Verified shows top agents at ~50% and humans at ~78%. Which explanation best accounts for the remaining gap?
A safety-critical agent (e.g., one that can write to prod databases) uses T = w₁·completion + w₂·efficiency + w₃·tool_acc. Which weight allocation is most appropriate?
GPT-4 agents score 14.4% on WebArena vs human 78.2% across 812 tasks. The gap is far larger than on SWE-bench (~50% vs ~78%). What best explains the difference in gap size?
A task passes 3 out of 5 agent runs with no code changes between runs. How should this be classified?
Interview Questions
Showing 9 of 9
Why is evaluating agents harder than evaluating LLMs?
★★☆Design a trajectory evaluation system for a code agent.
★★★How do you handle non-determinism in agent evaluation?
★★☆What is the difference between outcome-based and process-based evaluation?
★★☆How would you build a regression testing pipeline for agents?
★★★Compare SWE-bench, WebArena, and AgentBench — what does each measure?
★★☆How do you evaluate tool selection correctness in multi-tool agents?
★★★Design cost-aware evaluation: how do you balance quality vs efficiency?
★★★How would you calibrate an LLM-as-judge so trajectory scores correlate with human raters?
★★★