Skip to content

Transformer Math

Module 40 · Applications

🧪 Agent Evaluation

Both agents got the right answer — but one cost 6x more tokens

Status:

Evaluating a standalone LLM is hard enough. Evaluating agents is fundamentally harder — you are no longer scoring a single output, but an entire trajectory of decisions, tool calls, and recoveries. An agent that gets the right answer through luck is not the same as one that reasons its way there efficiently.

📋

Evaluating Agent Trajectories

What you're seeing: An agent eval pipeline has two distinct evaluation layers — trajectory (process) and outcome (result). The non-deterministic, multi-step nature of agents makes both layers necessary.

What to notice: The same task can yield different trajectories on different runs. Outcome eval alone misses the difference between Agent A (3 correct steps) and Agent B (7 chaotic steps that happen to finish correctly).

TRAJECTORY EVALOUTCOME EVALTask“Find the bug”ThinkreasonActtool callObservetool resultrepeat until doneResultfinal answerTrajectory EvalWas each step reasonable? (process)non-deterministicpaths · partial credittool failures· side effectsOutcome EvalDid we get the right answer? (result)LLM eval: single-turn,one output, one scoreharder:multi-step
🎮

Same Task, Two Agents — Spot the Difference

Both agents solve the same task correctly. But Agent B is far worse — can you see why?

Agent A — Score: 0.95

Step 1search("weather API docs") → found OpenWeather API
Step 2get_weather(city="Tokyo") → 22°C, sunny
Step 3return "Tokyo: 22°C and sunny" ✓
3 steps · correct tool · 450 tokens · 1.2s

Agent B — Score: 0.31

Step 1search("Tokyo") → Wikipedia article about Tokyo
Step 2search("Tokyo weather today") → news articles
Step 3search("current temperature Tokyo") → blog post
Step 4get_weather(city="Tokyo, Japan") → error: invalid format
Step 5get_weather(city="Tokyo") → 22°C, sunny
Step 6search("is 22 celsius warm") → unnecessary
Step 7return "Tokyo: 22°C and sunny" ✓
7 steps · wrong tools first · 2,800 tokens · 8.4s
✨ Insight · Both agents got the right answer — but Agent B used 6x more tokens, 7x more time, and picked the wrong tools 3 times. Outcome-only evaluation gives both 100%. Trajectory evaluation catches the difference.
💡

The Intuition

LLM evalasks: “Is this output correct?” One input, one output, one score. Agent evalasks: “Was this sequence of decisions good?” Multiple steps, tool interactions, branching paths, and non-deterministic outcomes.

Why agent eval is harder — four reasons:

  • Multi-step decisions — a 20-step trajectory has exponentially more failure modes than a single generation
  • Non-determinism — the same agent on the same task may take a different path every run (temperature, tool latency, environment state)
  • Tool interactions — agents have side effects: file writes, API calls, database mutations. A wrong tool call cannot be unscored like a wrong token
  • Process vs outcome— an agent that deletes the test file to make tests pass “succeeds” by outcome but fails by process

Three dimensions of agent quality:

  • Functional correctness — did the agent complete the task? (binary gate)
  • Process quality — were the intermediate steps reasonable, safe, and well-planned?
  • Efficiency — how many steps, tokens, and dollars did it cost?

Trajectory evaluation scores the full decision chain, not just the final answer. Record every tool call, file read, edit, and search. Compare against optimal (shortest successful) paths. Penalize backtracks, unnecessary calls, and dangerous actions.

Regression testing for agentsuses golden test sets (50-200 curated tasks), snapshot testing (diff trajectories across runs), behavioral contracts (“must not exceed 3 API calls for simple lookups”), and flakiness tracking (per-task pass rate over time).

✨ Insight · An agent that gets the right answer in 47 steps is worse than one that fails in 3. Efficiency is not optional — it is a core quality signal.

Process Reward Models (PRMs) — rewarding the right steps. Outcome-only evaluation rewards any path to the correct answer, including dangerous shortcuts. PRMs assign a reward to every intermediate step, not just the final output. OpenAI's Let's Verify Step by Step (Lightman et al., 2023) shows that process supervision — training with per-step rewards — produces more reliable reasoning: on MATH competition problems, a PRM-guided best-of-N search reached versus — a 29 point gap using the same underlying model. The evaluation implication: if you only score final answers during eval, you cannot detect whether your agent reasons correctly or just gets lucky. PRM-style step scoring during evaluation (even with an LLM-as-judge per step) catches dangerous shortcuts before they compound across longer tasks.

Quick check

Trade-off

An agent reaches the correct final answer via 15 wasted steps and one dangerous file deletion. Which eval catches the danger?

An agent reaches the correct final answer via 15 wasted steps and one dangerous file deletion. Which eval catches the danger?
Quick Check

An agent completes a task in 20 steps. The optimal path is 5 steps. It made 15 tool calls, 9 of which were correct. What is its efficiency score?

📐

Key Metrics

Task Completion Rate

The most basic metric — what fraction of tasks does the agent complete successfully?

Step Efficiency

Ratio of the optimal number of steps (from human solutions or shortest successful trajectory) to actual steps taken. is perfect; lower means wasted work:

Tool Selection Accuracy

Of all tool calls the agent made, how many were correct choices for the given state? Penalizes both wrong tools and unnecessary calls:

Composite Trajectory Score

Weighted combination of the three dimensions. Weights are tuned per deployment — safety-critical agents weight process quality higher:

💡 Tip · Common weights: , , for general agents. For safety-critical agents, increase (tool accuracy) significantly.

Cost-Normalized Score

Quality per dollar — critical for production deployment where budget is finite:

Python: Trajectory Evaluator

python
from dataclasses import dataclass

@dataclass
class TrajectoryResult:
    completed: bool
    actual_steps: int
    optimal_steps: int
    correct_tool_calls: int
    total_tool_calls: int
    tokens_used: int

def evaluate_trajectory(
    result: TrajectoryResult,
    w1: float = 0.5,  # completion weight
    w2: float = 0.3,  # efficiency weight
    w3: float = 0.2,  # tool accuracy weight
    cost_per_token: float = 3e-6,  # ~$3/M tokens
) -> dict:
    """Score a single agent trajectory."""
    completion = 1.0 if result.completed else 0.0
    efficiency = min(result.optimal_steps / max(result.actual_steps, 1), 1.0)
    tool_acc = result.correct_tool_calls / max(result.total_tool_calls, 1)

    # Composite trajectory score
    trajectory_score = w1 * completion + w2 * efficiency + w3 * tool_acc

    # Cost-normalized score (quality per dollar)
    cost = result.tokens_used * cost_per_token
    cost_normalized = trajectory_score / max(cost, 1e-9)

    return {
        "completion": completion,
        "efficiency": efficiency,
        "tool_accuracy": tool_acc,
        "trajectory_score": trajectory_score,
        "cost_usd": cost,
        "cost_normalized": cost_normalized,
    }

# Example: agent solves task in 12 steps (optimal: 5), 8/10 correct tools
result = TrajectoryResult(
    completed=True, actual_steps=12, optimal_steps=5,
    correct_tool_calls=8, total_tool_calls=10, tokens_used=15000,
)
scores = evaluate_trajectory(result)
# => trajectory_score=0.785, efficiency=0.417, cost_normalized=17.44

Quick check

Derivation

An agent completes a task (completion=1.0), efficiency=0.42, tool_acc=0.80. With weights w₁=0.5, w₂=0.3, w₃=0.2, what is its composite trajectory score T?

An agent completes a task (completion=1.0), efficiency=0.42, tool_acc=0.80. With weights w₁=0.5, w₂=0.3, w₃=0.2, what is its composite trajectory score T?
🔧

Break It — See What Happens

Only evaluate final output (ignore trajectory)
Use deterministic eval on non-deterministic agents

Quick check

Trade-off

A team runs each eval task exactly once and uses the binary pass/fail as their regression signal. What is the primary failure mode?

A team runs each eval task exactly once and uses the binary pass/fail as their regression signal. What is the primary failure mode?
📊

Real-World Numbers

BenchmarkRepresentative ScoreHuman BaselineWhat It Measures
SWE-bench VerifiedEnd-to-end software engineering — real GitHub issues, must produce passing patches
WebArenaWeb navigation — across 5 real websites (Reddit, GitLab, shopping)
AgentBenchvariesGeneralization breadth — 8 environments (OS, DB, web, games, knowledge graph)
HumanEval (agent)Code generation — agents with tool use (run tests, iterate) surpass single-shot LLMs
Chatbot Arena (LLM preference)ELO ~1250+human prefHead-to-head preference — users choose which agent response they prefer
✨ Insight · The gap between agents and humans is largest on tasks requiring real-world grounding (WebArena: 14% vs 78%) and smallest on well-defined code tasks (HumanEval: agents with iteration actually beat single-shot human performance). The harder the environment, the more agent eval matters.

2024–2025 SOTA Update

— up from GPT-4's 14.4% baseline as CUA systems (Anthropic Computer Use, OpenAI Operator) apply RL post-training on browser tasks. Human baseline remains 78.2%. vs human 72% — computer-use evals now test perception accuracy and UI grounding, not just JSON-clean tool calls. A model that outputs the right function name but clicks the wrong pixel fails. This shifts eval from schema-conformance checks to visual grounding metrics: correct element identification rate, pixel-distance error, and task-completion under dynamic UI state changes.

🧠

Key Takeaways

What to remember for interviews

  1. 1Agent evaluation must score full trajectories, not just final answers — an agent that solves a task in 47 steps is worse than one that fails cleanly in 3.
  2. 2Non-determinism requires running N≥5 trials per task and reporting pass@k or mean±variance, never a single-run pass/fail.
  3. 3Trajectory score combines three dimensions: functional correctness (binary gate), process quality (step reasoning), and efficiency (tokens/steps/cost).
  4. 4Process Reward Models (PRMs) assign rewards to every intermediate step, catching flawed reasoning that happens to reach the correct answer — PRM-guided selection hit 78.2% on MATH vs 49.6% for outcome-only.
  5. 5Regression testing for agents needs golden test sets, snapshot trajectory diffing, behavioral contracts, and per-task flakiness tracking across model updates.
📚

Further Reading

🧠

Recap quiz

🧠

Agent Eval recap

Trade-off

An agent deletes the failing test file and all remaining tests pass. Which evaluation approach alone would catch this failure?

An agent deletes the failing test file and all remaining tests pass. Which evaluation approach alone would catch this failure?
Derivation

An agent solves a task in 20 steps; the optimal path is 5 steps. What is its step-efficiency score?

An agent solves a task in 20 steps; the optimal path is 5 steps. What is its step-efficiency score?
Trade-off

On MATH competition problems, PRM-guided best-of-N selection reached 78.2% accuracy. Outcome-reward-only selection hit 49.6%. What does this 29-point gap tell an evaluator?

On MATH competition problems, PRM-guided best-of-N selection reached 78.2% accuracy. Outcome-reward-only selection hit 49.6%. What does this 29-point gap tell an evaluator?
Derivation

An agent is run 10 times on a task; it succeeds 4 times. What is the estimated pass@3?

An agent is run 10 times on a task; it succeeds 4 times. What is the estimated pass@3?
Trade-off

SWE-bench Verified shows top agents at ~50% and humans at ~78%. Which explanation best accounts for the remaining gap?

SWE-bench Verified shows top agents at ~50% and humans at ~78%. Which explanation best accounts for the remaining gap?
Trade-off

A safety-critical agent (e.g., one that can write to prod databases) uses T = w₁·completion + w₂·efficiency + w₃·tool_acc. Which weight allocation is most appropriate?

A safety-critical agent (e.g., one that can write to prod databases) uses T = w₁·completion + w₂·efficiency + w₃·tool_acc. Which weight allocation is most appropriate?
Trade-off

GPT-4 agents score 14.4% on WebArena vs human 78.2% across 812 tasks. The gap is far larger than on SWE-bench (~50% vs ~78%). What best explains the difference in gap size?

GPT-4 agents score 14.4% on WebArena vs human 78.2% across 812 tasks. The gap is far larger than on SWE-bench (~50% vs ~78%). What best explains the difference in gap size?
Trade-off

A task passes 3 out of 5 agent runs with no code changes between runs. How should this be classified?

A task passes 3 out of 5 agent runs with no code changes between runs. How should this be classified?
🎯

Interview Questions

Difficulty:
Company:

Showing 9 of 9

Why is evaluating agents harder than evaluating LLMs?

★★☆
GoogleAnthropic

Design a trajectory evaluation system for a code agent.

★★★
GoogleOpenAI

How do you handle non-determinism in agent evaluation?

★★☆
GoogleAnthropic

What is the difference between outcome-based and process-based evaluation?

★★☆
AnthropicOpenAI

How would you build a regression testing pipeline for agents?

★★★
GoogleDatabricks

Compare SWE-bench, WebArena, and AgentBench — what does each measure?

★★☆
GoogleMeta

How do you evaluate tool selection correctness in multi-tool agents?

★★★
GoogleOpenAI

Design cost-aware evaluation: how do you balance quality vs efficiency?

★★★
GoogleAnthropic

How would you calibrate an LLM-as-judge so trajectory scores correlate with human raters?

★★★
AnthropicOpenAIGoogle