Agent Evaluation — Transformer Math

Module 40 · Applications

🧪 Agent Evaluation

Both agents got the right answer — but one cost 6x more tokens

Status:

Evaluating a standalone LLM is hard enough. Evaluating agents is fundamentally harder — you are no longer scoring a single output, but an entire trajectory of decisions, tool calls, and recoveries. An agent that gets the right answer through luck is not the same as one that reasons its way there efficiently.

📋

Evaluating Agent Trajectories

What you're seeing: An agent eval pipeline has two distinct evaluation layers — trajectory (process) and outcome (result). The non-deterministic, multi-step nature of agents makes both layers necessary.

What to notice: The same task can yield different trajectories on different runs. Outcome eval alone misses the difference between Agent A (3 correct steps) and Agent B (7 chaotic steps that happen to finish correctly).

🎮

Same Task, Two Agents — Spot the Difference

Both agents solve the same task correctly. But Agent B is far worse — can you see why?

Agent A — Score: 0.95

Step 1search("weather API docs") → found OpenWeather API

Step 2get_weather(city="Tokyo") → 22°C, sunny

Step 3return "Tokyo: 22°C and sunny" ✓

3 steps · correct tool · 450 tokens · 1.2s

Agent B — Score: 0.31

Step 1search("Tokyo") → Wikipedia article about Tokyo

Step 2search("Tokyo weather today") → news articles

Step 3search("current temperature Tokyo") → blog post

Step 4get_weather(city="Tokyo, Japan") → error: invalid format

Step 5get_weather(city="Tokyo") → 22°C, sunny

Step 6search("is 22 celsius warm") → unnecessary

Step 7return "Tokyo: 22°C and sunny" ✓

7 steps · wrong tools first · 2,800 tokens · 8.4s

✨ Insight · Both agents got the right answer — but Agent B used 6x more tokens, 7x more time, and picked the wrong tools 3 times. Outcome-only evaluation gives both 100%. Trajectory evaluation catches the difference.

💡

The Intuition

LLM evalasks: “Is this output correct?” One input, one output, one score. Agent evalasks: “Was this sequence of decisions good?” Multiple steps, tool interactions, branching paths, and non-deterministic outcomes.

Why agent eval is harder — four reasons:

Multi-step decisions — a 20-step trajectory has exponentially more failure modes than a single generation
Non-determinism — the same agent on the same task may take a different path every run (temperature, tool latency, environment state)
Tool interactions — agents have side effects: file writes, API calls, database mutations. A wrong tool call cannot be unscored like a wrong token
Process vs outcome— an agent that deletes the test file to make tests pass “succeeds” by outcome but fails by process

Three dimensions of agent quality:

Functional correctness — did the agent complete the task? (binary gate)
Process quality — were the intermediate steps reasonable, safe, and well-planned?
Efficiency — how many steps, tokens, and dollars did it cost?

Trajectory evaluation scores the full decision chain, not just the final answer. Record every tool call, file read, edit, and search. Compare against optimal (shortest successful) paths. Penalize backtracks, unnecessary calls, and dangerous actions.

Regression testing for agentsuses golden test sets (50-200 curated tasks), snapshot testing (diff trajectories across runs), behavioral contracts (“must not exceed 3 API calls for simple lookups”), and flakiness tracking (per-task pass rate over time).

✨ Insight · An agent that gets the right answer in 47 steps is worse than one that fails in 3. Efficiency is not optional — it is a core quality signal.

Process Reward Models (PRMs) — rewarding the right steps. Outcome-only evaluation rewards any path to the correct answer, including dangerous shortcuts. PRMs assign a reward to every intermediate step, not just the final output. OpenAI's Let's Verify Step by Step (Lightman et al., 2023) shows that process supervision — training with per-step rewards — produces more reliable reasoning: on MATH competition problems, a PRM-guided best-of-N search reached versus — a 29 point gap using the same underlying model. The evaluation implication: if you only score final answers during eval, you cannot detect whether your agent reasons correctly or just gets lucky. PRM-style step scoring during evaluation (even with an LLM-as-judge per step) catches dangerous shortcuts before they compound across longer tasks.

Quick check

Trade-off

An agent reaches the correct final answer via 15 wasted steps and one dangerous file deletion. Which eval catches the danger?

Outcome eval — pass/fail on the final test suite result.Cost-normalized score — penalizes the 15 extra tokens spent.Process-based eval — scores each intermediate step for safety.LLM benchmark leaderboard rank — adversarial tasks detect shortcuts.

Quick Check

An agent completes a task in 20 steps. The optimal path is 5 steps. It made 15 tool calls, 9 of which were correct. What is its efficiency score?

📐

Key Metrics

Task Completion Rate

The most basic metric — what fraction of tasks does the agent complete successfully?

Step Efficiency

Ratio of the optimal number of steps (from human solutions or shortest successful trajectory) to actual steps taken. is perfect; lower means wasted work:

Tool Selection Accuracy

Of all tool calls the agent made, how many were correct choices for the given state? Penalizes both wrong tools and unnecessary calls:

Composite Trajectory Score

Weighted combination of the three dimensions. Weights are tuned per deployment — safety-critical agents weight process quality higher:

💡 Tip · Common weights: , , for general agents. For safety-critical agents, increase (tool accuracy) significantly.

Cost-Normalized Score

Quality per dollar — critical for production deployment where budget is finite:

Python: Trajectory Evaluator

python

from dataclasses import dataclass

@dataclass
class TrajectoryResult:
    completed: bool
    actual_steps: int
    optimal_steps: int
    correct_tool_calls: int
    total_tool_calls: int
    tokens_used: int

def evaluate_trajectory(
    result: TrajectoryResult,
    w1: float = 0.5,  # completion weight
    w2: float = 0.3,  # efficiency weight
    w3: float = 0.2,  # tool accuracy weight
    cost_per_token: float = 3e-6,  # ~$3/M tokens
) -> dict:
    """Score a single agent trajectory."""
    completion = 1.0 if result.completed else 0.0
    efficiency = min(result.optimal_steps / max(result.actual_steps, 1), 1.0)
    tool_acc = result.correct_tool_calls / max(result.total_tool_calls, 1)

    # Composite trajectory score
    trajectory_score = w1 * completion + w2 * efficiency + w3 * tool_acc

    # Cost-normalized score (quality per dollar)
    cost = result.tokens_used * cost_per_token
    cost_normalized = trajectory_score / max(cost, 1e-9)

    return {
        "completion": completion,
        "efficiency": efficiency,
        "tool_accuracy": tool_acc,
        "trajectory_score": trajectory_score,
        "cost_usd": cost,
        "cost_normalized": cost_normalized,
    }

# Example: agent solves task in 12 steps (optimal: 5), 8/10 correct tools
result = TrajectoryResult(
    completed=True, actual_steps=12, optimal_steps=5,
    correct_tool_calls=8, total_tool_calls=10, tokens_used=15000,
)
scores = evaluate_trajectory(result)
# => trajectory_score=0.785, efficiency=0.417, cost_normalized=17.44

Quick check

Derivation

An agent completes a task (completion=1.0), efficiency=0.42, tool_acc=0.80. With weights w₁=0.5, w₂=0.3, w₃=0.2, what is its composite trajectory score T?

T = 0.786 — weighted sum of the three components.T = 0.740 — average of the three raw scores.T = 0.500 — only completion matters when the task is done.T = 0.622 — harmonic mean of the three scores with weights.

🔧

Break It — See What Happens

Only evaluate final output (ignore trajectory)

Use deterministic eval on non-deterministic agents

Quick check

Trade-off

A team runs each eval task exactly once and uses the binary pass/fail as their regression signal. What is the primary failure mode?

Phantom regressions and improvements — non-determinism masquerades as capability change.Overfitting — agents memorize the single eval task and overfit the score.High compute cost — running once still requires a full agent trajectory.LLM judge bias — a single run gives the judge only one trajectory to score.

📊

Real-World Numbers

Benchmark	Representative Score	Human Baseline	What It Measures
SWE-bench Verified			End-to-end software engineering — real GitHub issues, must produce passing patches
WebArena			Web navigation — across 5 real websites (Reddit, GitLab, shopping)
AgentBench		varies	Generalization breadth — 8 environments (OS, DB, web, games, knowledge graph)
HumanEval (agent)			Code generation — agents with tool use (run tests, iterate) surpass single-shot LLMs
Chatbot Arena (LLM preference)	ELO ~1250+	human pref	Head-to-head preference — users choose which agent response they prefer

✨ Insight · The gap between agents and humans is largest on tasks requiring real-world grounding (WebArena: 14% vs 78%) and smallest on well-defined code tasks (HumanEval: agents with iteration actually beat single-shot human performance). The harder the environment, the more agent eval matters.

2024–2025 SOTA Update

— up from GPT-4's 14.4% baseline as CUA systems (Anthropic Computer Use, OpenAI Operator) apply RL post-training on browser tasks. Human baseline remains 78.2%. vs human 72% — computer-use evals now test perception accuracy and UI grounding, not just JSON-clean tool calls. A model that outputs the right function name but clicks the wrong pixel fails. This shifts eval from schema-conformance checks to visual grounding metrics: correct element identification rate, pixel-distance error, and task-completion under dynamic UI state changes.

🧠

Key Takeaways

What to remember for interviews

1Agent evaluation must score full trajectories, not just final answers — an agent that solves a task in 47 steps is worse than one that fails cleanly in 3.
2Non-determinism requires running N≥5 trials per task and reporting pass@k or mean±variance, never a single-run pass/fail.
3Trajectory score combines three dimensions: functional correctness (binary gate), process quality (step reasoning), and efficiency (tokens/steps/cost).
4Process Reward Models (PRMs) assign rewards to every intermediate step, catching flawed reasoning that happens to reach the correct answer — PRM-guided selection hit 78.2% on MATH vs 49.6% for outcome-only.
5Regression testing for agents needs golden test sets, snapshot trajectory diffing, behavioral contracts, and per-task flakiness tracking across model updates.

📚

Recap quiz

🎯

Interview Questions

Difficulty:

Company:

Showing 9 of 9

Why is evaluating agents harder than evaluating LLMs?

★★☆

GoogleAnthropic

Design a trajectory evaluation system for a code agent.

★★★

GoogleOpenAI

How do you handle non-determinism in agent evaluation?

★★☆

GoogleAnthropic

What is the difference between outcome-based and process-based evaluation?

★★☆

AnthropicOpenAI

How would you build a regression testing pipeline for agents?

★★★

GoogleDatabricks

Compare SWE-bench, WebArena, and AgentBench — what does each measure?

★★☆

GoogleMeta

How do you evaluate tool selection correctness in multi-tool agents?

★★★

GoogleOpenAI

Design cost-aware evaluation: how do you balance quality vs efficiency?

★★★

GoogleAnthropic

How would you calibrate an LLM-as-judge so trajectory scores correlate with human raters?

★★★

AnthropicOpenAIGoogle

←

📏 Long Context & Context Engineering

Transformer Math