Engineering Notes

Engineering Notes

Thoughts and Ideas on AI by Muthukrishnan

Agent Evaluation and Benchmarking for Measuring What Matters

19 Feb 2026

Agent evaluation is one of the hardest unsolved problems in the field and one of the most important. Without rigorous evaluation, you’re flying blind. This article covers the principles, metrics, benchmarks, and practical frameworks for measuring agent performance systematically.

Concept Introduction

Agent evaluation differs from a traditional ML benchmark because agents are not graded on a single right answer. They perform multi-step tasks, use tools, make judgment calls, and recover from mistakes. You need a richer framework: not just “did you get the right answer?” but “did you take reasonable steps, use resources efficiently, and handle surprises gracefully?”

Agent evaluation differs from standard model evaluation in several key ways:

The core challenge is that agent performance is a multi-dimensional surface, not a single number.

Metrics and Measurement

The Agent Evaluation Hierarchy

Agent performance decomposes into multiple layers, each capturing a different aspect of quality:

┌─────────────────────────────┐
│     Task Success Rate       │  ← Did the agent complete the goal?
├─────────────────────────────┤
│     Trajectory Quality      │  ← Was the path reasonable?
├─────────────────────────────┤
│     Efficiency Metrics      │  ← Cost, latency, tool calls
├─────────────────────────────┤
│     Safety & Reliability    │  ← Errors, hallucinations, harm
└─────────────────────────────┘

Key Metrics

Success metrics:

Trajectory metrics:

Cost metrics:

Safety metrics:

Design Patterns & Architectures

A reusable evaluation framework follows a standard architecture:

graph LR
    A[Task Suite] --> B[Agent Under Test]
    B --> C[Sandbox Environment]
    C --> D[Trajectory Logger]
    D --> E[Evaluator]
    E --> F[Metrics Report]
    F --> G[Comparison Dashboard]
  

Key design decisions:

When ground truth is hard to define (open-ended tasks, creative output), use a separate LLM to evaluate agent output. This LLM-as-Judge approach works like this:

graph TD
    A[Agent Output] --> B[Judge LLM]
    C[Rubric / Criteria] --> B
    D[Reference Answer] --> B
    B --> E[Structured Score]
  

This pattern is powerful but introduces its own biases: judge LLMs tend to prefer verbose outputs and have position bias (favoring the first option presented).

Practical Application

A minimal evaluation harness centers on three pieces: an EvalTask dataclass (task ID, instruction, success validator), an EvalResult dataclass (pass/fail, step count, token usage, latency), and an AgentEvaluator class that loops over tasks, runs each n times, and aggregates pass@k, mean steps, and mean tokens into a metrics dict. The raw Anthropic SDK is the best fit here — no orchestration framework is needed, since evaluation is about observing agent behavior from the outside rather than composing agents together. Data flows from a task list into repeated agent invocations, with each trajectory collected via callbacks and fed into the aggregation step. The evaluator’s _check_success method accepts either an exact-match string or a callable validator, making it easy to plug in test-suite runners or semantic similarity checks for open-ended tasks.

Try it

Using the raw Anthropic SDK, build an AgentEvaluator with EvalTask and EvalResult dataclasses.
EvalTask holds: task_id, instruction, optional expected_output string, optional check_fn callable, max_steps, timeout.
EvalResult holds: task_id, success bool, steps_taken, total_tokens, latency_seconds, trajectory list, error string.
AgentEvaluator.run_eval(tasks, n_runs=5) runs each task n times, collects trajectories via callbacks,
and returns a dict of pass_rate, mean_steps, mean_tokens, mean_latency per task_id.
Include inline comments explaining each aggregation step. Make the code runnable end-to-end.

Latest Developments & Research

SWE-bench Evolution (2024–2025)

SWE-bench, introduced by Jimenez et al. (2024), has become the de facto standard for coding agent evaluation. Key developments:

GAIA and General-Purpose Evaluation (2024)

GAIA (Mialon et al., 2024) tests whether agents can answer questions that require real-world tool use (web browsing, file manipulation, calculation). Even top systems score under 75% on Level 1 questions, revealing how far agents are from robust general capability.

Emerging Directions

Open Problems

Cross-Disciplinary Insight

Agent evaluation has a deep parallel in psychometrics, the science of measuring human cognitive abilities. Key concepts transfer directly:

The lesson from a century of psychometrics: measurement is a science, not an afterthought. The same rigor should apply to agent evaluation.