Engineering Notes

Engineering Notes

Thoughts and Ideas on AI by Muthukrishnan

Agent Evaluation and Benchmarking for Measuring What Matters

19 Feb 2026

You built an AI agent. It works on your demos. But is it actually good? Can it handle real-world complexity? Will it break on edge cases? Agent evaluation is one of the hardest unsolved problems in the field — and one of the most important. Without rigorous evaluation, you’re flying blind. This article covers the principles, metrics, benchmarks, and practical frameworks for measuring agent performance systematically.

1. Concept Introduction

Simple Explanation

Think of agent evaluation like grading a student. A multiple-choice exam (traditional ML benchmarks) tests one narrow skill. But agents are more like interns — they perform multi-step tasks, use tools, make judgment calls, and recover from mistakes. You need a richer evaluation framework: not just “did you get the right answer?” but “did you take reasonable steps, use resources efficiently, and handle surprises gracefully?”

Technical Detail

Agent evaluation differs from standard model evaluation in several key ways:

The core challenge: agent performance is a multi-dimensional surface, not a single number.

2. Historical & Theoretical Context

Evaluation has always been the backbone of AI progress. The history follows a clear arc of increasing complexity:

The shift to agent evaluation reflects Goodhart’s Law in action: when LLMs saturated static benchmarks, the field needed harder, more realistic evaluations. Agent benchmarks aim for ecological validity — measuring performance in conditions that resemble real use.

This connects to a deep idea from measurement theory: the act of measuring shapes what gets optimized. Choose the wrong metric, and you’ll build the wrong agent.

3. Metrics and Measurement

The Agent Evaluation Hierarchy

Agent performance decomposes into multiple layers, each capturing a different aspect of quality:

┌─────────────────────────────┐
│     Task Success Rate       │  ← Did the agent complete the goal?
├─────────────────────────────┤
│     Trajectory Quality      │  ← Was the path reasonable?
├─────────────────────────────┤
│     Efficiency Metrics      │  ← Cost, latency, tool calls
├─────────────────────────────┤
│     Safety & Reliability    │  ← Errors, hallucinations, harm
└─────────────────────────────┘

Key Metrics

Success metrics:

Trajectory metrics:

Cost metrics:

Safety metrics:

4. Design Patterns & Architectures

Pattern: The Evaluation Harness

A reusable evaluation framework follows a standard architecture:

graph LR
    A[Task Suite] --> B[Agent Under Test]
    B --> C[Sandbox Environment]
    C --> D[Trajectory Logger]
    D --> E[Evaluator]
    E --> F[Metrics Report]
    F --> G[Comparison Dashboard]
  

Key design decisions:

Pattern: LLM-as-Judge

When ground truth is hard to define (open-ended tasks, creative output), use a separate LLM to evaluate agent output:

graph TD
    A[Agent Output] --> B[Judge LLM]
    C[Rubric / Criteria] --> B
    D[Reference Answer] --> B
    B --> E[Structured Score]
  

This pattern is powerful but introduces its own biases — judge LLMs tend to prefer verbose outputs and have position bias (favoring the first option presented).

5. Practical Application

Here’s a practical evaluation framework you can use today:

import json
import time
from dataclasses import dataclass, field
from typing import Callable

@dataclass
class EvalTask:
    task_id: str
    instruction: str
    expected_output: str | None = None
    check_fn: Callable | None = None  # Custom validator
    max_steps: int = 20
    timeout_seconds: float = 120.0

@dataclass
class EvalResult:
    task_id: str
    success: bool
    steps_taken: int
    total_tokens: int
    latency_seconds: float
    trajectory: list = field(default_factory=list)
    error: str | None = None

class AgentEvaluator:
    def __init__(self, agent_factory: Callable):
        self.agent_factory = agent_factory

    def run_eval(self, tasks: list[EvalTask], n_runs: int = 5) -> dict:
        all_results = []

        for task in tasks:
            task_results = []
            for run_idx in range(n_runs):
                result = self._run_single(task, run_idx)
                task_results.append(result)
            all_results.append((task.task_id, task_results))

        return self._compute_metrics(all_results)

    def _run_single(self, task: EvalTask, run_idx: int) -> EvalResult:
        agent = self.agent_factory()
        trajectory = []
        start = time.time()

        try:
            response = agent.run(
                task.instruction,
                max_steps=task.max_steps,
                callbacks=[lambda e: trajectory.append(e)]
            )
            success = self._check_success(task, response)
        except Exception as e:
            success = False
            response = None

        return EvalResult(
            task_id=task.task_id,
            success=success,
            steps_taken=len(trajectory),
            total_tokens=sum(t.get("tokens", 0) for t in trajectory),
            latency_seconds=time.time() - start,
            trajectory=trajectory,
            error=str(e) if not success and 'e' in dir() else None,
        )

    def _check_success(self, task: EvalTask, response) -> bool:
        if task.check_fn:
            return task.check_fn(response)
        if task.expected_output:
            return task.expected_output.strip() == str(response).strip()
        return False

    def _compute_metrics(self, all_results) -> dict:
        metrics = {}
        for task_id, results in all_results:
            successes = sum(1 for r in results if r.success)
            n = len(results)
            metrics[task_id] = {
                "pass_rate": successes / n,
                "mean_steps": sum(r.steps_taken for r in results) / n,
                "mean_tokens": sum(r.total_tokens for r in results) / n,
                "mean_latency": sum(r.latency_seconds for r in results) / n,
                "all_runs": [r.__dict__ for r in results],
            }
        return metrics

Usage with a task suite:

tasks = [
    EvalTask(
        task_id="file_search",
        instruction="Find all Python files containing 'TODO' and list them.",
        check_fn=lambda r: "utils.py" in r and "main.py" in r,
    ),
    EvalTask(
        task_id="bug_fix",
        instruction="Fix the off-by-one error in sort_items().",
        check_fn=lambda r: run_test_suite("test_sort.py"),
    ),
]

evaluator = AgentEvaluator(agent_factory=create_my_agent)
results = evaluator.run_eval(tasks, n_runs=5)

for task_id, m in results.items():
    print(f"{task_id}: pass@5={m['pass_rate']:.0%}, "
          f"avg_steps={m['mean_steps']:.1f}, "
          f"avg_tokens={m['mean_tokens']:.0f}")

6. Comparisons & Tradeoffs

Major Agent Benchmarks

BenchmarkDomainTasksMetricStrengthLimitation
SWE-benchSoftware engineering2,294 GitHub issues% resolvedReal-world tasksOnly Python repos
GAIAGeneral assistant466 questionsExact match accuracyDiverse, hardSmall task count
AgentBenchMulti-domain8 environmentsDomain-specificBroad coverageComplex setup
WebArenaWeb navigation812 tasksTask success rateRealistic web tasksBrittle to UI changes
SWE-bench VerifiedSoftware engineering500 human-verified% resolvedHigh quality labelsSmaller subset
τ-benchCustomer serviceTool-use tasksSuccess rateTests tool reliabilityNarrow domain

Evaluation Method Tradeoffs

7. Latest Developments & Research

SWE-bench Evolution (2024–2025)

SWE-bench, introduced by Jimenez et al. (2024), has become the de facto standard for coding agent evaluation. Key developments:

GAIA and General-Purpose Evaluation (2024)

GAIA (Mialon et al., 2024) tests whether agents can answer questions that require real-world tool use — web browsing, file manipulation, calculation. Even top systems score under 75% on Level 1 (simplest) questions, revealing how far agents are from robust general capability.

Emerging Directions

Open Problems

8. Cross-Disciplinary Insight

Agent evaluation has a deep parallel in psychometrics — the science of measuring human cognitive abilities. Key concepts transfer directly:

The lesson from a century of psychometrics: measurement is a science, not an afterthought. The same rigor should apply to agent evaluation.

9. Daily Challenge

Exercise: Build a Mini Agent Benchmark

Create a small evaluation suite (3–5 tasks) for a tool-using agent. Each task should:

  1. Have a clear, automatically verifiable success condition
  2. Require at least 2 tool calls to solve
  3. Include one task where the agent must recover from a tool error

Implement it using the AgentEvaluator pattern above, and measure:

Stretch goal: Add an LLM-as-Judge evaluator for one open-ended task (e.g., “Summarize this document”) with a rubric covering completeness, accuracy, and conciseness.

10. References & Further Reading

Papers

Tools & Frameworks

Blog Posts & Resources


Key Takeaways

  1. One number is never enough: Evaluate success rate, efficiency, cost, and safety together
  2. Run multiple times: Agent non-determinism demands statistical evaluation with confidence intervals
  3. Sandbox everything: Agents modify their environment — isolate evaluation runs completely
  4. Match your benchmark to your use case: SWE-bench scores don’t predict customer service performance
  5. Evaluate trajectories, not just outcomes: A correct answer reached through unsafe actions is still a failure
  6. Beware Goodhart’s Law: Any metric you optimize will eventually stop measuring what you care about
  7. Start small: A custom 10-task eval suite for your specific domain beats a generic benchmark every time
● Intelligence at Every Action

AI Native
Project Management

Stop using tools that bolt on AI as an afterthought. Jovis is built AI-first — smart routing, proactive monitoring, and intelligent workflows from the ground up.