Evals: How to Evaluate Agents

September 9, 2025

Preface

Evaluating agents is messy. Traditional software is deterministic — same input, same output. Agents don’t work that way. They reason in loops, call tools, hit APIs, and sometimes just change their mind. Run the same query twice and you’ll get two different answers.
That means you can’t lean on the old testing playbook. Unit tests and e2e tests give you predictability. Agent evals don’t.

The “three gulfs” framing and the analysis cycle are concepts I first learned from Hamel Husain and Shreya Shankar's course AI Evals for Engineers & PMs. This post is my applied notes — how I’ve been implementing those ideas on my own workloads and what’s been working in practice. A mental model I’ve found useful is the idea of three gulfs. It’s not a formal taxonomy — just a way to break down where things usually fall apart. To make it concrete, let’s run with an example: you’re building a trip-planning agent.

The 3 Gulfs of Agent Evaluation

Gulf of Comprehension

Developer ↔ Data

First, do you even understand the inputs your agent will face? Some users write: “I’m going to Paris, what should I do on the weekend?” Others type: “Find me a warm place.” Both are “travel queries,” but they require very different reasoning downstream. Comprehension work is about mapping that input space: clustering real queries, spotting outliers, checking if your dataset matches reality. If you don’t do this, you’re flying blind.

Gulf of Specification

Developer ↔ Agent

Next, can you turn fuzzy human intent into something the agent can act on? Take “cheap trip.” If you don’t define cheap (say, < $500 flight, <$ 100 hotel), the agent will happily make up its own definition. That’s a spec failure, not a model failure. Specification is all about clarity: tight prompts, clear tool interfaces, and shared definitions. If you don’t nail this, the agent’s behavior will drift.

Gulf of Generalization

Agent ↔ Data

Finally, even if you’ve mapped inputs and nailed intent, can your agent handle the long tail?

Say you’ve already accounted for retirees as a user group. The intent is clear: “Plan a 50th anniversary trip for retirees.” And yet, the agent still suggests base jumping and nightclubs.

Not comprehension — you knew retirees exist.
Not specification — the prompt was explicit.
This is generalization: the agent defaulted to youth-centric patterns instead of adapting.

This is the gulf that hurts the most, because it catches you by surprise in production — you thought you had it covered.

Why this matters

As builders, we end up wearing all three hats:

Comprehension → mapping messy inputs.
Specification → tightening intent and structure.
Generalization → stress-testing behavior on the long tail.

The real work of evaluation is systematically bridging those gulfs. In the next section, I’ll share the approach I’ve been using over the last couple months to do exactly that.

How it works

When I first learned about evals, I thought of them as unit tests. Coming from an engineering background, that was the closest analogy I had. But after a couple of months of actually living with them, I’ve started to see evals differently: they’re less about “tests that pass or fail” and more about a cycle of diagnosing errors, fixing them, and reassessing.

Here’s the rough loop:

Analyze: collect failures from agent runs and break them down.
Measure: quantify them with clear metrics.
Improve: fix prompts, tool descriptions, or even the underlying code.

Analyze-Measure-Improve

The key: don’t just run evals and call it a day. The real value is in understanding errors, fixing them, and then re-running to see if the changes hold up.

Analyze

This is where the gulf of comprehension shows up. The job is to systematically map out the space of failures. And it’s iterative — you usually need to do it two passes or so before your clusters settle.

Practical process:

Pull 50–100 traces from recent runs.
Apply open coding: short labels that capture why the trace failed. Be specific. Use the first error that occurred in multi turn.

Example 1: “Trip cost exceeded user’s stated budget by >2x.”
Example 2: “Destination misinterpreted: user asked for Barcelona, response gave Ibiza.”

Move to axial coding: cluster open codes into categories.

Example 1 rolls into price-sensitivity failures.
Example 2 rolls into entity resolution errors.

Repeat the process with another batch of traces to consolidate clusters and make sure they’re stable.

By the end, you’ve got a taxonomy of errors that actually reflects your agent’s behavior in the wild.

Measure

This is where generalization gets tested. Specification issues (bad prompts, unclear tool descriptions) usually don’t need evals — you just fix them. Generalization issues do: evaluators track whether the agent can consistently apply the specification across diverse inputs.

How to measure:

Stick to binary metrics (success/fail).
Two approaches:
- Deterministic checks: write code to catch failures automatically.
  - Example: if the CRM shows the ticket is still open but the agent claimed it was resolved → fail.
- LLM-as-a-judge: useful for axial categories that are harder to code.
  - Example: if the itinerary is not age appropriate → fail.

Improve

This part is straightforward: fix what you found.

Takeaways

Evals ≠ unit tests.
They’re a cycle: analyze → measure → improve → repeat.
Analysis fixes comprehension issues by mapping the input/failure space.
Specification issues are fixed directly (prompting, tool descriptions).
Evaluators track generalization issues — because these are the ones you can’t just patch, you need to test against diversity.
The value isn’t in the first run, but in the loop.

What’s next

In the next post, I’ll walk through how I take these evals and actually improve the agent.