Agentic Evals

Trajectory vs outcome, tool-call evaluation, reward hacking, pass^k. Also available as a free PDF download.

Agentic Evals

An agent run is a sequence of thoughts, tool calls, observations, and responses, not a single output. Scoring the last token alone is a category error: a wrong answer can come from a perfect plan whose final tool returned a 503, and a right answer can come from a broken plan that hit the gold by luck. Pass@1 hides catastrophic inconsistency. Reward hacking is empirically common, not theoretical. These eight chapters say what to score, how to score it, and which standard benchmarks to use as the floor versus the ceiling for your own private regression set.

The opinionated frame: report a vector, not a scalar. Outcome success, sub-goal completion, pass^k at the k that matches your deployment, cost per task, recovery rate under tool failure, and policy-violation rate are the minimum honest description of an agent. Anything that collapses to one number is hiding one of those, and the buyer of the eval needs to know which. Agent trajectory evaluation is the condensed operational reference for this section.

Chapters:

Trajectory-level evaluation. Score the path and the destination, with deterministic state checks before any judge.
Tool-call evaluation. Schema conformance, AST equivalence, and execution-grounded scoring; false-positives for "should not have called the tool" cases.
Sub-goal decomposition. Plan-graph extraction, sub-goal completion rate, and replanning frequency as a diagnostic for self-error-detection.
Pass^k and consistency. Why pass@1 hides the failure mode that ships, the math of pass^k, and how to report a Pareto curve instead of a single number.
Agent-as-a-judge. When a tool-equipped judge beats a single-pass LLM judge, and the cases where it does not.
Reward hacking detection. Specification gaming, in-context hacking, chain-of-thought unfaithfulness, and a red-team harness you can run this week.
Environmental drift. Robustness to deprecated fields, restructured responses, injected latency, and the "evil twin" tool layer.
Regression suites for agent loops. Trajectory replay, sandboxed Docker harnesses, and diffing tool-call sequences on every model change.

Task-shaped playbooks for the newest agent surfaces (deep research, multi-agent systems, MCP servers, agent memory) live under /task-types.