Agentic Evals
An agent run is a sequence of thoughts, tool calls, observations, and responses, not a single output. Scoring the last token alone is a category error: a wrong answer can come from a perfect plan whose final tool returned a 503, and a right answer can come from a broken plan that hit the gold by luck. Pass@1 hides catastrophic inconsistency. Reward hacking is empirically common, not theoretical. These eight chapters say what to score, how to score it, and which standard benchmarks to use as the floor versus the ceiling for your own private regression set.
The opinionated frame: report a vector, not a scalar. Outcome success, sub-goal completion, pass^k at the k that matches your deployment, cost per task, recovery rate under tool failure, and policy-violation rate are the minimum honest description of an agent. Anything that collapses to one number is hiding one of those, and the buyer of the eval needs to know which.
Chapters:
- Trajectory-level evaluation. Score the path and the destination, with deterministic state checks before any judge.
- Tool-call evaluation. Schema conformance, AST equivalence, and execution-grounded scoring; false-positives for "should not have called the tool" cases. Full content lands in Phase 7.
- Sub-goal decomposition. Plan-graph extraction, sub-goal completion rate, and replanning frequency as a diagnostic for self-error-detection. Full content lands in Phase 7.
- Pass^k and consistency. Why pass@1 hides the failure mode that ships, the math of pass^k, and how to report a Pareto curve instead of a single number.
- Agent-as-a-judge. When a tool-equipped judge beats a single-pass LLM judge, and the cases where it does not. Full content lands in Phase 7.
- Reward hacking detection. Specification gaming, in-context hacking, chain-of-thought unfaithfulness, and a red-team harness you can run this week.
- Environmental drift. Robustness to deprecated fields, restructured responses, injected latency, and the "evil twin" tool layer. Full content lands in Phase 7.
- Regression suites for agent loops. Trajectory replay, sandboxed Docker harnesses, and diffing tool-call sequences on every model change. Full content lands in Phase 7.