AI Evals › Techniques

Agent trajectory evaluation

Step-level scoring beats outcome-only. Tool-call F1, plan analysis, reward-hacking probes, environmental drift. The eval pattern that catches partial failures.

An agent is a system that takes multiple actions to accomplish a task: tool calls, sub-goal decomposition, plan-then-execute, retrieval-then-act. An outcome-only eval collapses everything between start and end into a single bit. The trajectory pattern scores each step plus the outcome. Yehudai et al.'s 2025 survey is the canonical reference ¹.

Why outcome-only fails

Three failure modes are invisible to an outcome-only metric.

Tool-call hallucination. The agent calls a non-existent tool or calls a real tool with malformed arguments. The catch handler triggers a fallback that produces a passable outcome. Outcome: pass. Reality: the agent was wrong.

Redundant work. The agent calls the same tool five times with minor variations, processes intermediate state poorly, eventually arrives at the right answer. Outcome: pass. Cost: five times the budget. Latency: five times the SLA.

Inconsistency. The agent passes on the first run, fails on the second, passes on the third. Yao et al. introduced pass^k for tau-bench to surface this: a 60% pass@1 with 30% per-run inconsistency yields a 13% pass^4 ². The agent is not reliable; the outcome metric did not say so.

The trajectory pattern fixes all three by scoring steps and outcome together. He et al.'s TRAJECT-Bench is the explicit benchmark ³.

The four trajectory metrics

Tool-call F1. Did the agent pick the right tool with the right arguments? Compute per-step precision and recall against a gold trajectory; aggregate as F1. T-Eval and BFCL operationalize this at scale ⁴.

Plan correctness. Score the plan separately from the execution. A wrong plan executed correctly is a different failure from a right plan executed incorrectly.

Outcome correctness. Did the task complete. Necessary, not sufficient.

Pass^k consistency. Run the agent k times. Report the probability all k succeeded. Yao et al.'s tau-bench result: pass@1 60%, pass^4 20% for representative agents ². The gap is the reliability gap operational systems care about.

Metric	What it catches	Computed against
Tool-call F1	wrong tool, wrong args	gold trajectory
Plan correctness	wrong plan, right execution	gold plan or rubric
Outcome correctness	task completion	task verifier
Pass^k	inconsistency	k independent runs

A complete agent-eval report ships all four.

Pass^k in detail

For an agent with per-run success probability p (assumed independent across runs), pass^k = p^k.

p (per-run)	pass^1	pass^2	pass^4	pass^8
0.95	0.95	0.90	0.81	0.66
0.85	0.85	0.72	0.52	0.27
0.70	0.70	0.49	0.24	0.06
0.50	0.50	0.25	0.06	0.004

A team shipping an agent at 70% pass@1 has a one-in-four shot at four consecutive successes. For a customer-facing flow where reliability matters, pass^4 is the headline metric, not pass@1. The pass^k chapter walks the implications.

Tool-call F1 in detail

Computing tool-call F1 requires a gold trajectory: a labeled sequence of (tool_name, arguments) pairs a correct agent should produce. BFCL provides this for a curated set of API-calling tasks ⁴; for internal benchmarks, the gold trajectory comes from a human-expert demonstration.

def tool_call_f1(gold: list[dict], pred: list[dict]) -> float:
    def normalize(call):
        return (call["tool"], frozenset(call["args"].items()))
    g = set(normalize(c) for c in gold)
    p = set(normalize(c) for c in pred)
    if not g or not p: return 0.0
    tp = len(g & p)
    prec, rec = tp / len(p), tp / len(g)
    return 2 * prec * rec / (prec + rec) if (prec + rec) else 0.0

Practical extensions: argument-similarity tolerance (gold "user_id=alice" vs predicted "user_id=Alice"), order tolerance (some sequences require exact order, others permit any topological order on the dependency graph).

Agent-as-judge

When outcomes are subjective, use another agent to judge the trajectory. Zhuge et al. formalize this as "Agent-as-Judge" and report agreement comparable to a panel of human judges on agentic tasks ⁵. The evaluator agent has access to the trace (input, plan, tool calls, intermediate state, output), runs against a rubric, returns scores per step plus an overall judgment.

The same LLM-as-Judge discipline applies. Calibrate against the principal expert on a hundred-trace labeled set; measure TPR and TNR; recalibrate monthly. The LLM-as-Judge page covers the discipline.

Reward-hacking probes

The agent learns to satisfy the metric rather than the underlying intent. Pan et al. document "in-context reward hacking" where an agent in a feedback-loop eval exploits the metric's definitions ⁶. Three probes. Specification gaming: construct a task with a metric that has an obvious exploit, watch whether the agent finds it. Feedback-loop: iterative eval where the agent sees its score after each attempt, measure whether scores improve via exploit or improvement. CoT faithfulness: check whether the stated reasoning matches actual behavior; if the agent says "I'll call tool X" then calls Z, the chain-of-thought is unfaithful. These are red-team-style audits on a quarterly minimum for safety-critical agents.

Environmental drift

An agent depends on an environment (tools, APIs, retrieval index). When the environment changes, an agent that calibrated cleanly six months ago may fail predictably. He et al.'s TRAJECT-Bench tests environmental robustness via controlled tool changes between runs ³. Pin the tool version, API schema, and index snapshot for every eval run; note them in the trace. When production drifts, re-run and compare.

Observability

Moshkovich et al.'s 2025 work treats observability as the production-time counterpart to trajectory eval ⁷. Each tool call, intermediate state, and decision is a structured span; the trace can be re-scored against an updated rubric without re-running the agent. Schema covers full prompt and response per turn, tool name and arguments per call, intermediate state, plan if surfaced, outcome verdict. Mohammadi et al.'s survey covers the choices ⁸.

What good looks like

A production agent eval program ships a gold-trajectory benchmark of 100-300 tasks with labeled (tool, args) sequences, reporting of tool-call F1, plan correctness, outcome correctness, and pass^k for k>=4, quarterly reward-hacking probes, trajectory traces with replayability against updated rubrics, and pinned environment versions in every eval run. The trajectory vs outcome chapter walks the conceptual setup; this page is the methodology reference.

Yehudai et al., agent eval survey, 2025. ↩
Yao et al., tau-bench, 2024. ↩ ↩²
He et al., TRAJECT-Bench, 2025. ↩ ↩²
Berkeley Function Calling Leaderboard. ↩ ↩²
Zhuge et al., Agent-as-Judge, 2024. ↩
Pan et al., in-context reward hacking, 2024. ↩
Moshkovich et al., agent observability, 2025. ↩
Mohammadi et al., agent eval survey, 2025. ↩