Trajectory-level evaluation

Outcome-only scoring ranks agents in the wrong order. Score the trajectory and the result.

An agent run is a sequence: thought, tool call, observation, response, repeat. A correct final answer can ride on a broken plan that hit the right cell of the search space by luck. A wrong final answer can come from a perfect plan whose last tool call returned a 503. Scoring only the last token in that sequence is the easiest way to ship an agent that looks fine on the leaderboard and collapses the moment your tools behave differently in production.

The argument of this page is that you need two grades, not one: outcome success and trajectory quality. They disagree often enough that picking the wrong one will rank your candidate models in the opposite order from the one your users would prefer. TRAJECT-Bench shows that headline outcome-success rankings flip when you score the path instead of the destination ¹. Two recent surveys say the same thing in different words: single-turn accuracy is degenerate for agents because it ignores planning, tool selection, long-horizon reliability, and the failure-recovery behaviour that determines whether a deployment is operable ² ³.

What "trajectory" means

The unit you score is the span tree: the root request, the model's plan, every tool call with its arguments, every tool response, every model turn in between, and the final response. In OpenTelemetry terms, the trajectory is the trace; each step is a span; the per-step attributes are the arguments, the result, the latency, the token count. If you cannot reconstruct the trajectory from your traces, you cannot do trajectory-level eval; instrumentation is the prerequisite.

A useful mental split, lifted from the Yehudai survey ² and reflected in benchmarks like AgentBench ⁴ and GAIA ⁵:

Plan quality. Did the agent decompose the task correctly before acting? Are there obvious missing sub-goals or wrong dependencies?
Tool selection. Did it pick a reasonable tool at each step, including the decision not to call a tool when answering from memory was correct?
Tool argument correctness. When it called the tool, were the arguments grounded in the user's request and the prior context?
Error recovery. When a tool returned an error, did the agent retry, replan, or escalate, instead of silently fabricating a result?
Final answer. Did the response actually solve the user's task?

You score these independently because they fail independently. A model that picks the right tools but supplies hallucinated arguments fails differently from a model that picks the wrong tool every time but happens to get the right answer through compensating tries.

TIP

If the trajectory-level score and the outcome score never disagree on your set, your trajectory scorer is too loose. Real agents pass for the wrong reasons and fail for the wrong reasons. Rubric tightening, not metric removal, is the response.

Why outcome-only ranks models in the wrong order

Three concrete reasons.

First, the success surface for many agent tasks is heavily multimodal. A retail customer-support task in τ-bench has tens of valid paths to "refund processed", but most paths violate at least one policy ⁶. Outcome scoring rewards "the database ended in the right state". Operationally, you also need "no policy violations along the way". Two agents with identical outcome scores can have wildly different policy-violation rates; only the trajectory score distinguishes them.

Second, contamination and overfitting hide in outcome metrics. GAIA-style benchmarks contain widely-known reference answers ⁵. An agent that has memorized the final answer will hit it from a plan that no domain expert would endorse. A judge that walks the trajectory catches the gap between "answer is right" and "answer was earned"; an outcome metric cannot.

Third, error-recovery matters more in production than in static evals because production tools fail. TRAJECT-Bench injects shaped failures into 1,228 production APIs and shows that the outcome-success ranking shuffles substantially once recovery behaviour counts ¹. If your model is excellent at first-try success but never recovers from a 429, you will see "win" on a clean benchmark and "loss" on a real workload.

State-check functions: the minimum viable trajectory scorer

Before you reach for an agent-as-judge, write deterministic state checks. They are cheaper, faster, and catch the largest single class of failure: the agent claimed success but the world is in the wrong state. The Anthropic guide to building effective agents argues for the simplest harness that does the job; for evaluation, the simplest harness is a sandbox and an assertion ⁷.

The pattern is: run the agent against a sandboxed environment, snapshot the relevant state at the end, and check it against an expected state. The check returns a boolean for outcome success plus a structured record of which sub-goals fired, in what order, and which post-conditions hold.

"""
Minimum-viable trajectory scorer for a retail-support agent.
Run the agent in a sandbox, then check the resulting state.
"""
from dataclasses import dataclass, field
from typing import Callable

@dataclass
class Trajectory:
    user_request: str
    steps: list[dict]   # each step: {tool, args, result, latency_ms}
    final_response: str

@dataclass
class TaskCase:
    name: str
    user_request: str
    setup: Callable[["Sandbox"], None]
    check_outcome: Callable[["Sandbox"], bool]
    expected_subgoals: list[str]      # ordered sub-goals
    forbidden_actions: list[str] = field(default_factory=list)

@dataclass
class TrajectoryScore:
    outcome_pass: bool
    subgoal_hits: list[str]
    subgoal_misses: list[str]
    policy_violations: list[str]
    redundant_calls: int
    recovered_from_error: bool

def score(traj: Trajectory, case: TaskCase, sandbox) -> TrajectoryScore:
    tools_called = [s["tool"] for s in traj.steps]

    subgoal_hits, subgoal_misses = [], []
    for goal in case.expected_subgoals:
        if goal in tools_called:
            subgoal_hits.append(goal)
        else:
            subgoal_misses.append(goal)

    policy_violations = [
        a for a in case.forbidden_actions if a in tools_called
    ]

    # redundant: same tool called with same args twice in a row.
    redundant_calls = 0
    for i in range(1, len(traj.steps)):
        prev, cur = traj.steps[i - 1], traj.steps[i]
        if prev["tool"] == cur["tool"] and prev["args"] == cur["args"]:
            redundant_calls += 1

    # recovery: any step whose result is an error followed by a non-error
    # call to a different tool or same tool with changed args.
    recovered_from_error = False
    for i in range(len(traj.steps) - 1):
        cur, nxt = traj.steps[i], traj.steps[i + 1]
        is_error = isinstance(cur["result"], dict) and cur["result"].get("error")
        recovered = (
            nxt["tool"] != cur["tool"]
            or nxt["args"] != cur["args"]
        )
        if is_error and recovered:
            recovered_from_error = True
            break

    return TrajectoryScore(
        outcome_pass=case.check_outcome(sandbox),
        subgoal_hits=subgoal_hits,
        subgoal_misses=subgoal_misses,
        policy_violations=policy_violations,
        redundant_calls=redundant_calls,
        recovered_from_error=recovered_from_error,
    )

That gives you five numbers per run instead of one. Aggregate across your set and report the joint distribution: outcome pass-rate by sub-goal completion bucket, by policy-violation count, by recovery success. The headline is the outcome, but the diagnostic is the trajectory.

When you need an agent-as-judge

State checks cover the deterministic-environment case. They do not cover open-ended tasks like "research this topic and summarize sources", or trajectories long enough that hand-writing sub-goal lists scales badly. For those, you need a judge that walks the trajectory and grades each step against a rubric.

The Zhuge et al. Agent-as-a-Judge work shows that a judge with its own tools, evaluating step by step on 55 realistic AI development tasks, approaches human-evaluator reliability on tasks where a single-pass LLM judge breaks down ⁸. Their finding is that single-LLM judges cannot reliably reason about long trajectories; the judge needs to act like an agent itself, opening files, running code, checking intermediate state.

In practice, three patterns work:

State-check first, judge second. If the state check passes, you do not need the judge. If it fails, run a judge to attribute the failure to plan / tool selection / argument / recovery.
Independent per-step grades. Score plan quality, tool selection, argument grounding, and recovery separately. Aggregate at the end. Joint grading of "was the whole trajectory good" produces vague labels.
The judge gets the tools too. Without tool access, a judge cannot verify "the file the agent claimed to write actually has the right contents". Read-only sandbox access to the same tool set as the agent is the minimum.

Inter-rater agreement is the first sanity check

Before any of this is automated, two domain experts should grade the same 20 trajectories. The agent-eval survey is explicit: trajectory-level annotation is 5 to 20 times more expensive per example than single-output annotation, and inter-rater κ at step granularity is regularly below 0.5 on first pass ². That is not a rater problem; it is a rubric problem. If two trained humans cannot agree on whether the plan was reasonable, no automated judge can either, and the right move is to sharpen the rubric, not to swap models.

The cheapest version of this: label 20 trajectories with two annotators, compute κ on (a) outcome success and (b) overall trajectory quality, then on each per-step axis. Step-level κ below 0.5 means the rubric for that step is under-specified. Fix the rubric before automating anything downstream.

What to report

For every model and every dataset, report the vector, not the scalar:

Outcome pass-rate, with a 95% confidence interval.
Sub-goal completion rate.
Mean number of redundant tool calls per run.
Recovery rate on injected errors.
Policy-violation rate (any forbidden tool / argument).
Median and p95 step count.
Median and p95 cost per run.

Anyone asking for a single number is asking you to hide one of these. Refuse the question or, at minimum, name the trade you are making.

What this section covers next

The trajectory split is the frame. The chapters that follow drill into the parts. Tool-call evaluation gets its own page; pass^k handles the inconsistency problem that even a perfect trajectory scorer cannot detect from a single attempt; reward hacking is the failure mode trajectory scoring is most likely to miss when the judge itself is the target of the optimization. Read those next if you are designing a harness from scratch.

He et al., "TRAJECT-Bench" (2025). https://arxiv.org/abs/2510.04550 ↩ ↩²
Yehudai et al., "Survey on Evaluation of LLM-based Agents" (2025). https://arxiv.org/abs/2503.16416 ↩ ↩² ↩³
Mohammadi et al., "Evaluation of LLM Agents: A Survey" (2025). https://arxiv.org/abs/2507.21504 ↩
Liu et al., "AgentBench" (2023). https://arxiv.org/abs/2308.03688 ↩
Mialon et al., "GAIA" (2023). https://arxiv.org/abs/2311.12983 ↩ ↩²
Yao et al., "τ-bench" (2024). https://arxiv.org/abs/2406.12045 ↩
Anthropic, "Building Effective Agents". https://www.anthropic.com/engineering/building-effective-agents ↩
Zhuge et al., "Agent-as-a-Judge" (2024). https://arxiv.org/abs/2410.10934 ↩