Outcome-only scoring treats every failure equally: the agent either solved the task or did not. For agents that plan, that is the wrong unit. A six-step plan with five correct sub-goals and one failed branch is not the same artifact as a six-step plan that fell apart at step one. Outcome metrics rank them the same; sub-goal completion does not.
The agent-eval surveys both flag sub-goal completion as the primary planning metric and argue that without it, prioritisation of fixes is guesswork 1 2. The benchmarks that adopt it (GAIA's multi-hop tasks, TRAJECT-Bench's plan-graph splits) consistently show that headline outcome-success can rank models in the opposite order from trajectory-level correctness 3 4.
The plan graph
A task is a directed graph. Nodes are sub-goals (sub-problems the agent must complete). Edges are dependencies (sub-goal B requires sub-goal A). The root is the user request; the leaves are atomic actions. The agent's trajectory walks the graph, sometimes in the planned order, sometimes with backtracking, sometimes with skipped branches.
For a customer-support flow, the graph might look like:
flowchart TD
A[Identify customer intent] --> B[Look up order]
A --> C[Check policy applies]
B --> D[Determine refund eligibility]
C --> D
D --> E[Issue refund or escalate]
E --> F[Send confirmation]
The agent must reach F, but F is meaningful only if D made the right decision, which is meaningful only if B and C produced correct intermediate state. An outcome metric on F flattens this to a single bit and discards the diagnostic information.
Sub-goal completion as a metric
Define three counts per trajectory:
attempted: sub-goals the agent took at least one action on
completed: sub-goals where the resulting state matches the annotated success criterion
skipped: sub-goals on the canonical path that the agent never attempted
Sub-goal completion rate is completed / total. Sub-goal coverage is attempted / total. The difference between them tells you whether the agent is failing inside sub-goals (low completion, high coverage) or skipping them entirely (low coverage). The two failure modes need different fixes: failure-inside-sub-goal points at tool-call or reasoning quality; skipping points at planning.
A useful refinement: weight the sub-goals. Reaching F without completing D is a policy violation, not partial credit. Annotate critical-path sub-goals separately and report critical_path_completion alongside the overall rate. The surveys both note this as the difference between "agent made progress" and "agent did the work that mattered" 1 2.
Replanning frequency
A second metric that falls out of the plan-graph view: how often the agent revises its plan in response to feedback. An agent that detects a failed sub-goal, replans, and tries an alternative branch shows correlated success across attempts. An agent that pushes ahead with a broken plan shows near-zero replanning even when production traces contain known transient failures.
Near-zero replanning frequency in production is a strong signal that the agent is not detecting its own errors. The fix is rarely "better model"; it is usually "add a self-check sub-goal between branches" or "feed the tool response back into the planner before the next call."
"""
Extract a plan graph from an agent trajectory and score it against an annotated
canonical graph. Trajectories are lists of (thought, tool_call, observation)
tuples; canonical graphs are dicts of {sub_goal_id: {deps, success_check}}.
"""
from dataclasses import dataclass, field
from typing import Callable
@dataclass
class SubGoal:
id: str
deps: list[str] = field(default_factory=list)
critical: bool = False
success_check: Callable[[dict], bool] = lambda state: False
@dataclass
class TrajectoryEval:
attempted: set[str]
completed: set[str]
skipped_critical: set[str]
replanning_events: int
def score_trajectory(
trajectory: list[dict],
canonical: dict[str, SubGoal],
final_state: dict,
) -> TrajectoryEval:
attempted: set[str] = set()
completed: set[str] = set()
last_plan: list[str] = []
replans = 0
for step in trajectory:
plan = step.get("plan", [])
if last_plan and plan and plan != last_plan:
replans += 1
last_plan = plan or last_plan
sg = step.get("sub_goal")
if sg and sg in canonical:
attempted.add(sg)
if canonical[sg].success_check(final_state):
completed.add(sg)
critical = {sgid for sgid, sg in canonical.items() if sg.critical}
skipped_critical = critical - attempted
return TrajectoryEval(
attempted=attempted,
completed=completed,
skipped_critical=skipped_critical,
replanning_events=replans,
)
def report(evals: list[TrajectoryEval], total_sub_goals: int) -> dict:
n = len(evals)
return {
"coverage_rate": sum(len(e.attempted) for e in evals) / (n * total_sub_goals),
"completion_rate": sum(len(e.completed) for e in evals) / (n * total_sub_goals),
"critical_path_skipped_rate": sum(1 for e in evals if e.skipped_critical) / n,
"replanning_events_per_trace": sum(e.replanning_events for e in evals) / n,
}
The extraction is the work, not the scoring. Some agents emit explicit plans as part of their CoT; for those, the plan graph is a parse. Most agents do not. For the rest, annotate which sub-goal each tool call serves; the annotation can be rule-based (search_orders maps to "Look up order") or judge-based for ambiguous calls.
Agent-as-a-judge on the plan
A single-LLM judge tends to lose track of long trajectories. Zhuge et al.'s Agent-as-a-Judge framework gives the judge its own tools and lets it walk the trajectory step by step, checking each sub-goal independently 5. On the DevAI benchmark of 55 realistic AI-dev tasks, this approach approached human-evaluator reliability where single-LLM judges fell off.
The pattern works well for plan-graph scoring because each node is a localised judgment. Was the order looked up correctly? Yes or no, with the tool result available. Was the refund eligibility decision consistent with policy? Yes or no, with the policy document and the prior state available. Aggregate the per-node judgments into the completion rate above.
[!IMPORTANT]
The workflow-versus-agent decision sharpens when sub-goal completion is visible. If a workflow has 95% completion on critical-path sub-goals and the agent has 90% with more replanning events, the workflow wins until the gap is explainable 6.
What to do this week
- Pick five high-value workflows that already have trajectories logged. Annotate each with a canonical sub-goal graph; mark which sub-goals are on the critical path.
- Run an Agent-as-a-Judge prompt over the trajectories. Compute coverage rate, completion rate, critical-path-skipped rate, replanning events per trace.
- Compare against your current single-outcome metric. Where the rankings differ is where the planning-vs-execution failure mode is hiding.