An LLM judge that reads a 4,000-token agent trajectory and emits one pass/fail bit is doing roughly the same work as a human reviewer skimming the last paragraph. It catches obvious failures and misses the rest. Agent-as-a-Judge changes the assignment: the judge is itself an agent, with its own tools, walking the trajectory step by step and producing a judgment per node.
Zhuge et al. introduced the framework on DevAI, a benchmark of 55 realistic AI-development tasks where multi-step solutions defeated single-LLM judges. The agent-judge approached human-evaluator reliability while the flat judges fell off 1. The agent-eval surveys both flag this as the direction trajectory-level scoring is heading, and both note that single judges are not reliable on long-horizon tasks even when the underlying model is strong 2 3.
Three failure modes that show up reliably when a single LLM judges a trajectory.
The first is attention dilution. The judge attends to early tokens of the trajectory and the final answer, with weaker grip on the middle. A mistake in tool call three of seven is structurally easier to miss than a mistake in tool call one or seven. The longer the trajectory, the worse the dilution.
The second is criteria substitution. Asked to judge planning, tool use, and final answer simultaneously, the judge tends to grade the final answer and back-fill consistent scores for the others. JudgeBench's results on hard reasoning show the same pattern: judges agree with humans on surface judgments and revert to coin-flip on harder ones 4.
The third is missing context. The trajectory references a policy document, a database row, an API response. A single judge sees the references, not the referents. The judge cannot check whether the policy actually says what the agent claimed it said, and so the judge accepts the claim.
The agent-judge pattern
Give the judge a small set of read-only tools (lookups against the same sources the agent used) and a structured per-node rubric. The judge walks the trajectory, fetches the documents the agent quoted, checks the database rows the agent updated, and emits a verdict per sub-goal.
A minimal scaffold:
"""
Agent-as-a-judge: walk a trajectory and grade each step with tool access.
"""
from dataclasses import dataclass
from typing import Callable
@dataclass
class Step:
sub_goal: str
thought: str
tool_call: dict | None
observation: dict | None
@dataclass
class StepVerdict:
sub_goal: str
plan_reasonable: bool
tool_choice_correct: bool
arguments_correct: bool
output_used_correctly: bool
rationale: str
JudgeFn = Callable[[Step, dict], StepVerdict]
def judge_trajectory(
steps: list[Step],
judge_fn: JudgeFn,
read_tools: dict[str, Callable],
) -> list[StepVerdict]:
"""
For each step, the judge gets (a) the step itself,
(b) read-only tools to verify claims, and returns
a structured verdict.
"""
judge_context = {"read_tools": read_tools, "history": []}
verdicts = []
for step in steps:
v = judge_fn(step, judge_context)
verdicts.append(v)
judge_context["history"].append((step, v))
return verdicts
def aggregate(verdicts: list[StepVerdict]) -> dict:
n = len(verdicts)
if n == 0:
return {"overall": False}
return {
"plan_reasonable_rate": sum(v.plan_reasonable for v in verdicts) / n,
"tool_choice_correct_rate": sum(v.tool_choice_correct for v in verdicts) / n,
"arguments_correct_rate": sum(v.arguments_correct for v in verdicts) / n,
"output_used_correctly_rate": sum(v.output_used_correctly for v in verdicts) / n,
"overall_pass": all(
v.plan_reasonable
and v.tool_choice_correct
and v.arguments_correct
and v.output_used_correctly
for v in verdicts
),
}
The judge prompt for each step asks four independent binary questions. Hamel's binary-with-critique discipline applies as much to agent judging as to the single-turn case 5. Four bits per step beats one fuzzy score across the whole trajectory.
Calibration is mandatory
The Agent-as-a-Judge paper achieved its reliability gains after calibration against human labels on DevAI 1. Skip calibration and you get a confident judge with no measurable agreement against the ground truth. Two minimums.
First, double-label twenty trajectories step by step with two domain experts. Compute Cohen's kappa per axis (plan, tool, arguments, output use). If kappa is below 0.6 on any axis, the rubric is underspecified for that axis and the judge cannot fix it.
Second, after the rubric is stable, calibrate the judge against the human labels. Iterate until per-axis TPR and TNR clear 0.9. The frontier models can hit this on plan and tool-choice; arguments and output-use often need a second pass.
Where agent-judges still fail
JudgeBench's results on hard reasoning carry into the agent setting 4. If a sub-goal hinges on a mathematical or logical claim the judge cannot independently verify, the agent-judge is at coin-flip on that node, regardless of the rubric. For those sub-goals, use executable verification: a sandbox that runs the tool call and checks the resulting state. The judge can grade everything around the executable check; the check itself is the source of truth.
A second failure mode: environmental drift. TRAJECT-Bench's results show that models rank differently once the environment includes realistic adversariality (latency, error injection, response-field changes) 6. A judge calibrated on the happy path scores recovery behaviours inconsistently. Calibrate on the adversarial path too, or scope the judge to the happy path and report environmental robustness separately.
What this replaces
The agent-judge does not replace human review or executable verification. It replaces three patterns that scale poorly.
- A single-LLM judge that reads the trajectory and outputs one bit. The agent-judge produces structured per-step verdicts that aggregate into the same bit plus diagnostic detail.
- A row of separate flat judges (one per axis) that disagree because they cannot see each other's reasoning. The agent-judge produces coherent per-step judgments and aggregates per axis.
- A purely outcome-based metric that ranks trajectories with five correct sub-goals and one wrong action identically with trajectories that fell apart at step one.
[!IMPORTANT]
An agent-judge is an agent. It can be biased, slow, and reward-hackable. Hold it to the same calibration discipline as any judge: held-out human labels, TPR and TNR per axis, retraining when the rubric shifts 5.
What to do this week
- Pick the failure cluster on which your current judge has the lowest agreement against human labels. Build a step-level rubric with three to five binary axes.
- Give the judge read-only access to the documents and database snapshots the agent had. Wire the prompt to walk the trajectory one step at a time.
- Double-label twenty trajectories with two SMEs, compute kappa per axis, iterate the rubric until kappa clears 0.6, then iterate the judge until per-axis TPR and TNR clear 0.9.