A regression suite for a single-turn classifier is straightforward: same inputs, same expected outputs, alert on divergence. For an agent, the same playbook breaks. The same input under different sampling can take different but equally correct trajectories. A change in trajectory may be progress (the model learned a faster path) or regression (the model lost a guardrail). The harness has to surface both without forcing every diff into a binary pass/fail.
The working pattern is to pick a fixed set of production trajectories, freeze them as a replay set, and report three diffs per replay: outcome match, step-count delta, tool-call sequence diff. Each diff is independently actionable, and the union of them tells you whether the change is safe to merge.
What to put in the replay set
Pull from production, not synthetic. Synthetic trajectories miss the long tail; production trajectories are the tail. A useful default is fifty trajectories drawn from the top failure clusters identified during error analysis, plus ten "happy-path" canonical trajectories per workflow as a sanity floor 1.
For each trajectory in the replay set, store:
- The full input (user message, any system context).
- The trajectory as logged (steps with thoughts, tool calls, observations).
- The final state of every mutated external system at the time of capture.
- The user-visible outcome (success, failure, escalation).
- Annotations: which sub-goals were on the critical path, which actions were considered correct.
The annotations are the work. Without them the harness can only report "the trajectory changed"; with them it can report "the trajectory changed and the critical sub-goal is still completed" or "the trajectory changed and the critical sub-goal is now skipped."
The three diff axes
For each replay, run the candidate agent against the same input and a sandboxed copy of the external systems.
Outcome diff. Did the candidate reach the same user-visible outcome. This is the answer-regression bit. A change here is a hard regression candidate and needs review before merge.
Step-count delta. How many steps did the candidate take compared to the original. A delta of zero is normal noise; a delta of plus-three on a four-step trajectory is a planning change worth investigating. A consistently negative delta across the suite is often a real improvement (the new model needs fewer rounds); a consistently positive delta is often a regression in tool selection or self-correction.
Tool-call sequence diff. Did the candidate call the same tools in the same order. This is the most signal-rich diff and the noisiest. Use it as a starting point for review, not as a pass/fail. The Inspect AI harness exposes trajectory replay with structured per-step diffs out of the box 2; Phoenix, Langfuse, and Braintrust each support similar workflows with their own ergonomics 3 4 5.
"""
Replay a candidate agent against a frozen trajectory and compute the
three diffs. No external deps; assumes a runner that returns
(outcome, steps, final_state).
"""
from dataclasses import dataclass
@dataclass
class TrajectoryDiff:
outcome_match: bool
step_count_delta: int
tool_sequence_match: bool
critical_subgoals_intact: bool
def tool_seq(steps: list[dict]) -> tuple[str, ...]:
return tuple(s["tool_call"]["name"] for s in steps if s.get("tool_call"))
def diff(
candidate_steps: list[dict],
candidate_outcome: str,
candidate_state: dict,
frozen_steps: list[dict],
frozen_outcome: str,
frozen_state: dict,
critical_subgoals: list[str],
) -> TrajectoryDiff:
seq_cand = tool_seq(candidate_steps)
seq_froz = tool_seq(frozen_steps)
cand_goals = {s.get("sub_goal") for s in candidate_steps if s.get("sub_goal")}
return TrajectoryDiff(
outcome_match=(candidate_outcome == frozen_outcome),
step_count_delta=len(candidate_steps) - len(frozen_steps),
tool_sequence_match=(seq_cand == seq_froz),
critical_subgoals_intact=all(g in cand_goals for g in critical_subgoals),
)
def summarise(diffs: list[TrajectoryDiff]) -> dict:
n = len(diffs)
return {
"outcome_match_rate": sum(d.outcome_match for d in diffs) / n,
"critical_subgoal_match_rate": sum(d.critical_subgoals_intact for d in diffs) / n,
"tool_seq_match_rate": sum(d.tool_sequence_match for d in diffs) / n,
"median_step_count_delta": sorted(d.step_count_delta for d in diffs)[n // 2],
"max_step_count_delta": max(d.step_count_delta for d in diffs),
}
Each of the major observability platforms has converged on something close to this pattern; the differences are ergonomics, not concept.
- Inspect AI ships trajectory replay with sandboxed Docker and over 200 pre-built evals (GAIA, SWE-bench, Cybench), and is the canonical pick for offline regression harnesses 2.
- Phoenix exposes span-level annotation and agent dashboards with parent and child relationships, which makes step-level diffs visualisable in the UI 3.
- Braintrust turns production traces into eval cases with one click, which lowers the cost of growing the replay set as new failure clusters emerge.
- Langfuse maintains hierarchical traces and integrates with its prompt and dataset stores, useful when the replay set is co-versioned with the rest of the data 4.
None of these tools relieve you of the annotation work. They make the harness cheap; the annotated replay set is still yours to build.
What constitutes a regression
Not every diff is a regression. Three rules of thumb that hold up in practice.
- Outcome mismatch on a critical-path trajectory is always a regression. The candidate has to win review explicitly before merge.
- Tool-sequence mismatch with outcome match and critical-subgoal match is a planning change. Worth investigation, not blocking. Often a model upgrade or a prompt change has discovered a shorter path; sometimes it has discovered a path that skips a safety check.
- A consistent step-count delta of more than two on a five-step suite, with outcome and sub-goals intact, often indicates the candidate is over-tooling. This is a cost regression even when the quality holds.
[!IMPORTANT]
A trajectory replay where outcomes match but tool sequences diverge across half the suite is one of three things: a real improvement, a real safety regression, or a sampling artefact. Investigate; do not pass it on autopilot 6.
Environmental robustness in the replay
Run a fraction of the replays under perturbation. Inject latency or a 429 on one critical-path tool and check whether the candidate still produces a graceful failure. Environmental drift is the single largest source of "passed CI, failed production" for agent products, and pairing the replay suite with the environmental-drift perturbations is the cheapest way to expose it before merge 6.
What to do this week
- Pull fifty production trajectories from the top failure clusters surfaced by error analysis. Annotate critical-path sub-goals.
- Stand up the harness. Inspect AI is the fastest path; the major observability platforms each support the workflow.
- Wire the three diffs into PR review. Outcome mismatch blocks merge; tool-sequence diffs surface for human review with the trajectory rendered side by side.