Long-horizon planning and agentic loops

Multi-step decomposition, replanning, recovery. The agent works for minutes. Pass^k on real workflows is the metric that surfaces consistency, which pass@1 alone hides.

What "good" looks like

A long-horizon agent eval scores whether the agent succeeds consistently, not once. The right primary metric is pass^k: the probability that the agent succeeds on every one of k independent trials of the same task. Pass@1 is a vanity metric here, because real users do not retry forever.

The ship targets on a task set:

Report pass^k with k chosen to match the user experience. If a real user gives up after four attempts at the same goal, evaluate at k=4 or higher.
Report pass@1 alongside pass^k for the same task set. If pass^k is significantly below pass@1^k, the agent is non-deterministic in production-meaningful ways.

Why pass@1 misleads. Two agents can both hit 60% pass@1, and one can be production-ready while the other is not. The single number hides how often the agent repeats a success.

The evidence. The Sierra tau-bench paper introduces pass^k precisely because it surfaces inconsistency that pass@1 hides. On retail tasks, the headline result is that GPT-4o pass^8 is below 25%, despite a much higher pass@1^[1]. That gap between a strong single-shot score and a weak repeated-success score is the signal to fix.

flowchart TD
    TASK["Multi-step user goal"] --> RUNS["Run k=5 or k=8 trials"]
    RUNS --> STATE["Deterministic state check per trial"]
    STATE --> P1["pass@1: any single trial"]
    STATE --> PK["pass^k: every trial succeeds"]
    P1 --> GATE{"pass^k far below pass@1^k?"}
    PK --> GATE
    GATE -->|yes| FIX["Fix inconsistency: the bug"]
    GATE -->|no| SHIP["Ship candidate"]

Secondary metrics

Metric	What it catches	How to score
Steps to completion	Bloated or looping trajectories	Median and 90th percentile turns per successful task
Replanning frequency	Instability in the plan	How often the agent changes its declared plan mid-execution
Error attribution	Where the failures come from	The tau-bench framework distinguishes agent errors, tool errors, and user-side errors^[1]
Cost per successful task	Economically different products at the same pass rate	A 90% pass^k at $4 per task is a different product from a 60% pass^k at$ 0.40

Offline eval design

Standard benchmarks first. tau-bench retail and airline are the closest things to a standard for this category. Run both. AgentBench is broader and covers 8 environments including a SQL agent, an OS agent, and a web shopping agent^[2]; pick the subset that matches your product.

Then a domain harness. Build 20 to 50 of the most common multi-step user goals, each encoded as a deterministic state-check function (for example, did the row appear in CRM with the expected fields). State checks are what make the eval reusable; LaaJ on the final agent message is too noisy at this scope.

Run and report. Run k=5 or k=8 trials per task and report pass@1 and pass^k together. The variance across trials is the eval; the headline is the summary.

Online eval design

Capture full trajectories. Log every tool call, observation, and intermediate plan snippet, with state snapshots at known checkpoints.

Sample for review by three stratifications:

Outcome bucket (success, partial, failure).
Steps-decile (anomalously long trajectories often hide loops).
Any tool error, which is a leading indicator of pipeline drift.

Reserve judges for samples. LaaJ on the plan or critique is expensive; use it for sample reviews, not full traffic.

Alert on drift. Flag cost-per-task drift greater than 20% week over week, and retry-loop patterns (the same tool with the same arguments three times in a row).

Top benchmarks

tau-bench, https://github.com/sierra-research/tau-bench. Retail and airline; introduces the pass^k metric and a user-simulator strategy.
AgentBench, https://github.com/THUDM/AgentBench. 8 environments; useful for breadth.
GAIA, https://arxiv.org/abs/2311.12983. 466 questions designed for general assistants; humans hit 92%, GPT-4 with plugins around 15% in the original paper^[3].

Tools especially suited

tau-bench harness: pass^k on retail and airline tasks.
AgentBench: cross-environment breadth.
Inspect AI: research-grade trace capture with strong sandbox primitives.
Langfuse or Braintrust: production trajectory logging and trace search; both have annotation queues that work for the sample-review pattern.
OSWorld: sometimes used in this category, though it is browser- and OS-focused enough to merit its own playbook^[4].

Common pitfalls

Pitfall	Why it bites	Fix
Reporting only pass@1	The most common error in this category; it masks the inconsistency that makes agents fragile in production	Report pass^k; tau-bench's whole framing exists because of this gap^[1]
No state checkpointing	If you cannot replay a failed trajectory from a known mid-state, your eval is a black box	Snapshot state at known checkpoints so a failed run is replayable
Conflating "completed" with "completed correctly"	Many agents end with confident success messages that are wrong	Use a deterministic state check as the only pass signal
Letting LaaJ-on-trajectory replace deterministic verifiers	Judges of long multi-step traces have high variance and unstable cost	Use them on samples only

Canonical reading

tau-bench is the most useful single read because it both defines the metric this task needs and instruments a credible user simulator alongside it^[1]. The paper's conclusion is direct: "improving consistency and rule-following remains essential for practical agent applications". That sentence is the operational thesis of this whole playbook.

Day-1 starter

Encode 20 of your most common multi-step user goals as state-check functions. The function takes the agent's final state and returns pass or fail; no LLM, no judge.
Log every tool call, observation, and plan snippet to one trajectory row per attempt. State snapshots at known checkpoints so a failed run is replayable.
Run k=5 trials per task on every release candidate; report pass@1 and pass^5 side by side. If they diverge, the agent's inconsistency is the bug to fix.

Cross-cutting links

Pass^k: the metric that catches inconsistent agents: the primary metric, derived properly.
Trajectory-level evaluation: what to score between the first tool call and the final state.
Agent trajectory evaluation: the technique reference for step-level scoring.
Langfuse: the trace store for one-row-per-trajectory logging.
Inspect AI: the harness for deterministic state-check evals.
Agent eval with τ-bench: start here; it runs the pass^k math this page recommends.