What "good" looks like
For agents that run more than a single turn, pass@1 is a vanity metric. Two agents can both hit 60% pass@1 and one can be production-ready while the other is not, because real users do not retry forever. The right primary is pass^k, the probability that the agent succeeds on every one of k independent trials of the same task. The Sierra tau-bench paper introduces the metric precisely because it surfaces inconsistency that pass@1 hides; on retail tasks the headline result is that GPT-4o pass^8 is below 25%, despite a much higher pass@1[1]. Pick k to match the user experience: if a real user gives up after four attempts at the same goal, evaluate at k=4 or higher.
The composite picture is pass@1 alongside pass^k for the same task set. If pass^k is significantly below pass@1^k, the agent is non-deterministic in production-meaningful ways. That is the signal to fix.
Secondary metrics
Steps to completion. Median and 90th percentile turns per successful task. Replanning frequency. How often the agent changes its declared plan mid-execution. Error attribution. The tau-bench framework distinguishes agent errors, tool errors, and user-side errors[1]; the breakdown tells you where to invest. Cost per successful task. A 90% pass^k that costs 4pertaskisadifferentproductfroma600.40.
Offline eval design
tau-bench retail and airline are the closest things to a standard for this category. Run both. AgentBench is broader and covers 8 environments including a SQL agent, an OS agent, and a web shopping agent[2]; pick the subset that matches your product. On top of those, build a domain-specific harness: 20 to 50 of the most common multi-step user goals, each encoded as a deterministic state-check function (e.g., did the row appear in CRM with the expected fields). State checks are what make the eval reusable; LaaJ on the final agent message is too noisy at this scope.
Run k=5 or k=8 trials per task and report pass@1 and pass^k together. The variance across trials is the eval; the headline is the summary.
Online eval design
Capture full trajectories: every tool call, observation, and intermediate plan snippet, with state snapshots at known checkpoints. Sample for review by three stratifications. Outcome bucket (success, partial, failure). Steps-decile (anomalously long trajectories often hide loops). Any tool error, which is a leading indicator of pipeline drift. LaaJ on the plan or critique is expensive; reserve it for sample reviews, not full traffic. Alert on cost-per-task drift greater than 20% week over week and on retry-loop patterns (the same tool with the same arguments three times in a row).
Top benchmarks
The tau-bench harness for pass^k. AgentBench for cross-environment breadth. Inspect AI for research-grade trace capture with strong sandbox primitives. Langfuse or Braintrust for production trajectory logging and trace search; both have annotation queues that work for the sample-review pattern. OSWorld is sometimes used in this category, though it is browser- and OS-focused enough to merit its own playbook[4].
Common pitfalls
Reporting only pass@1. The single most common error in this category; it masks the inconsistency that makes agents fragile in production. Tau-bench's whole framing exists because of this gap[1]. No state checkpointing. If you cannot replay a failed trajectory from a known mid-state, your eval is a black box. Conflating "completed" with "completed correctly". Many agents end with confident success messages that are wrong; the deterministic state check is the only way to catch this. Letting LaaJ-on-trajectory replace deterministic verifiers. Judges of long multi-step traces have high variance and unstable cost; use them on samples only.
Canonical reading
tau-bench is the most useful single read because it both defines the metric this task needs and instruments a credible user simulator alongside it[1]. The paper's conclusion is direct: "improving consistency and rule-following remains essential for practical agent applications." That sentence is the operational thesis of this whole playbook.
Day-1 starter
- Encode 20 of your most common multi-step user goals as state-check functions. The function takes the agent's final state and returns pass or fail; no LLM, no judge.
- Log every tool call, observation, and plan snippet to one trajectory row per attempt. State snapshots at known checkpoints so a failed run is replayable.
- Run k=5 trials per task on every release candidate; report pass@1 and pass^5 side by side. If they diverge, the agent's inconsistency is the bug to fix.