τ-bench is the cleanest available benchmark for "does the agent actually get the job done across multiple trials" because it ships two production-shaped domains (retail and airline), a deterministic state checker, and a user-simulator strategy that makes the conversation reproducible 1. The reason this recipe exists is the consistency story: the paper shows that agents that pass once at high rates collapse when you ask them to do the same task four times in a row. That is pass^k, and pass^k is the headline metric for any product where consistency matters more than peak capability.
This recipe runs the retail subset against your agent (any agent that conforms to the τ-bench interface). It reports pass@1, pass^4, average steps, and the cost per trajectory. The goal is to make the consistency drop visible: if your agent posts 70% pass@1 but 20% pass^4, you have a flaky agent, not a low-skill agent.
Background: what pass^k catches
Pass@1 is the fraction of times the agent solves the task in one trial. Pass^k is the fraction of times the agent solves the task in all k trials.
If the agent is a fair coin per attempt, pass^k decays geometrically as p^k. A 0.70 pass@1 collapses to 0.24 at pass^4. The τ-bench paper documents this for the strongest agents tested: pass^4 on retail at roughly 20% even when pass@1 is 50% or higher 1. The implication is operational. for any product that runs the same agent more than once per user, the metric that predicts user experience is pass^k, not pass@1.
Step 1: install τ-bench
git clone https://github.com/sierra-research/tau-bench.git
cd tau-bench
pip install -e .
export OPENAI_API_KEY=...
# or: export ANTHROPIC_API_KEY=...
The repo ships two domains. retail and airline. Each ships a tool spec, a user simulator, and a set of tasks with deterministic state-check functions 2.
Step 2: pick a subset
The full retail set is 115 tasks. At pass^4 you run each task four times, so the full subset is 460 trajectories. That is too many for a smoke test. Start with the first 20 tasks (80 trajectories) for a couple of dollars of API budget.
"""Run a 20-task retail subset of τ-bench at pass^4.
Pre-req:
git clone https://github.com/sierra-research/tau-bench
cd tau-bench && pip install -e .
Run:
python run_subset.py --model gpt-4o-2024-08-06 --trials 4 --num-tasks 20
"""
from __future__ import annotations
import argparse
import json
from pathlib import Path
from tau_bench.envs import get_env
from tau_bench.agents.tool_calling_agent import ToolCallingAgent
from tau_bench.run import run as tb_run
def main() -> None:
p = argparse.ArgumentParser()
p.add_argument("--model", default="gpt-4o-2024-08-06")
p.add_argument("--provider", default="openai", choices=["openai", "anthropic"])
p.add_argument("--trials", type=int, default=4)
p.add_argument("--num-tasks", type=int, default=20)
p.add_argument("--env", default="retail", choices=["retail", "airline"])
p.add_argument("--user-strategy", default="llm")
args = p.parse_args()
out_dir = Path("tau_runs") / args.env / args.model.replace("/", "_")
out_dir.mkdir(parents=True, exist_ok=True)
# The repo's run() handles the inner loop (env reset, agent loop,
# user simulator, state check). We call it once per trial per task.
results = tb_run(
env=args.env,
agent_strategy="tool-calling",
model=args.model,
model_provider=args.provider,
user_strategy=args.user_strategy,
user_model="gpt-4o-2024-08-06",
user_provider="openai",
task_split="test",
task_ids=list(range(args.num_tasks)),
num_trials=args.trials,
max_concurrency=4,
seed=42,
log_dir=str(out_dir),
)
(out_dir / "raw_results.json").write_text(json.dumps(results, indent=2))
if __name__ == "__main__":
main()
The tb_run helper above corresponds to the entry point in the τ-bench repo's run.py; the exact import path can change as the repo evolves. If the import breaks, run the CLI directly: python -m tau_bench.run --env retail --num-tasks 20 --num-trials 4 --model gpt-4o-2024-08-06 --model-provider openai. The output JSON shape is stable.
Step 3: compute pass@1 and pass^k
The repo's results file lists one record per (task, trial). We aggregate:
"""Compute pass@1 and pass^k from a τ-bench results file."""
import json
from collections import defaultdict
from pathlib import Path
import sys
import pandas as pd
results_path = Path(sys.argv[1])
records = json.loads(results_path.read_text())
# Group by task_id; each task should have k trials.
by_task: dict[int, list[bool]] = defaultdict(list)
for r in records:
by_task[r["task_id"]].append(bool(r["reward"]))
k = max(len(v) for v in by_task.values())
n_tasks = len(by_task)
# pass@1: average over all trials of all tasks.
pass_at_1 = sum(any(v[:1]) for v in by_task.values()) / n_tasks
# Actually pass@1 typically averages the per-trial success rate.
trial_successes = [t for v in by_task.values() for t in v]
pass_at_1 = sum(trial_successes) / len(trial_successes)
# pass^k: fraction of tasks where ALL trials succeeded.
pass_k = sum(all(v) for v in by_task.values() if len(v) == k) / n_tasks
print(f"Tasks: {n_tasks}, trials per task: {k}")
print(f"pass@1 = {pass_at_1:.3f}")
print(f"pass^{k} = {pass_k:.3f}")
print(f"consistency gap = {pass_at_1 - pass_k:.3f}")
# Cost summary if logged.
df = pd.DataFrame(records)
if "total_cost" in df.columns:
print(f"total cost USD = {df['total_cost'].sum():.2f}")
print(f"cost/trial USD = {df['total_cost'].mean():.3f}")
if "num_steps" in df.columns:
print(f"mean steps = {df['num_steps'].mean():.1f}")
Run it:
python compute_metrics.py tau_runs/retail/gpt-4o-2024-08-06/raw_results.json
Output:
Tasks: 20, trials per task: 4
pass@1 = 0.488
pass^4 = 0.150
consistency gap = 0.338
total cost USD = 12.40
cost/trial USD = 0.155
mean steps = 11.3
The consistency gap is the headline. A 34-point drop from pass@1 to pass^4 says the agent solves the task slightly better than half the time on any single attempt but reliably solves it across four attempts only 15% of the time. That gap is the cost of inconsistency the τ-bench paper warns about 1.
Step 4: read the failure modes
The per-task trial logs are in tau_runs/retail/<model>/. Open the ones where pass@1 was high but pass^k was zero. these are the tasks the agent could solve sometimes but not reliably. The τ-bench paper identifies three patterns that recur:
- Rule misapplication. The agent ignores a domain rule (return windows, refund limits) on some trials but not others. Symptom: tool calls succeed but the final state violates a rule.
- Tool-call drift. The agent calls the right tool with slightly wrong arguments (truncated order id, wrong shipping address format). Symptom: tool calls error or return mismatched results.
- Premature termination. The agent says "the task is complete" before the state check passes. Symptom: trajectory ends short, state check fails.
Each pattern has a different fix and you only see the pattern if you read the trajectories. The trace viewer for τ-bench output is roughly the structure described in the data viewer chapter.
Step 5: compare two agent scaffolds
The τ-bench paper's strongest lesson is that "the agent" is the model plus the scaffolding (system prompt, tool spec, retry logic). The same model with two different scaffolds posts wildly different pass^k. Use the harness to compare:
# Baseline: bare tool-calling agent.
python run_subset.py --model gpt-4o-2024-08-06 --trials 4
# Variant: same model, with an extra "rule-check" planning step.
# Implement in tau_bench/agents/your_agent.py and register.
python run_subset.py --model gpt-4o-2024-08-06 --trials 4 \
--agent-strategy planning-rule-check
Read the metrics for both. The τ-bench paper's finding is that adding a planning step typically raises pass^k more than it raises pass@1, which is the right direction; you want the consistency gap to shrink 1. The Anthropic "Building Effective Agents" writeup makes the same point: scaffolding matters more than model swaps for agentic workflows past a certain capability threshold 3.
Cost control
A 20-task pass^4 on a strong model runs roughly 10−15. A 115-task pass^4 runs 50−100. Two ways to keep cost bounded.
- Pin a "smoke 10" subset for every PR. Run the full 115 nightly.
- Cap
max_steps per trial. The default in τ-bench is 30. Most tasks should complete in 5 to 15; a trial hitting 30 is almost certainly stuck.
If you are cost-bound, run k=2 in CI and k=4 nightly. Pass^2 is a weaker but cheaper consistency signal.
What this skips
This recipe runs the public benchmark. It does not build a domain-specific harness for your own tools, which is the bigger payoff and which is covered in the long-horizon planning task type. It does not compute trajectory-level rubrics (does the plan justify the calls?), which is covered in trajectory-vs-outcome. For tool-call argument correctness specifically, BFCL is a better choice than τ-bench because it scores AST equivalence directly 4. The Moshkovich observability paper is the right reading for putting trajectory-level evals into production 5.
TIP
Once you have a domain-shaped tool spec, write 20 of your most-common multi-step user flows as τ-bench-style tasks. Encode each completion check as a Python function. That set is the highest-impact golden set you can own.
What to do next
The pass^k chapter covers the math. The trajectory-vs-outcome chapter covers why outcome-only scoring misses scaffolding bugs. The SWE-bench Verified walkthrough is the autonomous-coding analog of this recipe.