Pass^k: the metric that catches inconsistent agents

Pass@1 hides catastrophic inconsistency. Report pass^k or you do not know how often the agent will repeatedly succeed.

Pass@1 measures the probability that a single attempt succeeds. Pass^k measures the probability that all k attempts on the same task succeed. They sound similar and they are emphatically not. The first is the headline number on every agent leaderboard; the second is the number that predicts whether you can ship the agent to a real user. The gap between them, for the same model on the same task, can be enormous.

τ-bench introduced pass^k as a first-class metric for exactly this reason ¹. A customer-support agent at 90% pass@1 on a retail-task benchmark drops to roughly 57% at pass^8 and around 35% at pass^16 if the failures are independent. The arithmetic is unforgiving: pass^k = p^k under independence, and even with positive correlation between attempts the decay is steep enough to matter operationally. A 1-in-10 chance of going off-policy per multi-turn conversation is the difference between an internal tool and a production launch.

The math, in two lines

For a single task with per-attempt success probability p:

pass@1 = p
pass^k (independent attempts) = p^k

When attempts are correlated, the decay is slower but still strictly worse than pass@1. The τ-bench definition is the strict one: pass^k is the fraction of tasks for which all k independent attempts succeed ¹. You can compute it from raw per-attempt results without any modeling assumption.

For a benchmark with n tasks and k attempts each, the unbiased per-task estimator is:

task_i_pass^k = 1 if all k attempts succeed, else 0
benchmark_pass^k = mean over tasks

If you ran more than k attempts per task and want a stable estimate of pass^k from m total attempts (m >= k), the unbiased combinatorial estimator from the HumanEval line of work gives a tighter answer than naive sub-sampling.

Why pass@1 hides the inconsistency you ship with

Three failure modes that pass@1 cannot see and pass^k can.

First, near-misses that become real misses. An agent that succeeds 9 times out of 10 may be failing on a near-deterministic class: the same kind of input fails the same way. Pass@1 averages over the class; pass^k surfaces it because k independent attempts have k chances to hit the failure class. If the class is 5% of the input space and it is hit on every attempt, pass^k decays much faster than a 95% pass@1 model with random failures would suggest.

Second, judge gaming. An agent that wins pass@1 by gaming the judge prompt (more later, in reward hacking) often wins by a narrow margin that is sensitive to small perturbations. Pass^k repeated under temperature greater than zero will catch the agent that wins one attempt and fails the next because the gaming is fragile.

Third, environmental variance. Production tools fail intermittently. An agent that is brittle to a single 429 response will pass at k=1 because most attempts do not see the 429, and fail at k=8 because at least one of eight does. TRAJECT-Bench's injected-failure suite makes this concrete; the ranking of models shuffles substantially once attempts must succeed under realistic environmental adversariality ².

TIP

A model that gains pass@1 and loses pass^4 is a regression for production even if the leaderboard moves up. Pass@1 measures peak capability; pass^k measures reliability. Reliability is what ships.

Pareto curves, not single numbers

Once you accept that pass^k matters, you stop reporting a single benchmark number and start reporting a curve. The shape of the curve tells you what kind of agent you have.

Steep decay (pass@1 high, pass^4 collapses): a strong but brittle model. Capability is real; consistency is not. Useful for capability ceilings; bad for production gates.
Shallow decay: a model whose failures are concentrated on a small set of tasks. The capability ceiling is lower but the operating ceiling is higher.
Plateau then drop: a model that handles common cases reliably and fails identically on a hard subset. The hard subset is where to invest annotation effort.

xychart-beta
  title "pass@1 vs pass^4 vs pass^8 for three agents"
  x-axis ["Agent A", "Agent B", "Agent C"]
  y-axis "Success rate (%)" 0 --> 100
  bar [90, 65, 75]
  bar [66, 45, 65]
  bar [43, 30, 56]

Agent A has the highest pass@1 (90%) but its independence-implied pass^8 (about 43%) means roughly half of all 8-attempt task suites contain at least one failure. Agent B is weaker everywhere. Agent C trails A on pass@1 (75%) but holds 56% at pass^8 because its failures are concentrated and not lottery-style; in a production setting that needs consistent multi-step behaviour, Agent C is the better deploy.

The point of the chart is not that lower decay is always better; it is that you cannot tell A from C from pass@1 alone, and the choice between them depends on whether your application is one-shot or repeated.

A runnable pass^k computation

The naive version of pass^k under k=1 attempts gives you nothing; you need k>=4 trials at temperature greater than zero per task. Then compute the per-task fraction and the benchmark-level pass^k. Here is a runnable harness skeleton you can drop into a notebook.

"""
Compute pass@1, pass^k, and the HumanEval-style unbiased pass@k from
raw per-attempt results. No external dependencies.
"""
from math import comb
from statistics import mean
from typing import Iterable

Attempt = bool  # True if the attempt succeeded
Task = list[Attempt]      # k attempts on one task
Benchmark = list[Task]    # n tasks, each with k attempts

def pass_at_1(benchmark: Benchmark) -> float:
    """Average per-attempt success across all (task, attempt) pairs."""
    flat = [a for task in benchmark for a in task]
    return mean(flat)

def strict_pass_pow_k(benchmark: Benchmark, k: int) -> float:
    """
    τ-bench-style pass^k: fraction of tasks for which the first k
    attempts ALL succeed. Requires at least k attempts per task.
    """
    bad = [t for t in benchmark if len(t) < k]
    if bad:
        raise ValueError(f"{len(bad)} task(s) have fewer than {k} attempts")
    return mean(1 if all(t[:k]) else 0 for t in benchmark)

def unbiased_pass_at_k(benchmark: Benchmark, k: int) -> float:
    """
    HumanEval-style estimator for pass@k from m>=k attempts per task.
    pass@k_hat = E_task [1 - C(m-c, k) / C(m, k)] where c is the count
    of successful attempts on the task and m is the number of attempts.
    This estimates "probability that at least one of k samples is correct"
    using all m samples without bias.
    """
    estimates = []
    for task in benchmark:
        m = len(task)
        c = sum(1 for a in task if a)
        if m - c < k:
            estimates.append(1.0)
        else:
            estimates.append(1.0 - comb(m - c, k) / comb(m, k))
    return mean(estimates)

def report(name: str, benchmark: Benchmark, ks: Iterable[int] = (1, 4, 8)):
    print(f"=== {name} ===")
    print(f"  pass@1 = {pass_at_1(benchmark):.3f}")
    for k in ks:
        try:
            print(f"  pass^{k} = {strict_pass_pow_k(benchmark, k):.3f}")
        except ValueError as e:
            print(f"  pass^{k} = skipped ({e})")

# Toy example: 5 tasks, k=8 attempts each.
# Agent A: 90% pass@1 but failures are independent (Bernoulli with p=0.9).
# Agent C: 75% pass@1 but two tasks fail every attempt; the other three
# always succeed.
agent_a = [
    [True, True, True, True, True, True, True, False],
    [True, True, False, True, True, True, True, True],
    [True, True, True, True, True, True, False, True],
    [True, True, True, True, False, True, True, True],
    [True, True, True, True, True, True, True, True],
]
agent_c = [
    [True] * 8,
    [True] * 8,
    [True] * 8,
    [False] * 8,
    [False] * 8,
]

report("Agent A (brittle)", agent_a)
report("Agent C (concentrated)", agent_c)

Running this on the toy data: Agent A reports pass@1 around 0.88 and pass^8 around 0.20; Agent C reports pass@1 = 0.60 and pass^8 = 0.60. The numbers are toy, but the asymmetry is real: outcome on a single attempt does not constrain joint success across attempts, and a leaderboard that reports only the former is hiding the latter.

How to report pass^k in a model card

Three rules, lifted from how the τ-bench and TRAJECT-Bench papers present results ¹ ²:

Always pair pass^k with pass@1. The delta is the information.
Pick k from the deployment, not the benchmark. If your product is a 6-turn customer support flow, pass^6 is the meaningful one. If your product is a one-shot search summarizer, pass^1 is the meaningful one.
Use temperature greater than zero for at least k attempts per task, with the random seed varied per attempt. Temperature zero with the same seed makes the metric meaningless.

The agent-eval surveys both flag pass^k as a primary planning-and-reliability metric, and both note that adopting it is the easiest single change that aligns benchmark reporting with production behaviour ³ ⁴.

Operational implications

Three implications for how you build agent products.

First, the workflow-vs-agent decision sharpens. Anthropic's guide to building effective agents argues for the simplest architecture that solves the problem, with workflows preferred over agents wherever they suffice ⁵. Pass^k is the metric that surfaces when an agentic loop is buying you peak capability at the cost of reliability. If a workflow has pass^k = 0.95 and your agent has pass@1 = 0.97 with pass^k = 0.50, the workflow wins for a production gate.

Second, replanning behaviour shows up in pass^k. An agent that detects its own failures and retries with a corrected plan has correlated successes across attempts and a flatter pass^k curve. An agent that hits the right answer on lucky tries has uncorrelated successes and a steeper curve. The shape of the curve is a diagnostic for replanning quality.

Third, contamination shows up too. Agents that memorize answers on a contaminated benchmark like GAIA ⁶ tend to win pass@1 by a wide margin and lose pass^k by a wide margin because memorization is brittle to seed perturbation. A widening pass@1 to pass^k gap on a public benchmark, where it does not appear on a private regression set, is a contamination signal.

What to do this week

For one workflow you already evaluate, run k=4 attempts per task at temperature 0.7. Compute pass@1 and pass^4. Compare to your current single-run number.
Pick k by deployment. Six-turn flows: k=6. One-shot tasks: k=1. Long-horizon research agents: k=8 or 16.
Add pass^k to your model card or release note alongside pass@1. Track it across releases. A pass^k drop is a release gate, not a footnote.

Yao et al., "τ-bench" (2024). https://arxiv.org/abs/2406.12045 ↩ ↩² ↩³
He et al., "TRAJECT-Bench" (2025). https://arxiv.org/abs/2510.04550 ↩ ↩²
Yehudai et al., "Survey on Evaluation of LLM-based Agents" (2025). https://arxiv.org/abs/2503.16416 ↩
Mohammadi et al., "Evaluation of LLM Agents: A Survey" (2025). https://arxiv.org/abs/2507.21504 ↩
Anthropic, "Building Effective Agents". https://www.anthropic.com/engineering/building-effective-agents ↩
Mialon et al., "GAIA" (2023). https://arxiv.org/abs/2311.12983 ↩