Statistical power: how big does the eval set need to be

Power is the probability of detecting the delta you care about. Pick the delta first, derive the N, then commit. Reversing the order produces evals nobody can act on.

Power is the probability that an experiment correctly detects a real effect of a given size. An eval at low power is a coin flip dressed in numbers: it returns "no significant difference" when the difference is real and you cannot tell which scenario produced the result. The fix is to commit to the smallest delta that matters, derive the N that gives you 80% or 90% power at that delta, and stop running evals at sizes the team has not done this math for.

The headline: at a 70% baseline, distinguishing a 5-point delta with 80% power and 95% confidence needs roughly 1,300 paired inputs; distinguishing a 3-point delta needs roughly 3,400. N=200 is often where the conversation starts; N=200 is also often a coin flip on the deltas teams want to ship behind ¹. The numbers come from Wilson-interval and McNemar-style formulas with no exotic assumptions.

What power means

Power is the complement of the Type II error rate. At a chosen significance level (typically alpha = 0.05) and a chosen effect size (the smallest delta you care about, the "minimum detectable effect" or MDE), power is the probability that the test rejects the null when the true effect is MDE.

Three numbers fix the design:

Alpha: the false-positive rate you accept (commonly 0.05).
MDE: the smallest delta worth detecting (a quality decision, not a statistical one).
Power: the probability of detecting MDE if it is real (commonly 0.80, sometimes 0.90).

Given those three, the required N is a derivation. For binary metrics with paired inputs, the formula uses the discordant-pair rate; for unpaired, it uses the two-proportion variance. Either way, N grows quickly as MDE shrinks.

A power table

Approximate paired-design N for 95% confidence and 80% power at the MDE shown, baseline 70%, assuming a conservative 40% discordant-pair rate (the numbers stay valid when pairing is weak):

MDE	Approx. N (paired)
10 points	320
7 points	660
5 points	1,300
4 points	2,000
3 points	3,400
2 points	7,800

For unpaired (independent) designs, double the N approximately. When a pilot shows tight pairing (the more typical 20% discordant rate), the table roughly halves: a 5-point MDE needs about 630 paired inputs instead of 1,300. The numbers are sensitive to the baseline, the pairing strength, and the variance of the metric; treat them as the right order of magnitude rather than precise targets. The Anthropic Error Bars paper derives these for evals in detail and shows the same shape across baselines ¹ ².

The shape is the message: cutting MDE in half roughly quadruples the required N. Teams that want to detect 2-point regressions need eval sets thousands of inputs deep; teams that can settle for 7-point regressions need a few hundred. Pick the MDE first.

Sizing is a loop, not a one-shot calculation: when the computed N is unaffordable, the MDE or the power target is what gives.

flowchart LR
    A["Pick MDE (product decision)"] --> B["Pick alpha + power"]
    B --> C["Pilot: discordant rate"]
    C --> D["Compute N"]
    D --> E{"N affordable?"}
    E -- yes --> F["Run the eval"]
    E -- no --> G["Relax MDE or accept lower power"]
    G --> A

Computing the N you need

"""
Approximate sample size for paired binary comparison via McNemar normal
approximation, and for unpaired two-proportion via the Wilson-style
approximation. Both inputs are: baseline rate, MDE, alpha, power.
"""
import math
from scipy.stats import norm

def n_paired_binary(baseline: float, mde: float, alpha: float = 0.05, power: float = 0.80,
                     discordant_rate: float = 0.20) -> int:
    """
    discordant_rate is the fraction of inputs where the two systems disagree.
    Higher rates mean tighter pairing; if you do not know, 0.15-0.25 is typical
    for production eval comparisons.
    """
    z_a = norm.ppf(1 - alpha / 2)
    z_b = norm.ppf(power)
    # b - c proportion equals mde under the alternative
    p_b_minus_c = mde
    p_b_plus_c = discordant_rate
    if p_b_plus_c <= abs(p_b_minus_c):
        raise ValueError("discordant_rate must exceed |mde|")
    var_null = p_b_plus_c
    var_alt = p_b_plus_c - p_b_minus_c ** 2
    n = ((z_a * math.sqrt(var_null) + z_b * math.sqrt(var_alt)) / p_b_minus_c) ** 2
    return int(math.ceil(n))

def n_unpaired_binary(baseline: float, mde: float, alpha: float = 0.05, power: float = 0.80) -> int:
    z_a = norm.ppf(1 - alpha / 2)
    z_b = norm.ppf(power)
    p1 = baseline
    p2 = baseline - mde
    p_bar = (p1 + p2) / 2
    n = ((z_a * math.sqrt(2 * p_bar * (1 - p_bar)) + z_b * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2)))
         / mde) ** 2
    return int(math.ceil(n))

# Common defaults
print(n_paired_binary(0.70, 0.05))                        # ~630 at the 20% default discordance
print(n_paired_binary(0.70, 0.05, discordant_rate=0.40))  # ~1,300, the table's conservative assumption
print(n_unpaired_binary(0.70, 0.05))                      # ~1,380 per arm, ~2,750 total
print(n_unpaired_binary(0.70, 0.03))                      # ~3,760 per arm

The paired version requires an estimate of the discordant-pair rate. If you do not know it, run a small pilot of fifty paired inputs and count the disagreements; the empirical rate is good enough for the design calculation.

For non-binary metrics, the calculation needs the standard deviation of the per-input difference. A pilot run gives you both the mean and the standard deviation; plug into the standard formula n = ((z_a + z_b) * sd / mde)^2 for the paired continuous case.

Power under clustering

When users contribute multiple inputs, the effective N is smaller than the raw count. The design effect (the ratio of clustered variance to naive variance) inflates the required N by roughly 1 + (m - 1) * rho, where m is the average cluster size and rho is the intracluster correlation. Even at modest values (m = 5, rho = 0.1), the inflation is 40%. The clustered standard errors page details the variance side; the power side is the same correction applied to the N estimate ¹.

In practice, the cluster correction means an eval set of "200 inputs from 40 users" has the power of "roughly 140 inputs from independent users." If the design calls for 1,300 paired inputs, that maps to roughly 1,800 inputs if each user contributes five and they correlate moderately.

Why teams ignore power

Three reasons recur, and all three are worth flagging.

The first is "we already have 200 inputs and we want to ship." The cost of doubling the set feels higher than the cost of an uncertain decision; in practice the next regression that ships through the underpowered eval costs more than the dataset extension would have. The Field Guide and the FAQ both make this point: the eval set is a leveraged investment ³.

The second is "the bootstrap interval is wide and we accept it." The bootstrap reports the interval honestly; the team's acceptance does not change the underlying inferential weakness. A 90% confidence interval on a 5-point delta that ranges from minus 6 to plus 16 is a coin flip, not a decision input.

The third is "we will compensate with more frequent runs." Repeated underpowered tests with naive multiple-comparison handling produce false discoveries on a timeline the team has no plan to correct. The multiple-comparisons page is the correction; the prior question is whether the underlying test is powered.

TIP

Pick MDE before you pick N. Picking N first and reading the MDE off the math invites the team to argue the underpowered comparison was decisive when it was not ².

Power for online A/B

Online A/B inherits the same logic with two differences. The pairing is rarely clean (users do not always see both arms), so the comparison is usually unpaired. The variance is usually larger due to clustering by user. Chatbot Arena reports model intervals from bootstrap on a sample-by-sample basis precisely because the underlying pairwise comparisons are noisy and the team wants the interval to reflect the actual signal ⁴. The same discipline applies to in-product A/B: design the comparison for the MDE that matters, and run it long enough.

The flywheel framing makes the same point at the program level: eval programs that under-invest in power produce shipping decisions that look statistical but are not ⁵.

What to do this week

For each release-gated metric, write the MDE on the spec. The number is a product decision; argue it with whoever owns the product.
Compute the required N for each metric at 80% power, alpha 0.05. If the eval set is smaller, treat that as a known underpowered state, not as something you can wish away.
Run a pilot of fifty paired inputs to estimate the discordant-pair rate or the standard deviation of the difference. Plug it into the design.

Miller, "Adding Error Bars to Evals" (Anthropic), arXiv:2411.00640. ↩ ↩² ↩³
Anthropic, "A Statistical Approach to Model Evaluations." ↩ ↩²
Hamel Husain and Shreya Shankar, "LLM Evals FAQ." ↩
Chiang et al., "Chatbot Arena," arXiv:2403.04132. ↩
Shreya Shankar, "The AI Engineering Flywheel." ↩