Power is the probability that an experiment correctly detects a real effect of a given size. An eval at low power is a coin flip dressed in numbers: it returns "no significant difference" when the difference is real and you cannot tell which scenario produced the result. The fix is to commit to the smallest delta that matters, derive the N that gives you 80% or 90% power at that delta, and stop running evals at sizes the team has not done this math for.
The headline: at a 70% baseline, distinguishing a 5-point delta with 80% power and 95% confidence needs roughly 1,200 paired inputs; distinguishing a 3-point delta needs roughly 3,400. N=200 is often where the conversation starts; N=200 is also often a coin flip on the deltas teams want to ship behind 1. The numbers come from Wilson-interval and McNemar-style formulas with no exotic assumptions.
What power means
Power is the complement of the Type II error rate. At a chosen significance level (typically alpha = 0.05) and a chosen effect size (the smallest delta you care about, the "minimum detectable effect" or MDE), power is the probability that the test rejects the null when the true effect is MDE.
Three numbers fix the design:
- Alpha: the false-positive rate you accept (commonly 0.05).
- MDE: the smallest delta worth detecting (a quality decision, not a statistical one).
- Power: the probability of detecting MDE if it is real (commonly 0.80, sometimes 0.90).
Given those three, the required N is a derivation. For binary metrics with paired inputs, the formula uses the discordant-pair rate; for unpaired, it uses the two-proportion variance. Either way, N grows quickly as MDE shrinks.
A power table
Approximate paired-design N for 95% confidence and 80% power at the MDE shown, baseline 70%:
| MDE | Approx. N (paired) |
|---|
| 10 points | 320 |
| 7 points | 660 |
| 5 points | 1,300 |
| 4 points | 2,000 |
| 3 points | 3,400 |
| 2 points | 7,800 |
For unpaired (independent) designs, double the N approximately. The numbers are sensitive to the baseline, the pairing strength, and the variance of the metric; treat them as the right order of magnitude rather than precise targets. The Anthropic Error Bars paper derives these for evals in detail and shows the same shape across baselines 1 2.
The shape is the message: cutting MDE in half roughly quadruples the required N. Teams that want to detect 2-point regressions need eval sets thousands of inputs deep; teams that can settle for 7-point regressions need a few hundred. Pick the MDE first.
Computing the N you need
"""
Approximate sample size for paired binary comparison via McNemar normal
approximation, and for unpaired two-proportion via the Wilson-style
approximation. Both inputs are: baseline rate, MDE, alpha, power.
"""
import math
from scipy.stats import norm
def n_paired_binary(baseline: float, mde: float, alpha: float = 0.05, power: float = 0.80,
discordant_rate: float = 0.20) -> int:
"""
discordant_rate is the fraction of inputs where the two systems disagree.
Higher rates mean tighter pairing; if you do not know, 0.15-0.25 is typical
for production eval comparisons.
"""
z_a = norm.ppf(1 - alpha / 2)
z_b = norm.ppf(power)
# b - c proportion equals mde under the alternative
p_b_minus_c = mde
p_b_plus_c = discordant_rate
if p_b_plus_c <= abs(p_b_minus_c):
raise ValueError("discordant_rate must exceed |mde|")
var_null = p_b_plus_c
var_alt = p_b_plus_c - p_b_minus_c ** 2
n = ((z_a * math.sqrt(var_null) + z_b * math.sqrt(var_alt)) / p_b_minus_c) ** 2
return int(math.ceil(n))
def n_unpaired_binary(baseline: float, mde: float, alpha: float = 0.05, power: float = 0.80) -> int:
z_a = norm.ppf(1 - alpha / 2)
z_b = norm.ppf(power)
p1 = baseline
p2 = baseline - mde
p_bar = (p1 + p2) / 2
n = ((z_a * math.sqrt(2 * p_bar * (1 - p_bar)) + z_b * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2)))
/ mde) ** 2
return int(math.ceil(n))
# Common defaults
print(n_paired_binary(0.70, 0.05)) # ~1,300
print(n_paired_binary(0.70, 0.03)) # ~3,400
print(n_unpaired_binary(0.70, 0.05)) # ~2,500
print(n_unpaired_binary(0.70, 0.03)) # ~7,000
The paired version requires an estimate of the discordant-pair rate. If you do not know it, run a small pilot of fifty paired inputs and count the disagreements; the empirical rate is good enough for the design calculation.
For non-binary metrics, the calculation needs the standard deviation of the per-input difference. A pilot run gives you both the mean and the standard deviation; plug into the standard formula n = ((z_a + z_b) * sd / mde)^2 for the paired continuous case.
Power under clustering
When users contribute multiple inputs, the effective N is smaller than the raw count. The design effect (the ratio of clustered variance to naive variance) inflates the required N by roughly 1 + (m - 1) * rho, where m is the average cluster size and rho is the intracluster correlation. Even at modest values (m = 5, rho = 0.1), the inflation is 40%. The clustered standard errors page details the variance side; the power side is the same correction applied to the N estimate 1.
In practice, the cluster correction means an eval set of "200 inputs from 40 users" has the power of "roughly 140 inputs from independent users." If the design calls for 1,300 paired inputs, that maps to roughly 1,800 inputs if each user contributes five and they correlate moderately.
Why teams ignore power
Three reasons recur, and all three are worth flagging.
The first is "we already have 200 inputs and we want to ship." The cost of doubling the set feels higher than the cost of an uncertain decision; in practice the next regression that ships through the underpowered eval costs more than the dataset extension would have. The Field Guide and the FAQ both make this point: the eval set is a leveraged investment 3.
The second is "the bootstrap interval is wide and we accept it." The bootstrap reports the interval honestly; the team's acceptance does not change the underlying inferential weakness. A 90% confidence interval on a 5-point delta that ranges from minus 6 to plus 16 is a coin flip, not a decision input.
The third is "we will compensate with more frequent runs." Repeated underpowered tests with naive multiple-comparison handling produce false discoveries on a timeline the team has no plan to correct. The multiple-comparisons page is the correction; the prior question is whether the underlying test is powered.
[!IMPORTANT]
Pick MDE before you pick N. Picking N first and reading the MDE off the math invites the team to argue the underpowered comparison was decisive when it was not 2.
Power for online A/B
Online A/B inherits the same logic with two differences. The pairing is rarely clean (users do not always see both arms), so the comparison is usually unpaired. The variance is usually larger due to clustering by user. Chatbot Arena reports model intervals from bootstrap on a sample-by-sample basis precisely because the underlying pairwise comparisons are noisy and the team wants the interval to reflect the actual signal 4. The same discipline applies to in-product A/B: design the comparison for the MDE that matters, and run it long enough.
The flywheel framing makes the same point at the program level: eval programs that under-invest in power produce shipping decisions that look statistical but are not 5.
What to do this week
- For each release-gated metric, write the MDE on the spec. The number is a product decision; argue it with whoever owns the product.
- Compute the required N for each metric at 80% power, alpha 0.05. If the eval set is smaller, treat that as a known underpowered state, not as something you can wish away.
- Run a pilot of fifty paired inputs to estimate the discordant-pair rate or the standard deviation of the difference. Plug it into the design.