The Wald interval is the one undergraduate stats courses introduce first: p^±zp^(1−p^)/n\hat{p} \pm z\sqrt{\hat{p}(1-\hat{p})/n}. It is wrong for binary eval metrics in three specific ways.

Confidence intervals: Wilson and Clopper-Pearson

Q: When N starts to matter?

There is no universal "right" sample size; the right nn depends on what difference you are trying to detect. A rough table of detectable differences at 95% confidence on a 70% baseline, derived from Wilson intervals:

Wald is wrong. Wilson is the practical default. Clopper-Pearson is the conservative exact. With code and the case for why N=10 evals tell you almost nothing.

A pass rate without a confidence interval is a number pretending to be a finding. Seventy percent on ten traces and seventy percent on a thousand traces are the same point estimate and very different claims. The interval is what tells you which one to trust, and which alternative numbers the data is also consistent with.

The headline opinions: never use the Wald (normal-approximation) interval; default to Wilson; reach for Clopper-Pearson when you need a guarantee that the coverage is at least the nominal level; and accept that with fewer than two hundred evals you usually cannot distinguish a 70% system from a 55% one. The math is short, the code is shorter, and the consequences of getting this wrong are several "ship it" decisions per quarter that did not deserve the confidence they got.

Why Wald is wrong

The Wald interval is the one undergraduate stats courses introduce first: $\hat{p} \pm z\sqrt{\hat{p}(1-\hat{p})/n}$ . It is wrong for binary eval metrics in three specific ways.

First, it collapses at the boundaries. If you observe 10/10 successes, $\hat{p} = 1$ , the standard error is zero, and the interval is the single point [1.0, 1.0]. Wald tells you a 100% pass rate on ten examples is a certainty. It is not.

Second, the actual coverage is well below the nominal 95% even for moderate $n$ and moderate $\hat{p}$ . The normal approximation undercovers, especially near the boundaries, so a "95% Wald interval" might actually contain the true rate 88% of the time ¹.

Third, it can extend below 0 or above 1, which is nonsensical for a proportion and confusing to readers.

Wilson and Clopper-Pearson fix all three. There is no reason to ship Wald in 2026.

Wilson score interval

The Wilson interval is the inversion of the score test under the binomial likelihood. It does not collapse at the boundaries, it has good coverage even at small $n$ , and it stays inside [0, 1] ¹. The formula:

\frac{\hat{p} + \frac{z^2}{2n} \pm z\sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac{z^2}{4n^2}}}{1 + \frac{z^2}{n}}

For 95% intervals, $z \approx 1.96$ . This is the default for every binary eval metric on this site.

Clopper-Pearson exact interval

Clopper-Pearson is the inversion of the exact binomial test. It is constructed to guarantee that the actual coverage is at least the nominal level ¹. The price is conservatism: the interval is wider than Wilson at the same $n$ and $\hat{p}$ . Use it when you have a coverage guarantee to make (a regulatory filing, a contract, an audit) or when the consequence of overstating precision is real.

Code, both intervals

The standard implementations live in statsmodels.stats.proportion and scipy.stats. Both are fewer than ten lines.

import numpy as np
from scipy import stats
from statsmodels.stats.proportion import proportion_confint

def wilson_ci(successes: int, n: int, alpha: float = 0.05) -> tuple[float, float]:
    """Wilson score interval. Default 95% confidence."""
    low, high = proportion_confint(successes, n, alpha=alpha, method="wilson")
    return low, high

def clopper_pearson_ci(successes: int, n: int, alpha: float = 0.05) -> tuple[float, float]:
    """Clopper-Pearson exact interval. Conservative; coverage at least 1 - alpha."""
    low, high = proportion_confint(successes, n, alpha=alpha, method="beta")
    return low, high

def wald_ci_for_comparison_only(successes: int, n: int, alpha: float = 0.05) -> tuple[float, float]:
    """Wald is shown here to demonstrate why it is wrong. Do not ship."""
    p_hat = successes / n
    z = stats.norm.ppf(1 - alpha / 2)
    se = np.sqrt(p_hat * (1 - p_hat) / n)
    return max(0.0, p_hat - z * se), min(1.0, p_hat + z * se)


# The canonical sobering case: 7 out of 10.
print("7/10 successes")
print(f"  Wilson 95%:          {wilson_ci(7, 10)}")
print(f"  Clopper-Pearson 95%: {clopper_pearson_ci(7, 10)}")
print(f"  Wald 95% (wrong):    {wald_ci_for_comparison_only(7, 10)}")

# Output (approximate):
# Wilson 95%:          (0.3968, 0.8922)
# Clopper-Pearson 95%: (0.3475, 0.9333)
# Wald 95% (wrong):    (0.4159, 0.9841)

print("\nThe boundary case: 10/10 successes")
print(f"  Wilson 95%:          {wilson_ci(10, 10)}")
print(f"  Clopper-Pearson 95%: {clopper_pearson_ci(10, 10)}")
print(f"  Wald 95% (wrong):    {wald_ci_for_comparison_only(10, 10)}")

# Output (approximate):
# Wilson 95%:          (0.7225, 1.0000)
# Clopper-Pearson 95%: (0.6915, 1.0000)
# Wald 95% (wrong):    (1.0000, 1.0000)  # the interval is a single point

print("\nThe N=200 case: 140/200 (70% pass rate)")
print(f"  Wilson 95%:          {wilson_ci(140, 200)}")
print(f"  Clopper-Pearson 95%: {clopper_pearson_ci(140, 200)}")

# Output (approximate):
# Wilson 95%:          (0.6336, 0.7585)
# Clopper-Pearson 95%: (0.6307, 0.7621)

CITATION

Wikipedia, "Binomial proportion confidence interval." The canonical write-up with derivations of Wald, Wilson, Clopper-Pearson, Agresti-Coull, and Jeffreys. It explains why Wald undercovers and why Wilson is the practical default. en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

The 7-out-of-10 lesson

Wilson on 7/10 gives roughly [0.40, 0.89]. That interval is enormous. It is also correct. It says the data is consistent with a true pass rate as low as 40% and as high as 89%, and the eval is informative about almost nothing in between.

Teams discover this when they run a 10-trace "quick eval" and report "the new prompt scored 70% vs the old prompt's 60%." Both estimates have intervals so wide they overlap completely. The change is not detectable from the data, and "ship it" is at best a coin flip. Hamel and Shankar make the same point bluntly in their FAQ: tiny eval sets are vibe-checking with extra steps ².

The fix is a larger N. The width of the Wilson interval shrinks roughly with $1/\sqrt{n}$ , so to halve it you quadruple the dataset. The next section is the practical sample-size guide.

When N starts to matter

There is no universal "right" sample size; the right $n$ depends on what difference you are trying to detect. A rough table of detectable differences at 95% confidence on a 70% baseline, derived from Wilson intervals:

N	Wilson 95% half-width	Smallest detectable delta vs 70% baseline
10	~25 percentage points	basically anything large enough to notice by eye
50	~13 percentage points	a 20-point shift
100	~9 percentage points	a 13-point shift
200	~6 percentage points	an 8-point shift
500	~4 percentage points	a 5-point shift
1000	~3 percentage points	a 4-point shift
2000	~2 percentage points	a 3-point shift

A few takeaways the table makes concrete. At N=10, you cannot distinguish 70% from 50%. At N=100, you cannot distinguish 70% from 65%. At N=500 the interval starts to be useful for the kinds of changes a thoughtful prompt edit produces. The N=200 threshold often comes up because that is where Wilson half-widths drop below 7 points on a 70% baseline, which is roughly the smallest delta most prompt or model changes deliver in production ³.

Width scales with $1/\sqrt{n}$ : quadrupling N halves the interval.

flowchart TD
    A["N=10: +/-25pt"] --> B["N=50: +/-13pt"]
    B --> C["N=200: +/-6pt"]
    C --> D["N=1000: +/-3pt"]

Two-sample comparison

A common downstream question is "is system A better than system B." For independent binary samples, the standard approach is a two-proportion test plus an interval on the difference. Wilson has a two-proportion analogue (Newcombe-Wilson) that handles the binary-binary comparison correctly.

from statsmodels.stats.proportion import confint_proportions_2indep, test_proportions_2indep

def compare_systems(successes_a: int, n_a: int, successes_b: int, n_b: int, alpha: float = 0.05):
    """Newcombe-Wilson interval on the difference plus a p-value."""
    p_a = successes_a / n_a
    p_b = successes_b / n_b
    diff_low, diff_high = confint_proportions_2indep(
        successes_a, n_a, successes_b, n_b,
        alpha=alpha, method="wald",  # statsmodels exposes wald, score, agresti-caffo
    )
    test = test_proportions_2indep(successes_a, n_a, successes_b, n_b, method="score")
    return {
        "p_a": p_a,
        "p_b": p_b,
        "diff": p_a - p_b,
        "diff_ci": (diff_low, diff_high),
        "p_value": test.pvalue,
    }

result = compare_systems(140, 200, 120, 200)
# diff ~ 0.10; diff_ci ~ (0.00, 0.20); p_value ~ 0.04

Two cautions. First, the comparison above assumes the two samples are independent. They usually are not: most teams reuse the same eval set across systems, which is paired data and wants McNemar's test or a paired bootstrap. The bootstrap and paired tests page covers that. Second, if the system has multiple subscores (faithfulness, conciseness, refusal rate) and you compare each, you have multiple comparisons and need the correction on the multiple comparisons page.

The Chatbot Arena precedent

Large pairwise battle systems hit the same issue at scale and resolved it with bootstrap. Chatbot Arena reports model Elo scores with 95% bootstrap confidence intervals on every release ⁴. The interval is what lets a reader distinguish "model A is clearly above model B" from "the difference is within the noise floor." Without the interval, the leaderboard is a ranking with no meaning attached to neighbouring entries.

The Anthropic Error Bars paper makes the same point for general eval reports: every headline number deserves an interval, every comparison deserves a p-value, and any release decision should be accountable to both ³ ⁵.

When Wilson is not enough

Wilson assumes the trials are independent Bernoulli draws. For most offline eval sets, they are: the prompts are distinct, the labels are independent, the judge is the same. The assumption breaks in three settings.

The first is when the same user contributes multiple traces. Errors on one trace from a user predict errors on others. The variance is larger than Wilson reports, and the interval is too narrow. The clustered standard errors page is the correction.

The second is when the metric is a numeric score, not a binary label (a Likert rubric, a token-overlap score, a continuous quality score). Wilson does not apply; use a bootstrap percentile interval or a normal-theory interval with the sample standard deviation, depending on the distribution.

The third is when the dataset is small and the judge is noisy. The judge's own variance compounds with the sampling variance. The fix is usually larger N, multiple judges, or a paired design where the judge sees both systems' outputs and the comparison is at the judge level.

What to put on the dashboard

The minimum acceptable presentation of a binary eval metric:

The point estimate, formatted to 1 decimal place (72.5%).
The Wilson 95% interval, formatted to 1 decimal place ([68.4%, 76.2%]).
The N (n=400).
The judge version and dataset version that produced the score.
A delta against the previous run with its own interval if the run is comparative.

A dashboard that shows 72.5% and nothing else is hiding the interval, not avoiding it. Readers fill in their own (usually too narrow). Show the interval and let the reader make calibrated decisions.

The cookbook page

The adding error bars cookbook walks the end-to-end: take a CSV of eval results, compute Wilson, compute a paired bootstrap, plot the interval. It is the practical companion to this page.

Wikipedia, "Binomial proportion confidence interval." ↩ ↩² ↩³
Hamel Husain and Shreya Shankar, "LLM Evals FAQ." ↩
Miller, "Adding Error Bars to Evals" (Anthropic), arXiv:2411.00640. ↩ ↩²
Chiang et al., "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference," arXiv:2403.04132. ↩
Anthropic, "A Statistical Approach to Model Evaluations." ↩