A pass rate without a confidence interval is a number pretending to be a finding. Seventy percent on ten traces and seventy percent on a thousand traces are the same point estimate and very different claims. This page is the minimum statistical toolkit for an eval program.
Confidence intervals on binary metrics
Three intervals are commonly cited. One is wrong.
Wald (normal approximation). p^±zp^(1−p^)/n. Collapses at boundaries, undercovers in practice, can extend below zero or above one. Never ship 1.
Wilson score. Inverts the score test under the binomial. Does not collapse at boundaries, stays inside [0, 1], good coverage at small N. The practical default.
Clopper-Pearson exact. Inverts the exact binomial test. Conservative (wider than Wilson) but guarantees nominal coverage. Use for regulatory filings, contracts, audits.
The 7-out-of-10 lesson: Wilson on 7/10 gives roughly [0.40, 0.89]. Most teams discover this when they ship on the difference between 70% and 60% on ten traces; both intervals overlap completely 2.
from statsmodels.stats.proportion import proportion_confint
def wilson(s, n, alpha=0.05): return proportion_confint(s, n, alpha=alpha, method="wilson")
def clopper_pearson(s, n, alpha=0.05): return proportion_confint(s, n, alpha=alpha, method="beta")
| N | Wilson 95% half-width on 70% | Smallest detectable delta |
|---|
| 10 | ~25 points | any large effect |
| 50 | ~13 points | 20 points |
| 100 | ~9 points | 13 points |
| 200 | ~6 points | 8 points |
| 500 | ~4 points | 5 points |
N=200 is the rough threshold where Wilson half-widths drop below seven points on a 70% baseline. Most prompt or model changes deliver four to seven points, so N=200 is the standard release-gate size 3. The confidence intervals chapter walks the derivations.
Bootstrap for noisy metrics
When the metric is not a clean binomial (Likert, token-overlap, embedding cosine, Elo), Wilson does not apply. Bootstrap is the general tool: resample with replacement N times, compute the metric on each resample, take the 2.5th and 97.5th percentile. Chatbot Arena uses bootstrap on Bradley-Terry fits for leaderboard intervals 4.
import numpy as np
def bootstrap_ci(metric_fn, data, n_boot=1000, alpha=0.05):
n = len(data)
boot = np.empty(n_boot)
for b in range(n_boot):
idx = np.random.randint(0, n, size=n)
boot[b] = metric_fn([data[i] for i in idx])
lo, hi = np.percentile(boot, [100 * alpha / 2, 100 * (1 - alpha / 2)])
return float(boot.mean()), float(lo), float(hi)
1000 replicates for headline numbers, 200 for fast triage. The dominant cost is the metric function.
Paired tests
Two systems evaluated on the same set are paired data, not independent samples. Treating them as independent over-states the variance of the difference.
For binary: McNemar's test on the discordant pairs. For continuous: paired bootstrap (resample pairs, not individual scores; percentile interval on the differences). For two-proportion intervals: Newcombe-Wilson paired form or paired bootstrap.
def paired_bootstrap_delta(a, b, n_boot=1000, alpha=0.05):
deltas = np.array(a) - np.array(b)
boot = np.empty(n_boot)
for k in range(n_boot):
idx = np.random.randint(0, len(deltas), size=len(deltas))
boot[k] = deltas[idx].mean()
lo, hi = np.percentile(boot, [100 * alpha / 2, 100 * (1 - alpha / 2)])
return float(deltas.mean()), float(lo), float(hi)
Clustered standard errors
Wilson and the bootstrap assume independent observations. Three eval-program patterns violate independence: the same user contributes multiple traces, multiple questions share a source document, multi-turn conversations are scored per turn.
Miller shows the naive variance can be three times too small in these settings, so a 95% interval has actual 65% coverage 3. The fix is a cluster bootstrap: resample clusters with replacement, not individual observations. The clustered standard errors chapter walks the math.
def cluster_bootstrap_ci(scores, cluster_ids, n_boot=1000, alpha=0.05):
by_cluster = {}
for s, c in zip(scores, cluster_ids):
by_cluster.setdefault(c, []).append(s)
keys = list(by_cluster.keys())
boot = np.empty(n_boot)
for b in range(n_boot):
idx = np.random.randint(0, len(keys), size=len(keys))
sample = []
for i in idx: sample.extend(by_cluster[keys[i]])
boot[b] = np.mean(sample)
lo, hi = np.percentile(boot, [100 * alpha / 2, 100 * (1 - alpha / 2)])
return float(np.mean(scores)), float(lo), float(hi)
If any observable groups observations (user, source doc, conversation, topic), cluster on that group. Skipping this step is the single most common reason eval results overstate precision.
Power and sample-size planning
Statistical power is the probability of detecting a real effect of a given size. Shipping at 30% power on the deltas the team cares about means the eval misses real improvements as often as not.
Two-line rule for binary metrics at 80% power, 5% significance. To detect a 5-point delta on a 70% baseline: N is roughly 800 per arm. 10-point delta: N is 200 per arm. 20-point delta: N is 50 per arm 5. Halving the delta of interest quadruples the required N.
Multiple comparisons
Reporting a release-gate decision on whichever sub-metric "scored best" out of fifteen is the classic trap. With fifteen independent tests at p < 0.05, the family-wise error rate is roughly 54%. Bonferroni (conservative) or Benjamini-Hochberg FDR (less conservative) are the standard corrections. The practical rule: pick a headline metric before looking at the data; report others with FDR as secondary.
Inter-rater agreement
Cohen's kappa for two raters on the same items. Krippendorff's alpha for more than two raters, missing data, or ordinal scales. Both correct for chance agreement. The standard bar for an eval-program calibration set is alpha or kappa above 0.80 6 7. Below 0.60 the rubric is ambiguous and needs rewriting; between 0.60 and 0.80 the rubric needs another iteration; above 0.80 it is shippable.
| Statistic | When | Tooling |
|---|
| Percent agreement | never (chance-inflated) | none |
| Cohen's kappa | 2 raters, categorical | sklearn.metrics.cohen_kappa_score |
| Krippendorff's alpha | 2+ raters, any scale | krippendorff (PyPI) |
| Fleiss' kappa | 3+ raters, categorical, no missing | statsmodels |
What to put on the dashboard
Point estimate to one decimal place. Wilson 95% interval (or bootstrap percentile for non-binary). The N. The judge and dataset versions. For paired comparisons, the paired bootstrap delta and its interval. For clustered data, the cluster-aware interval. A dashboard that shows 72.5% and nothing else is hiding the interval, not avoiding it.