AI Evals › Techniques

Statistical methods

Wilson, Clopper-Pearson, bootstrap, paired tests, clustered standard errors, inter-rater agreement. The minimum statistics an eval program needs.

A pass rate without a confidence interval is a number pretending to be a finding. Seventy percent on ten traces and seventy percent on a thousand traces are the same point estimate and very different claims. This page is the minimum statistical toolkit for an eval program.

Confidence intervals on binary metrics

Three intervals are commonly cited. One is wrong.

Wald (normal approximation). $\hat{p} \pm z\sqrt{\hat{p}(1-\hat{p})/n}$ . Collapses at boundaries, undercovers in practice, can extend below zero or above one. Never ship ¹.

Wilson score. Inverts the score test under the binomial. Does not collapse at boundaries, stays inside [0, 1], good coverage at small N. The practical default.

Clopper-Pearson exact. Inverts the exact binomial test. Conservative (wider than Wilson) but guarantees nominal coverage. Use for regulatory filings, contracts, audits.

The 7-out-of-10 lesson: Wilson on 7/10 gives roughly [0.40, 0.89]. Most teams discover this when they ship on the difference between 70% and 60% on ten traces; both intervals overlap completely ².

from statsmodels.stats.proportion import proportion_confint

def wilson(s, n, alpha=0.05): return proportion_confint(s, n, alpha=alpha, method="wilson")
def clopper_pearson(s, n, alpha=0.05): return proportion_confint(s, n, alpha=alpha, method="beta")

N	Wilson 95% half-width on 70%	Smallest detectable delta
10	~25 points	any large effect
50	~13 points	20 points
100	~9 points	13 points
200	~6 points	8 points
500	~4 points	5 points

N=200 is the rough threshold where Wilson half-widths drop below seven points on a 70% baseline. Most prompt or model changes deliver four to seven points, so N=200 is the standard release-gate size ³. The confidence intervals chapter walks the derivations.

Bootstrap for noisy metrics

When the metric is not a clean binomial (Likert, token-overlap, embedding cosine, Elo), Wilson does not apply. Bootstrap is the general tool: resample with replacement N times, compute the metric on each resample, take the 2.5th and 97.5th percentile. Chatbot Arena uses bootstrap on Bradley-Terry fits for leaderboard intervals ⁴.

import numpy as np

def bootstrap_ci(metric_fn, data, n_boot=1000, alpha=0.05):
    n = len(data)
    boot = np.empty(n_boot)
    for b in range(n_boot):
        idx = np.random.randint(0, n, size=n)
        boot[b] = metric_fn([data[i] for i in idx])
    lo, hi = np.percentile(boot, [100 * alpha / 2, 100 * (1 - alpha / 2)])
    return float(boot.mean()), float(lo), float(hi)

1000 replicates for headline numbers, 200 for fast triage. The dominant cost is the metric function.

Paired tests

Two systems evaluated on the same set are paired data, not independent samples. Treating them as independent over-states the variance of the difference.

For binary: McNemar's test on the discordant pairs. For continuous: paired bootstrap (resample pairs, not individual scores; percentile interval on the differences). For two-proportion intervals: Newcombe-Wilson paired form or paired bootstrap.

def paired_bootstrap_delta(a, b, n_boot=1000, alpha=0.05):
    deltas = np.array(a) - np.array(b)
    boot = np.empty(n_boot)
    for k in range(n_boot):
        idx = np.random.randint(0, len(deltas), size=len(deltas))
        boot[k] = deltas[idx].mean()
    lo, hi = np.percentile(boot, [100 * alpha / 2, 100 * (1 - alpha / 2)])
    return float(deltas.mean()), float(lo), float(hi)

Clustered standard errors

Wilson and the bootstrap assume independent observations. Three eval-program patterns violate independence: the same user contributes multiple traces, multiple questions share a source document, multi-turn conversations are scored per turn.

Miller shows the naive variance can be three times too small in these settings, so a 95% interval has actual 65% coverage ³. The fix is a cluster bootstrap: resample clusters with replacement, not individual observations. The clustered standard errors chapter walks the math.

def cluster_bootstrap_ci(scores, cluster_ids, n_boot=1000, alpha=0.05):
    by_cluster = {}
    for s, c in zip(scores, cluster_ids):
        by_cluster.setdefault(c, []).append(s)
    keys = list(by_cluster.keys())
    boot = np.empty(n_boot)
    for b in range(n_boot):
        idx = np.random.randint(0, len(keys), size=len(keys))
        sample = []
        for i in idx: sample.extend(by_cluster[keys[i]])
        boot[b] = np.mean(sample)
    lo, hi = np.percentile(boot, [100 * alpha / 2, 100 * (1 - alpha / 2)])
    return float(np.mean(scores)), float(lo), float(hi)

If any observable groups observations (user, source doc, conversation, topic), cluster on that group. Skipping this step is the single most common reason eval results overstate precision.

Power and sample-size planning

Statistical power is the probability of detecting a real effect of a given size. Shipping at 30% power on the deltas the team cares about means the eval misses real improvements as often as not.

Two-line rule for binary metrics at 80% power, 5% significance, on a paired design with a conservative 40% discordant-pair rate. To detect a 5-point delta on a 70% baseline: N is roughly 1,300 paired inputs (roughly 630 if a pilot shows the more typical 20% discordance; about 2,750 total for an unpaired design). 10-point delta: roughly 320 paired. 20-point delta: roughly 80 paired ⁵. Halving the delta of interest roughly quadruples the required N; the full derivation and the code live in statistical power.

Multiple comparisons

Reporting a release-gate decision on whichever sub-metric "scored best" out of fifteen is the classic trap. With fifteen independent tests at p < 0.05, the family-wise error rate is roughly 54%. Bonferroni (conservative) or Benjamini-Hochberg FDR (less conservative) are the standard corrections. The practical rule: pick a headline metric before looking at the data; report others with FDR as secondary.

Inter-rater agreement

Cohen's kappa for two raters on the same items. Krippendorff's alpha for more than two raters, missing data, or ordinal scales. Both correct for chance agreement. The standard bar for an eval-program calibration set is alpha or kappa above 0.80 ⁶ ⁷. Below 0.60 the rubric is ambiguous and needs rewriting; between 0.60 and 0.80 the rubric needs another iteration; above 0.80 it is shippable.

Statistic	When	Tooling
Percent agreement	never (chance-inflated)	none
Cohen's kappa	2 raters, categorical	sklearn.metrics.cohen_kappa_score
Krippendorff's alpha	2+ raters, any scale	krippendorff (PyPI)
Fleiss' kappa	3+ raters, categorical, no missing	statsmodels

What to put on the dashboard

Point estimate to one decimal place. Wilson 95% interval (or bootstrap percentile for non-binary). The N. The judge and dataset versions. For paired comparisons, the paired bootstrap delta and its interval. For clustered data, the cluster-aware interval. A dashboard that shows 72.5% and nothing else is hiding the interval, not avoiding it.

Wikipedia, "Binomial proportion confidence interval." ↩
Hamel Husain and Shreya Shankar, "LLM Evals FAQ." ↩
Miller, "Adding Error Bars to Evals" (Anthropic), arXiv:2411.00640. ↩ ↩²
Chiang et al., Chatbot Arena, ICML 2024. ↩
Anthropic, "A Statistical Approach to Model Evaluations." ↩
Wikipedia, "Cohen's kappa." ↩
Wikipedia, "Krippendorff's alpha." ↩