Bootstrap and paired tests for noisy eval metrics

When the metric is not a clean Bernoulli, bootstrap. When the systems share inputs, pair. Both close the gap between point estimates and actionable decisions.

Wilson handles binary metrics on a single sample. Production eval comparisons rarely fit that frame cleanly. The metric is often a Likert mean, a percentile, or a ratio. The two systems usually share inputs, so the samples are paired. The judge is sometimes noisy. None of these defeat statistical comparison; they just defeat the formulas that assume independence and Bernoulli-clean data. Bootstrap and paired tests are the two tools that handle the cases Wilson does not.

The opinion: bootstrap percentile intervals as the default for any non-binary metric; paired bootstrap or McNemar's test for the standard offline A-vs-B comparison. Chatbot Arena uses bootstrap for Elo intervals at scale; the Anthropic Error Bars paper recommends paired comparisons whenever the inputs match across systems ¹ ² ³. The two patterns cover most production eval reporting that the Wilson page does not.

Why pairing matters

Two independent samples of two hundred yield a comparison with a wide interval because the system-level variance dominates. Two paired samples of two hundred (the same inputs through both systems) cancel the input-level variance and tighten the interval substantially. For the same delta and the same n, the paired comparison's standard error is often a third or a half of the independent one.

The pairing is the realistic offline regime: the eval set is fixed, both systems see the same inputs, and the per-input differences are the unit of analysis. Treating the samples as independent in this setting throws away the most useful structural feature of the data and reports a too-conservative interval that hides genuine improvements.

McNemar for paired binary

For a paired comparison on a binary metric (both systems pass or fail per input), McNemar's test is the textbook tool. The statistic uses only the cells where the systems disagreed: b is the count where A passed and B failed, c is the count where A failed and B passed. The discordant pairs are the signal; the concordant pairs cancel out.

from statsmodels.stats.contingency_tables import mcnemar

def paired_binary_compare(both_pass: int, a_only: int, b_only: int, both_fail: int):
    """McNemar's test on a 2x2 paired contingency table."""
    table = [[both_pass, a_only], [b_only, both_fail]]
    result = mcnemar(table, exact=False, correction=True)
    return {"statistic": result.statistic, "p_value": result.pvalue,
            "delta_pp": (a_only - b_only) / sum(sum(r) for r in table)}

The delta in percentage points is (a_only - b_only) / n. McNemar's p-value answers "is the observed delta consistent with a true delta of zero." For ship decisions, complement it with a bootstrap interval on the delta itself; p-values without intervals are half-answers.

Bootstrap percentile intervals

The bootstrap resamples the eval set with replacement and recomputes the statistic on each resample. The percentile interval is the empirical 2.5 and 97.5 quantiles of the bootstrap distribution. The construction works for any statistic: a mean, a quantile, a difference of medians, an Elo rating, a Pareto frontier point.

"""
Paired bootstrap for two systems on the same inputs. The unit of resample
is the input, not the (system, input) pair: resampling pairs preserves
the pairing structure.
"""
import numpy as np

def paired_bootstrap(
    scores_a: np.ndarray,
    scores_b: np.ndarray,
    n_boot: int = 10000,
    alpha: float = 0.05,
    seed: int = 0,
) -> dict:
    assert len(scores_a) == len(scores_b), "paired comparison requires aligned inputs"
    rng = np.random.default_rng(seed)
    n = len(scores_a)
    diffs = np.empty(n_boot)
    for i in range(n_boot):
        idx = rng.integers(0, n, size=n)
        diffs[i] = scores_a[idx].mean() - scores_b[idx].mean()
    delta = scores_a.mean() - scores_b.mean()
    low, high = np.quantile(diffs, [alpha / 2, 1 - alpha / 2])
    p_two_sided = 2 * min(
        float((diffs <= 0).mean()),
        float((diffs >= 0).mean()),
    )
    return {"delta": float(delta), "ci": (float(low), float(high)),
            "p_value": p_two_sided}

# Example: two systems on the same 200 prompts, judge score in [0, 1].
rng = np.random.default_rng(7)
a = rng.beta(8, 2, size=200)        # mean ~0.80
b = rng.beta(7.5, 2.5, size=200)    # mean ~0.75
b = 0.6 * b + 0.4 * a                # induce pairing (correlated inputs)

result = paired_bootstrap(a, b)
# delta ~ 0.05; ci ~ (0.03, 0.07); p ~ 0

The resample unit is the input. Resampling at the level of individual scores across systems would break the pairing and reproduce the wider independent-comparison interval.

Bootstrap is not magic

Three caveats worth committing to.

First, the bootstrap inherits the sample's biases. A skewed eval set produces an interval that is correct given the skew; it does not correct for it. If the eval set under-represents a slice, the global bootstrap interval reports a precise estimate of a biased number. The fix is stratified sampling on the front end, not heroic resampling on the back end.

Second, ten thousand bootstrap samples is enough for most production use. Fewer than two thousand starts to introduce Monte Carlo noise in the interval endpoints; more than fifty thousand is overkill. Seed the RNG so the report is reproducible.

Third, the percentile interval can underperform near boundaries (proportions close to 0 or 1, or long-tailed distributions). The bias-corrected and accelerated (BCa) variant is the standard upgrade and is one line in scipy.stats.bootstrap. For most paired eval comparisons in production, the plain percentile interval is good enough; reach for BCa when the boundary is biting ².

Wilcoxon signed-rank as a paired alternative

For paired continuous metrics where you want a distribution-free test of "is the median difference zero," the Wilcoxon signed-rank test is the standard nonparametric alternative. It does not require the differences to be normal, only symmetric around zero under the null. It is more robust than a paired t-test on small samples or skewed metrics, and weaker than the bootstrap when the question is the magnitude of the delta rather than its sign.

from scipy import stats
res = stats.wilcoxon(scores_a, scores_b)
# res.statistic, res.pvalue

Use it as a second opinion when the paired bootstrap is showing a borderline result.

Reporting the comparison

A minimum acceptable presentation of an A-vs-B comparison:

Per-system point estimates with their own intervals.
The paired delta with its bootstrap interval (or McNemar for binary).
The p-value, with the explicit framing "consistent with zero" rather than "significant" or "not significant."
The N of paired inputs (not the count of judgments, which doubles N misleadingly).

Hamel and Shankar's FAQ argues this is the only honest way to present the comparison in a model card or release note; numbers without intervals look more confident than they are ⁴.

Where to use it

Three settings where paired comparisons and bootstrap are the right defaults.

The first is release gates. The candidate and baseline see the same eval set, so the paired bootstrap on the delta is the natural unit of decision. The release-gates page wires this in.

The second is judge calibration drift. The current judge and a candidate judge score the same labelled set; the paired delta tells you whether the candidate has drifted in a measurable direction.

The third is online A/B style reporting where the same user receives both treatments at different times. The pairing is at the user level; the bootstrap resamples users, not requests, to handle the within-user correlation. The clustered standard errors page handles the variance side of this case ⁵.

CITATION

Wikipedia, "Binomial proportion confidence interval." The canonical write-up for the binary single-sample case the bootstrap and McNemar pages extend to comparisons. en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

What to do this week

Switch your A-vs-B reporting from independent two-proportion tests to the paired bootstrap or McNemar (depending on whether the metric is continuous or binary).
Set n_boot = 10000 and seed it. Reproducible intervals; predictable runtime.
Add the per-comparison N (paired inputs) to your report. The number is smaller than the count of judgments and is the honest unit of evidence.

Chiang et al., "Chatbot Arena," arXiv:2403.04132. ↩
Miller, "Adding Error Bars to Evals" (Anthropic), arXiv:2411.00640. ↩ ↩²
Anthropic, "A Statistical Approach to Model Evaluations." ↩
Hamel Husain and Shreya Shankar, "LLM Evals FAQ." ↩
Shreya Shankar, "The AI Engineering Flywheel." ↩