AI Evals › Cookbook

Adding error bars to your evals

Wilson intervals on pass rates, cluster-aware standard errors, and a paired McNemar test for two-system comparisons, with runnable statsmodels code.

A typical eval report says "Model A scored 61.5%, Model B scored 71.0%". The implied claim is that B is better. Whether that claim survives contact with statistics depends on three things: the sample size, whether the questions cluster (multiple questions per topic or document), and whether the comparison was paired across the same inputs. Evan Miller's "Adding Error Bars to Evals" lays out five recommendations for eval reports; this recipe implements four of them with runnable code: confidence intervals, clustered standard errors, paired analysis, and power analysis ¹ ². The fifth (reading answer distributions from next-token probabilities) only applies when you grade closed-form answers with logprob access, so it is out of scope here.

The 45 minutes break down as: five minutes to shape your results file, ten to run the two scripts, thirty to read the four numbers against the decisions they gate.

What you build

Two small Python modules:

make_data.py seeds a 200-example paired result set with topic clustering, so your output matches this page to the digit. Swap it for your own CSV when you are done reading.
error_bars.py reads the CSV and prints the report: Wilson 95% interval per system, naive vs clustered standard error, the paired McNemar test, and the sample size an unpaired design would have needed.

Step 1: shape the results as paired rows

One row per example, one column per system, plus the cluster id. If you ran both systems on the same inputs (you should have), the file is already paired:

example_id,topic_id,system_a,system_b
0,topic_00,1,1
1,topic_00,0,1
2,topic_01,1,0
...

The topic_id column is the one teams forget to keep. Any grouping that makes two rows more alike than two random rows counts as a cluster: questions generated from the same document, turns from the same session, problems from the same template. If you cannot name a cluster column, your standard errors are probably fine; if you can, they are probably too small ¹.

To follow along with reproducible numbers, generate the seeded set:

"""Generate a synthetic paired eval result set with topic clustering.

Run:
    pip install scipy statsmodels pandas numpy
    python make_data.py
"""
import numpy as np
import pandas as pd

rng = np.random.default_rng(0)

N_TOPICS = 20
PER_TOPIC = 10

rows = []
for topic in range(N_TOPICS):
    # Each topic has its own difficulty. This is what makes clustering matter.
    topic_skill = rng.normal(0, 1.2)
    for q in range(PER_TOPIC):
        ease = topic_skill + rng.normal(0, 0.8)
        p_a = 1 / (1 + np.exp(-(ease + 0.9)))   # system A
        p_b = 1 / (1 + np.exp(-(ease + 1.25)))  # system B, slightly better
        rows.append({
            "example_id": topic * PER_TOPIC + q,
            "topic_id": f"topic_{topic:02d}",
            "system_a": int(rng.random() < p_a),
            "system_b": int(rng.random() < p_b),
        })

df = pd.DataFrame(rows)
df.to_csv("results.csv", index=False)
print(f"wrote results.csv: {len(df)} rows, {N_TOPICS} topics")
print(f"system_a mean: {df.system_a.mean():.3f}")
print(f"system_b mean: {df.system_b.mean():.3f}")

Expected output:

wrote results.csv: 200 rows, 20 topics
system_a mean: 0.615
system_b mean: 0.710

Step 2: the analysis script

"""Error bars for an eval report: Wilson CI, clustered SE, paired McNemar, power.

Run:
    python error_bars.py results.csv
"""
from __future__ import annotations

import sys

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.contingency_tables import mcnemar
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_confint, proportion_effectsize


def wilson_ci(successes: int, n: int) -> tuple[float, float]:
    lo, hi = proportion_confint(successes, n, alpha=0.05, method="wilson")
    return float(lo), float(hi)


def naive_se(scores: np.ndarray) -> float:
    return float(scores.std(ddof=1) / np.sqrt(len(scores)))


def clustered_se(scores: np.ndarray, clusters: pd.Series) -> float:
    """Cluster-robust SE of the mean: intercept-only OLS, clustered covariance."""
    X = np.ones((len(scores), 1))
    fit = sm.OLS(scores, X).fit(cov_type="cluster", cov_kwds={"groups": clusters})
    return float(fit.bse[0])


def paired_mcnemar(a: np.ndarray, b: np.ndarray) -> tuple[int, int, float]:
    a_only = int(((a == 1) & (b == 0)).sum())
    b_only = int(((a == 0) & (b == 1)).sum())
    both = int(((a == 1) & (b == 1)).sum())
    neither = int(((a == 0) & (b == 0)).sum())
    res = mcnemar([[both, a_only], [b_only, neither]], exact=True)
    return a_only, b_only, float(res.pvalue)


def required_n_unpaired(p_a: float, p_b: float) -> int:
    es = proportion_effectsize(p_a, p_b)
    n = NormalIndPower().solve_power(
        effect_size=es, alpha=0.05, power=0.8, alternative="two-sided"
    )
    return int(np.ceil(n))


def main() -> None:
    df = pd.read_csv(sys.argv[1] if len(sys.argv) > 1 else "results.csv")
    a = df["system_a"].to_numpy()
    b = df["system_b"].to_numpy()
    n = len(df)

    print(f"n = {n} examples, {df['topic_id'].nunique()} topics\n")

    for name, scores in (("A", a), ("B", b)):
        k = int(scores.sum())
        lo, hi = wilson_ci(k, n)
        se_naive = naive_se(scores)
        se_clust = clustered_se(scores, df["topic_id"])
        print(
            f"system_{name}: {k}/{n} = {k / n:.3f}  "
            f"Wilson 95% [{lo:.3f}, {hi:.3f}]  "
            f"SE naive {se_naive:.4f} | clustered {se_clust:.4f} "
            f"(x{se_clust / se_naive:.2f})"
        )

    a_only, b_only, p = paired_mcnemar(a, b)
    print(f"\npaired: A-only wins {a_only}, B-only wins {b_only}, ties {n - a_only - b_only}")
    print(f"McNemar exact p = {p:.4f}")

    delta = b.mean() - a.mean()
    print(f"\ndelta = {delta:+.3f}")
    print(
        f"unpaired N per arm to detect this delta at power 0.8: "
        f"{required_n_unpaired(a.mean(), b.mean())}"
    )


if __name__ == "__main__":
    main()

Expected output, exactly:

n = 200 examples, 20 topics

system_A: 123/200 = 0.615  Wilson 95% [0.546, 0.680]  SE naive 0.0345 | clustered 0.0499 (x1.45)
system_B: 142/200 = 0.710  Wilson 95% [0.644, 0.768]  SE naive 0.0322 | clustered 0.0464 (x1.44)

paired: A-only wins 26, B-only wins 45, ties 129
McNemar exact p = 0.0319

delta = +0.095
unpaired N per arm to detect this delta at power 0.8: 388

Step 3: read the four numbers

The Wilson intervals overlap. A [0.546, 0.680] against B [0.644, 0.768]. If you stopped at interval overlap you would call the comparison inconclusive. That read is wrong, and the rest of the report says why. Use Wilson rather than the Wald interval your spreadsheet computes by default: Wald is the textbook p ± 1.96·sqrt(p(1-p)/n) formula, and its coverage collapses at small n and extreme rates ³. On a 7-out-of-10 run, Wald reports [0.42, 0.98] while Wilson reports [0.40, 0.89]; Wald flirts with certainty that the data cannot support. The confidence intervals chapter has the full derivation.

Clustered SEs are 1.45 times the naive SEs. The 20 topics each contribute 10 correlated questions, so the effective sample size is much smaller than 200. Naive standard errors pretend every row is independent and understate uncertainty accordingly; Miller's recommendation is to cluster on the sampling unit, which here is the topic ¹. If you report the naive SE on clustered data, every downstream consumer of your eval inherits an overconfident number. The inflation factor grows with cluster size and within-cluster correlation; the clustered standard errors chapter shows cases where it passes 3x.

The paired test is significant anyway. McNemar's test ignores the 129 ties and asks one question: of the 71 examples where the systems disagree, is the 45-26 split in B's favor explainable by a fair coin? At p = 0.032 it is not ⁴. This is the payoff of pairing: between-question difficulty (the thing that made the intervals wide) cancels out, because both systems faced the same questions. Overlapping CIs plus a significant paired test is not a contradiction; it is the expected shape when inputs are shared and per-question difficulty varies. Report the paired result.

An unpaired design would have needed 388 examples per arm. The power calculation says that detecting a 9.5-point delta around these rates at 80% power requires 388 independent samples per system if the comparisons are not paired. The paired design detected it with 200 shared examples. When someone proposes scoring system A on one random traffic slice and system B on another, this number is the argument against.

Apply it to your own results

Replace results.csv with your data and rename columns. Three rules:

If your metric is a pass rate, this script is complete. If it is a continuous score (judge scores averaged per example), swap McNemar for a paired t-test or a paired bootstrap; the bootstrap chapter has both.
Choose the cluster column before looking at the results, and cluster at the coarsest level you sampled: document, session, or template. Clustering after peeking is p-hacking with extra steps.
Put the Wilson interval and the paired p-value in the report next to the headline rate, every time. Chatbot Arena does the equivalent at scale with bootstrapped Bradley-Terry intervals, and it is the reason their rankings move slowly and credibly ⁵.

What this does not do

It does not correct for running twenty comparisons and reporting the best one; see multiple comparisons. It does not handle judge noise as a variance component; calibrate the judge first with LLM-judge calibration. And it does not replace error analysis: a significant delta tells you B is better, not why. For that, go back to reading traces.

TIP

Wire error_bars.py into the end of your eval harness so every run prints the interval next to the mean. A score without an interval reads as more precise the less data it has, which is exactly backwards.

Evan Miller, "Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations." arXiv 2411.00640. The five recommendations this recipe implements. ↩ ↩² ↩³
Anthropic, "A statistical approach to model evaluations." Companion post to the paper. ↩
Wikipedia, "Binomial proportion confidence interval." Coverage comparison of Wald, Wilson, and Clopper-Pearson. ↩
Wikipedia, "McNemar's test." The exact test on discordant pairs used here. ↩
Chiang et al., "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference." arXiv 2403.04132. Bootstrapped Bradley-Terry confidence intervals in production. ↩