A typical eval report says "Model A scored 78%, Model B scored 81%". The implied claim is that Model B is better. Whether the claim is real depends on the sample size, on whether the questions cluster (multiple questions from the same topic or document), and on whether the comparison was paired across the same input set. Anthropic's "Adding Error Bars to Evals" paper formalizes the five things every eval report should do; this recipe walks each one with runnable code 1 2.
Planned outline
- Wilson interval on a single rate. Why Wald is wrong on a 0.90 rate; why Clopper-Pearson is overly conservative; why Wilson is the default 3.
- Clustered standard errors. When questions group by topic or document, naive standard errors are too small by roughly the square root of the cluster size. Compute cluster-robust SE in statsmodels.
- Paired-difference test. When you score the same inputs with two systems, the comparison is paired. McNemar's test for binary outcomes; paired bootstrap for continuous.
- Power analysis. Given your expected effect size, how many examples do you need to detect it at alpha 0.05? Implement the standard formula and apply it to a sample eval.
- Reporting template. A small Markdown table that shows mean, Wilson 95% CI, cluster-aware SE, and the paired-diff p-value.
What this will cover
A complete Python script that takes a CSV (example_id, topic_id, system_a_label, system_b_label, gold_label) and emits the report. Plus three failure cases (small N, high clustering, no pairing) that illustrate when each correction matters.
What this will skip in v1
Multiple-comparisons correction (separate page in statistics). Mixed-effects models (rare in eval reports; overkill for v1). Bayesian intervals (worth a follow-up).
For the chapter that builds the conceptual frame, see confidence intervals and clustered standard errors. The Bradley-Terry analysis from Chatbot Arena is the canonical worked example of paired ranking with bootstrap CIs 4.