Statistics & Rigor

Confidence intervals, bootstrap, power, inter-rater agreement. Also available as a free PDF download.

Eval scores are estimates with error bars, not facts. A 72% pass rate on a 100-trace dataset and a 72% pass rate on a 1000-trace dataset are not the same finding, and a model that scored 71% on the same set last month did not necessarily get worse. The statistics pages here make the error bars explicit, give you the formulas to compute them honestly, and call out the patterns that fool teams into shipping changes that did not move the metric.

The opinionated bits: use Wilson or Clopper-Pearson for binary metrics; report a confidence interval next to every headline number; assume clustered standard errors whenever traces are correlated within sessions, users, or topics; correct for multiple comparisons when you slice; and treat inter-rater agreement as a calibration step you redo monthly, not a one-shot blessing. The Anthropic Error Bars paper is the canonical reference for the production cases ¹. The condensed reference for this section is statistical methods.

Chapters:

Confidence intervals: Wilson and Clopper-Pearson. Why Wald misleads, Wilson is the practical default, and Clopper-Pearson the conservative exact.
Bootstrap and paired tests. Bootstrap when the metric is not a clean Bernoulli; pair when the systems share inputs.
Statistical power and sample sizing. Pick the delta you care about first, then derive the N, not the other way around.
Clustered standard errors. When questions cluster by topic or session, naive error bars come out far too narrow.
Inter-rater agreement: Cohen's kappa, Krippendorff's alpha. Disagreement between qualified humans is information about the rubric; measure it before trusting labels.
Multiple comparisons: Bonferroni vs BH-FDR. Twenty tests at alpha 0.05 expect one false positive. Bonferroni for family-wise control, BH-FDR for false discovery.
Effect size and absolute deltas. Report the magnitude, not just the p-value; a significant 0.3-point win may not be worth shipping.

Miller, "Adding Error Bars to Evals" (Anthropic), arXiv:2411.00640. ↩