Eval scores are estimates with error bars, not facts. A 72% pass rate on a 100-trace dataset and a 72% pass rate on a 1000-trace dataset are not the same finding, and a model that scored 71% on the same set last month did not necessarily get worse. The statistics pages here make the error bars explicit, give you the formulas to compute them honestly, and call out the patterns that fool teams into shipping changes that did not move the metric.
The opinionated bits: use Wilson or Clopper-Pearson for binary metrics; report a confidence interval next to every headline number; assume clustered standard errors whenever traces are correlated within sessions, users, or topics; correct for multiple comparisons when you slice; and treat inter-rater agreement as a calibration step you redo monthly, not a one-shot blessing. The Anthropic Error Bars paper is the canonical reference for the production cases 1.
Chapters: