AI Evals › Papers

Foundational eval theory papers

What an eval is, what a benchmark is, and why rankings flip when nothing changed. The papers that built the vocabulary.

Nine papers that explain why eval design is harder than it looks. MMLU and BIG-Bench are the canonical knowledge benchmarks. HELM is the canonical attempt to evaluate more than one axis at once. The Benchmark Lottery, Stop Uploading Test Data, MMLU-Redux, and Sclar et al. on format sensitivity together show why benchmark numbers move when the model does not. Miller's Adding Error Bars makes the statistical argument that an eval run is a sample, not a measurement.