Nine papers that explain why eval design is harder than it looks. MMLU and BIG-Bench are the canonical knowledge benchmarks. HELM is the canonical attempt to evaluate more than one axis at once. The Benchmark Lottery, Stop Uploading Test Data, MMLU-Redux, and Sclar et al. on format sensitivity together show why benchmark numbers move when the model does not. Miller's Adding Error Bars makes the statistical argument that an eval run is a sample, not a measurement.