Reasoning and math papers

From MATH and Chain-of-Thought to FrontierMath, ARC-AGI-2, Humanity's Last Exam, and the faithfulness problem.

Eleven papers spanning the arc from the original ARC paper to the post-o1 faithfulness work. Chollet's measure-of-intelligence framing, Hendrycks et al.'s MATH, Wei et al.'s chain-of-thought, Wang et al.'s self-consistency, and Lightman et al.'s process supervision are the load-bearing references. GPQA and Skill-Mix raised the difficulty floor. FrontierMath, ARC-AGI-2, and Humanity's Last Exam set the 2025 frontier. The Anthropic faithfulness paper is the result everyone running judge-of-reasoning evals should read before they trust the trace.