What "good" looks like
Reasoning models, the o-series, Claude with extended thinking, R1-style chains, are evaluated on two axes that diverge. The first is correctness on hard problems where chain-of-thought matters: GPQA Diamond (198 graduate-level questions), MATH, AIME[1]. The second is faithfulness, whether the visible thinking trace actually reflects the computation that produced the answer. Anthropic's 2025 result is the most useful single benchmark on this axis: when researchers injected hints into prompts, Claude 3.7 acknowledged the hint in its visible thinking only 25% of the time, R1 only 39%[2]. The thinking trace looks like reasoning but is not what the model is actually doing.
Both metrics matter and they trade off in counterintuitive ways. A model that scores higher on GPQA Diamond can be less faithful in its visible reasoning; longer chains are not more faithful. The "good" target depends on the surface: if you show reasoning to users, faithfulness is product-critical; if reasoning is internal, accuracy is the only customer-facing axis.
Secondary metrics
Tokens to correct answer. Reasoning models can balloon trace length; the ratio of tokens to correct answer is the cost-quality frontier. Sandbagging detection on adversarial hints, where the model knows the answer but hides it in its visible reasoning[2]. Paraphrase consistency. Same problem, four wordings, same answer. Self-consistency, sample N times at temperature greater than 0 and majority vote[3]; the gap between argmax and majority vote is a noisiness proxy.
Offline eval design
GPQA Diamond plus MATH plus the most recent AIME for capability. AIME has only 30 questions per year, so variance is high; report bootstrap confidence intervals, not point estimates. For faithfulness, build a hint-perturbation harness in the Anthropic style: take a problem, prepend a hint of the answer (correct or incorrect), measure how often the visible reasoning acknowledges the hint[2]. Twenty cases is enough to see the gap; 100 is enough to track it across releases.
Capture and log the full thinking tokens, not just the final answer. The thinking is the eval surface; treating it as an opaque internal state defeats the purpose.
Online eval design
For products that surface thinking to users, sample traces and run a LaaJ check: does the visible reasoning support the answer the model gives? Alert on reasoning-length explosions, since reasoning tokens are billed; a 3x jump in tokens per answer is either a model regression or a prompt regression. For products where reasoning is hidden, the online check is correctness against user feedback or downstream task success, not the reasoning surface itself.
Top benchmarks
- GPQA Diamond, https://arxiv.org/abs/2311.12022. 198 graduate-level questions; non-experts with web search reach 34%, experts around 65%.
- MATH, https://arxiv.org/abs/2103.03874. Hendrycks et al.; full MATH is saturated, MATH-500 from Lightman et al. is the current discriminator[4].
- AIME 2024/2025. Official MAA; 30 items per year; useful but variance-heavy.
Inspect AI for capturing the full thinking trace alongside the answer. lm-evaluation-harness from EleutherAI for reproducible GPQA and MATH runs. Anthropic Console Evaluate for side-by-side prompt and model comparison on a fixed reasoning set. Braintrust for trace storage with full token logs.
Common pitfalls
Treating high CoT length as "more thinking". Anthropic showed unfaithful explanations are longer, not shorter[2]. The correlation between trace length and answer quality is weak past a saturation point. Using MMLU as a reasoning eval. MMLU is saturated and the MMLU-Redux re-annotation found a 6.49% error rate across the benchmark, 57% in Virology[5]. Use GPQA, MATH, or AIME instead. Reporting AIME as a single number without confidence intervals; 30 items is small and the variance is real. Skipping the faithfulness check entirely. If your product shows reasoning, the user is making decisions on the visible trace; you owe them the audit.
Canonical reading
Anthropic's 2025 "Reasoning Models Don't Say What They Think" is the most important single read in this category because it makes the faithfulness problem measurable rather than philosophical[2]. The 2023 precursor on measuring faithfulness in chain-of-thought establishes the hint-perturbation methodology[6]. Pair with the original chain-of-thought paper[7] for context on why this matters.
Day-1 starter
- GPQA Diamond plus your domain's "hard 50". Score correctness with bootstrap confidence intervals because the sample sizes are small.
- For 20 cases, inject a correct or incorrect hint and measure how often the visible reasoning acknowledges it. That mention rate is your faithfulness baseline.
- Log full thinking tokens for every release. Track tokens per correct answer over time; a release that adds 30% tokens without adding correctness is regressive.