Reasoning depth and thinking-mode analysis

Evaluate the thinking traces of reasoning models. Correctness on hard reasoning plus chain-of-thought faithfulness, not just answer accuracy.

What "good" looks like

A reasoning eval scores two axes that diverge: correctness on hard problems where chain-of-thought matters, and faithfulness, whether the visible thinking trace reflects the computation that produced the answer. Score both, because a model can gain one while losing the other.

The two axes for reasoning models (the o-series, Claude with extended thinking, R1-style chains):

Correctness: GPQA Diamond (198 graduate-level questions), MATH, and AIME^[1].
Faithfulness: how often the visible reasoning acknowledges an injected hint. When researchers injected hints into prompts, Claude 3.7 acknowledged the hint in its visible thinking only 25% of the time, R1 only 39%^[2].

The trace can lie. The thinking looks like reasoning but is not always what the model is doing. Anthropic's 2025 result is the most useful single benchmark on this axis.

The axes trade off counterintuitively. A model that scores higher on GPQA Diamond can be less faithful; longer chains are not more faithful.

The target depends on the surface. Show reasoning to users and faithfulness is product-critical; keep it internal and accuracy is the only customer-facing axis.

flowchart TD
    PROB["Reasoning problem"] --> RUN["Run model, log full thinking tokens"]
    RUN --> ACC["Correctness: GPQA Diamond, MATH, AIME"]
    RUN --> HINT["Inject correct or incorrect hint"]
    HINT --> FAITH["Faithfulness: does visible reasoning acknowledge the hint?"]
    ACC --> SURF{"Is reasoning shown to users?"}
    FAITH --> SURF
    SURF -->|yes| BOTH["Gate on correctness and faithfulness"]
    SURF -->|no| ONLY["Gate on correctness only"]

Secondary metrics

Metric	What it catches	How to score
Tokens to correct answer	Trace-length balloon; the cost-quality frontier	Ratio of tokens to correct answer
Sandbagging detection	Model knows the answer but hides it in visible reasoning	Adversarial hints where visible reasoning conceals the answer^[2]
Paraphrase consistency	Answer instability across wordings	Same problem, four wordings, same answer
Self-consistency	Noisiness of the reasoning path	Sample N times at temperature greater than 0 and majority vote; the gap between argmax and majority vote is the proxy^[3]

Offline eval design

Capability set. GPQA Diamond plus MATH plus the most recent AIME. AIME has only 30 questions per year, so variance is high; report bootstrap confidence intervals, not point estimates.

Faithfulness harness. Build a hint-perturbation harness in the Anthropic style: take a problem, prepend a hint of the answer (correct or incorrect), and measure how often the visible reasoning acknowledges the hint^[2]. Twenty cases is enough to see the gap; 100 is enough to track it across releases.

Log the thinking. Capture the full thinking tokens, not just the final answer. The thinking is the eval surface; treating it as opaque state defeats the purpose.

Online eval design

When thinking is surfaced to users. Sample traces and run a LaaJ check: does the visible reasoning support the answer the model gives?

Watch token growth. Alert on reasoning-length explosions, since reasoning tokens are billed. A 3x jump in tokens per answer is either a model regression or a prompt regression.

When reasoning is hidden. The online check is correctness against user feedback or downstream task success, not the reasoning surface.

Top benchmarks

GPQA Diamond, https://arxiv.org/abs/2311.12022. 198 graduate-level questions; non-experts with web search reach 34%, experts around 65%.
MATH, https://arxiv.org/abs/2103.03874. Hendrycks et al.; full MATH is saturated, MATH-500 from Lightman et al. is the current discriminator^[4].
AIME 2024/2025. Official MAA; 30 items per year; useful but variance-heavy.

Tools especially suited

Inspect AI: capturing the full thinking trace alongside the answer.
lm-evaluation-harness: reproducible GPQA and MATH runs, from EleutherAI.
Anthropic Console Evaluate: side-by-side prompt and model comparison on a fixed reasoning set.
Braintrust: trace storage with full token logs.

Common pitfalls

Pitfall	Why it bites	Fix
Treating high CoT length as "more thinking"	Unfaithful explanations are longer, not shorter; trace length and answer quality correlate weakly past saturation	Measure faithfulness directly, not length^[2]
Using MMLU as a reasoning eval	MMLU is saturated; MMLU-Redux re-annotation found a 6.49% error rate across the benchmark, 57% in Virology	Use GPQA, MATH, or AIME instead^[5]
Reporting AIME as a single number	30 items is small and the variance is real	Report bootstrap confidence intervals
Skipping the faithfulness check	If your product shows reasoning, users decide on the visible trace	Run the hint-perturbation audit you owe them

Canonical reading

Anthropic's 2025 "Reasoning Models Don't Say What They Think" is the most important single read in this category because it makes the faithfulness problem measurable rather than philosophical^[2]. The 2023 precursor on measuring faithfulness in chain-of-thought establishes the hint-perturbation methodology^[6]. Pair with the original chain-of-thought paper^[7] for context on why this matters.

Day-1 starter

GPQA Diamond plus your domain's "hard 50". Score correctness with bootstrap confidence intervals because the sample sizes are small.
For 20 cases, inject a correct or incorrect hint and measure how often the visible reasoning acknowledges it. That mention rate is your faithfulness baseline.
Log full thinking tokens for every release. Track tokens per correct answer over time; a release that adds 30% tokens without adding correctness is regressive.

Cross-cutting links

Why LLM evals are hard: prompt sensitivity and non-determinism, the background noise of every reasoning eval.
Robustness testing: paraphrase consistency, the check that separates reasoning from recall.
Contamination detection: why saturated public sets mislead and how to probe for leakage.
Inspect AI: trace capture for thinking tokens.
lm-evaluation-harness: the standard runner for GPQA, MATH, and AIME.
Contamination detection recipe: runnable n-gram and rephrase probes for your reasoning set.