Effect size: Cohen's d and absolute deltas

Q: Why p-value alone is misleading?

A p-value depends on the sample size as well as the underlying effect. Two scenarios produce identical p < 0.001:

Q: What to do this week?

MDEs and intervals discipline the failures that show up on their own; the failures an adversary manufactures on purpose need a different toolkit, covered in Safety and Red-Team.

A statistically significant 0.3-point win on a 0-100 rubric is not worth shipping. Report the magnitude, not just the p-value.

A p-value answers "is the observed delta consistent with zero." It does not answer "is the delta large enough to ship." An eval set of ten thousand inputs can detect a 0.3-point win on a 100-point rubric with p < 0.001 and still be reporting an effect nobody can feel. Effect size is the missing column: it pairs the p-value with a magnitude that makes the practical importance of the result obvious.

The opinion: every eval comparison reports the absolute delta with its interval and, for continuous metrics, a standardised effect size (Cohen's d). The p-value adjudicates "is this real"; the effect size adjudicates "is this big enough to act on." Reporting only one of the two is half the analysis.

Absolute delta first

For most production eval comparisons, the right primary number is the absolute delta in the natural units of the metric. A 4-point shift in pass rate. A 200ms shift in p95 latency. A $0.003 shift in cost per request. The team's product context decides whether each is large enough to act on, and that is the decision unit.

The reporting habit is to put the delta first and the inference second:

"Quality moved from 72% to 76%, a 4-point absolute gain. The paired bootstrap 95% interval on the delta is [2.4, 5.6]; the p-value under McNemar is 0.0003."
"Cost moved from $0.018 to$ 0.021, a 17% relative increase. The bootstrap interval excludes zero; the p-value is below 0.001."

The reader sees the magnitude before the inference. The Anthropic Error Bars paper uses this framing throughout; the Chatbot Arena reports follow the same shape for Elo deltas ¹ ² ³.

Cohen's d for continuous metrics

When the metric is continuous (a Likert mean, an embedding similarity, a token-overlap score) and you want a unit-free measure of how separated the two distributions are, Cohen's d standardises the delta by the pooled standard deviation.

For two independent samples:

d = (mean_a - mean_b) / pooled_sd
pooled_sd = sqrt(((n_a - 1) * var_a + (n_b - 1) * var_b) / (n_a + n_b - 2))

For a paired comparison (the usual offline eval case), use the standard deviation of the per-input differences:

d_paired = mean(diff) / sd(diff)

The conventions (Cohen 1988) are widely cited and worth committing to:

d around 0.2: small.
d around 0.5: medium.
d around 0.8: large.

These are calibrated to behavioural research; they are not laws of nature. Production eval contexts often consider d = 0.2 actionable when the metric is high-stakes, and d = 0.5 too small to ship when the cost or latency tradeoff is large.

"""
Cohen's d for independent and paired samples. No external deps.
"""
import math
from statistics import mean, stdev

def cohens_d_independent(a: list[float], b: list[float]) -> float:
    n_a, n_b = len(a), len(b)
    if n_a < 2 or n_b < 2:
        return float("nan")
    var_a = stdev(a) ** 2
    var_b = stdev(b) ** 2
    pooled = math.sqrt(((n_a - 1) * var_a + (n_b - 1) * var_b) / (n_a + n_b - 2))
    if pooled == 0:
        return float("nan")
    return (mean(a) - mean(b)) / pooled

def cohens_d_paired(a: list[float], b: list[float]) -> float:
    assert len(a) == len(b)
    diffs = [x - y for x, y in zip(a, b)]
    if len(diffs) < 2 or stdev(diffs) == 0:
        return float("nan")
    return mean(diffs) / stdev(diffs)

# Example: judged scores 0-1
a = [0.82, 0.79, 0.91, 0.74, 0.88, 0.85, 0.80, 0.92, 0.76, 0.84]
b = [0.78, 0.74, 0.85, 0.72, 0.83, 0.81, 0.76, 0.88, 0.70, 0.79]
print(cohens_d_paired(a, b))   # ~1.5 (large; tightly paired)

For binary metrics, the analogue is Cohen's h (the arcsine-transformed difference of proportions); for ordinal scales, Cliff's delta is the standard nonparametric alternative. In practice, for binary, reporting the absolute percentage-point delta with its Wilson interval is more interpretable than Cohen's h and is the default.

Why p-value alone is misleading

A p-value depends on the sample size as well as the underlying effect. Two scenarios produce identical p < 0.001:

A 10-point quality delta on a sample of 60.
A 0.3-point quality delta on a sample of 60,000.

The first is a launch story. The second is statistical noise that crossed an arbitrary threshold because the eval set is large. Reporting only the p-value treats these as equivalent. The Field Guide and Applied LLMs both flag this as a recurring failure mode in growing eval programs: the team scales the eval set, the p-values get tinier, the leadership reads them as larger wins than they are ⁴ ⁵.

Practical importance versus statistical significance

A simple operational rule. Define the minimum detectable effect (the MDE) for each release-gated metric on the spec; that is the threshold of practical importance. A win is a delta whose lower interval bound exceeds the MDE. A regression is a loss whose upper bound is below the negative MDE. Anything in between is a "noise within the band" outcome and does not merit either a ship or a revert.

def decision(delta: float, ci_low: float, ci_high: float, mde: float) -> str:
    if ci_low > mde:
        return "ship"
    if ci_high < -mde:
        return "block"
    if max(ci_low, -ci_high) > mde:
        return "investigate"
    return "noise"

# Example
print(decision(delta=0.04, ci_low=0.024, ci_high=0.056, mde=0.02))
# "ship": the lower CI bound (2.4 pp) exceeds the MDE (2 pp).

The framing makes the MDE explicit and shifts the conversation away from "is it significant" toward "is it large enough to matter for this product."

Reporting

A minimum acceptable presentation of an effect:

The absolute delta, in the metric's natural units.
The 95% interval on the delta.
For continuous metrics, Cohen's d (paired or independent as appropriate).
The N of the comparison.
The MDE on the product spec, so the reader can place the result against the bar ².

The p-value is fine to include. It is not enough on its own.

TIP

A statistically significant tiny effect and a non-significant moderate effect both fail the ship test for different reasons. Report the magnitude and the interval so the reader can see which failure (or success) you are looking at ¹.

Pairing back to power and to gates

Effect size is the bridge between the statistical-power page and the release-gates page. Power answers "given the MDE, how large is the eval set." Effect size answers "given the eval set, how large was the observed effect." The release gate answers "given the effect and its interval, what action ships." Without effect size, the gate is reading the wrong signal.

The flywheel framing puts it succinctly: every release decision should be accountable to both the inference and the magnitude, with the MDE on the spec and the delta with its interval on the report ⁶.

What to do this week

Pick the metric your team treats as primary. Write the MDE on the spec; it is a product number, not a statistical one.
Add the absolute delta and its interval to your comparison report. For continuous metrics, add Cohen's d.
Replace the "p < 0.05" verdict on the dashboard with the "ship / block / investigate / noise" decision tied to the MDE.

MDEs and intervals discipline the failures that show up on their own; the failures an adversary manufactures on purpose need a different toolkit, covered in Safety and Red-Team.

Miller, "Adding Error Bars to Evals" (Anthropic), arXiv:2411.00640. ↩ ↩²
Anthropic, "A Statistical Approach to Model Evaluations." ↩ ↩²
Chiang et al., "Chatbot Arena," arXiv:2403.04132. ↩
Applied LLMs, "What We Learned from a Year of Building with LLMs." ↩
Hamel Husain and Shreya Shankar, "LLM Evals FAQ." ↩
Shreya Shankar, "The AI Engineering Flywheel." ↩