Multiple comparisons: Bonferroni vs BH-FDR

Q: Which to use when?

The choice is about the relative cost of false positives versus false negatives.

Q: Where multiple-comparison reasoning fails?

Two failure modes that show up reliably.

Q: What to do this week?

Miller, "Adding Error Bars to Evals" (Anthropic), arXiv:2411.00640. ↩ ↩

Run twenty tests at alpha 0.05 and you expect one false positive. Correct for it. Bonferroni for strict family-wise control; Benjamini-Hochberg for false-discovery control.

A release scorecard with fifteen sub-metrics, each tested at alpha 0.05, has a roughly 54% chance of declaring at least one false positive even when nothing actually changed. The arithmetic is 1 - (1 - 0.05)^15. Without a correction, every release sheet reports a few "wins" that are noise. Multiple-comparison correction makes the analysis honest about how many independent comparisons are on the table.

The two standards. Bonferroni divides the per-test alpha by the number of comparisons, gives strict family-wise error rate control, and is the right choice when any false positive is costly. Benjamini-Hochberg (BH-FDR) controls the expected false discovery rate, is less conservative, and is the right choice when the cost of missing real effects is high relative to the cost of accepting some false positives.

Why correction is necessary

Three settings where the inflation bites in eval reports.

The first is multi-dimensional scorecards. A release comparison reports quality, refusal rate, latency p50, latency p95, cost per request, and per-stratum versions of each. A dozen tests is normal; the family-wise false-positive rate is non-trivial. The Anthropic Error Bars paper makes this case explicitly: every comparison deserves an interval, and the set of comparisons deserves a correction ¹ ².

The second is per-stratum dashboards. Watching twenty intent slices each with its own pass rate and a daily alarm is twenty tests every day; over a quarter the family-wise expected false-positive count is significant. The Field Guide and the Applied LLMs piece both note that drift dashboards without correction become alarm noise quickly ³.

The third is judge ablation. Comparing five judges against the gold set on five axes is twenty-five tests; without correction the "best judge per axis" finding is partly noise.

Bonferroni

The simplest, most conservative correction. With m comparisons and a target family-wise alpha of 0.05, test each individual comparison at 0.05 / m. The probability that any single comparison generates a false positive under the null is then bounded by 0.05 across the family.

def bonferroni(p_values: list[float], alpha: float = 0.05):
    m = len(p_values)
    threshold = alpha / m
    return [(p, p <= threshold) for p in p_values]

# Twenty comparisons; one of them p = 0.008
p = [0.74, 0.31, 0.42, 0.008, 0.55, 0.62, 0.99, 0.18, 0.50, 0.71,
     0.44, 0.20, 0.85, 0.39, 0.66, 0.92, 0.10, 0.27, 0.81, 0.05]
print(bonferroni(p))
# 0.008 is significant after correction (threshold = 0.0025? no, 0.05/20 = 0.0025).
# Actually 0.008 > 0.0025: NOT significant after Bonferroni. That is the point:
# Bonferroni is strict.

Use Bonferroni when a single false positive is expensive. Release gates that fire automation, regulatory filings, customer-promised metrics. The cost of running larger eval sets to clear the stricter threshold is usually less than the cost of acting on a false positive.

Benjamini-Hochberg FDR

The less conservative alternative. Instead of controlling the probability of any false positive, BH controls the expected proportion of false positives among the rejected hypotheses. At a target FDR of 5%, you accept that roughly one in twenty rejected hypotheses is a false positive in expectation.

The procedure sorts the p-values, then walks them in increasing order and finds the largest k such that p_(k) <= (k / m) * Q, where Q is the target FDR. Reject the first k hypotheses.

def benjamini_hochberg(p_values: list[float], fdr: float = 0.05):
    m = len(p_values)
    indexed = sorted(enumerate(p_values), key=lambda x: x[1])
    rejected = [False] * m
    threshold_idx = -1
    for k, (_, p) in enumerate(indexed, start=1):
        if p <= (k / m) * fdr:
            threshold_idx = k
    for i in range(threshold_idx):
        orig_idx, _ = indexed[i]
        rejected[orig_idx] = True
    return rejected

# Same twenty comparisons
print(benjamini_hochberg(p))
# 0.008 falls under the BH line at 0.05 target FDR; it is rejected.

Use BH when you are screening for candidates to investigate further. A drift dashboard flagging which strata to look at, a per-axis ablation suggesting which dimensions of a judge to revise, an exploratory release sheet where the team will read the underlying traces before acting.

Which to use when

The choice is about the relative cost of false positives versus false negatives.

Bonferroni: family-wise error rate control. Strict; conservative. Right for any test whose rejection automatically triggers an action.
BH-FDR: false discovery rate control. Less conservative. Right for any test whose rejection triggers an investigation, not an action.

In practice, most production eval programs end up with both. Release gates use Bonferroni on the gating axes; drift dashboards use BH on the exploratory ones. The flywheel framing makes this distinction explicit: gates require strict control, alarms require sensitive screening ⁴.

Reporting under correction

A correction without a report is half the work. The minimum acceptable presentation of a corrected analysis:

The raw p-values per comparison.
The correction method and the family size (m).
The corrected threshold or the per-test BH cutoff.
The flag per comparison: pass, fail, or under investigation.

A dashboard that shows the corrected verdict without the raw p-values cannot be audited. A dashboard that shows the raw p-values without the correction invites the reader to do the inflation themselves, which is the trap the correction is supposed to prevent.

Where multiple-comparison reasoning fails

Two failure modes that show up reliably.

The first is correction over the wrong family. A bonferroni division by m = 3 on a sheet that actually has fifteen comparisons elsewhere is a Bonferroni adjustment in name only. Decide the family size before you read the p-values, and include all comparisons that share a decision context.

The second is repeated testing across time without a sequential correction. A daily comparison on the same metric for a quarter is roughly ninety tests; the family-wise false-positive rate is then large. The fix is either sequential probability ratio testing or a correction that accounts for the cumulative tests. Hamel and Shankar's FAQ flags this as the standard online-eval pitfall ⁵.

TIP

Decide whether you are gating or screening. Apply Bonferroni for gates, BH for screens, and never publish a multi-axis dashboard without the family size visible ¹.

CITATION

Wikipedia, "Binomial proportion confidence interval." The single-comparison case the correction extends to families of comparisons. en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

What to do this week

Audit your release sheet. Count the number of comparisons that share a single ship decision; that is your family size.
Apply Bonferroni on the gating axes. The raw p-values still appear on the report; the corrected threshold gates the merge.
Apply BH on the screening axes (drift dashboards, exploratory ablations). Surface the BH-rejected hypotheses as "look here next," not as conclusions.

Miller, "Adding Error Bars to Evals" (Anthropic), arXiv:2411.00640. ↩ ↩²
Anthropic, "A Statistical Approach to Model Evaluations." ↩
Applied LLMs, "What We Learned from a Year of Building with LLMs." ↩
Shreya Shankar, "The AI Engineering Flywheel." ↩
Hamel Husain and Shreya Shankar, "LLM Evals FAQ." ↩