Wilson and Clopper-Pearson assume the eval items are independent draws. Most production eval datasets do not look like that. MMLU has multiple questions per subject; a multi-turn benchmark has multiple turns per conversation; a production sample has multiple traces per session and multiple sessions per user. Items within a cluster are correlated with each other, the effective sample size is smaller than the row count, and the naive interval is too narrow 1.
The Anthropic Error Bars paper measures the gap directly. On standard reasoning benchmarks where questions cluster by topic, clustered standard errors are typically two to three times larger than the naive ones 1. A 1-point gap that looks significant under naive SEs vanishes once the clustering correction is applied. The same paper makes the case that any eval where items group by some natural axis (topic, document, user, scenario) should default to a clustered estimator, not the textbook one.
What clustering does to variance
The standard binomial variance assumes each trial is independent: Var(p^)=p(1−p)/n. When items cluster, the trials within a cluster are correlated, and the variance of the mean depends on both the within-cluster and between-cluster components.
Let ρ be the intra-cluster correlation, m the average cluster size, and n the total number of items. The cluster-corrected variance scales by the design effect:
DEFF=1+(m−1)ρ
The effective sample size is neff=n/DEFF. The standard error is computed against neff, not n. A benchmark with 10,000 items in 200 clusters of 50 each (m=50) and intra-cluster correlation ρ=0.04 has DEFF=1+49⋅0.04=2.96. The effective sample size is roughly 3300, not 10,000. The standard error is 2.96≈1.7× larger than the naive computation. That is the factor of 3 the Anthropic paper highlights 1.
The intuition is plain. If you sample 50 questions on chemistry from the same textbook chapter and the model gets all 50 right or all 50 wrong, you really sampled "one chapter" not "fifty questions." Naive SE counts the latter and overstates precision.
Why this matters for eval reports
Three places this trips teams up.
The first is comparing models on a benchmark like MMLU. A 0.4-point gap between two models on 14,000 questions looks comfortably significant under a naive interval (SE≈0.4 percentage points). Under clustered standard errors with DEFF=3, the SE is closer to 0.7 percentage points and the gap is not significant. The release notes still call it a "win" 2.
The second is the production case. A 2% sample of 100,000 daily traces yields 2,000 scored traces. If 1,500 of those are from 300 distinct conversations (5 turns each), the effective N is closer to 600 once you correct for within-conversation correlation. The dashboard says "2,000 traces, Wilson interval 1.5%." The honest answer is "effective 600 traces, Wilson interval 3.0%."
The third is the agentic case. Trajectory-level evals score many tool calls per trajectory and many turns per session. Correlation within a trajectory is high; the same tool failure cascades through the trajectory's subsequent steps. Treating each tool call as independent gives a wildly optimistic interval.
For binary outcomes, the cluster-robust variance for the overall proportion follows the Cameron-Miller framework. Let c be the number of clusters, ng the size of cluster g, yˉg the cluster mean, and yˉ the overall mean. The cluster-robust variance is:
Var(yˉ)=c−1c⋅n21g=1∑cng2(yˉg−yˉ)2
This is the formula on which the Anthropic Error Bars paper's clustered SE rests 1. It estimates the variance of the overall mean directly from the variability of cluster means, rather than from the variability of individual items. When cluster means are similar (low intra-cluster correlation), the formula collapses to something close to the naive SE; when cluster means differ (high intra-cluster correlation), the formula correctly inflates the SE.
Code, with a synthetic example
import numpy as np
from scipy import stats
def cluster_robust_se(scores: np.ndarray, cluster_ids: np.ndarray) -> dict:
"""Cluster-robust standard error for a proportion or mean.
scores: 0/1 or numeric outcomes per item.
cluster_ids: integer label per item, identifying its cluster.
Returns mean, naive_se, clustered_se, design_effect (DEFF), and effective_n.
"""
n = len(scores)
clusters = np.unique(cluster_ids)
c = len(clusters)
mean = scores.mean()
# Naive SE assumes i.i.d.
naive_se = np.sqrt(mean * (1 - mean) / n) if 0 <= mean <= 1 else scores.std(ddof=1) / np.sqrt(n)
# Clustered SE via Cameron-Miller / sandwich estimator on cluster means.
cluster_sums = np.array([scores[cluster_ids == g].sum() for g in clusters])
cluster_sizes = np.array([(cluster_ids == g).sum() for g in clusters])
cluster_means = cluster_sums / cluster_sizes
# Variance of the overall mean from cluster means.
var_cluster = (c / (c - 1)) * np.sum(cluster_sizes**2 * (cluster_means - mean)**2) / n**2
clustered_se = np.sqrt(var_cluster)
deff = (clustered_se / naive_se) ** 2 if naive_se > 0 else float("nan")
n_eff = n / deff if deff > 0 else float("nan")
return {
"n": n,
"n_clusters": c,
"mean": float(mean),
"naive_se": float(naive_se),
"clustered_se": float(clustered_se),
"design_effect": float(deff),
"effective_n": float(n_eff),
"naive_ci_95": (float(mean - 1.96 * naive_se), float(mean + 1.96 * naive_se)),
"clustered_ci_95": (float(mean - 1.96 * clustered_se), float(mean + 1.96 * clustered_se)),
}
# Simulate the canonical case: 200 clusters of 10 questions each,
# strong intra-cluster correlation (cluster-level baseline drives the result).
rng = np.random.default_rng(42)
n_clusters = 200
cluster_size = 10
# Per-cluster baseline rate, drawn from Beta(7, 3) so cluster means vary a lot.
cluster_rates = rng.beta(7, 3, size=n_clusters)
scores = []
cluster_ids = []
for g, rate in enumerate(cluster_rates):
cluster_scores = rng.binomial(1, rate, size=cluster_size)
scores.extend(cluster_scores)
cluster_ids.extend([g] * cluster_size)
scores = np.array(scores)
cluster_ids = np.array(cluster_ids)
result = cluster_robust_se(scores, cluster_ids)
for k, v in result.items():
print(f" {k}: {v}")
# Approximate output:
# n: 2000
# n_clusters: 200
# mean: 0.702
# naive_se: 0.0102
# clustered_se: 0.0188
# design_effect: 3.40
# effective_n: 588.3
# naive_ci_95: (0.682, 0.722)
# clustered_ci_95: (0.666, 0.739)
The synthetic example reproduces the headline result from the Anthropic paper. With 2000 items in 200 clusters and realistic intra-cluster correlation, the clustered SE is roughly 1.85× the naive SE. The design effect is 3.4, meaning the effective sample size is 588, not 2000. The clustered interval is more than 70% wider than the naive interval. Treating the 2000 items as independent and reporting the narrower interval claims a precision the data does not support.
CITATION
Evan Miller, "Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations" (Anthropic), arXiv:2411.00640, 2024. Derives the clustered variance estimator for benchmark settings, shows empirically that clustered SEs are typically 2-3x the naive SEs on standard reasoning benchmarks, and recommends defaulting to clustered estimators whenever items group by topic, scenario, or session. arxiv.org/abs/2411.00640
Bootstrap as an alternative
The Cameron-Miller formula above is the closed-form solution. The block bootstrap is the resampling alternative: resample clusters with replacement, recompute the metric, and take the percentile interval of the bootstrap distribution. Block bootstrap respects the cluster structure automatically and gives identical answers asymptotically. It also handles numeric and non-binary metrics without modification.
def cluster_bootstrap_ci(scores: np.ndarray, cluster_ids: np.ndarray,
n_boot: int = 2000, alpha: float = 0.05) -> tuple[float, float]:
"""Percentile bootstrap on cluster-level resamples."""
clusters = np.unique(cluster_ids)
cluster_scores = {g: scores[cluster_ids == g] for g in clusters}
rng = np.random.default_rng(0)
boot_means = np.empty(n_boot)
for i in range(n_boot):
sampled_clusters = rng.choice(clusters, size=len(clusters), replace=True)
sample = np.concatenate([cluster_scores[g] for g in sampled_clusters])
boot_means[i] = sample.mean()
return float(np.quantile(boot_means, alpha / 2)), float(np.quantile(boot_means, 1 - alpha / 2))
low, high = cluster_bootstrap_ci(scores, cluster_ids)
# Approximate output: (0.666, 0.739) - matches the closed-form clustered CI.
Use the closed-form for proportions and means; use the block bootstrap for ratios, F1, pass^k, or anything else where the metric is a function of the cluster-level data. Both correct for the same problem.
When to cluster
Cluster whenever the eval items naturally group on an axis that affects the score. The common cases:
- Benchmarks with topics, like MMLU subjects or BIG-Bench tasks. Cluster by topic.
- Multi-turn sessions, where each turn is a row. Cluster by session.
- Production traces from the same user, where one user contributes many traces. Cluster by user.
- Agentic trajectories, where each step is a row. Cluster by trajectory.
- Document-level QA, where the same document generates many questions. Cluster by document.
The test for whether clustering is needed is empirical: compute the intra-cluster correlation ρ from the data. If ρ is meaningfully above zero (anything above 0.02 on a benchmark of any size matters), the clustered SE is the right report. Hamel and Shankar's FAQ gives the practitioner version of the rule: "if the same person, conversation, or document shows up more than once in your dataset, your standard errors are wrong" 3.
def intra_cluster_correlation(scores: np.ndarray, cluster_ids: np.ndarray) -> float:
"""ANOVA-style estimator of intra-cluster correlation (Snedecor & Cochran)."""
clusters = np.unique(cluster_ids)
overall_mean = scores.mean()
n = len(scores)
c = len(clusters)
n_g = np.array([(cluster_ids == g).sum() for g in clusters])
cluster_means = np.array([scores[cluster_ids == g].mean() for g in clusters])
ms_between = np.sum(n_g * (cluster_means - overall_mean) ** 2) / (c - 1)
ss_within = 0.0
for g in clusters:
cs = scores[cluster_ids == g]
ss_within += np.sum((cs - cs.mean()) ** 2)
ms_within = ss_within / (n - c) if n > c else float("nan")
m_bar = (n - np.sum(n_g ** 2) / n) / (c - 1) if c > 1 else float("nan")
rho = (ms_between - ms_within) / (ms_between + (m_bar - 1) * ms_within) if (ms_between + (m_bar - 1) * ms_within) > 0 else 0.0
return float(max(0.0, rho))
print(f"Intra-cluster correlation: {intra_cluster_correlation(scores, cluster_ids):.3f}")
# Approximate output: 0.24
When not to cluster
Two cases where the clustered SE is not the right adjustment.
The first is when the items are genuinely independent. A well-curated golden set with 200 distinct prompts, each on a different topic, written by different authors, with no shared context, behaves like 200 independent draws. Wilson is fine.
The second is when the cluster sizes are tiny and uneven. The cluster-robust estimator has finite-sample bias when the number of clusters is small (under 30) or when one cluster dominates the dataset. The Cameron-Miller paper recommends 30+ clusters as a rule of thumb. With fewer, use the block bootstrap with the wild bootstrap correction or report a wider interval with a note.
How the Anthropic paper sets the practice
The Error Bars paper makes three recommendations the rest of this site treats as defaults 1:
- Use the clustered estimator for any benchmark with topic structure. The cost is one extra column in the dataset (a cluster ID) and one helper function in the eval harness.
- Report the design effect alongside the score. A DEFF of 3 is a flag that the effective sample size is a third of the nominal one and the comparison is harder than the row count suggests.
- When sample-sizing a new evaluation, plan with the expected DEFF. If you want a half-width of 3 percentage points at 70% pass rate and you expect a DEFF of 2.5, you need 2.5x the items you would need under i.i.d. assumptions.
What to put in the report
A clustered eval result should display:
- The point estimate.
- The clustered 95% interval.
- The naive 95% interval (so the reader can see the gap).
- The number of items, the number of clusters, and the average cluster size.
- The design effect and the effective sample size.
- The clustering axis (
cluster by user_id, cluster by document_id, etc.).
The Chatbot Arena papers do something analogous with their bootstrap intervals on Elo: every leaderboard entry has a 95% bootstrap interval, and the documentation is explicit about how the resampling respects vote structure 4. The point is the same: the interval should reflect the dependency structure of the data, not a textbook assumption that does not apply.
Cross-references
The confidence intervals page covers Wilson and Clopper-Pearson for the i.i.d. case. The bootstrap page covers the bootstrap mechanics in more depth. The statistical power page covers how to plan a sample size when you expect clustering. The multiple comparisons page covers the second correction every eval program needs.