Pairwise comparison asks a rater (human or judge) to pick between two responses on the same query. Aggregated across thousands of pairs, those preferences fit a Bradley-Terry or Elo model and produce a ranked leaderboard with calibrated intervals. The pattern powers Chatbot Arena, Arena-Hard-Auto, and AlpacaEval 2.
When pairwise beats absolute scoring
Two cases push toward pairwise. Humans cannot calibrate a scalar on subjective quality; asking five raters for a 1-to-10 score produces five distributions with no shared zero point, while asking which of two is better produces a label every rater reads the same way 1. And ranking many systems against each other without re-labeling each scales better: pairwise is O(n log n) battles with smart sampling; absolute scoring is O(n) per system times the eval-set size.
Pairwise is the wrong tool when a clean binary criterion exists. Direct binary is one judge call per trace, produces a confusion matrix, and forces a falsifiable cutoff. Use pairwise when the rubric is "which feels better"; use direct when the rubric is "is this correct."
Bradley-Terry and Elo
Bradley-Terry models the probability A beats B as a logistic function of a latent skill difference: P(A>B)=σ(θA−θB). Fit by maximum likelihood on the battle log. Elo is the online update form of the same model; each battle nudges ratings by a step inversely proportional to the predicted probability. Chatbot Arena reports BT max-likelihood scores on snapshots and presents them as "Elo" for convention, not as a true online Elo 2.
The arena methodology changed once at scale: the team moved from informed pairing to uniform random sampling because informed pairing introduced selection bias that distorted the fit 3. The generalization: if a private eval program adaptively picks pairs, the BT fit needs to model the sampling design or the ranking drifts.
Confidence intervals on a ranking
A leaderboard without intervals pretends every neighbouring pair is distinguishable. They almost never are. Chatbot Arena reports 95% bootstrap intervals on every model's score; the interval width at the top is typically twenty to thirty Elo points 2. Differences smaller than the combined interval are noise.
import numpy as np
from scipy.special import expit
from scipy.optimize import minimize
def fit_bt(battles, n_models):
def nll(theta):
full = np.concatenate([theta, [0.0]])
return -sum(np.log(expit(full[w] - full[l]) + 1e-12) for w, l in battles)
out = minimize(nll, x0=np.zeros(n_models - 1), method="L-BFGS-B")
return np.concatenate([out.x, [0.0]])
def bootstrap_bt(battles, n_models, n_boot=1000):
n = len(battles)
samples = np.empty((n_boot, n_models))
for b in range(n_boot):
idx = np.random.randint(0, n, size=n)
samples[b] = fit_bt([battles[i] for i in idx], n_models)
return samples.mean(axis=0), np.percentile(samples, 2.5, axis=0), np.percentile(samples, 97.5, axis=0)
Position bias is non-negotiable
Pairwise judges (LLM or human) favor one slot over the other. Zheng et al. measured GPT-3.5 at roughly fifty percent first-position rate and Claude v1 at roughly seventy percent on identical content 4. The standard mitigation: score every pair twice with the order swapped. If both orderings agree, record one win for that response. If they disagree, record a tie. Two judge calls per battle; residual position effect bounded by the disagreement rate.
Length-controlled variants
Verbosity bias stacks on position bias. AlpacaEval introduced length-controlled scoring (LC-AlpacaEval); Dubois et al. report correlation with Chatbot Arena rose from 0.94 to 0.98 after the adjustment and spurious rank shifts driven by output length disappeared 5. If a pairwise program reports raw win rates without controlling for length, the leaderboard is partly measuring which model writes longer.
| Method | Position control | Verbosity control |
|---|
| MT-Bench | swap pairs | none |
| Chatbot Arena | uniform sampling, randomized slot | none |
| Arena-Hard-Auto | swap pairs | none |
| AlpacaEval 2 LC | swap pairs | length-control regression |
How many battles
Sample-size for a Bradley-Terry leaderboard depends on the smallest skill gap you want to separate. Rough order of magnitude: one thousand battles per pair you want to resolve at 95% confidence on a moderate gap. Chatbot Arena resolves the top tier with several hundred thousand votes; a private eval comparing four prompt versions usually gets clean rank with a few thousand battles total 2. Arena-Hard-Auto automates the pattern over 500 prompts and reports rank correlations against the human Arena above 0.95 6.
What good looks like
A pairwise program ships position swaps on every battle treated as one observation, length control reported alongside raw win rate, bootstrap 95% intervals on every model's score, and a fixed reference model kept across runs as a drift sentinel. The pairwise versus direct chapter covers the operational picks; statistical methods covers the bootstrap.