AI Evals › Techniques

Pairwise comparison and Elo

Bradley-Terry, Elo, and Chatbot Arena methodology. When pairwise beats absolute scoring, how to control position bias, how to read a leaderboard with intervals.

Pairwise comparison asks a rater (human or judge) to pick between two responses on the same query. Aggregated across thousands of pairs, those preferences fit a Bradley-Terry or Elo model and produce a ranked leaderboard with calibrated intervals. The pattern powers Chatbot Arena, Arena-Hard-Auto, and AlpacaEval 2.

When pairwise beats absolute scoring

Two cases push toward pairwise. Humans cannot calibrate a scalar on subjective quality; asking five raters for a 1-to-10 score produces five distributions with no shared zero point, while asking which of two is better produces a label every rater reads the same way ¹. And ranking many systems against each other without re-labeling each scales better: pairwise is O(n log n) battles with smart sampling; absolute scoring is O(n) per system times the eval-set size.

Pairwise is the wrong tool when a clean binary criterion exists. Direct binary is one judge call per trace, produces a confusion matrix, and forces a falsifiable cutoff. Use pairwise when the rubric is "which feels better"; use direct when the rubric is "is this correct."

Bradley-Terry and Elo

Bradley-Terry models the probability A beats B as a logistic function of a latent skill difference: $P(A>B) = \sigma(\theta_A - \theta_B)$ . Fit by maximum likelihood on the battle log. Elo is the online update form of the same model; each battle nudges ratings by a step inversely proportional to the predicted probability. Chatbot Arena reports BT max-likelihood scores on snapshots and presents them as "Elo" for convention, not as a true online Elo ².

The arena methodology changed once at scale: the team moved from informed pairing to uniform random sampling because informed pairing introduced selection bias that distorted the fit ³. The generalization: if a private eval program adaptively picks pairs, the BT fit needs to model the sampling design or the ranking drifts.

Confidence intervals on a ranking

A leaderboard without intervals pretends every neighbouring pair is distinguishable. They almost never are. Chatbot Arena reports 95% bootstrap intervals on every model's score; the interval width at the top is typically twenty to thirty Elo points ². Differences smaller than the combined interval are noise.

import numpy as np
from scipy.special import expit
from scipy.optimize import minimize

def fit_bt(battles, n_models):
    def nll(theta):
        full = np.concatenate([theta, [0.0]])
        return -sum(np.log(expit(full[w] - full[l]) + 1e-12) for w, l in battles)
    out = minimize(nll, x0=np.zeros(n_models - 1), method="L-BFGS-B")
    return np.concatenate([out.x, [0.0]])

def bootstrap_bt(battles, n_models, n_boot=1000):
    n = len(battles)
    samples = np.empty((n_boot, n_models))
    for b in range(n_boot):
        idx = np.random.randint(0, n, size=n)
        samples[b] = fit_bt([battles[i] for i in idx], n_models)
    return samples.mean(axis=0), np.percentile(samples, 2.5, axis=0), np.percentile(samples, 97.5, axis=0)

Position bias is non-negotiable

Pairwise judges (LLM or human) favor one slot over the other. Zheng et al. measured GPT-3.5 at roughly fifty percent first-position rate and Claude v1 at roughly seventy percent on identical content ⁴. The standard mitigation: score every pair twice with the order swapped. If both orderings agree, record one win for that response. If they disagree, record a tie. Two judge calls per battle; residual position effect bounded by the disagreement rate.

Length-controlled variants

Verbosity bias stacks on position bias. AlpacaEval introduced length-controlled scoring (LC-AlpacaEval); Dubois et al. report correlation with Chatbot Arena rose from 0.94 to 0.98 after the adjustment and spurious rank shifts driven by output length disappeared ⁵. If a pairwise program reports raw win rates without controlling for length, the leaderboard is partly measuring which model writes longer.

Method	Position control	Verbosity control
MT-Bench	swap pairs	none
Chatbot Arena	uniform sampling, randomized slot	none
Arena-Hard-Auto	swap pairs	none
AlpacaEval 2 LC	swap pairs	length-control regression

How many battles

Sample-size for a Bradley-Terry leaderboard depends on the smallest skill gap you want to separate. Rough order of magnitude: one thousand battles per pair you want to resolve at 95% confidence on a moderate gap. Chatbot Arena resolves the top tier with several hundred thousand votes; a private eval comparing four prompt versions usually gets clean rank with a few thousand battles total ². Arena-Hard-Auto automates the pattern over 500 prompts and reports rank correlations against the human Arena above 0.95 ⁶.

What good looks like

A pairwise program ships position swaps on every battle treated as one observation, length control reported alongside raw win rate, bootstrap 95% intervals on every model's score, and a fixed reference model kept across runs as a drift sentinel. The pairwise versus direct chapter covers the operational picks; statistical methods covers the bootstrap.

Eugene Yan, "LLM Evaluators." ↩
Chiang et al., Chatbot Arena, ICML 2024. ↩ ↩² ↩³
LMSYS, Chatbot Arena blog, 2023. ↩
Zheng et al., MT-Bench / Chatbot Arena, NeurIPS 2023 D&B. ↩
Dubois et al., Length-Controlled AlpacaEval, 2024. ↩
Li et al., Arena-Hard-Auto, 2024. ↩