Pairwise vs direct scoring

When a side-by-side ranking beats a binary verdict, the swap discipline that makes pairwise honest, and the cost table that decides at ten candidates.

A judge can rank two candidate outputs against each other (pairwise) or score one output against a rubric (direct). The two modes are not interchangeable, they fail differently, and they cost differently as the candidate count grows. Pick by the shape of the criterion and the number of systems you compare. Picking wrong explains a lot of judges that look noisy in practice ¹.

The rule

Yan's heuristic is the one to memorize: pairwise for subjective criteria, direct binary for objective ones ¹. Subjective criteria (tone, helpfulness, creativity, preference) are easier to compare than to anchor; an annotator can say "A is better than B" while struggling to defend "A is a 7 out of 10". Objective criteria (factual correctness, schema validity, presence of a required disclaimer) are easier to anchor than to compare.

Violating the rule produces two characteristic failures:

Pairwise on an objective task. The comparison feels easy, but it cannot catch absolute regressions. Both candidates can fail the requirement while one of them still "wins".
Direct on a subjective task. The rubric drifts as the grader grades, scores cluster in the middle of the scale, and the number stops meaning anything. This is the same pathology that sinks Likert rubrics; the introduction covers why binary plus critique is the default.

When pairwise wins

Three properties make pairwise the right pick.

Systems are close in quality but distinguishable side by side. Direct scoring pushes near-equal responses toward the same label; pairwise forces a decision. Chatbot Arena runs this at scale: crowd users send one prompt to two anonymized models and vote, and Chiang et al. validated the resulting rankings on over 240,000 human votes ².

The criterion is preference-shaped rather than rule-shaped. "Which reply would a user rather receive" is a comparison question. "Did the reply include the policy disclaimer" is a rule question. Pairwise on rule questions wastes information; direct on preference questions invents a scale nobody can calibrate.

Many systems, one labeling budget. Pairwise outcomes feed an aggregation model (below) that ranks any number of systems from a sparse comparison graph. AlpacaEval and Arena-Hard automate this by judging every candidate against one fixed baseline model, which keeps cost linear in the candidate count ³ ⁴.

When direct wins

Direct binary wins on absolute quality bars. A factual QA bot must answer correctly; "better than last week" is not the release criterion. Direct gives you a pass rate per run, a confusion matrix against human labels, and a number a release gate can consume. Pairwise gives you a relative ordering that is silent about whether any candidate clears the bar.

Direct also wins on regression evals. "Did this prompt change break anything" is a per-trace binary question, and the calibration discipline (TPR and TNR against a human gold set) only works on direct labels.

Direct is also the cheaper instrument in the everyday case: one judge call per trace, no position swap, and the output aggregates by counting instead of model fitting.

Position bias and the swap

Pairwise judging carries position bias: the judge favors a response because of where it sits, not what it says. Zheng et al. measured it on MT-Bench: GPT-3.5 picks the first response about half the time on equivalent content, while Claude v1 picks the first position 75 percent of the time ⁵. The mitigation is mechanical. Present every pair twice, once in each order, and count a win only when the judge agrees with itself across the swap.

from dataclasses import dataclass

@dataclass
class PairwiseResult:
    winner: str  # "A", "B", or "tie"

def judge_pair(a: str, b: str, judge) -> PairwiseResult:
    """Run a pairwise judge in both orders. A clean winner agrees across the swap."""
    first = judge(a, b)   # judge picks "A" or "B"
    second = judge(b, a)  # positions swapped
    if first == "A" and second == "B":
        return PairwiseResult("A")  # A wins regardless of position
    if first == "B" and second == "A":
        return PairwiseResult("B")
    return PairwiseResult("tie")     # judge flipped with the swap

The swap is not optional. A pairwise number computed without it is part preference signal and part position artifact, and nothing downstream can separate the two. The biases page has the fifty-trace test for measuring how hard position bias is biting your specific judge.

From votes to a ranking: Bradley-Terry and Elo

A pile of pairwise outcomes is not a metric. You aggregate it into per-model strengths with the Bradley-Terry model: each model gets a latent strength, and the probability that model i beats model j is i's strength divided by the sum of both. Fit by maximum likelihood, rank by strength. Raw win rate is the wrong aggregate because it rewards an easy schedule; Bradley-Terry corrects for who actually played whom.

import numpy as np

def bradley_terry(wins: np.ndarray, iters: int = 200) -> np.ndarray:
    """Fit Bradley-Terry strengths from a win matrix.

    wins[i][j] = times model i beat model j (count ties as 0.5 each).
    Returns strengths summing to 1; rank by descending strength.
    """
    n = wins.shape[0]
    p = np.ones(n) / n
    for _ in range(iters):
        for i in range(n):
            total_wins = wins[i].sum()
            denom = sum(
                (wins[i, j] + wins[j, i]) / (p[i] + p[j])
                for j in range(n) if j != i
            )
            if denom > 0:
                p[i] = total_wins / denom
        p /= p.sum()
    return p

Elo is the online cousin: per-game incremental updates, order-dependent, designed for human players whose strength drifts. Model checkpoints do not drift, and you hold the full game history, so the static Bradley-Terry fit is the better estimator. Chatbot Arena launched on online Elo ⁶ and moved to a Bradley-Terry maximum-likelihood fit for exactly this reason; the bootstrap confidence intervals on BT coefficients capture the real variance where the online Elo intervals ran too wide ⁷.

Two operational details from the Arena methodology worth stealing. Report a confidence interval on every strength and refuse to declare a winner when intervals overlap. And if you control which pairs get judged, sample the pair that shrinks the widest interval fastest; the Arena's adaptive rule reached target precision in about 4,400 votes where uniform random sampling needed 6,800 ². The confidence intervals page covers the mechanics.

The cost table

Direct scoring is linear in candidates. Round-robin pairwise is quadratic, and the position swap doubles it. Judge calls per prompt:

Candidates	Direct	Round-robin pairwise + swap	Fixed-baseline pairwise + swap
2	2	2	4
5	5	20	10
10	10	90	20
n	n	n(n-1)	2n

At two candidates, round-robin pairwise costs the same as direct, which is why head-to-head prompt comparisons feel free. The quadratic bite arrives with the third candidate and never leaves: on a 500-prompt suite with ten candidates, direct is 5,000 judge calls, round-robin is 45,000, and the fixed-baseline pattern (every candidate vs one anchored reference, both orders) is 10,000. That fixed-baseline pattern is the AlpacaEval and Arena-Hard architecture, and it is the only pairwise shape that survives a growing model roster ³ ⁴.

Agreement numbers worth knowing

The literature gives you calibration anchors for how much to trust each mode.

GPT-4 as a pairwise judge reaches over 80 percent agreement with human preferences on MT-Bench, the same level at which humans agree with each other ⁵.
Arena crowd voters agree with expert raters between 72.8 and 83.1 percent depending on the model pair; two experts agree 79.4 and 89.8 percent on the same comparisons ². Crowd pairwise votes are noisier than expert votes, but only by five to ten points.
Controlling for response length raises AlpacaEval's Spearman correlation with Arena rankings from 0.94 to 0.98 ³. Verbosity bias is that large, and that correctable.
Arena-Hard reports 98.6 percent correlation with human preference rankings and three times MT-Bench's separability between models ⁴. A small, hard, well-chosen prompt set beats a large bland one.
The caveat on all of these: Thakur et al. show judges with high percent agreement can still assign substantially different scores, and only the largest judge models reach reasonable human alignment at all ⁸. Report kappa or a confusion matrix alongside raw agreement; the inter-rater agreement page covers which statistic to use.

CITATION

Chiang et al. 2024, "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference," ICML 2024. The canonical pairwise-at-scale methodology: Bradley-Terry ranking over 240K crowd votes, expert-validated vote quality, adaptive pair sampling, intervals on every rank. arxiv.org/abs/2403.04132

The decision rule

Apply in order; stop at the first match.

Write the criterion as a falsifiable sentence. If you can, use direct binary and gate releases on TPR and TNR. Most production criteria end here.
Criterion is preference-shaped and you compare exactly two systems: pairwise with the position swap, ties when the swap disagrees.
Three or more systems, or an ongoing leaderboard: pairwise against a fixed baseline, aggregate with Bradley-Terry, publish intervals, never call overlapping ranks.
The release question is "is it good enough" rather than "which is better": direct, even if you also run pairwise for product taste.
Whatever you picked: no pairwise number without the swap, no ranking without intervals, no judge without calibration against humans.

The hybrid that holds up in production is direct binary on the objective axes as the release gate, plus a small fixed-baseline pairwise suite against the current production system for the open-ended axes. Direct catches rule regressions; pairwise catches preference drift the rubric cannot name.

Eugene Yan, "LLM Evaluators." ↩ ↩²
Chiang et al., Chatbot Arena, ICML 2024. ↩ ↩² ↩³
Dubois et al., Length-Controlled AlpacaEval, 2024. ↩ ↩² ↩³
Li et al., Arena-Hard / BenchBuilder, 2024. ↩ ↩² ↩³
Zheng et al., MT-Bench / Chatbot Arena, NeurIPS 2023 D&B. ↩ ↩²
LMSYS, "Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings." ↩
LMSYS, "Chatbot Arena: New models & Elo system update" (Bradley-Terry analysis). ↩
Thakur et al., "Judging the Judges," 2024. ↩