A judge can either rank two candidate outputs against each other (pairwise) or score one output against a rubric (direct). The two are not interchangeable. Pairwise is better when the criterion is subjective and humans cannot reliably anchor a scalar score; direct is better when the criterion is objective enough that a binary judgment captures it. Picking wrong explains a lot of judges that look noisy in practice [1].
Eugene Yan's rule
The shortest heuristic, from Yan: pairwise for subjective, direct binary for objective [1]. Subjective criteria (tone, helpfulness, creativity, preference) are easier to compare than to anchor; an annotator says "A is better than B" while struggling to say "A is a 7 out of 10". Objective criteria (factual correctness, schema validity, presence of a required disclaimer) are easier to anchor than to compare.
The two failure modes that explain most bad judges are pairwise on an objective task (the comparison feels easy but does not catch absolute regressions) and direct on a subjective task (the rubric drifts as the annotator grades and scores cluster on three).
When pairwise beats direct
Three properties make pairwise the right pick.
The first is that systems are close in absolute quality but distinguishable side by side. Direct scoring forces both responses to land near the middle; pairwise asks "which is preferred" and forces a decision. Chiang et al.'s Chatbot Arena work scales this to two hundred forty thousand votes and validates the resulting Elo rankings against expert raters [2].
The second is that the criterion is preference-shaped rather than rule-shaped. "Which response would a user prefer" is a comparison question. "Did the response include the policy disclaimer" is a rule question. Pairwise on rule questions wastes information.
The third is many systems to rank without re-labeling each separately. Pairwise plus Bradley-Terry or Elo ranks any number of systems from a sparse comparison graph. AlpacaEval and Arena-Hard automate pairwise comparison against a fixed reference model to produce a leaderboard from a small number of matches [3][4].
When direct beats pairwise
Pairwise is more expensive than it looks. Ranking N systems takes on the order of N choose 2 comparisons (or N log N with a clever sampler). Direct scoring is N judgments. For an offline regression eval where the question is "did this change pass", direct binary is the cheaper and sharper signal [5].
Direct also wins when the system has an absolute quality bar that does not depend on a baseline. A factual QA bot must answer correctly; whether it answered better than last week is a secondary question direct scoring captures by comparing pass rates between runs.
Position bias and the position-swap pattern
Pairwise judging suffers from position bias: the LLM favors the response in the first position regardless of content. GPT-3.5 is roughly random; Claude-V1 favored the first position around seventy percent of the time when Zheng et al. measured it on MT-Bench [5]. The mitigation is the position-swap pattern: present each pair twice, once in each order, and count a clean win only when the judge agrees with itself across the swap.
from dataclasses import dataclass
@dataclass
class PairwiseResult:
winner: str # "A", "B", or "tie"
def judge_pair(a: str, b: str, judge) -> PairwiseResult:
"""Run a pairwise judge in both orders. A clean winner agrees across the swap."""
first = judge(a, b) # judge picks "A" or "B"
second = judge(b, a) # positions swapped
if first == "A" and second == "B":
return PairwiseResult("A") # A wins regardless of position
if first == "B" and second == "A":
return PairwiseResult("B")
return PairwiseResult("tie") # judge flipped with the swap
The swap is not optional. A pairwise judge without position-swap reports a number that is roughly half position bias and half preference signal, with no way to separate the two [6].
Chatbot Arena in brief
Chatbot Arena is the canonical public pairwise eval at scale: crowd users hit two anonymized models with the same prompt, pick a winner, and the platform updates a Bradley-Terry ranking [2]. The platform moved from informed pairing to uniform sampling because informed pairing amplified popularity bias and slowed convergence on low-traffic models. It reports Wilson confidence intervals on every Elo and refuses to declare a winner inside the overlap.
A hybrid that holds up
The configuration that holds up is direct for stage-1 and stage-2 evals on objective criteria, plus a small pairwise eval against the production-baseline for the open-ended axes. Direct catches rule regressions; pairwise catches preference movements direct rubrics miss.