A judge that agrees with humans 95% of the time on a rubric humans only agree on 60% of the time is not calibrated; it is matching one rater's idiosyncrasy. Inter-rater agreement is the prerequisite measurement: before you trust a labelled set as ground truth, find out whether two qualified humans agree on the labels. If they do not, the rubric is the bug, and no amount of judge calibration fixes a rubric humans cannot follow.
The operational discipline: two domain experts double-label fifty traces per rubric. Compute Cohen's kappa for two raters and Krippendorff's alpha for more than two. If kappa is below 0.6, do not move forward; revise the rubric and relabel. The agent-eval surveys and the validators paper both flag this as the single most common omission in production eval programs 1 2.
Why raw agreement is misleading
A raw agreement of 90% sounds good. It can also be 90% on a class-imbalanced rubric where both raters say "pass" on 90% of the items because most items are easy and the rubric is uninformative on the rest. The raw rate is consistent with two raters who are coin-flipping on the hard cases and parking on "pass" everywhere else.
Cohen's kappa corrects for chance agreement. It is the observed agreement above what would be expected if the raters were guessing with their own marginal class rates. Kappa = 1 is perfect agreement; kappa = 0 is chance; negative kappa is worse than chance. The convention adopted in most eval programs:
- Kappa below 0.4: the rubric is broken or the categories are unstable. Revise.
- Kappa 0.4 to 0.6: workable but borderline. The rubric needs sharpening.
- Kappa 0.6 to 0.8: the rubric is usable; judges can be calibrated against this gold set.
- Kappa above 0.8: high agreement; the rubric is well specified.
Hamel's argument from the LLM-judge piece is that the rubric ought to be reliable before you spend time calibrating a model judge; otherwise the model is chasing one rater's signal 3.
Cohen's kappa for two raters
"""
Cohen's kappa for two raters on the same items. No external deps.
"""
from collections import Counter
def cohens_kappa(rater_a: list[str], rater_b: list[str]) -> float:
assert len(rater_a) == len(rater_b)
n = len(rater_a)
if n == 0:
return float("nan")
agree = sum(1 for a, b in zip(rater_a, rater_b) if a == b)
p_o = agree / n
counts_a = Counter(rater_a)
counts_b = Counter(rater_b)
categories = set(counts_a) | set(counts_b)
p_e = sum((counts_a[c] / n) * (counts_b[c] / n) for c in categories)
if 1 - p_e == 0:
return float("nan")
return (p_o - p_e) / (1 - p_e)
# Example: two raters on a binary pass/fail rubric.
a = ["pass"] * 70 + ["fail"] * 30
b = ["pass"] * 65 + ["fail"] * 5 + ["pass"] * 10 + ["fail"] * 20
print(cohens_kappa(a, b)) # ~0.46 (borderline)
For binary rubrics, the formula collapses to a clean expression and gives a quick gut check during rubric design. The kappa often surfaces "I thought we agreed on 'helpful' but we never wrote down what that means" before the project commits to the rubric.
Krippendorff's alpha for more than two raters
When more than two raters annotate and the data has missing cells (rater C did not see item 47), Cohen's kappa is the wrong tool. Krippendorff's alpha generalises the idea, handles any number of raters, any level of measurement (nominal, ordinal, ratio), and missing data 4.
The construction (nominal case) is:
- For every pair of raters who saw the same item, count disagreements; aggregate across items.
- Compare against the expected disagreement under random pairing across the full dataset.
- Alpha = 1 - (observed disagreement / expected disagreement).
A serviceable Python implementation is short for the nominal case:
from itertools import combinations
from collections import Counter, defaultdict
def krippendorff_alpha_nominal(ratings: dict[str, dict[str, str]]) -> float:
"""
ratings = {rater_id: {item_id: label_or_None}}
Items with fewer than two raters are skipped.
"""
# Build per-item rating multisets
items: dict[str, list[str]] = defaultdict(list)
for rater, by_item in ratings.items():
for item, label in by_item.items():
if label is not None:
items[item].append(label)
# Observed disagreement
obs_num = 0
obs_den = 0
label_counts: Counter = Counter()
for labels in items.values():
if len(labels) < 2:
continue
m_u = len(labels)
local_counts = Counter(labels)
label_counts.update(labels)
# Number of disagreeing pairs
pair_count = m_u * (m_u - 1)
disagree_pairs = sum(
local_counts[a] * local_counts[b]
for a, b in combinations(local_counts, 2)
) * 2 # ordered pairs
obs_num += disagree_pairs
obs_den += pair_count
# Expected disagreement
total = sum(label_counts.values())
exp_num = sum(
label_counts[a] * label_counts[b]
for a, b in combinations(label_counts, 2)
) * 2
exp_den = total * (total - 1)
if obs_den == 0 or exp_den == 0:
return float("nan")
d_o = obs_num / obs_den
d_e = exp_num / exp_den
if d_e == 0:
return 1.0
return 1 - d_o / d_e
A common bar adopted across literature: alpha greater than or equal to 0.8 for high-stakes labelling, alpha greater than or equal to 0.667 for an "acceptable" floor in earlier exploratory work 4. The 0.8 bar is the one we recommend for any labelled set used as a calibration target or release gate.
What to do when agreement is low
Three causes recur in practice; each maps to a different fix.
The first is rubric ambiguity. Two raters disagree because the rubric does not adjudicate the case. Fix the rubric: write examples for the borderline cases, add anchors, decompose the rubric into independent axes. Shankar's validators paper makes the case that this is the rubric authoring loop, not a side activity 2.
The second is rater training. The raters interpret the rubric differently because nobody walked them through the borderline cases. Fix the training: hold a calibration session, label twenty items together, talk through disagreements, then double-label the next fifty independently and re-measure.
The third is rater fatigue or pressure. The same rater labels a thousand items in a day and quality degrades through the run. Fix the throughput: smaller batches, rotation across raters, a quality sample (the same items relabelled later in the run).
[!IMPORTANT]
Kappa below 0.6 is a rubric problem, not a rater problem. Do not push the rubric into automation while humans cannot follow it consistently 5.
How agreement interacts with judge calibration
The standard sequence is rubric authoring, then human IRR, then judge calibration. Judge calibration against a rubric humans cannot follow produces a confident judge with no measurable agreement against the ground truth. The judge can be re-calibrated; the rubric still cannot be followed; the headline metric is decoration.
After the rubric clears the IRR bar (kappa above 0.6, ideally 0.8 for high-stakes; alpha above 0.8 for multi-rater), the judge calibration loop is the second step. Hamel's argument is that two to three iterations against the labelled set are enough to reach above 0.9 TPR and TNR when the rubric is right 3. The agent-eval surveys make the same point in the trajectory setting: step-level IRR is the prerequisite to step-level judge calibration 1.
CITATION
Wikipedia, "Cohen's kappa" and "Krippendorff's alpha." The canonical references for the formulas and bands. en.wikipedia.org/wiki/Cohen%27s_kappa and en.wikipedia.org/wiki/Krippendorff%27s_alpha
What to do this week
- Pick the rubric your judge is most reliant on. Double-label fifty items with two domain experts.
- Compute Cohen's kappa. If it lands below 0.6, do not calibrate the judge; revise the rubric and relabel.
- For multi-rater settings, switch from "average agreement" reporting to Krippendorff's alpha. The agreement bar stays at 0.8 for any rubric used as a gold set.