Calibrating your judge against humans

TPR and TNR over agreement percent. Train, dev, test splits. The 100-example minimum. Iterate the prompt until both rates clear 0.90.

A judge that has not been calibrated is a vibes generator. Calibration is the work that turns it into a measurement instrument. The unit of work is straightforward: hold out a human-labeled set, run the judge, build a confusion matrix, iterate the prompt until both true-positive and true-negative rates clear 0.90 on dev, then read once on test ¹.

The mistake every team makes on the first pass is treating "agreement percent" as the headline metric. It is not. On any real product, the PASS/FAIL split is imbalanced, and a judge that always says PASS will hit ninety percent agreement on a ninety-ten dataset while catching zero real failures.

Agreement percent lies on imbalanced data

The seed worked example illustrates the trap. Suppose your traces are ninety percent PASS and ten percent FAIL. A judge that returns PASS unconditionally gets ninety percent agreement. It also catches zero failures, which is the entire point of having a judge ¹.

The two metrics that do not lie:

True positive rate (TPR). Of the traces that are actually PASS, how many did the judge correctly label PASS? Equivalent to recall on the PASS class.
True negative rate (TNR). Of the traces that are actually FAIL, how many did the judge correctly label FAIL? Equivalent to specificity, or recall on the FAIL class.

The first attempt in the seed case study had TPR 90.1 percent and TNR 22.2 percent. Agreement looked acceptable. The judge was catching roughly one in five real dietary violations. After prompt iteration the same harness reached TPR 95.7 percent and TNR 100 percent on the test set ¹. The agreement number moved seven points. The TNR moved seventy-eight points. That is the story.

The 100-example minimum

You need at least 100 human-labeled traces before the judge metrics carry signal, and 200 is more honest. Anthropic's "Adding Error Bars to Evals" formalizes why: at N = 100 the Wilson interval on a 0.90 success rate is roughly [0.82, 0.95], wide enough that a ten-point shift in a prompt revision is inside the error bar ².

Hamel and Shreya recommend 100 to 200 hand-labeled traces as the working floor and 500+ when you need to distinguish small effects ³. Below 100 you are tuning to noise.

Train, dev, test, stratified

You split the labeled set three ways. Train carries the few-shot examples the judge prompt embeds. Dev is the iteration target where you read every error and revise the prompt. Test is the unbiased number you read once.

from sklearn.model_selection import train_test_split

# labeled_data is a DataFrame with columns: query, response, label.
# Roughly 15% train, 40% dev, 45% test.

train_dev, test = train_test_split(
    labeled_data,
    test_size=0.45,
    stratify=labeled_data["label"],
    random_state=42,
)

train, dev = train_test_split(
    train_dev,
    test_size=0.73,  # 0.73 * 0.55 = 0.40 of the original.
    stratify=train_dev["label"],
    random_state=42,
)

print(f"train: {len(train)}, dev: {len(dev)}, test: {len(test)}")
print("train PASS rate:", train["label"].eq("PASS").mean())
print("dev PASS rate:",   dev["label"].eq("PASS").mean())
print("test PASS rate:",  test["label"].eq("PASS").mean())

Stratification is non-optional. Without it, a small dataset routinely ends up with zero FAIL examples in dev or zero PASS examples in test, and you cannot compute the metric that matters.

The sklearn TPR/TNR snippet

The metric you want is the confusion matrix. Compute TPR and TNR directly; do not lean on macro-averaged F1 or balanced accuracy as a substitute when reporting to your team, because they smear the failure direction. You want to know which way the judge errs.

from sklearn.metrics import confusion_matrix
from typing import Sequence


def tpr_tnr(
    truth: Sequence[str],
    pred: Sequence[str],
    positive_label: str = "PASS",
    negative_label: str = "FAIL",
) -> dict:
    """Compute TPR and TNR for a binary judge.

    TPR = recall on the positive class.
    TNR = recall on the negative class.
    """
    labels = [positive_label, negative_label]
    # confusion_matrix returns rows = truth, cols = pred.
    tn, fp, fn, tp = confusion_matrix(
        truth, pred, labels=labels[::-1]  # so order is neg, pos.
    ).ravel()
    tpr = tp / (tp + fn) if (tp + fn) else float("nan")
    tnr = tn / (tn + fp) if (tn + fp) else float("nan")
    balanced = (tpr + tnr) / 2
    return {
        "tp": int(tp), "tn": int(tn), "fp": int(fp), "fn": int(fn),
        "tpr": tpr,
        "tnr": tnr,
        "balanced_accuracy": balanced,
        "n": len(truth),
    }


# Example, dev set with 100 labeled traces.
dev_truth = dev["label"].tolist()
dev_pred = [run_judge(q, r)["label"] for q, r in zip(dev["query"], dev["response"])]

metrics = tpr_tnr(dev_truth, dev_pred)
print(f"N={metrics['n']}")
print(f"TPR = {metrics['tpr']:.1%}")
print(f"TNR = {metrics['tnr']:.1%}")
print(f"Balanced accuracy = {metrics['balanced_accuracy']:.1%}")
print(f"TP={metrics['tp']} TN={metrics['tn']} FP={metrics['fp']} FN={metrics['fn']}")

A representative first run on a fresh judge:

N=100
TPR = 90.1%
TNR = 22.2%
Balanced accuracy = 56.2%
TP=46 TN=8 FP=28 FN=5

The TPR looks fine. The TNR tells you the judge is missing roughly four out of five real failures. That is the number you take into the iteration loop.

The iteration loop

You stay in this loop on the dev split until both rates are above 0.90, then you read test once.

Run the judge on every dev trace.
Compute TPR and TNR.
Read the false positives. Did the judge call PASS on something that violated the rubric? Add a refusal-style clause to the criteria, or a new few-shot example in the FAIL bucket.
Read the false negatives. Did the judge call FAIL on something the human marked PASS? Either the rubric is wrong (relabel the human) or the criteria need a "this is not a failure" clarifying clause.
Update the prompt. Re-run.
Stop when both metrics are above 0.90 on dev, then read test exactly once.

Reading test more than once turns it into another dev split and you lose your unbiased estimate. The discipline is harder than it sounds; budget for it.

Cohen's kappa for the human side

Before you blame the judge, calibrate the humans. Two human raters on the same 30-trace shared set, compute Cohen's kappa, and target kappa at least 0.80 ⁴. Below that, "the human gold set" is not a gold set; it is two people who disagree about what PASS means.

from sklearn.metrics import cohen_kappa_score

# Two raters labeled the same 30 calibration traces.
rater_a = ["PASS", "FAIL", "PASS", "PASS", "FAIL", ...]
rater_b = ["PASS", "FAIL", "FAIL", "PASS", "FAIL", ...]
kappa = cohen_kappa_score(rater_a, rater_b)
print(f"kappa = {kappa:.2f}")
# Acceptable: >= 0.80. Below 0.60 the rubric is the problem, not the raters.

Shankar et al. document "criteria drift" as the failure mode where a single rater's notion of PASS shifts as they read more examples ⁵. The kappa check on a shared subset catches it.

Why not "judge bigger than human"?

The judge has to be at least as capable as the model that generated the trace; that does not mean smarter than the human rater. The human is your source of truth. The judge is the cheap scaler. You measure the judge against the human, not the other way around.

Thakur et al.'s "Judging the Judges" sharpens the point: lenient judges can match GPT-4's overall agreement number while being systematically wrong in the same direction as the generator, which means the judge tracks the generator's mistakes instead of catching them ⁶. The defense is a TNR you actually measured against humans, not an agreement percent against a stronger judge.

CITATION

Hamel Husain, "Creating an LLM-as-a-Judge That Drives Business Results." TPR/TNR over agreement, the train/dev/test split, the iterate-on-dev discipline, the 22.2 percent TNR pre-iteration case study, all from this post. hamel.dev/blog/posts/llm-judge

When to recalibrate

A judge is not "calibrate once and ship." Recalibrate when any of the following changes:

The generator model changes (new minor version of GPT-4o, new Claude release, new prompt). Self-preference and capability ceiling both shift.
The product surface changes (new feature, new persona, new task category). New traces are not in the distribution your judge saw.
The judge model changes. Even within a family, a point release can move the calibration.

Cadence as a default: monthly spot check (10 to 20 fresh human labels, compare to judge), quarterly full recalibration (100 to 200 new labels, full split discipline).

What is next

The judge is calibrated. Now you defuse the biases the calibration set might not catch on its own. Position bias, verbosity bias, self-preference, format sensitivity, sycophancy. The biases page covers each with the empirical evidence and the mitigation. After that, the JudgeBench page covers the cases where the judge fails outright, calibration or no calibration.

Hamel Husain, "Creating an LLM-as-a-Judge That Drives Business Results." ↩ ↩² ↩³
Miller, "Adding Error Bars to Evals," Anthropic 2024. ↩
Hamel Husain & Shreya Shankar, "LLM Evals FAQ." ↩
Wikipedia, "Cohen's kappa." ↩
Shankar et al., "Who Validates the Validators?" UIST 2024. ↩
Thakur et al., "Judging the Judges," GEM 2025. ↩