AI Evals › Cookbook

LLM-judge calibration loop

Push TPR and TNR past 0.90 by iterating the rubric, not the model. A complete loop adapted from the HuggingFace LLM-as-judge cookbook.

A judge that has not been calibrated is a vibes generator. The point of this recipe is to walk the loop that turns it into a measurement instrument: read disagreements, change the rubric, re-run, watch true-positive rate and true-negative rate climb until they both clear 0.90. The shape is adapted from the HuggingFace LLM-as-judge cookbook, which reports correlation moving from 0.567 to 0.843 with the same kind of prompt iteration we do here ¹.

If you have not built your first eval yet, do your first eval in 30 minutes first. This recipe assumes you have a labeled dev set and a working judge that needs to be tightened.

The loop

Every iteration runs four steps:

Run the current judge against the dev set.
Compute TPR, TNR, Cohen's kappa.
Read the disagreements.
Edit one piece of the rubric (instructions, examples, or schema).

Stop when both TPR and TNR are at least 0.90 on dev and Cohen's kappa is at least 0.6 ². Then read the test set once and accept the numbers.

What to change between iterations

The four levers, in order of how much they typically move the numbers:

Lever	Typical impact	When to use
Rubric specificity	large	First pass; the rubric is under-specified
Few-shot examples	medium	Edge cases the rubric language cannot describe
Output schema	medium	Judge returns ambiguous or unparseable labels
Model swap	small	Already tight rubric, want to squeeze TNR

The first two account for most of the gain. Hamel's field guide is direct about this: prompt iteration is the cheap intervention; model swaps are the expensive last resort ³.

Step 1: split, score, and report

"""LLM-judge calibration harness.

Run:
    pip install openai pandas scikit-learn anthropic
    export OPENAI_API_KEY=...
    python calibration.py --dataset labeled.csv --iteration 0
"""
from __future__ import annotations

import argparse
import json
from pathlib import Path

import pandas as pd
from openai import OpenAI
from sklearn.metrics import cohen_kappa_score, confusion_matrix
from sklearn.model_selection import train_test_split

CLIENT = OpenAI()
MODEL = "gpt-4o-2024-08-06"


def split(df: pd.DataFrame, seed: int = 42) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """15% train (for few-shots), 40% dev, 45% test. Stratified."""
    train_dev, test = train_test_split(
        df, test_size=0.45, stratify=df["label"], random_state=seed
    )
    train, dev = train_test_split(
        train_dev, test_size=0.73, stratify=train_dev["label"], random_state=seed
    )
    return train, dev, test


def render_few_shots(train: pd.DataFrame, n: int = 4) -> str:
    if n == 0 or train.empty:
        return ""
    bal = pd.concat(
        [
            train[train["label"] == "PASS"].head(n // 2),
            train[train["label"] == "FAIL"].head(n // 2),
        ]
    )
    blocks = []
    for _, row in bal.iterrows():
        blocks.append(
            f"QUERY: {row['query']}\n"
            f"RESPONSE: {row['response']}\n"
            f"CORRECT LABEL: {row['label']}\n"
        )
    return "EXAMPLES:\n\n" + "\n---\n".join(blocks) + "\n\n"


def judge(query: str, response: str, rubric: str, few_shots: str) -> dict:
    prompt = (
        f"You evaluate an assistant response against one rubric.\n\n"
        f"RUBRIC:\n{rubric}\n\n"
        f"{few_shots}"
        f"QUERY:\n{query}\n\n"
        f"RESPONSE:\n{response}\n\n"
        f'Return JSON: {{"label": "PASS" or "FAIL", "critique": "..."}}\n'
        f"The critique must be one sentence."
    )
    msg = CLIENT.chat.completions.create(
        model=MODEL,
        temperature=0,
        response_format={"type": "json_object"},
        messages=[{"role": "user", "content": prompt}],
    )
    return json.loads(msg.choices[0].message.content)


def score(df: pd.DataFrame, rubric: str, train: pd.DataFrame, few_shot_n: int = 4) -> dict:
    fs = render_few_shots(train, n=few_shot_n)
    preds, critiques = [], []
    for _, row in df.iterrows():
        result = judge(row["query"], row["response"], rubric, fs)
        preds.append(result["label"].upper())
        critiques.append(result["critique"])
    df = df.copy()
    df["pred"] = preds
    df["critique"] = critiques

    tn, fp, fn, tp = confusion_matrix(
        df["label"], df["pred"], labels=["FAIL", "PASS"]
    ).ravel()
    tpr = tp / (tp + fn) if (tp + fn) else float("nan")
    tnr = tn / (tn + fp) if (tn + fp) else float("nan")
    agreement = (tp + tn) / len(df)
    kappa = cohen_kappa_score(df["label"], df["pred"])

    return {
        "n": len(df),
        "tpr": tpr,
        "tnr": tnr,
        "agreement": agreement,
        "kappa": kappa,
        "confusion": {"tp": int(tp), "fp": int(fp), "fn": int(fn), "tn": int(tn)},
        "df": df,
    }


def main() -> None:
    p = argparse.ArgumentParser()
    p.add_argument("--dataset", required=True)
    p.add_argument("--rubric", default="rubric.txt")
    p.add_argument("--iteration", type=int, default=0)
    p.add_argument("--few-shot-n", type=int, default=4)
    args = p.parse_args()

    df = pd.read_csv(args.dataset)
    train, dev, test = split(df)
    rubric = Path(args.rubric).read_text()

    print(f"Iteration {args.iteration}: rubric chars={len(rubric)}, "
          f"few_shots={args.few_shot_n}, dev_n={len(dev)}")

    result = score(dev, rubric, train, few_shot_n=args.few_shot_n)
    print(
        f"  dev TPR={result['tpr']:.3f}  TNR={result['tnr']:.3f}  "
        f"agreement={result['agreement']:.3f}  kappa={result['kappa']:.3f}"
    )

    out_dir = Path("calibration_runs") / f"iter_{args.iteration:02d}"
    out_dir.mkdir(parents=True, exist_ok=True)
    result["df"].to_csv(out_dir / "dev_predictions.csv", index=False)
    disagree = result["df"][result["df"]["label"] != result["df"]["pred"]]
    disagree.to_json(out_dir / "disagreements.json", orient="records", indent=2)
    (out_dir / "rubric.txt").write_text(rubric)

    summary = {k: v for k, v in result.items() if k != "df"}
    (out_dir / "summary.json").write_text(json.dumps(summary, indent=2))

    if result["tpr"] >= 0.90 and result["tnr"] >= 0.90:
        print("\nReady for test. Run with --dataset test_set.csv.")
    else:
        print("\nNot yet. Read disagreements.json and tighten rubric.txt.")


if __name__ == "__main__":
    main()

Run it once with iteration 0 to see the baseline:

python calibration.py --dataset labeled.csv --iteration 0

You will see something like:

Iteration 0: rubric chars=320, few_shots=4, dev_n=80
  dev TPR=0.974  TNR=0.412  agreement=0.838  kappa=0.469

Agreement looks fine; kappa says the rate is only modestly better than chance on the imbalanced class ². TNR is the problem.

Step 2: read the disagreements

Open calibration_runs/iter_00/disagreements.json. The rows are the ones where the judge disagreed with your human label. Cluster them by what the judge missed.

In the seed cookbook scenario from HuggingFace, the disagreements clustered into three patterns: the rubric did not say what to do when the response was empty, the rubric did not differentiate between the assistant ignoring vs misreading the constraint, and a few examples had ambiguous human labels ¹. The first two are rubric problems; the third is a labeling problem.

A practical rule: if more than ten percent of disagreements are ambiguous to you, the labels are noisy, not the judge. Pause calibration and re-label.

Step 3: tighten the rubric

Each disagreement category gets one sentence. Iteration 1 of the seed scenario:

Does the response respect every dietary restriction the user stated earlier
in the conversation? Restrictions include allergies (peanut, tree-nut,
shellfish, gluten, soy, egg, dairy), kosher, halal, vegetarian, vegan,
and keto.

A response FAILS if any ingredient (including garnishes, sauces, sides, or
seasoning) violates a stated restriction. A response also FAILS if it
acknowledges the restriction in prose but suggests an ingredient that
violates it (this counts as ignoring, not partial credit).

A response that refuses to suggest anything because the restriction makes
the request infeasible PASSES if the refusal is correct (e.g., the user
asked for a meal that cannot exist under their restrictions). It FAILS if
the refusal is unwarranted.

A response that says "I do not know" or is empty FAILS.

Re-run with iteration 1:

python calibration.py --dataset labeled.csv --rubric rubric.txt --iteration 1

Typical second-iteration shape:

Iteration 1: rubric chars=864, few_shots=4, dev_n=80
  dev TPR=0.955  TNR=0.882  agreement=0.925  kappa=0.806

TNR jumped from 0.41 to 0.88, kappa from 0.47 to 0.81. The intervention was prose, not a model swap.

Step 4: when prose stops moving the numbers

You will hit a plateau around iteration two or three. The remaining disagreements are usually edge cases the rubric language cannot describe; this is where few-shot examples help. Add two PASS examples that look like FAIL on a careless read, and two FAIL examples that look like PASS on a careless read. Re-run.

If that does not move TNR another five points, the model is the ceiling. Try a stronger judge model (Sonnet 4.5 → Opus, or GPT-4o → o3) before changing the rubric again. The bigger-than-generator rule applies: the judge has to be at least as capable as the model it grades ³.

Step 5: read test once

Once dev TPR and TNR both clear 0.90 and kappa is above 0.6, run on the test set:

python calibration.py --dataset test_set.csv --rubric rubric.txt --iteration final

If the test numbers are within five points of dev, the judge is calibrated. If they drift more than five points, the dev set was not representative; you need more data ⁴.

Why agreement is not the headline

A judge that always says PASS on a 90:10 dataset gets 0.90 agreement. That is the trap. TPR and TNR are conditional rates; they do not collapse when the classes are imbalanced. Cohen's kappa adjusts for chance agreement; on imbalanced data, kappa is what tells you whether the rate is meaningfully above the always-say-PASS baseline ⁵.

Metric	What it tells you	When it fails
Agreement	Overall fraction correct	Trivial on imbalanced data
TPR	Recall on PASS class	High if judge over-PASSes
TNR	Recall on FAIL class	High if judge over-FAILs
Cohen's kappa	Agreement adjusted for chance	Misleading on near-zero base rates
Pearson r	Useful for ordinal scales	Not for binary

For binary judges, the headline is the pair (TPR, TNR). For pairwise judges, the headline is preference agreement with humans, computed across position-swapped pairs ⁵.

Sample size and error bars

At 100 dev examples, the Wilson confidence interval on TNR 0.90 is roughly [0.82, 0.95]; a five-point change between iterations is inside that interval ⁶. At 200 examples the interval narrows to roughly [0.85, 0.94]. The pragmatic floor is 100 labels; the honest floor is 200 ⁷.

For the test set specifically, do not read it more than once per project. Every re-read is a peek at the held-out distribution and inflates your reported numbers. Treat the test number the way you would treat a release-gate decision: one read, one number, written down.

Productionize the loop

Once calibrated, the judge prompt and rubric go into version control. The test number gets pinned to the rubric version. Future PRs that touch the prompt have to re-run the calibration script and beat the pinned test number.

Wire it into CI by reading the labeled CSV from a fixture path and asserting TPR and TNR thresholds. The production CI integration chapter covers the tiering pattern (smoke vs full vs nightly) for keeping judge calls affordable on every PR.

What to do next

Add error bars once your labeled set is bigger than 200. If your domain is RAG-shaped, the RAG eval recipe shows how the two-layer split (retrieval vs generation) interacts with judge calibration. The biases page covers what to do about position, verbosity, and self-preference bias once you start running pairwise comparisons.

HuggingFace Cookbook, "LLM as a judge." Reports judge-human correlation moving 0.567 → 0.843 via prompt iteration. ↩ ↩²
Cohen's kappa. Wikipedia. ↩ ↩²
Hamel Husain, "Creating an LLM-as-a-Judge That Drives Business Results." ↩ ↩²
Shankar et al., "Who Validates the Validators?" ICML 2024. ↩
Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," NeurIPS 2023 D&B. ↩ ↩²
Evan Miller, "Adding Error Bars to Evals." arxiv 2411.00640. ↩
Hamel Husain & Shreya Shankar, "LLM Evals FAQ." 100 to 200 labels as the working floor. ↩