A judge that has not been calibrated is a vibes generator. The point of this recipe is to walk the loop that turns it into a measurement instrument: read disagreements, change the rubric, re-run, watch true-positive rate and true-negative rate climb until they both clear 0.90. The shape is adapted from the HuggingFace LLM-as-judge cookbook, which reports correlation moving from 0.567 to 0.843 with the same kind of prompt iteration we do here 1.
If you have not built your first eval yet, do your first eval in 30 minutes first. This recipe assumes you have a labeled dev set and a working judge that needs to be tightened.
The loop
Every iteration runs four steps:
- Run the current judge against the dev set.
- Compute TPR, TNR, Cohen's kappa.
- Read the disagreements.
- Edit one piece of the rubric (instructions, examples, or schema).
Stop when both TPR and TNR are at least 0.90 on dev and Cohen's kappa is at least 0.6 2. Then read the test set once and accept the numbers.
What to change between iterations
The four levers, in order of how much they typically move the numbers:
| Lever | Typical impact | When to use |
|---|
| Rubric specificity | large | First pass; the rubric is under-specified |
| Few-shot examples | medium | Edge cases the rubric language cannot describe |
| Output schema | medium | Judge returns ambiguous or unparseable labels |
| Model swap | small | Already tight rubric, want to squeeze TNR |
The first two account for most of the gain. Hamel's field guide is direct about this: prompt iteration is the cheap intervention; model swaps are the expensive last resort 3.
Step 1: split, score, and report
"""LLM-judge calibration harness.
Run:
pip install openai pandas scikit-learn anthropic
export OPENAI_API_KEY=...
python calibration.py --dataset labeled.csv --iteration 0
"""
from __future__ import annotations
import argparse
import json
from pathlib import Path
import pandas as pd
from openai import OpenAI
from sklearn.metrics import cohen_kappa_score, confusion_matrix
from sklearn.model_selection import train_test_split
CLIENT = OpenAI()
MODEL = "gpt-4o-2024-08-06"
def split(df: pd.DataFrame, seed: int = 42) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""15% train (for few-shots), 40% dev, 45% test. Stratified."""
train_dev, test = train_test_split(
df, test_size=0.45, stratify=df["label"], random_state=seed
)
train, dev = train_test_split(
train_dev, test_size=0.73, stratify=train_dev["label"], random_state=seed
)
return train, dev, test
def render_few_shots(train: pd.DataFrame, n: int = 4) -> str:
if n == 0 or train.empty:
return ""
bal = pd.concat(
[
train[train["label"] == "PASS"].head(n // 2),
train[train["label"] == "FAIL"].head(n // 2),
]
)
blocks = []
for _, row in bal.iterrows():
blocks.append(
f"QUERY: {row['query']}\n"
f"RESPONSE: {row['response']}\n"
f"CORRECT LABEL: {row['label']}\n"
)
return "EXAMPLES:\n\n" + "\n---\n".join(blocks) + "\n\n"
def judge(query: str, response: str, rubric: str, few_shots: str) -> dict:
prompt = (
f"You evaluate an assistant response against one rubric.\n\n"
f"RUBRIC:\n{rubric}\n\n"
f"{few_shots}"
f"QUERY:\n{query}\n\n"
f"RESPONSE:\n{response}\n\n"
f'Return JSON: {{"label": "PASS" or "FAIL", "critique": "..."}}\n'
f"The critique must be one sentence."
)
msg = CLIENT.chat.completions.create(
model=MODEL,
temperature=0,
response_format={"type": "json_object"},
messages=[{"role": "user", "content": prompt}],
)
return json.loads(msg.choices[0].message.content)
def score(df: pd.DataFrame, rubric: str, train: pd.DataFrame, few_shot_n: int = 4) -> dict:
fs = render_few_shots(train, n=few_shot_n)
preds, critiques = [], []
for _, row in df.iterrows():
result = judge(row["query"], row["response"], rubric, fs)
preds.append(result["label"].upper())
critiques.append(result["critique"])
df = df.copy()
df["pred"] = preds
df["critique"] = critiques
tn, fp, fn, tp = confusion_matrix(
df["label"], df["pred"], labels=["FAIL", "PASS"]
).ravel()
tpr = tp / (tp + fn) if (tp + fn) else float("nan")
tnr = tn / (tn + fp) if (tn + fp) else float("nan")
agreement = (tp + tn) / len(df)
kappa = cohen_kappa_score(df["label"], df["pred"])
return {
"n": len(df),
"tpr": tpr,
"tnr": tnr,
"agreement": agreement,
"kappa": kappa,
"confusion": {"tp": int(tp), "fp": int(fp), "fn": int(fn), "tn": int(tn)},
"df": df,
}
def main() -> None:
p = argparse.ArgumentParser()
p.add_argument("--dataset", required=True)
p.add_argument("--rubric", default="rubric.txt")
p.add_argument("--iteration", type=int, default=0)
p.add_argument("--few-shot-n", type=int, default=4)
args = p.parse_args()
df = pd.read_csv(args.dataset)
train, dev, test = split(df)
rubric = Path(args.rubric).read_text()
print(f"Iteration {args.iteration}: rubric chars={len(rubric)}, "
f"few_shots={args.few_shot_n}, dev_n={len(dev)}")
result = score(dev, rubric, train, few_shot_n=args.few_shot_n)
print(
f" dev TPR={result['tpr']:.3f} TNR={result['tnr']:.3f} "
f"agreement={result['agreement']:.3f} kappa={result['kappa']:.3f}"
)
out_dir = Path("calibration_runs") / f"iter_{args.iteration:02d}"
out_dir.mkdir(parents=True, exist_ok=True)
result["df"].to_csv(out_dir / "dev_predictions.csv", index=False)
disagree = result["df"][result["df"]["label"] != result["df"]["pred"]]
disagree.to_json(out_dir / "disagreements.json", orient="records", indent=2)
(out_dir / "rubric.txt").write_text(rubric)
summary = {k: v for k, v in result.items() if k != "df"}
(out_dir / "summary.json").write_text(json.dumps(summary, indent=2))
if result["tpr"] >= 0.90 and result["tnr"] >= 0.90:
print("\nReady for test. Run with --dataset test_set.csv.")
else:
print("\nNot yet. Read disagreements.json and tighten rubric.txt.")
if __name__ == "__main__":
main()
Run it once with iteration 0 to see the baseline:
python calibration.py --dataset labeled.csv --iteration 0
You will see something like:
Iteration 0: rubric chars=320, few_shots=4, dev_n=80
dev TPR=0.974 TNR=0.412 agreement=0.838 kappa=0.469
Agreement looks fine; kappa says the rate is only modestly better than chance on the imbalanced class 2. TNR is the problem.
Step 2: read the disagreements
Open calibration_runs/iter_00/disagreements.json. The rows are the ones where the judge disagreed with your human label. Cluster them by what the judge missed.
In the seed cookbook scenario from HuggingFace, the disagreements clustered into three patterns: the rubric did not say what to do when the response was empty, the rubric did not differentiate between the assistant ignoring vs misreading the constraint, and a few examples had ambiguous human labels 1. The first two are rubric problems; the third is a labeling problem.
A practical rule: if more than ten percent of disagreements are ambiguous to you, the labels are noisy, not the judge. Pause calibration and re-label.
Step 3: tighten the rubric
Each disagreement category gets one sentence. Iteration 1 of the seed scenario:
Does the response respect every dietary restriction the user stated earlier
in the conversation? Restrictions include allergies (peanut, tree-nut,
shellfish, gluten, soy, egg, dairy), kosher, halal, vegetarian, vegan,
and keto.
A response FAILS if any ingredient (including garnishes, sauces, sides, or
seasoning) violates a stated restriction. A response also FAILS if it
acknowledges the restriction in prose but suggests an ingredient that
violates it (this counts as ignoring, not partial credit).
A response that refuses to suggest anything because the restriction makes
the request infeasible PASSES if the refusal is correct (e.g., the user
asked for a meal that cannot exist under their restrictions). It FAILS if
the refusal is unwarranted.
A response that says "I do not know" or is empty FAILS.
Re-run with iteration 1:
python calibration.py --dataset labeled.csv --rubric rubric.txt --iteration 1
Typical second-iteration shape:
Iteration 1: rubric chars=864, few_shots=4, dev_n=80
dev TPR=0.955 TNR=0.882 agreement=0.925 kappa=0.806
TNR jumped from 0.41 to 0.88, kappa from 0.47 to 0.81. The intervention was prose, not a model swap.
Step 4: when prose stops moving the numbers
You will hit a plateau around iteration two or three. The remaining disagreements are usually edge cases the rubric language cannot describe; this is where few-shot examples help. Add two PASS examples that look like FAIL on a careless read, and two FAIL examples that look like PASS on a careless read. Re-run.
If that does not move TNR another five points, the model is the ceiling. Try a stronger judge model (Sonnet 4.5 → Opus, or GPT-4o → o3) before changing the rubric again. The bigger-than-generator rule applies: the judge has to be at least as capable as the model it grades 3.
Step 5: read test once
Once dev TPR and TNR both clear 0.90 and kappa is above 0.6, run on the test set:
python calibration.py --dataset test_set.csv --rubric rubric.txt --iteration final
If the test numbers are within five points of dev, the judge is calibrated. If they drift more than five points, the dev set was not representative; you need more data 4.
Why agreement is not the headline
A judge that always says PASS on a 90:10 dataset gets 0.90 agreement. That is the trap. TPR and TNR are conditional rates; they do not collapse when the classes are imbalanced. Cohen's kappa adjusts for chance agreement; on imbalanced data, kappa is what tells you whether the rate is meaningfully above the always-say-PASS baseline 5.
| Metric | What it tells you | When it fails |
|---|
| Agreement | Overall fraction correct | Trivial on imbalanced data |
| TPR | Recall on PASS class | High if judge over-PASSes |
| TNR | Recall on FAIL class | High if judge over-FAILs |
| Cohen's kappa | Agreement adjusted for chance | Misleading on near-zero base rates |
| Pearson r | Useful for ordinal scales | Not for binary |
For binary judges, the headline is the pair (TPR, TNR). For pairwise judges, the headline is preference agreement with humans, computed across position-swapped pairs 5.
Sample size and error bars
At 100 dev examples, the Wilson confidence interval on TNR 0.90 is roughly [0.82, 0.95]; a five-point change between iterations is inside that interval 6. At 200 examples the interval narrows to roughly [0.85, 0.94]. The pragmatic floor is 100 labels; the honest floor is 200 7.
For the test set specifically, do not read it more than once per project. Every re-read is a peek at the held-out distribution and inflates your reported numbers. Treat the test number the way you would treat a release-gate decision: one read, one number, written down.
Productionize the loop
Once calibrated, the judge prompt and rubric go into version control. The test number gets pinned to the rubric version. Future PRs that touch the prompt have to re-run the calibration script and beat the pinned test number.
Wire it into CI by reading the labeled CSV from a fixture path and asserting TPR and TNR thresholds. The production CI integration chapter covers the tiering pattern (smoke vs full vs nightly) for keeping judge calls affordable on every PR.
What to do next
Add error bars once your labeled set is bigger than 200. If your domain is RAG-shaped, the RAG eval recipe shows how the two-layer split (retrieval vs generation) interacts with judge calibration. The biases page covers what to do about position, verbosity, and self-preference bias once you start running pairwise comparisons.