LLM-as-Judge

Judge prompt structure, calibration against humans, biases and mitigations. Also available as a free PDF download.

LLM-as-Judge

A judge is a second LLM call that scores the first one. It is the cheapest way to scale a rubric past what one human reviewer can read. It is also the technique most teams get wrong, because they treat it as a vibes machine instead of a classifier with measurable error rates against a human gold set.

The seven chapters here are opinionated. We use binary pass-fail with a written critique, not 1-to-5 Likert. We pick the judge bigger than the generator. We measure TPR and TNR, not agreement percentage. We swap positions on pairwise calls. And we do not let a judge near hard reasoning, math, or code correctness without an executable verifier behind it. On the techniques side, the condensed references are LLM-as-Judge, rubric design, and pairwise comparison and Elo.

Chapters:

Introduction. What an LLM judge is, when it works, when it does not, and the three biases everyone hits.
The four-part judge prompt. A template you can copy: role and definitions, criteria, few-shot examples from your train set, output format.
Calibrating your judge against humans. Train, dev, test split, TPR and TNR over agreement percent, the 100-example minimum, iteration loop.
Biases you must defuse. Position, verbosity, self-preference, format sensitivity, sycophancy. Each with the empirical evidence and the mitigation.
Pairwise vs direct scoring. When relative comparison beats absolute scores, and the decision rule for choosing between them.
Distilled judges. Small specialized judge models, when they pay for themselves, and the cost math against frontier judges.
Where LLM judges fail. The JudgeBench result that GPT-4o is roughly random on hard correctness judgments, and what that means for your pipeline.

The honest take page is the one you should not skip. Strong judges agree with humans well above 80 percent on MT-Bench style preference data. They are roughly random on JudgeBench style hard reasoning. The gap matters; you cannot generalize one to the other.