LLM-as-Judge
A judge is a second LLM call that scores the first one. It is the cheapest way to scale a rubric past what one human reviewer can read. It is also the technique most teams get wrong, because they treat it as a vibes machine instead of a classifier with measurable error rates against a human gold set.
The five chapters here are opinionated. We use binary pass-fail with a written critique, not 1-to-5 Likert. We pick the judge bigger than the generator. We measure TPR and TNR, not agreement percentage. We swap positions on pairwise calls. And we do not let a judge near hard reasoning, math, or code correctness without an executable verifier behind it.
Chapters:
- Introduction. What an LLM judge is, when it works, when it does not, and the three biases everyone hits.
- The four-part judge prompt. A template you can copy: role and definitions, criteria, few-shot examples from your train set, output format.
- Calibrating your judge against humans. Train, dev, test split, TPR and TNR over agreement percent, the 100-example minimum, iteration loop.
- Biases you must defuse. Position, verbosity, self-preference, format sensitivity, sycophancy. Each with the empirical evidence and the mitigation.
- Where LLM judges fail. The JudgeBench result that GPT-4o is roughly random on hard correctness judgments, and what that means for your pipeline.
The honest take page is the one you should not skip. Strong judges agree with humans well above 80 percent on MT-Bench style preference data. They are roughly random on JudgeBench style hard reasoning. The gap matters; you cannot generalize one to the other.