A judge is a second LLM call that grades the first one. You give it the input, the candidate output, a rubric, and (often) a reference. It returns a label and a critique. That is the whole technique 1. Everything that follows is discipline around that one call: how to write the prompt, how to calibrate the score against humans, which biases will derail you, and which classes of task you should never put a judge anywhere near.
When the technique works
Use a judge when the criterion is real but expensive to write code for. The cases that consistently calibrate above 0.90 TPR and 0.90 TNR against a principal expert are open-ended generation with no clean reference, RAG faithfulness against a retrieved context window, tone and style adherence, and dimension-by-dimension summarization quality 2 3. Strong frontier judges agree with human raters above eighty percent on MT-Bench style preference data, which is the empirical bound on the upper end 4.
The cases where judges break: hard reasoning, math correctness, code correctness when the task is non-trivial. JudgeBench, the Tan et al. 2024 benchmark explicitly built to test judges on hard correctness, finds that GPT-4o performs roughly at chance on reasoning, coding, and math judgments where ground truth exists but is non-trivial 5. The intuition is mechanical: if the judge model cannot solve the underlying problem, it cannot reliably tell you whether the candidate solved it. Use executable verification, not a judge, whenever the task admits a programmatic checker. Reserve the judge for the second signal.
The four-part prompt
The structure that consistently calibrates uses four parts. Role and definitions to kill ambiguity, binary criteria to force a falsifiable cutoff, few-shot examples lifted from the train split to carry the boundary cases prose cannot articulate, and a strict JSON schema so the parser fails loudly when the judge does. The pattern is treated in depth on the judge prompt structure page; the headline rule is that all four parts are non-negotiable 1.
# Part 1. Role and definitions.
You are an expert {DOMAIN_ROLE} evaluating whether responses
adhere to {DOMAIN_CRITERION}.
DEFINITIONS: ...
# Part 2. Binary criteria.
PASS: response adheres including method and content.
FAIL: response violates {DOMAIN_CRITERION} in any element.
When in doubt, FAIL and explain why.
# Part 3. Three few-shot examples from the TRAIN split.
...
# Part 4. Strict output schema.
Return JSON: {"label": "PASS" or "FAIL", "critique": "..."}
Temperature zero. One judge model per rubric. Strict parse on the way out. Few-shot examples come from the train split, never the dev or test split, or your calibration numbers stop meaning anything.
Calibration against humans
The judge is a classifier you wrote in English; you measure it like any classifier. Build a hundred-trace human-labeled set with the principal domain expert, split fifteen percent train, forty percent dev, forty-five percent test, stratified on the PASS or FAIL ratio so every split has both classes 6. The train split exists to pick few-shot examples. The dev split is your iteration target; you measure TPR and TNR against it and rewrite the prompt until both clear 0.90. The test split is the unbiased number you cite to your team, and you look at it exactly once before deploy.
Agreement percent is a trap on imbalanced data. A judge that says PASS for every trace agrees with the human reviewer ninety percent of the time when the base rate is ninety percent PASS, and it has captured zero failure signal. TPR (the rate at which the judge correctly flags failures the human flagged) and TNR (the rate at which the judge correctly clears passes the human cleared) are the metrics that survive imbalance.
The recalibration cadence is monthly minimum: rerun fifty fresh traces through the judge, compute agreement with the same expert, and alert if either rate drifts below 0.85. A judge that calibrated cleanly six months ago has often drifted with the underlying model API.
The three biases
Every team's first judge hits these in order. The mechanics are covered in detail on the biases page; the operational headlines are below.
Position bias. On pairwise judgments, the judge favors whichever response sits first. Zheng et al. report GPT-3.5 picks the first response roughly half the time but Claude v1 picks the first response around seventy percent of the time on identical content 4. The mitigation is to swap positions and only count agreement when both orderings agree.
Verbosity bias. The judge prefers longer responses even when concision is closer to the rubric. MT-Bench finds GPT-4 picks the longer answer in over ninety percent of cases when answers differ in length but not substance 4. The mitigation is rubric language that scores conciseness explicitly, or a length-normalized pairwise setup.
Self-preference bias. A judge favors outputs from its own model family. G-Eval observes self-preference as a flag in the Liu et al. 2023 results 3. The mitigation is to pick a judge from a different family than the generator, and to pick a judge that is bigger than the generator so the capability ceiling does not bound the score 1.
Format sensitivity is a separate problem worth flagging. Sclar et al. demonstrate up to seventy-six accuracy-point swings on LLaMA-2-13B from meaning-preserving format changes 7. For judges, the relevance is that schema or option-label changes can move the judge's calls by enough to ruin a calibration. Pin the schema. Pin the option labels. Re-run calibration if either changes.
Direct versus pairwise
Two operational modes; pick by task. Direct scoring asks the judge to label one response against a rubric (PASS or FAIL). Pairwise asks the judge to pick between two responses on the same query. Yan's rule: pairwise for subjective preference where humans cannot calibrate a scalar, direct for objective binary criteria where the rubric has a clean cutoff 2. Direct binary is cheaper (one judge call per trace, no position swap), gives a confusion matrix you can report, and forces the rubric author to write a falsifiable criterion. Pairwise scales when ranking many systems without re-labeling each, at two judge calls per pair and the position-swap discipline 4. See pairwise and Elo for the Bradley-Terry mechanics and Chatbot Arena precedent.
Distillation in production
A frontier judge at one call per trace gets expensive at production volume. Distilled judges fine-tuned for the rubric are the practical answer. Prometheus and Prometheus 2 train an open 7B-13B model with reference-based critique data and reach Pearson 0.897 against humans on customized rubrics 8 9. JudgeLM ships a 7B model that judges 5K samples in three minutes on 8 A100s, with reference-support and reference-drop augmentation to defuse position bias 10. Auto-J trains a 13B critique model on a generative-then-pairwise objective 11. The pattern is the same in every case: a small judge for hot-path inference, a frontier judge as periodic calibration, a human gold set as the truth signal. Vendor distilled judges (Galileo Luna, Patronus Lynx, Vectara HHEM) ship the same architecture.
Where judges still fail
Three failure modes are worth tracking even when calibration looks clean.
JudgeBench reasoning gap. GPT-4o is roughly random on hard reasoning judgments; multi-step proofs, non-trivial code correctness, and multi-step math correctness are still cases where the judge does not have the capability to grade 5. Use a verifier.
Judge-of-judges issues. Thakur et al. show that when LLM judges are used at the meta level (judging the judges) the same biases compound and self-preference becomes severe 12. Avoid recursive judging without a human gold set in the chain.
Criteria drift. Shankar et al. demonstrate that the human's notion of pass shifts as they look at more outputs, which means the rubric you wrote on day one does not match the rubric you would write on day thirty 13. The fix is a strict schema (so parser failures are visible) plus rubric versioning (so old labels stay legible).
What good looks like
A production judge program has all five of the following, or it is not production. A frozen rubric versioned with the code that uses it. A held-out human-labeled dev and test set of at least one hundred traces, Cohen's kappa or Krippendorff's alpha at least 0.80 between two raters. TPR and TNR both above 0.90 on the test set, measured before deploy. Position-swap discipline on every pairwise comparison. A monthly recalibration cadence with an alert if judge-versus-human drift exceeds five points.
The cheapest first artifact is a fifty-trace labeled set, a one-page rubric, and a half-day of expert labeling. From there you walk the rubric to one hundred labels, split train and dev, write the judge prompt, calibrate, then go look at the test number exactly once.