LLM-as-Judge: the practical introduction

What an LLM judge is, when to use one, when not to, and the three pitfalls that show up on every team's first attempt.

A judge is a second LLM call that grades the first one. You give it the input, the candidate output, a rubric, and (sometimes) a reference. It returns a label and a critique. That is the whole technique.

It works because a strong frontier model on a well-scoped rubric agrees with human raters somewhere above eighty percent of the time on open-ended generation ¹. It does not work because a strong frontier model is roughly random on hard reasoning judgments ². Most of the failure modes between those two poles trace back to the same three biases.

What an LLM judge actually is

A judge is a classifier you wrote in English instead of training in scikit-learn. The training signal is a small human-labeled set; the inference cost is one extra API call per trace; the output is a binary label plus a free-form critique you can read ³.

# Minimum viable judge.
def judge(query: str, response: str, rubric: str) -> dict:
    prompt = f"""You evaluate whether a response satisfies the rubric.

RUBRIC:
{rubric}

QUERY:
{query}

RESPONSE:
{response}

Return JSON: {{"label": "PASS" or "FAIL", "critique": "..."}}"""
    raw = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )
    return json.loads(raw.choices[0].message.content)

You measure the judge against a held-out human-labeled set the same way you would measure any classifier: true positive rate, true negative rate, confusion matrix. Agreement percent is a trap on imbalanced data; we cover that on the calibration page.

When to use a judge

Use a judge when the criterion is real but expensive to write code for. The four cases that almost always work:

Open-ended generation where a reference is impossible (creative writing, brainstorming, suggested replies) ⁴.
RAG faithfulness, where the question is "does the response stick to what is in the retrieved context."
Tone and style adherence (refund-policy compliance, professional register, brand voice).
Summarization quality on dimensions humans agree on (coverage, conciseness, faithfulness to source) ⁵.

In all four cases, a calibrated binary judge beats a Likert rubric, beats ROUGE-style reference metrics, and beats the vendor's pre-baked "quality score" you cannot inspect.

When not to use a judge

The hard cases. JudgeBench, the Tan et al. 2024 benchmark explicitly built to test judges on hard correctness, finds that GPT-4o performs roughly at chance on reasoning, coding, and math judgments where ground truth exists but is non-trivial ². The reason is straightforward: if the judge model cannot solve the underlying problem, it cannot reliably tell you whether the candidate solved it.

flowchart TD
    A[Task] --> B{Verifiable<br/>by code?}
    B -->|yes| C[Executable verification<br/>unit tests, regex, SQL exec, schema]
    B -->|no| D{Hard reasoning,<br/>math, or code?}
    D -->|yes| E[LLM judge as<br/>secondary signal only<br/>cite JudgeBench]
    D -->|no| F{Reference<br/>available?}
    F -->|yes| G[Hybrid: ref metric<br/>plus calibrated judge]
    F -->|no| H[Calibrated LLM judge<br/>binary + critique]

The rule: never let an LLM judge be the sole verifier on a task where ground truth is hard to compute but a programmatic check exists. Use executable verification first; use the judge as a second signal ². If you can run a unit test, run the unit test.

The three biases

Every team's first judge hits these in order. The mechanics are covered in detail on the biases page; the headline numbers are below.

Position bias. When you ask the judge to pick between A and B, it favors whichever response sits first. Zheng et al. report GPT-3.5 picks the first response roughly half the time but Claude v1 picks the first response around seventy percent of the time on identical content ¹. The mitigation is to swap positions and only count agreement when both orderings agree.

Verbosity bias. The judge prefers longer responses even when concision is closer to the rubric. MT-Bench finds GPT-4 picks the longer answer in over ninety percent of cases when answers differ in length but not substance ¹. The mitigation is rubric language that scores conciseness explicitly, or a length-normalized pairwise setup.

Self-preference bias. A judge favors outputs from its own model family. G-Eval observes self-preference as a flag in the Liu et al. 2023 results ⁵. The mitigation is to pick a judge from a different family than the generator, and to prefer a judge that is bigger than the generator so the capability ceiling does not bound the score.

The bigger-than-generator rule

If your generator is GPT-4o, judge with GPT-4o or stronger. If your generator is Llama 3.1 70B, do not judge with Llama 3.1 8B. The judge has to be at least as capable as the model it grades or it will systematically miscall the hard cases ³. This is a hard rule, not a guideline; teams that violate it almost always blame "judge instability" when the real problem is that the judge cannot solve the task.

The exception is a distilled judge fine-tuned for the specific rubric. Prometheus 13B can match GPT-4 on custom rubrics it was trained on ⁶. We cover the distillation play on the distilled-judges page in the techniques cluster.

Format sensitivity is a different problem

Sclar et al. demonstrate up to seventy-six accuracy-point swings on LLaMA-2-13B from meaning-preserving format changes ⁷. The relevance to judging is that "the rubric stays the same; only the output schema or option labels change" can move the judge's calls by enough to ruin a calibration. Pin the schema. Pin the option labels. Re-run calibration if either changes.

CITATION

Zheng et al. 2023, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," NeurIPS 2023 Datasets and Benchmarks. The canonical paper; introduces position, verbosity, and self-preference bias formally, reports GPT-4 reaching above eighty percent agreement with human raters on MT-Bench. arxiv.org/abs/2306.05685

Direct vs pairwise

Two operational modes; pick by task. Direct scoring asks the judge to label one response against a rubric (PASS or FAIL). Pairwise asks the judge to pick between two responses on the same query. Yan's rule is the one we use: pairwise for subjective preference where humans cannot calibrate a scalar, direct for objective binary criteria where the rubric has a clean cutoff ⁴.

Direct binary is cheaper (one judge call per trace, no position swap needed), gives you a confusion matrix you can report directly, and forces the rubric author to write a falsifiable criterion. Pairwise scales when you have many systems to rank without re-labeling each, but it costs two judge calls (position swap) and produces a ranking, not an absolute score ¹. The Chatbot Arena methodology uses pairwise at crowdsource scale; that is the right call when "which is better" is the only question that matters.

For a typical product team starting from zero, the default is direct binary on a single dimension your business actually cares about (faithfulness, refund-policy compliance, tone). Pairwise comes later when you are comparing two prompt versions head-to-head.

What good looks like

A production judge program has all five of the following, or it is not production:

A frozen rubric versioned with the code that uses it.
A held-out human-labeled dev and test set of at least 100 traces, kappa or alpha at least 0.80 between two raters ⁸.
TPR and TNR both above 0.90 on the test set, measured before deploy.
Position-swap discipline on every pairwise comparison.
A monthly recalibration cadence and an alert if the judge-vs-human agreement on a fresh slice drifts.

The cheapest first artifact is a fifty-trace labeled set, a one-page rubric, and a half-day of human labeling. From there you walk the rubric to a hundred labels, split train and dev, write the judge prompt, calibrate, then go look at the test number once.

A note on Likert

The literature uses 1-to-5 and 1-to-10 Likert scales because academic eval traditions inherited them from BLEU-era summarization work. Practitioners should not. Binary plus critique wins on three dimensions Hamel articulated clearly: lower required N for the same statistical power, higher inter-rater agreement, and a forcing function that makes the rubric author specify exactly what failure looks like ^[1]. A judge that returns "4 out of 5" is also a judge that ducks the question; you cannot ship a release-gate decision on the difference between a 3 and a 4.

The exception is multi-dimensional grading where you genuinely need to attribute failures across orthogonal axes (faithfulness, conciseness, helpfulness, tone). Even there, the right shape is multiple independent binaries, not a multi-axis Likert.

The next page shows how to write the prompt that gets you there. After that, the calibration page shows how to measure it.

Zheng et al., MT-Bench / Chatbot Arena, NeurIPS 2023 D&B. ↩ ↩² ↩³ ↩⁴
Tan et al., JudgeBench, ICLR 2025. ↩ ↩² ↩³
Hamel Husain, "Creating an LLM-as-a-Judge That Drives Business Results." ↩ ↩²
Eugene Yan, "LLM Evaluators." ↩ ↩²
Liu et al., G-Eval, EMNLP 2023. ↩ ↩²
Kim et al., Prometheus, ICLR 2024. ↩
Sclar et al., Quantifying LM Sensitivity, ICLR 2024. ↩
Hamel Husain & Shreya Shankar, "LLM Evals FAQ." ↩