The four-part judge prompt

A copyable template: role and definitions, criteria, few-shot from your train set, output format. Plus the runnable invocation.

A judge prompt is a classifier specification written in English. The structure that consistently calibrates to human raters above 0.90 TPR and TNR has four parts: a role with explicit definitions, the binary criteria, two or three few-shot examples lifted from your train split, and a strict output schema ¹. Skip any of the four and you spend the next week debugging confusion.

Why these four parts

Each part fixes one failure mode that shows up if you skip it.

Role and definitions kill ambiguity. "Vegan" means different things across product domains; you write the operative definition once so the judge stops guessing.
Criteria force binary thinking. "PASS if X; FAIL if Y." No 1-to-5 scale, no "rate the helpfulness from 1 to 10." Binary lowers required sample size, raises inter-rater agreement, and forces you to be specific about what wrong actually means ².
Few-shot examples carry the boundary cases your prose cannot articulate. Lift them from your training split, never the dev or test split. The train split exists for exactly this purpose.
Output format locks the schema you parse downstream. Use JSON with a fixed key order so a single string match validates the response.

The fourth part is non-negotiable. Shankar et al. show that "criteria drift" (the human's notion of pass shifting as they look at more outputs) ruins judge pipelines that depend on free-form output, because the parsing layer silently misclassifies edge cases ³. A strict schema makes the parser noisy when the judge is noisy.

The template

Below is the canonical four-part structure, lifted from the dietary-restriction example and generalized. Drop your domain definitions, criteria, and three train-split examples into the placeholders.

# Part 1. Role and domain definitions.
You are an expert {DOMAIN_ROLE} evaluating whether responses
properly adhere to {DOMAIN_CRITERION}.

DEFINITIONS:
- {TERM_1}: {EXACT_OPERATIVE_DEFINITION_1}
- {TERM_2}: {EXACT_OPERATIVE_DEFINITION_2}
- {TERM_3}: {EXACT_OPERATIVE_DEFINITION_3}
[... all terms the rubric depends on ...]

# Part 2. Evaluation criteria, binary.
EVALUATION CRITERIA:
- PASS: The response clearly adheres to {DOMAIN_CRITERION},
  including both explicit content AND implicit method.
- FAIL: The response contains any element that violates
  {DOMAIN_CRITERION}, even if the rest of the response is correct.
- When in doubt, FAIL and explain why.

# Part 3. Few-shot examples from the train split.
Example 1 (PASS):
Query: "{TRAIN_EXAMPLE_1_QUERY}"
Response: {TRAIN_EXAMPLE_1_RESPONSE}
Explanation: {WHY_THIS_PASSES}
Label: PASS

Example 2 (FAIL):
Query: "{TRAIN_EXAMPLE_2_QUERY}"
Response: {TRAIN_EXAMPLE_2_RESPONSE}
Explanation: {WHY_THIS_FAILS}
Label: FAIL

Example 3 (FAIL, edge case):
Query: "{TRAIN_EXAMPLE_3_QUERY}"
Response: {TRAIN_EXAMPLE_3_RESPONSE}
Explanation: {WHY_THIS_LOOKS_LIKE_PASS_BUT_FAILS}
Label: FAIL

# Part 4. Output schema, strict.
Now evaluate the following:
Query: {{query}}
{DOMAIN_CRITERION}: {{criterion}}
Response: {{response}}

Return JSON, no preamble, no postamble:
{{
  "label": "PASS" or "FAIL",
  "critique": "specific element(s) that drove the decision"
}}

The placeholders in single braces are template variables you replace at authoring time. The placeholders in double braces are runtime substitutions you do per trace.

A worked example: dietary restrictions

The dietary-restriction judge from the seed study guide is a clean example because the domain has explicit, codifiable definitions and a real downstream consequence (someone with celiac disease cannot eat the gluten you missed).

You are an expert nutritionist and dietary specialist evaluating
whether recipe responses properly adhere to specified dietary
restrictions.

DIETARY RESTRICTION DEFINITIONS:
- Vegan: No animal products (meat, dairy, eggs, honey).
- Vegetarian: No meat or fish; dairy and eggs allowed.
- Gluten-free: No wheat, barley, rye, or other gluten-containing grains.
- Dairy-free: No milk, cheese, butter, yogurt.
- Keto: Very low carb (<20g net carbs), high fat, moderate protein.
- Paleo: No grains, legumes, dairy, refined sugar, processed foods.
- Raw vegan: No animal products AND no foods heated above 118 F (48 C).

EVALUATION CRITERIA:
- PASS: Recipe clearly adheres to the dietary restriction in both
  ingredients AND preparation method.
- FAIL: Recipe contains ingredients or methods that violate the
  dietary restriction. When ambiguous, FAIL and explain why.

Example 1 (PASS):
Query: "Gluten-free pizza dough that actually tastes good"
Response: [Recipe using a gluten-free flour blend, baking powder,
  olive oil, honey, apple cider vinegar.]
Explanation: All ingredients are gluten-free. Method introduces
  no gluten contamination.
Label: PASS

Example 2 (FAIL):
Query: "Raw vegan Mediterranean quinoa salad"
Response: [Recipe with cooked quinoa, vegetables, olive oil, lemon.]
Explanation: Raw vegan forbids foods heated above 118 F. Boiling
  quinoa exceeds that limit.
Label: FAIL

Example 3 (FAIL, edge case):
Query: "Quick keto breakfast"
Response: [Recipe using almond flour pancakes with maple syrup.]
Explanation: Maple syrup adds roughly 13g net carbs per tablespoon,
  blowing the <20g keto budget on a single serving.
Label: FAIL

Now evaluate the following:
Query: {{query}}
Dietary Restriction: {{restriction}}
Recipe Response: {{response}}

Return JSON, no preamble:
{{
  "label": "PASS" or "FAIL",
  "critique": "specific ingredient(s) or method that drove the call"
}}

The structure is the lesson; the domain is interchangeable. Customer support gets a refund-policy version. Code review gets a style-guide version. RAG faithfulness gets a "is this claim supported by the retrieved context" version.

Invocation

The judge is one API call per trace. Pin the model, pin temperature to zero, pin the JSON response format, and parse strictly. The boilerplate below is the production shape.

import json
import os
from anthropic import Anthropic
from string import Template

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

JUDGE_TEMPLATE = Template(open("dietary_judge.txt").read())
# The template file uses $query, $restriction, $response.

def run_judge(query: str, restriction: str, response: str) -> dict:
    prompt = JUDGE_TEMPLATE.substitute(
        query=query, restriction=restriction, response=response
    )
    result = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=512,
        temperature=0,
        messages=[{"role": "user", "content": prompt}],
    )
    raw = result.content[0].text.strip()
    # Strict parse. Bail loudly on schema drift.
    try:
        parsed = json.loads(raw)
    except json.JSONDecodeError as e:
        raise ValueError(f"judge returned non-JSON: {raw[:200]}") from e
    if parsed.get("label") not in {"PASS", "FAIL"}:
        raise ValueError(f"judge label invalid: {parsed!r}")
    return parsed


if __name__ == "__main__":
    out = run_judge(
        query="Gluten-free pasta carbonara",
        restriction="gluten-free",
        response="Use store-bought brown rice spaghetti, pancetta, eggs, "
                 "pecorino, black pepper. Cook pasta per package directions.",
    )
    print(out)
    # Expected: {"label": "PASS", "critique": "Brown rice spaghetti..."}

Three implementation notes that save real time:

Temperature zero, always. A judge is a classifier; you do not want sampled variation. If you need a confidence signal, use logprobs from the underlying token rather than re-rolling the dice.
Strict parser. A judge that returns plain text "PASS" because it forgot the JSON wrapper has to fail loud, not silently. The parser is your contract.
One judge model per rubric. Do not mix Claude and GPT calls on the same rubric without re-calibrating; cross-family agreement is not free.

Few-shot from the train split, never dev or test

The one mistake that ruins a calibration: pulling few-shot examples from the dev or test set. If the judge has seen the dev examples in its prompt, the dev metrics are not a fair signal anymore. The standard split for judge work is roughly fifteen percent train, forty percent dev, forty-five percent test, stratified on the PASS/FAIL ratio so every split has both classes ⁴.

The train split exists for one reason: pick few-shot examples. The dev split is the iteration target; the test split is the unbiased number you cite to your team. The calibration page walks the split discipline end to end.

CITATION

Hamel Husain, "Creating an LLM-as-a-Judge That Drives Business Results." The dietary-restriction case study, the four-part structure, the train/dev/test discipline, and the binary-not-Likert rule. The canonical practitioner reference. hamel.dev/blog/posts/llm-judge

What you do not need

You do not need a chain-of-thought block before the label. Liu et al. report CoT helps on G-Eval-style scoring ⁵, but the cost in tokens and latency rarely pays back on a well-scoped binary rubric. If a rubric is complex enough to need CoT, the rubric is doing too much; split it.

You do not need a custom rubric per persona. Build one rubric per business criterion, not per user segment.

You do not need a Likert scale. We covered this elsewhere; binary plus critique wins every comparison we have run on the seed and on Anthropic's eval cookbook examples ⁶.

What is next

Once the prompt is written, the judge is not deployed; it is a draft. The calibration page shows the train/dev/test loop and the sklearn snippet for TPR and TNR. The biases page shows the position-swap and self-preference fixes you bake into the harness.

Hamel Husain, "Creating an LLM-as-a-Judge That Drives Business Results." ↩
Hamel Husain, "Your AI Product Needs Evals." ↩
Shankar et al., "Who Validates the Validators?" UIST 2024. ↩
Hamel Husain & Shreya Shankar, "LLM Evals FAQ." ↩
Liu et al., G-Eval, EMNLP 2023. ↩
Anthropic, eval cookbook. ↩