Order matters. Ship the human gold set first (one hundred labeled traces minimum). Then a frontier judge that clears 0.9 TPR and TNR against it . Only then the distilled judge, calibrated against the same gold set, with the frontier judge demoted to the sampling layer. Teams that start with the cheap judge skip the...

Distilled judges

Q: What distillation buys?

Cost. A judge call is mostly input tokens (trace, rubric, few-shot examples) plus a short verdict. At frontier prices that is fractions of a cent per trace, which compounds into thousands of dollars per month at modest volume. The arithmetic is below; the ratio lands around 40x.

What a small fine-tuned judge buys you, what it silently costs, the seven models to know, and the arithmetic that justifies the swap.

A frontier judge on every production trace is fast to design and slow to operate. Tokens add up, p95 latency creeps, and the eval bill at one hundred thousand requests per day starts rivaling the generation bill. The standard production answer is distillation: train or adopt a small, task-specific judge, run it inline on every trace, and reserve the frontier judge for a sampled calibration batch anchored to a human gold set ¹.

What distillation buys

Cost. A judge call is mostly input tokens (trace, rubric, few-shot examples) plus a short verdict. At frontier prices that is fractions of a cent per trace, which compounds into thousands of dollars per month at modest volume. The arithmetic is below; the ratio lands around 40x.

Latency. An inline guardrail-style judge has to return inside the request budget. A frontier round-trip is seconds; a small judge on owned hardware returns in tens to hundreds of milliseconds, and an encoder-sized one in single-digit milliseconds on CPU.

Consistency and control. A distilled judge is a frozen artifact you version with your code. No provider deprecation breaks your calibration mid-quarter, no rate limit caps your backfill, and traces never leave your VPC. Provider quotas alone push many teams here ².

What distillation costs

A capability ceiling. Distillation transfers the teacher's labels, not capability the teacher never had. JudgeBench shows frontier judges performing barely above random on hard reasoning, coding, and math judgments ³; a 7B student inherits those errors and adds its own. On verifiable tasks, use executable verification and skip judges entirely.

Drift from the teacher. The student matches the teacher on the training distribution at distillation time. Your product, your users, and the attack surface keep moving; the student's agreement with the teacher decays silently. This is why the architecture below never runs a distilled judge without a frontier calibration layer on a sampled slice.

A fixed taxonomy. Most shipped judges are trained on a closed set of failure modes (groundedness, refusal, instruction-following, toxicity, PII, off-topic). Anything outside that set regresses toward a default score. A distilled judge tells you about the failures it was taught; it is structurally silent about novel ones.

The open-weights judges

Prometheus 2 is the strongest open generalist. Kim et al. trained 7B and 8x7B evaluators (Mistral-7B and Mixtral-8x7B bases) that handle both direct assessment and pairwise ranking under user-defined criteria in one set of weights, built by weight-merging a direct-assessment model and a pairwise model rather than joint training ⁴. Across four direct and four pairwise benchmarks it scores the highest human and proprietary-judge correlation among open evaluators.

The first Prometheus paper is still worth knowing for its headline result: a 13B model trained on the Feedback Collection (1K fine-grained rubrics, 20K instructions, 100K GPT-4-generated responses and critiques) reached 0.897 Pearson correlation with human raters on 45 custom rubrics, with GPT-4 itself at 0.882 ⁵. Reference answers and explicit rubrics in the prompt were the largest quality factors; that finding transfers to every judge you build.

JudgeLM is the throughput play. Zhu et al. fine-tuned judges at 7B, 13B, and 33B with bias-targeted augmentations (swap augmentation for position bias, reference support and reference drop for knowledge and format bias) and report agreement above 90 percent with their GPT-4 teacher, higher than human-to-human agreement on the same data ⁶.

The operational number a production team cares about: JudgeLM-7B judges 5,000 samples in 3 minutes on 8 A100s ⁶. That throughput makes full-corpus backfills and nightly regression sweeps practical in a way per-call frontier pricing never is.

Auto-J optimizes for inspectability. Li et al.'s 13B generative judge was trained on real user queries and responses across 58 scenarios, supports both pairwise comparison and single-response grading, and outputs structured natural-language critiques rather than bare verdicts ⁷.

The critique discipline matters more than it looks: a verdict without a critique cannot be audited without rerunning the trace, and critique text is what your error-analysis loop reads. Auto-J outperformed both open and closed competitors on its evaluation testbed by a wide margin.

PandaLM is the earliest of the line and the most narrowly scoped: a 7B judge built to pick the better of two instruction-tuned models, targeting subjective dimensions (conciseness, clarity, instruction adherence, comprehensiveness, formality) beyond raw correctness ⁸.

Its result is the proof of concept the others built on: a 7B student reaching 93.75 percent of GPT-3.5's evaluation ability and 88.28 percent of GPT-4's, measured by F1 on its human-annotated test set ⁸. Treat it as historical context and a baseline, not a current pick.

The shipped production judges

Galileo Luna / Luna-2 comes from the observation that judging is usually classification, so a 440M DeBERTa-large encoder fine-tuned for RAG hallucination detection beat GPT-3.5 on that task with 97 percent lower cost and 91 percent lower latency ⁹. Encoders cannot write critiques, but for a fixed binary taxonomy they are the cheapest credible verdict available.

Luna-2, the productized successor, moves to small decoder models with lightweight metric heads so one model serves many metrics. Galileo's sheet (vendor numbers, fetched 2026-06-10) lists $0.12 per million tokens against$ 5.00 for a frontier judge, with average verdict latency of 152 ms against 3.2 s ¹⁰. Vendor benchmarks deserve your skepticism, but the price and latency gap is the structural point, and it matches the open-literature numbers above.

Patronus Lynx is a hallucination judge fine-tuned from Llama-3-8B and 70B Instruct. On HaluBench, the 15K-sample faithfulness benchmark released with it, Lynx 70B scores 87.4 percent accuracy against GPT-4o's 86.5 ¹¹.

The interesting slice is domain transfer: on PubMedQA medical answers, Lynx 70B beats GPT-4o by 8.3 points at catching inaccurate responses ¹¹. A focused student can beat a general teacher inside a narrow domain; that is the strongest version of the distillation argument.

Vectara HHEM-2.1-Open is the minimalist extreme: a Flan-T5-base classifier that scores whether a hypothesis is supported by a premise, running in under 600 MB of RAM at about 1.5 s for a 2K-token input on an x86 CPU ¹². No GPU in the serving path at all.

On its model card benchmarks it beats GPT-3.5-Turbo by 4.4 to 18.1 balanced-accuracy points across AggreFact and RAGTruth slices and edges GPT-4 on the same data ¹². For a single fixed question ("is this grounded in that") asked millions of times, this is the right shape of model.

The cost math that justifies the swap

Stated assumptions, so you can re-run this with your own numbers: 1,200 input tokens per judgment (trace plus rubric plus few-shot examples), 150 output tokens of verdict and critique, 100,000 judged traces per day.

Frontier judge at a blended $5 per million tokens (the rate Galileo's own comparison sheet uses [^galileo-luna2]): 1,350 tokens x$ 5 / 1M = $0.00675 per judgment. That is$ 6.75 per 1,000 judgments, $675 per day, roughly$ 20,000 per month.
Hosted distilled judge at $0.12 per million tokens:$ 0.16 per 1,000 judgments, about $16 per day, under$ 500 per month. A 42x ratio.
Self-hosted cross-check from the JudgeLM throughput claim: 5,000 judgments per 3 minutes on 8 A100s is 0.08 A100-hours per 1,000 judgments ⁶. At an assumed $2 per A100-hour, that is also$ 0.16 per 1,000. Two independent routes land on the same order of magnitude.

If your frontier judge is half that price the ratio is still above 20x, and it widens again once you add the position swap on pairwise calls or judge multiple dimensions per trace. The eval bill is the lever; the gold set (below) is what keeps the cheap number honest.

The architecture: inline, calibration, gold

The pattern that holds up has three layers running at once ¹.

flowchart TD
    A[Every production trace] --> B[Inline distilled judge<br/>verdict + confidence, ms latency]
    B -->|high confidence| C[Log verdict]
    B -->|low confidence| D[Frontier judge]
    A -->|1-5% sample| D
    D --> E{Disagrees with<br/>inline judge?}
    E -->|yes| F[Human review queue<br/>update gold set]
    E -->|no| C
    F --> G[Retrain distilled judge]

The inline layer is the distilled judge on every trace, returning a verdict and a confidence score. The calibration layer is the frontier judge on a 1 to 5 percent sample plus every low-confidence flag; inline-vs-frontier disagreement is the drift signal that triggers a retrain. The gold set is the human-labeled baseline both judges are measured against. Without it you are comparing the student to the teacher and assuming the teacher is right; the gold set breaks that circularity ¹.

def judge_trace(trace, inline, frontier, threshold=0.7):
    """Inline distilled judge with frontier fallback on low confidence."""
    verdict, confidence = inline.score(trace)
    if confidence >= threshold:
        return {"verdict": verdict, "source": "inline", "confidence": confidence}
    # Cost is bounded by the fraction of traces below the threshold.
    frontier_verdict = frontier.score(trace)
    return {
        "verdict": frontier_verdict,
        "source": "frontier",
        "inline_disagreed": frontier_verdict != verdict,
        "confidence": confidence,
    }

The threshold is the cost lever. Tune it against the gold set: pick the value where the inline judge's error rate above the threshold drops below what the team accepts, then watch the disagreement rate per release.

When not to distill

Skip the distilled judge when failure modes sit outside any fixed taxonomy and novelty is the point (early-stage products still discovering their error modes through error analysis). Skip it on reasoning, math, and code correctness, where the teacher itself is unreliable ³ and executable checks are available. And skip it if you have no gold set: an inline judge calibrated only against its teacher drifts with it, and nobody notices until a bad release ships ².

What to ship first

Order matters. Ship the human gold set first (one hundred labeled traces minimum). Then a frontier judge that clears 0.9 TPR and TNR against it ¹. Only then the distilled judge, calibrated against the same gold set, with the frontier judge demoted to the sampling layer. Teams that start with the cheap judge skip the only step that makes its number mean anything.

Hamel Husain, "Creating an LLM-as-a-Judge That Drives Business Results." ↩ ↩² ↩³ ↩⁴
Hamel Husain & Shreya Shankar, "LLM Evals FAQ." ↩ ↩²
Tan et al., JudgeBench, ICLR 2025. ↩ ↩²
Kim et al., Prometheus 2, EMNLP 2024. ↩
Kim et al., Prometheus, ICLR 2024. ↩
Zhu et al., JudgeLM, ICLR 2025. ↩ ↩² ↩³
Li et al., Auto-J (Generative Judge for Evaluating Alignment), ICLR 2024. ↩
Wang et al., PandaLM, ICLR 2024. ↩ ↩²
Belyi et al., "Luna: An Evaluation Foundation Model" (2024). https://arxiv.org/abs/2406.00975 ↩
Galileo, Luna-2 evaluation SLMs (vendor page). https://galileo.ai/luna-2 ↩
Ravi et al., "Lynx: An Open Source Hallucination Evaluation Model" (2024). https://arxiv.org/abs/2407.08488 ↩ ↩²
Vectara, HHEM-2.1-Open model card. https://huggingface.co/vectara/hallucination_evaluation_model ↩ ↩²