The community headline is "GPT-4 reaches above eighty percent agreement with humans on MT-Bench" 1. The community quietly does not repeat the JudgeBench headline: GPT-4o performs roughly at chance on hard correctness judgments 2. Both are true. Generalizing one to the other will cost you a release.
This page is the honest take. Strong judges work where they work, and the place they work is preference data over open-ended generation. The place they fail is hard correctness: reasoning, math, code, science. If you cannot solve the problem, you cannot reliably grade the solution.
What JudgeBench measures
JudgeBench (Tan et al. 2024) is a benchmark specifically constructed to test judges on hard correctness judgments. The construction takes existing benchmarks where ground truth is known (knowledge, reasoning, math, code) and turns each item into a pairwise judgment task: given a query and two candidate responses, pick the correct one 2.
The result reported in the paper: across model families, the strongest judges score "just slightly better than random" on the reasoning, coding, and math splits. The headline number commonly quoted is GPT-4o around chance on these categories. The categories where you would most want a verifier are the categories where the verifier is least reliable.
Thakur et al.'s "Judging the Judges" sharpens the same finding from a different angle: lenient judges can match a stronger judge's overall agreement number while being systematically wrong 3. The metric "agreement with humans" can mask "agreement with the same wrong answer humans pick." The strongest judges in the survey reach approximately sixty percent on the hard subset.
Why this happens
The mechanism is straightforward. To grade whether a candidate solution is correct, the judge has to internally check the solution. If the underlying problem is beyond the judge's capabilities, the judge cannot do the check. It falls back on heuristics: which response sounds more confident, which is longer, which uses more authoritative vocabulary. Those heuristics are exactly the biases we cover on the biases page, and they are why the position-swap and verbosity controls do not save you here. The bias is downstream of an upstream failure: the judge does not know which answer is right.
This connects directly to a related result. Lightman et al. ("Let's Verify Step by Step") show that on hard math reasoning, a process-reward model trained on per-step labels crushes outcome-only supervision; the verifier needs to check the steps, not just the answer 4. A general-purpose LLM judge is doing outcome supervision in the harder direction (no step labels, no domain training). It loses badly to a small specialized verifier.
flowchart TD
A[Task category] --> B{Judge model<br/>can solve it?}
B -->|no| C[Judge cannot verify.<br/>Use executable check,<br/>process-reward model,<br/>or human grader.]
B -->|yes, easily| D{Reference<br/>solution available?}
B -->|barely| E[Risky. Use LLM judge as<br/>secondary signal only.<br/>Required: human spot-check.]
D -->|yes| F[Reference-anchored<br/>LLM judge; cheap, reliable.]
D -->|no| G[Direct rubric judge;<br/>calibrate TPR/TNR.]
What this means for your pipeline
Three concrete rules, applied at the task level:
Hard reasoning, math, code correctness: never LLM judge alone. If you can run the candidate's code against unit tests, do that. If you can execute the SQL against a known schema, do that. If you can run the proof through a checker, do that. The LLM judge is a secondary signal at best, useful for catching style or process violations that the executable check ignores. The primary signal is the executable check 2.
Open-ended generation without executable ground truth: LLM judge fine, calibrated. Chat replies, summaries, brainstorming, tone, RAG faithfulness. This is the regime MT-Bench-style results apply to. Use the four-part prompt, do the train/dev/test discipline, hit 0.90 TPR and TNR before deploying.
The mixed case: route by task type. If your product spans both regimes (a coding assistant that also writes a friendly summary), do not run a single rubric. Route the coding output to executable verification and the summary output to the calibrated judge. They are different evaluators because the underlying tasks have different verifiability properties.
A worked failure: ScienceAgentBench
Chen et al.'s ScienceAgentBench is a case study in the failure mode 5. The benchmark covers 102 scientific data-analysis tasks where the ground truth is a runnable script with known outputs. They report that LLM judges systematically over-credit incorrect code: the judge reads the code, sees plausible-looking logic, and rates it favorably even when the code fails to produce the correct output when executed.
The lesson: a judge that reads code and says "looks right" is approximately useless for correctness. The fix is the same as the JudgeBench takeaway: run the code, grade by the result of execution, not by the appearance of the code.
The MT-Bench result is not contradicted
To be precise about what JudgeBench does and does not show: it does not contradict the MT-Bench finding that strong judges reach >80 percent agreement with humans on preference tasks. The two benchmarks measure different things. MT-Bench asks "which response do humans prefer." JudgeBench asks "which response is correct." Those are not the same question, and the judge's reliability differs sharply between them 1 2.
The community confusion comes from collapsing the two. Practitioners read the MT-Bench headline ("GPT-4 agrees with humans on most pairs") and infer "GPT-4 is a good judge." It is a good judge for preference. It is not a good judge for correctness.
CITATION
Tan et al. 2024, "JudgeBench: A Benchmark for Evaluating LLM-Based Judges," ICLR 2025. The empirical result that GPT-4o performs at roughly chance on hard reasoning, coding, and math judgments. The single most useful corrective to "just use an LLM judge." arxiv.org/abs/2410.12784
What to use instead, by task
When the judge fails, you have other tools.
- Math, formal logic. Lean, Isabelle, sympy. A theorem prover or a CAS verifies the answer.
- Code. Unit tests. The model writes the code; you run it. HumanEval, MBPP, LiveCodeBench all work this way. For trajectory-style scoring on agent code, add tool-call F1 separately.
- SQL. Execute against a sandbox schema. Spider, BIRD all work this way.
- Structured output. JSON schema validation. Regex on extractive fields.
- Math step-by-step. A process-reward model trained on step labels 4. Out of reach for most teams; the next best fallback is "run the candidate's working, check the intermediate values."
- Reasoning over a closed corpus. Atomic-fact decomposition plus per-fact retrieval (FActScore-style). The judge grades atomic claims it can check, not the whole answer.
What still requires human review
Some categories no automation reaches. Legal advice, medical decisions, novel research claims, creative work where the criterion is "good." For these, the eval pipeline is: code-based filter for obvious failures, LLM judge for the next layer of cheap signal, human expert review for the final call. The judge buys you scale on the second layer; it does not replace the expert on the third 6.
How to read your own results honestly
The JudgeBench finding produces a specific failure mode in practitioner reporting: teams report "judge agreement with humans" as a single number across all task categories in their product, and the number averages to something respectable like seventy-eight percent. The aggregate hides the directional truth. The judge agrees with humans ninety percent of the time on the preference and tone slices and forty-five percent of the time on the reasoning slice; the average means nothing because the reasoning slice is the one you care about.
The discipline is to report agreement (or TPR and TNR) sliced by task category. If your product spans categories, slice by category. The slice where the judge is at chance is the slice that needs an executable verifier or a human grader, not a recalibration.
A useful field test on your own pipeline: take a sample of fifty traces where the judge said PASS and ground truth said FAIL (false positives). If most of them are in the hard-reasoning category, you have a JudgeBench-style problem; mitigate by routing those traces to a different verifier. If they spread across categories, you have a calibration problem; revise the prompt.
What this means for benchmark consumption
Practitioners consume benchmarks too. When a vendor or a paper reports "our system reaches X percent on LLM-judged benchmark Y," check whether Y is a preference benchmark (MT-Bench, Arena-Hard, AlpacaEval) or a correctness benchmark (anything where ground truth is verifiable by execution or formal check). For preference benchmarks, the number is meaningful with caveats; for correctness benchmarks judged by an LLM, the number is approximately uninformative about whether the system actually got the answer right. The Tan et al. result is the citation you need when pushing back on an over-claim 2.
The takeaway
A strong LLM judge is a useful instrument with a real working range. Inside the range (preference, faithfulness, tone, calibrated binary rubrics on open-ended generation) it scales human review by ten or twenty times for a small fraction of the cost. Outside the range (hard reasoning, math, code, science correctness) it tells you nothing and confidently tells it to you in a critique 2 7.
The discipline is to know which side of the line your task sits on before you ship. JudgeBench gave us the numbers; the rest is reading them honestly.