Five biases show up in every production judge. Four have well-characterized empirical evidence and standard mitigations. The fifth, format sensitivity, is more recently characterized and the mitigation is mostly discipline. This page is the punch list you walk before shipping.
1. Position bias
The judge prefers whichever response sits first (or last) in a pairwise comparison, regardless of content.
Evidence. Zheng et al. measure this directly on MT-Bench. GPT-3.5 picks the first response roughly half the time on identical content. Claude v1 picks the first response approximately seventy percent of the time, a strong directional skew 1. GPT-4 is more balanced but not immune.
Mitigation: position swap. Call the judge twice, once with [A, B] and once with [B, A]. Only count a preference when both orderings agree. If they disagree, mark it a tie. This doubles the inference cost; it is non-negotiable on any pairwise harness used to gate a release 2.
flowchart LR
Q[Trace pair] --> AB[Judge call:<br/>A then B]
Q --> BA[Judge call:<br/>B then A]
AB --> P{Both<br/>agree?}
BA --> P
P -->|yes, A wins| A[Score: A]
P -->|yes, B wins| B[Score: B]
P -->|disagree| T[Score: tie]
A subtler variant: when you sample data for human labels, randomize the visible position too. Human raters carry their own position bias; if your gold set was always "candidate on the left, reference on the right," your calibration is correlated with the layout.
2. Verbosity bias
The judge prefers longer responses even when concision is closer to the rubric.
Evidence. Zheng et al. find GPT-4 picks the longer answer in over ninety percent of cases when answers differ in length but not in substance on MT-Bench style pairwise data 1. The effect is strong, consistent, and survives across model families.
Mitigations.
- Score conciseness in the rubric directly. Add "FAIL if the response includes information not required by the query" as a binary clause.
- Length-normalize the pairwise comparison. Truncate both candidates to the shorter one's length before showing them to the judge. The truncation is sometimes ugly; the calibration gain is worth it.
- Use a reference-anchored direct rubric instead of pairwise where you can. Prometheus showed reference grounding is the single largest factor in raising agreement with humans on customized rubrics 3.
The cleanest signal that verbosity is biting: your judge says the long response is better, your humans tied them, and the long response was padded with restated context. Sample fifty disagreements and read them.
3. Self-preference bias
The judge favors outputs from its own model family.
Evidence. Liu et al. flag self-preference as one of the diagnostic problems with G-Eval style scoring; the same model evaluating its own outputs ranks them systematically higher than third-party evaluations 4. Zheng et al. document the same effect on MT-Bench across multiple frontier models 1.
Mitigations.
- Cross-family judging. Generator is GPT-4o; judge with Claude. Generator is Claude; judge with GPT. The bias does not disappear, but the correlation with your generator's idiosyncratic style drops.
- Bigger than the generator. A judge weaker than the generator cannot reliably detect failure on the hard cases; that interacts badly with self-preference because the judge defers to the generator on anything it cannot solve itself 5.
- Distilled judge with bias-controlled training data. JudgeLM is fine-tuned with reference-support and reference-drop augmentation specifically to reduce self-preference and length bias 6; Prometheus 2 trains a merged direct-plus-pairwise judge that controls for the same effect 3. This is more work than swapping the API key; pay the cost only when cross-family judging is not enough.
The judge's labels move when meaning-preserving format changes happen in the prompt or candidate output. This is the bias most teams discover the hard way: they revise their rubric's option labels from "PASS / FAIL" to "1 / 0" for downstream parsing convenience, and their calibration tanks.
Evidence. Sclar et al. report up to seventy-six accuracy-point swings on LLaMA-2-13B from meaning-preserving format changes 7. The effect persists at scale and shows weak correlation across models, which means you cannot extrapolate a calibration done on one schema to another. Anthropic's "Challenges in Evaluating AI Systems" reports (A)→(1) causes roughly five-point MMLU shifts at frontier scale 8 (the JudgeBench paper itself plus the Anthropic write-up both reference this regime).
Mitigations.
- Pin the schema. JSON keys, option labels, ordering of few-shot examples, and the order of rubric clauses are all part of the calibration. Changing them means rerunning calibration.
- Version the rubric. Treat the prompt as code; check it into git; include the prompt hash in the eval result so a stale calibration is detectable.
- Test the rubric on a known-good slice. When you do change format, run the new prompt on the calibration set first and confirm the metrics did not move before deploying.
5. Sycophancy
The judge tends to agree with whatever framing or claim the prompt itself encodes, especially when the prompt expresses a preference.
Evidence. Sharma et al. show sycophancy is baked into preference data and emerges across multiple frontier assistants in five distinct task families 9. Practical effect on judges: if your judge prompt contains language like "we generally expect the response to be helpful and detailed," the judge tilts toward calling things PASS to align with the stated expectation.
Mitigations.
- Neutral framing in the rubric. "Evaluate whether the response satisfies the criterion" beats "Evaluate whether the response is helpful and on-topic." The second sentence primes a positive bias.
- No anchor on the candidate's confidence. Strip "I am confident that..." or hedging tokens from the candidate before showing it to the judge if you can; the judge picks up on stated confidence and uses it as a signal.
- Adversarial test set. Build a small slice of traces where the candidate sounds confident but is wrong, and check that the judge catches them. Drop the judge if it does not.
A composite mitigation: panels and ensembles
When the cost budget allows, run two or three judges in parallel and require agreement. Cross-family panel (Claude plus GPT-4o plus a distilled judge) catches more failures than any single member, and the disagreement set is a free signal for "interesting traces" you should send to human review. Thakur et al. note that ensembling is not a free win; you still need to calibrate the panel against humans, because two correlated judges can be wrong in the same direction 8.
CITATION
Zheng et al. 2023, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," NeurIPS 2023 D&B. The empirical source for position bias (Claude v1 ~70% first), verbosity bias (GPT-4 picks longer in >90% of length-confounded pairs), and self-preference. arxiv.org/abs/2306.05685
How to detect each bias on your own data
The mitigations above are operational; before you apply them you have to confirm the bias is biting. Each of the five has a cheap test.
Position bias check. Take fifty pairwise traces from your calibration set. Run each one twice with swapped order. Count how often the judge picks position one across all runs. The expected rate at no bias is around fifty percent. If the judge picks position one materially more than fifty-five percent of the time, position bias is large enough that the position-swap discipline is mandatory, not optional 1.
Verbosity bias check. Pull fifty traces where the two candidates differ by at least 2x in length but agree on substance (you can generate this slice synthetically by truncating one of two semantically equivalent answers). Run the judge. If it picks the longer answer in more than seventy percent of cases, you have a verbosity problem. Mitigate by rubric language or length normalization.
Self-preference check. Generate the same response from two different model families on a hundred queries. Have your judge (call it model X) rate both. If the X-generated responses systematically rate higher than the non-X responses on equivalent traces (a rate gap of more than ten points on a calibrated rubric), self-preference is biting. Switch judge family.
Format sensitivity check. Take your final rubric. Make a meaning-preserving rewrite: change PASS/FAIL to YES/NO, or change the order of few-shot examples, or move the criteria block above the role block. Re-run on the calibration set. If TPR or TNR moves by more than three percentage points, your rubric is fragile to format; pin the original schema and version it 7.
Sycophancy check. Build a slice of fifty traces where the candidate sounds confident but is wrong (you can write these by hand or generate with adversarial prompting). Run the judge. If the judge calls PASS on more than ten percent of these confident-wrong cases, the rubric is biased toward stated confidence; neutralize the framing.
Checklist before you ship
Before you deploy a judge to gate releases, confirm you have done all five:
If any box is empty, the judge metrics you computed on the calibration set are an upper bound on what you will see in production, not a fair estimate.
What is next
There is a sixth category of judge failure that no bias mitigation fixes: the cases where the judge cannot solve the underlying task and so cannot grade it. JudgeBench measures this directly, and the result is sobering. The JudgeBench honest take page covers when not to use a judge at all.