LLM-as-Judge papers

Building, calibrating, and stress-testing model judges. From G-Eval and MT-Bench through Prometheus 2 to JudgeBench.

Eleven papers covering the arc from G-Eval to JudgeBench. G-Eval and MT-Bench established that a strong LLM can rival a human rater. Prometheus, Auto-J, and JudgeLM proved that open models can do the job too. Shankar et al. surfaced criteria drift, which is the reason most judge pipelines stop working after launch. Judging the Judges and JudgeBench close the loop with the uncomfortable finding that judges fail hardest on reasoning, coding, and math, which is where you most want a verifier.