What "good" looks like
The primary metric is pairwise preference rate against the last released version (does a human prefer the new model's response to the baseline?) paired with per-turn pass rate on a domain rubric. MT-Bench is the closest public reference, 80 multi-turn conversations across 8 categories with an LLM judge that reaches above 80% agreement with humans (the same level as inter-human agreement)[1]. Chatbot Arena is the gold standard but expensive to operate at scale; the LMSYS Bradley-Terry analysis is the citation-of-record for the methodology[2].
The framing that matters here, from Hamel: binary pass/fail with a short critique, not a 1-to-5 scale[3]. A 1-to-5 scale "is almost always a sign of a bad eval process" because annotators park on 3 and disagreement is hidden by aggregation. Use pass/fail.
Secondary metrics
Persona-break rate. The assistant breaks character or admits it is an AI when the product persona forbids it. Sycophancy. The Anthropic and Sharma et al. work demonstrates that models bias their answers toward what the user appears to want to hear[4]. Refusal false-positive rate. The assistant refuses a benign request. Amnesia within window, the model forgets something it was told three turns ago. Repeat-question rate, the model asks a question whose answer is in earlier turns.
Offline eval design
MT-Bench for breadth. On top of it, build a domain dialogue set of 30 multi-turn scenarios with explicit per-turn rubrics covering persona adherence, retention of earlier-turn information, and no-hallucination on domain facts. Per Hamel's field guide and judge papers, the right scorer is a binary pass/fail per turn plus a one-sentence critique[3][5]. Avoid 1-to-5 scales. Pair a calibration loop: a principal domain expert labels 100 traces, the judge prompt is iterated until LaaJ-PDE agreement is at least 80%.
For pairwise preference, structure the eval as A vs B with the order randomized; LaaJ has documented position bias and verbosity bias, and both must be controlled.
Online eval design
Sample conversations from production and run a LaaJ on each turn. Track explicit feedback (thumbs up and down) by topic; the topic-level breakdown tells you where the model is failing more than the global rate does. Track conversation-length distribution; a sudden shift usually means the model is being more verbose, less verbose, or losing context. Run a persona-break classifier on every response; alert on detected breaks because they are visible to users in real time.
Top benchmarks
Langfuse for live scoring, annotation queues, and per-turn LaaJ; the Langfuse cookbook on multi-turn evals is a useful template[6]. Braintrust for hosted eval observability with strong pairwise primitives. The MT-Bench harness for parity with public scores. Inspect AI for sandboxed deterministic runs across releases.
Common pitfalls
Single-turn proxy benchmarks for multi-turn quality. MMLU and HumanEval predict almost nothing about multi-turn conversational quality. LaaJ position and verbosity biases. Untreated, they make pairwise comparisons unreliable; randomize order and length-normalize where possible. 1-to-5 scale evaluation. Hamel calls it "almost always a sign of a bad eval process"[3]. Use pass/fail with critique. Skipping the calibration loop. A judge that has not been calibrated against a domain expert on 100 hand-labeled traces is decoration, not a measurement.
Canonical reading
Hamel's "Creating an LLM-as-a-Judge That Drives Results" is the canonical reading because it operationalizes the calibration loop that makes the judge trustworthy[3]. The "critique shadowing" pattern (one principal domain expert per judge prompt) is the most useful single technique in this category. Read it with the MT-Bench paper for the LaaJ-human agreement methodology[1].
Day-1 starter
- 30 multi-turn scenarios with per-turn rubrics. Persona, retention, factuality. The rubric is the eval.
- Pairwise A/B against the last release. One principal domain expert labels 100 traces; iterate the judge prompt until agreement is at least 80%.
- Track agreement percent between LaaJ and the PDE in CI. When agreement drops, the judge has drifted from the rubric, which is the prior bug, not a regression in the model.