Conversational quality

Multi-turn assistant. Stay on persona, remember context, recover from misunderstandings. Pairwise preference plus per-turn rubric, not 1-to-5 scales.

What "good" looks like

A conversational-quality eval asks two questions: does a human prefer the new model's response to the last release, and does each turn pass a domain rubric? Score both, and score turns as binary pass/fail rather than on a numeric scale.

The primary targets:

Pairwise preference rate against the last released version, judged by whether a human prefers the new response to the baseline.
Per-turn pass rate on a domain rubric, scored pass/fail per turn.

MT-Bench is the closest public reference. It runs 80 multi-turn conversations across 8 categories with an LLM judge that reaches above 80% agreement with humans, the same level as inter-human agreement^[1].

Chatbot Arena is the gold standard but expensive to operate at scale. The LMSYS Bradley-Terry analysis is the citation-of-record for the methodology^[2].

Use pass/fail, not a 1-to-5 scale. The framing that matters here, from Hamel, is binary pass/fail with a short critique^[3]. A 1-to-5 scale "is almost always a sign of a bad eval process" because annotators park on 3 and disagreement is hidden by aggregation.

flowchart TD
    CONV["Production or test conversation"] --> TURN["Per-turn LaaJ"]
    TURN --> RUBRIC{"Passes domain rubric?"}
    RUBRIC -->|no| FAIL["Log turn failure: persona, retention, or factuality"]
    RUBRIC -->|yes| PAIR["Pairwise A vs B, order randomized"]
    PAIR --> PREF{"Preferred over last release?"}
    CAL["PDE labels 100 traces"] --> AGREE{"LaaJ-PDE agreement at least 80%?"}
    AGREE -->|no| ITER["Iterate the judge prompt"]
    AGREE -->|yes| PREF
    PREF -->|yes| SHIP["Ship"]
    PREF -->|no| FIX["Fix the weaker axis"]

Secondary metrics

Metric	What it catches	How to score
Persona-break rate	Assistant breaks character or admits it is an AI when the persona forbids it	Persona-break classifier on every response
Sycophancy	Model biases answers toward what the user appears to want to hear^[4]	Compare answers under user-agreement pressure vs neutral framing
Refusal false-positive rate	Assistant refuses a benign request	Fraction of benign prompts refused
Amnesia within window	Model forgets something it was told three turns ago	Per-turn retention check against earlier turns
Repeat-question rate	Model asks a question whose answer is in earlier turns	Fraction of turns re-asking answered questions

Offline eval design

Start with MT-Bench for breadth. On top of it, build a domain dialogue set of 30 multi-turn scenarios with explicit per-turn rubrics covering:

Persona adherence.
Retention of earlier-turn information.
No-hallucination on domain facts.

Score each turn binary pass/fail plus a one-sentence critique. Per Hamel's field guide and judge papers, that is the right scorer^[3]^[5]. Avoid 1-to-5 scales.

Pair a calibration loop. A principal domain expert labels 100 traces, and the judge prompt is iterated until LaaJ-PDE agreement is at least 80%.

For pairwise preference, structure the eval as A vs B with the order randomized. LaaJ has documented position bias and verbosity bias, and both must be controlled.

Online eval design

Run a per-turn LaaJ on sampled production conversations. Sample conversations from production and score each turn.

Track explicit feedback by topic. Thumbs up and down broken down by topic tells you where the model is failing more than the global rate does.

Track the conversation-length distribution. A sudden shift usually means the model is being more verbose, less verbose, or losing context.

Run a persona-break classifier on every response. Alert on detected breaks, because they are visible to users in real time.

Top benchmarks

MT-Bench, https://arxiv.org/abs/2306.05685. 80 multi-turn conversations, 8 categories; LaaJ-human agreement above 80%.
Chatbot Arena, https://arxiv.org/abs/2403.04132. Bradley-Terry Elo from human pairwise judgments; the public gold standard.
Your domain set. The eval that predicts whether your assistant ships.

Tools especially suited

Langfuse: live scoring, annotation queues, and per-turn LaaJ; the Langfuse cookbook on multi-turn evals is a useful template^[6].
Braintrust: hosted eval observability with strong pairwise primitives.
MT-Bench harness: parity with public scores.
Inspect AI: sandboxed deterministic runs across releases.

Common pitfalls

Pitfall	Why it bites	Fix
Single-turn proxy benchmarks	MMLU and HumanEval predict almost nothing about multi-turn conversational quality	Build a multi-turn domain set instead
LaaJ position and verbosity biases	Untreated, they make pairwise comparisons unreliable	Randomize order and length-normalize where possible
1-to-5 scale evaluation	Hamel calls it "almost always a sign of a bad eval process"	Use pass/fail with critique^[3]
Skipping the calibration loop	A judge not calibrated against a domain expert on 100 hand-labeled traces is decoration, not a measurement	Calibrate on 100 traces before trusting the judge

Canonical reading

Hamel's "Creating an LLM-as-a-Judge That Drives Results" is the canonical reading because it operationalizes the calibration loop that makes the judge trustworthy^[3]. The "critique shadowing" pattern (one principal domain expert per judge prompt) is the most useful single technique in this category. Read it with the MT-Bench paper for the LaaJ-human agreement methodology^[1].

Day-1 starter

30 multi-turn scenarios with per-turn rubrics. Persona, retention, factuality. The rubric is the eval.
Pairwise A/B against the last release. One principal domain expert labels 100 traces; iterate the judge prompt until agreement is at least 80%.
Track agreement percent between LaaJ and the PDE in CI. When agreement drops, the judge has drifted from the rubric, which is the prior bug, not a regression in the model.

Cross-cutting links

Calibrating your judge against humans: the PDE-agreement loop the day-1 starter depends on.
Pairwise vs direct scoring: when preference rate beats rubric scoring and when it does not.
Pairwise comparison and Elo: the mechanics behind preference-rate metrics.
Langfuse: live scoring plus annotation queues for sampled conversations.
Multi-turn conversation eval: the day-1 starter, runnable end to end.
LLM-judge calibration loop: iterating the judge prompt to the 80% agreement bar.