Multi-turn evals are where most teams' single-turn judge programs come apart. The rubric that scored a one-shot response cleanly cannot tell you whether the assistant remembered the dietary restriction from turn one when it answered turn five. The recipe here is the smallest useful multi-turn eval that does two jobs at once: a per-turn binary rubric for response quality, and a per-session state check that asks whether the assistant respected the running constraints across the conversation. The structure is adapted from the Langfuse multi-turn cookbook and from MT-Bench's two-turn scoring pattern 1 2.
We log every turn to Langfuse as a child observation under a session-level trace, then run the judge twice per session. once per turn (per-turn rubric) and once at the end (per-session state check). The Langfuse session is the unit you read when something breaks; the per-turn and per-session scores are the metrics you ship.
The dataset shape
You need three artifacts:
- Scenarios (
scenarios.jsonl). One scenario per line. Each scenario is a list of user turns, plus a per-turn rubric and an end-of-session state check.
- Per-turn rubric (
turn_rubric.txt). One rubric per turn type, binary.
- Per-session state check (
state_check.py). A function that takes the full transcript and returns PASS or FAIL plus a one-line reason.
A scenario:
{
"id": "s001",
"domain": "meal-planning",
"constraints": ["vegan", "peanut allergy"],
"turns": [
{"user": "I am vegan and have a peanut allergy. Suggest a dinner.", "rubric": "respects_constraints"},
{"user": "What about a side dish?", "rubric": "respects_constraints"},
{"user": "Can you also suggest a dessert?", "rubric": "respects_constraints"}
],
"session_state_check": "no_constraint_violation_anywhere_in_transcript"
}
The per-turn rubric is binary and reusable. The session check is custom code that walks the full transcript.
Step 1: install and set up Langfuse
pip install langfuse openai anthropic
export LANGFUSE_PUBLIC_KEY=...
export LANGFUSE_SECRET_KEY=...
export LANGFUSE_HOST=https://cloud.langfuse.com
export OPENAI_API_KEY=...
Langfuse Cloud has a free tier sufficient for a 30-scenario dev set; self-hosted is fine too. The point of using it here is that the session, the per-turn observations, and the per-turn and per-session scores all show up in a viewer designed for reading multi-turn traces 1.
Step 2: run the assistant under trace
"""Run each multi-turn scenario against the assistant, log to Langfuse.
Run:
pip install langfuse openai
export LANGFUSE_PUBLIC_KEY=... LANGFUSE_SECRET_KEY=... LANGFUSE_HOST=...
export OPENAI_API_KEY=...
python run_assistant.py
"""
from __future__ import annotations
import json
import uuid
from pathlib import Path
from langfuse import Langfuse
from openai import OpenAI
LF = Langfuse()
OAI = OpenAI()
MODEL = "gpt-4o-2024-08-06"
SYSTEM = (
"You are a meal-planning assistant. Suggest meals that respect every "
"dietary restriction stated earlier in the conversation. Keep "
"responses to two sentences."
)
def run_scenario(scenario: dict) -> dict:
sid = scenario["id"]
session_id = f"{sid}-{uuid.uuid4().hex[:8]}"
trace = LF.trace(
id=session_id,
name=f"multi-turn-{sid}",
session_id=session_id,
metadata={"scenario_id": sid, "constraints": scenario["constraints"]},
)
messages = [{"role": "system", "content": SYSTEM}]
transcript = []
for i, turn in enumerate(scenario["turns"]):
messages.append({"role": "user", "content": turn["user"]})
gen = trace.generation(
name=f"turn-{i + 1}",
model=MODEL,
input=messages.copy(),
metadata={"rubric": turn["rubric"]},
)
resp = OAI.chat.completions.create(
model=MODEL, temperature=0.2, messages=messages
)
reply = resp.choices[0].message.content
gen.end(output=reply, usage_details={
"input": resp.usage.prompt_tokens,
"output": resp.usage.completion_tokens,
})
messages.append({"role": "assistant", "content": reply})
transcript.append({"user": turn["user"], "assistant": reply, "rubric": turn["rubric"]})
LF.flush()
return {"session_id": session_id, "transcript": transcript}
def main() -> None:
scenarios = [json.loads(line) for line in Path("scenarios.jsonl").read_text().splitlines() if line.strip()]
Path("sessions").mkdir(exist_ok=True)
for s in scenarios:
result = run_scenario(s)
Path(f"sessions/{result['session_id']}.json").write_text(
json.dumps({"scenario": s, **result}, indent=2)
)
print(f"ran {s['id']} -> {result['session_id']}")
if __name__ == "__main__":
main()
Each session shows up in Langfuse as a single session containing N generations. You can open the session view and read the full conversation as it happened.
Step 3: per-turn judge
We grade each turn independently. The rubric is binary; the model is the same family as the generator (deliberate). The token cost is one judge call per turn.
"""Per-turn binary judge. Scores attached to Langfuse generations.
Run after run_assistant.py.
"""
from __future__ import annotations
import json
from pathlib import Path
from langfuse import Langfuse
from openai import OpenAI
LF = Langfuse()
OAI = OpenAI()
MODEL = "gpt-4o-2024-08-06"
RUBRICS = {
"respects_constraints": (
"Does the assistant's reply violate any dietary restriction the user "
"stated earlier in this conversation? Reply FAIL if the reply names "
"or implies any ingredient that violates a stated restriction "
"(including in garnishes or sauces). Reply PASS otherwise."
),
}
def judge_turn(transcript_so_far: list[dict], reply: str, rubric_key: str) -> dict:
rubric = RUBRICS[rubric_key]
history = "\n".join(
f"{t['user']}\n--\n{t['assistant']}\n" for t in transcript_so_far
)
prompt = (
f"You evaluate one assistant turn against a binary rubric.\n\n"
f"RUBRIC:\n{rubric}\n\n"
f"CONVERSATION SO FAR:\n{history}\n\n"
f"REPLY UNDER REVIEW:\n{reply}\n\n"
f'Return JSON: {{"label": "PASS" or "FAIL", "critique": "..."}}'
)
msg = OAI.chat.completions.create(
model=MODEL,
temperature=0,
response_format={"type": "json_object"},
messages=[{"role": "user", "content": prompt}],
)
return json.loads(msg.choices[0].message.content)
def main() -> None:
for path in Path("sessions").glob("*.json"):
data = json.loads(path.read_text())
transcript = data["transcript"]
session_id = data["session_id"]
history: list[dict] = []
for i, turn in enumerate(transcript):
result = judge_turn(history, turn["assistant"], turn["rubric"])
LF.score(
trace_id=session_id,
observation_id=None,
name=f"per_turn_{turn['rubric']}",
value=1 if result["label"] == "PASS" else 0,
comment=result["critique"],
)
history.append(turn)
turn["judge_label"] = result["label"]
turn["judge_critique"] = result["critique"]
path.write_text(json.dumps(data, indent=2))
LF.flush()
print(f"scored {session_id}")
if __name__ == "__main__":
main()
The Langfuse score lives next to the trace; you can filter the session view by score and read every failed turn in context.
Step 4: per-session state check
This is the layer that catches the failure single-turn judges miss. The assistant might honor the constraint in turn one and turn three but violate it in turn two. A per-session check walks the full transcript and asks one question.
"""Per-session deterministic state check.
For meal-planning: does any assistant reply across the full transcript
mention an ingredient that violates the listed constraints?
"""
from __future__ import annotations
import json
import re
from pathlib import Path
from langfuse import Langfuse
LF = Langfuse()
# Tiny ingredient banlist; extend per constraint as needed.
BANLIST = {
"vegan": [
r"\b(chicken|beef|pork|lamb|fish|salmon|tuna|cheese|milk|yogurt|"
r"butter|cream|egg|honey|gelatin|anchov)\b"
],
"peanut allergy": [r"\b(peanut|peanuts)\b"],
"dairy": [r"\b(milk|cheese|yogurt|butter|cream|whey|casein)\b"],
"gluten": [r"\b(wheat|barley|rye|bread|pasta|flour)\b"],
}
def violates(text: str, constraint: str) -> str | None:
for pat in BANLIST.get(constraint, []):
m = re.search(pat, text, re.IGNORECASE)
if m:
return m.group(0)
return None
def check_session(scenario: dict, transcript: list[dict]) -> tuple[str, str]:
for i, turn in enumerate(transcript):
for c in scenario["constraints"]:
offending = violates(turn["assistant"], c)
if offending:
return "FAIL", f"Turn {i + 1} violates '{c}' via '{offending}'"
return "PASS", "No constraint violation found in transcript."
def main() -> None:
for path in Path("sessions").glob("*.json"):
data = json.loads(path.read_text())
label, reason = check_session(data["scenario"], data["transcript"])
LF.score(
trace_id=data["session_id"],
name="session_state_check",
value=1 if label == "PASS" else 0,
comment=reason,
)
data["session_state"] = {"label": label, "reason": reason}
path.write_text(json.dumps(data, indent=2))
print(f"{data['session_id']}: {label} - {reason}")
LF.flush()
if __name__ == "__main__":
main()
In meal-planning the deterministic check is a regex over an ingredient banlist. In other domains the deterministic check is a SQL query against a known schema, a state-equality check against a fixture, or a domain-specific Python validator. The pattern is the same: a function that takes the full transcript and returns one binary label plus one line of reason.
Step 5: aggregate the report
"""Compute per-turn and per-session rates from sessions/."""
import json
from pathlib import Path
import pandas as pd
rows = []
for p in Path("sessions").glob("*.json"):
data = json.loads(p.read_text())
for i, t in enumerate(data["transcript"]):
rows.append({
"scenario_id": data["scenario"]["id"],
"turn": i + 1,
"rubric": t["rubric"],
"turn_label": t.get("judge_label", "?"),
"session_label": data.get("session_state", {}).get("label", "?"),
})
df = pd.DataFrame(rows)
print(f"Sessions: {df['scenario_id'].nunique()}, turns: {len(df)}")
print(f"Per-turn PASS rate: {(df['turn_label'] == 'PASS').mean():.3f}")
print(f"Per-session PASS rate: "
f"{(df.groupby('scenario_id')['session_label'].first() == 'PASS').mean():.3f}")
# Where do the rates disagree?
mismatch = df.groupby("scenario_id").apply(
lambda g: (g["turn_label"] == "PASS").all() and (g["session_label"] == "FAIL").any()
)
print(f"Sessions where per-turn says PASS but per-session FAILS: {mismatch.sum()}")
The pattern you watch for is the gap between the per-turn and per-session rates. If per-turn PASS is 0.92 but per-session PASS is 0.74, you have a state-leak: the assistant scores well turn-by-turn but the session-level check catches violations that span turns. That is the single most common multi-turn failure mode and the reason single-turn judges miss real bugs 2.
Step 6: read the failures in Langfuse
Filter the Langfuse session view by session_state_check = 0. Open each failed session. Read the assistant replies turn by turn. The fix is usually one of:
- The system prompt did not explicitly say to re-read the constraints each turn. Fix at the prompt layer.
- The assistant compressed context (because of a context-window strategy or summarization) and dropped the constraint. Fix at the memory layer.
- The constraint surfaced ambiguously in turn one and the assistant interpreted it loosely on later turns. Fix at the user-experience layer (force confirmation).
Reading the trace tells you which layer to fix. Reading the metric does not.
Step 7: wire it to CI
Pin a 30-session smoke set in your repo. On every PR that touches the prompt or the memory implementation, run the run-assistant, per-turn-judge, and state-check scripts. Fail the build if either the per-turn rate or the per-session rate drops more than five points from the main-branch baseline.
For trace inspection in CI logs, write a compact summary to stdout (the report.py snippet above) and a JSON artifact for the run.
What this skips
Pairwise comparison between two assistant versions (use the position-swap pattern from MT-Bench). Persona drift detection (a classifier on the assistant's reply style). Cost and latency tracking (Langfuse logs both; report them in the same dashboard). Long-context state retention (RULER-style multi-needle for a chat session; see long-context eval with RULER).
For trajectory-level evals where the agent calls tools across turns, the agent eval with τ-bench recipe is the right pattern. The conversational-quality task type page covers the broader topic 3. The Moshkovich observability paper makes the case that per-session deterministic checks are the missing piece for production observability 4.
What to do next
The conversational-quality task type covers the broader rubric design. The Hamel evals essay covers why binary rubrics beat Likert in multi-turn settings 5. The Langfuse online eval recipe covers the production loop where new traces feed back into the eval set.