AI Evals › Cookbook

Multi-turn conversation eval

Per-turn binary rubrics plus a per-session state check, logged to Langfuse, for an assistant that has to remember context and follow rules across turns.

Multi-turn evals are where most teams' single-turn judge programs come apart. The rubric that scored a one-shot response cleanly cannot tell you whether the assistant remembered the dietary restriction from turn one when it answered turn five. The recipe here is the smallest useful multi-turn eval that does two jobs at once: a per-turn binary rubric for response quality, and a per-session state check that asks whether the assistant respected the running constraints across the conversation. The structure is adapted from the Langfuse multi-turn cookbook and from MT-Bench's two-turn scoring pattern ¹ ².

We log every turn to Langfuse as a child observation under a session-level trace, then run the judge twice per session. once per turn (per-turn rubric) and once at the end (per-session state check). The Langfuse session is the unit you read when something breaks; the per-turn and per-session scores are the metrics you ship.

The dataset shape

You need three artifacts:

Scenarios (scenarios.jsonl). One scenario per line. Each scenario is a list of user turns, plus a per-turn rubric and an end-of-session state check.
Per-turn rubric (turn_rubric.txt). One rubric per turn type, binary.
Per-session state check (state_check.py). A function that takes the full transcript and returns PASS or FAIL plus a one-line reason.

A scenario:

{
  "id": "s001",
  "domain": "meal-planning",
  "constraints": ["vegan", "peanut allergy"],
  "turns": [
    {"user": "I am vegan and have a peanut allergy. Suggest a dinner.", "rubric": "respects_constraints"},
    {"user": "What about a side dish?", "rubric": "respects_constraints"},
    {"user": "Can you also suggest a dessert?", "rubric": "respects_constraints"}
  ],
  "session_state_check": "no_constraint_violation_anywhere_in_transcript"
}

The per-turn rubric is binary and reusable. The session check is custom code that walks the full transcript.

Step 1: install and set up Langfuse

pip install langfuse openai anthropic
export LANGFUSE_PUBLIC_KEY=...
export LANGFUSE_SECRET_KEY=...
export LANGFUSE_HOST=https://cloud.langfuse.com
export OPENAI_API_KEY=...

Langfuse Cloud has a free tier sufficient for a 30-scenario dev set; self-hosted is fine too. The point of using it here is that the session, the per-turn observations, and the per-turn and per-session scores all show up in a viewer designed for reading multi-turn traces ¹.

Step 2: run the assistant under trace

"""Run each multi-turn scenario against the assistant, log to Langfuse.

Run:
    pip install langfuse openai
    export LANGFUSE_PUBLIC_KEY=... LANGFUSE_SECRET_KEY=... LANGFUSE_HOST=...
    export OPENAI_API_KEY=...
    python run_assistant.py
"""
from __future__ import annotations

import json
import uuid
from pathlib import Path

from langfuse import Langfuse
from openai import OpenAI

LF = Langfuse()
OAI = OpenAI()
MODEL = "gpt-4o-2024-08-06"
SYSTEM = (
    "You are a meal-planning assistant. Suggest meals that respect every "
    "dietary restriction stated earlier in the conversation. Keep "
    "responses to two sentences."
)


def run_scenario(scenario: dict) -> dict:
    sid = scenario["id"]
    session_id = f"{sid}-{uuid.uuid4().hex[:8]}"
    trace = LF.trace(
        id=session_id,
        name=f"multi-turn-{sid}",
        session_id=session_id,
        metadata={"scenario_id": sid, "constraints": scenario["constraints"]},
    )
    messages = [{"role": "system", "content": SYSTEM}]
    transcript = []

    for i, turn in enumerate(scenario["turns"]):
        messages.append({"role": "user", "content": turn["user"]})
        gen = trace.generation(
            name=f"turn-{i + 1}",
            model=MODEL,
            input=messages.copy(),
            metadata={"rubric": turn["rubric"]},
        )
        resp = OAI.chat.completions.create(
            model=MODEL, temperature=0.2, messages=messages
        )
        reply = resp.choices[0].message.content
        gen.end(output=reply, usage_details={
            "input": resp.usage.prompt_tokens,
            "output": resp.usage.completion_tokens,
        })
        messages.append({"role": "assistant", "content": reply})
        transcript.append({"user": turn["user"], "assistant": reply, "rubric": turn["rubric"]})

    LF.flush()
    return {"session_id": session_id, "transcript": transcript}


def main() -> None:
    scenarios = [json.loads(line) for line in Path("scenarios.jsonl").read_text().splitlines() if line.strip()]
    Path("sessions").mkdir(exist_ok=True)
    for s in scenarios:
        result = run_scenario(s)
        Path(f"sessions/{result['session_id']}.json").write_text(
            json.dumps({"scenario": s, **result}, indent=2)
        )
        print(f"ran {s['id']} -> {result['session_id']}")


if __name__ == "__main__":
    main()

Each session shows up in Langfuse as a single session containing N generations. You can open the session view and read the full conversation as it happened.

Step 3: per-turn judge

We grade each turn independently. The rubric is binary; the model is the same family as the generator (deliberate). The token cost is one judge call per turn.

"""Per-turn binary judge. Scores attached to Langfuse generations.

Run after run_assistant.py.
"""
from __future__ import annotations

import json
from pathlib import Path

from langfuse import Langfuse
from openai import OpenAI

LF = Langfuse()
OAI = OpenAI()
MODEL = "gpt-4o-2024-08-06"

RUBRICS = {
    "respects_constraints": (
        "Does the assistant's reply violate any dietary restriction the user "
        "stated earlier in this conversation? Reply FAIL if the reply names "
        "or implies any ingredient that violates a stated restriction "
        "(including in garnishes or sauces). Reply PASS otherwise."
    ),
}


def judge_turn(transcript_so_far: list[dict], reply: str, rubric_key: str) -> dict:
    rubric = RUBRICS[rubric_key]
    history = "\n".join(
        f"{t['user']}\n--\n{t['assistant']}\n" for t in transcript_so_far
    )
    prompt = (
        f"You evaluate one assistant turn against a binary rubric.\n\n"
        f"RUBRIC:\n{rubric}\n\n"
        f"CONVERSATION SO FAR:\n{history}\n\n"
        f"REPLY UNDER REVIEW:\n{reply}\n\n"
        f'Return JSON: {{"label": "PASS" or "FAIL", "critique": "..."}}'
    )
    msg = OAI.chat.completions.create(
        model=MODEL,
        temperature=0,
        response_format={"type": "json_object"},
        messages=[{"role": "user", "content": prompt}],
    )
    return json.loads(msg.choices[0].message.content)


def main() -> None:
    for path in Path("sessions").glob("*.json"):
        data = json.loads(path.read_text())
        transcript = data["transcript"]
        session_id = data["session_id"]
        history: list[dict] = []
        for i, turn in enumerate(transcript):
            result = judge_turn(history, turn["assistant"], turn["rubric"])
            LF.score(
                trace_id=session_id,
                observation_id=None,
                name=f"per_turn_{turn['rubric']}",
                value=1 if result["label"] == "PASS" else 0,
                comment=result["critique"],
            )
            history.append(turn)
            turn["judge_label"] = result["label"]
            turn["judge_critique"] = result["critique"]
        path.write_text(json.dumps(data, indent=2))
        LF.flush()
        print(f"scored {session_id}")


if __name__ == "__main__":
    main()

The Langfuse score lives next to the trace; you can filter the session view by score and read every failed turn in context.

Step 4: per-session state check

This is the layer that catches the failure single-turn judges miss. The assistant might honor the constraint in turn one and turn three but violate it in turn two. A per-session check walks the full transcript and asks one question.

"""Per-session deterministic state check.

For meal-planning: does any assistant reply across the full transcript
mention an ingredient that violates the listed constraints?
"""
from __future__ import annotations

import json
import re
from pathlib import Path

from langfuse import Langfuse

LF = Langfuse()

# Tiny ingredient banlist; extend per constraint as needed.
BANLIST = {
    "vegan": [
        r"\b(chicken|beef|pork|lamb|fish|salmon|tuna|cheese|milk|yogurt|"
        r"butter|cream|egg|honey|gelatin|anchov)\b"
    ],
    "peanut allergy": [r"\b(peanut|peanuts)\b"],
    "dairy": [r"\b(milk|cheese|yogurt|butter|cream|whey|casein)\b"],
    "gluten": [r"\b(wheat|barley|rye|bread|pasta|flour)\b"],
}


def violates(text: str, constraint: str) -> str | None:
    for pat in BANLIST.get(constraint, []):
        m = re.search(pat, text, re.IGNORECASE)
        if m:
            return m.group(0)
    return None


def check_session(scenario: dict, transcript: list[dict]) -> tuple[str, str]:
    for i, turn in enumerate(transcript):
        for c in scenario["constraints"]:
            offending = violates(turn["assistant"], c)
            if offending:
                return "FAIL", f"Turn {i + 1} violates '{c}' via '{offending}'"
    return "PASS", "No constraint violation found in transcript."


def main() -> None:
    for path in Path("sessions").glob("*.json"):
        data = json.loads(path.read_text())
        label, reason = check_session(data["scenario"], data["transcript"])
        LF.score(
            trace_id=data["session_id"],
            name="session_state_check",
            value=1 if label == "PASS" else 0,
            comment=reason,
        )
        data["session_state"] = {"label": label, "reason": reason}
        path.write_text(json.dumps(data, indent=2))
        print(f"{data['session_id']}: {label} - {reason}")
    LF.flush()


if __name__ == "__main__":
    main()

In meal-planning the deterministic check is a regex over an ingredient banlist. In other domains the deterministic check is a SQL query against a known schema, a state-equality check against a fixture, or a domain-specific Python validator. The pattern is the same: a function that takes the full transcript and returns one binary label plus one line of reason.

Step 5: aggregate the report

"""Compute per-turn and per-session rates from sessions/."""
import json
from pathlib import Path

import pandas as pd

rows = []
for p in Path("sessions").glob("*.json"):
    data = json.loads(p.read_text())
    for i, t in enumerate(data["transcript"]):
        rows.append({
            "scenario_id": data["scenario"]["id"],
            "turn": i + 1,
            "rubric": t["rubric"],
            "turn_label": t.get("judge_label", "?"),
            "session_label": data.get("session_state", {}).get("label", "?"),
        })

df = pd.DataFrame(rows)
print(f"Sessions: {df['scenario_id'].nunique()}, turns: {len(df)}")
print(f"Per-turn PASS rate:   {(df['turn_label'] == 'PASS').mean():.3f}")
print(f"Per-session PASS rate: "
      f"{(df.groupby('scenario_id')['session_label'].first() == 'PASS').mean():.3f}")

# Where do the rates disagree?
mismatch = df.groupby("scenario_id").apply(
    lambda g: (g["turn_label"] == "PASS").all() and (g["session_label"] == "FAIL").any()
)
print(f"Sessions where per-turn says PASS but per-session FAILS: {mismatch.sum()}")

The pattern you watch for is the gap between the per-turn and per-session rates. If per-turn PASS is 0.92 but per-session PASS is 0.74, you have a state-leak: the assistant scores well turn-by-turn but the session-level check catches violations that span turns. That is the single most common multi-turn failure mode and the reason single-turn judges miss real bugs ².

Step 6: read the failures in Langfuse

Filter the Langfuse session view by session_state_check = 0. Open each failed session. Read the assistant replies turn by turn. The fix is usually one of:

The system prompt did not explicitly say to re-read the constraints each turn. Fix at the prompt layer.
The assistant compressed context (because of a context-window strategy or summarization) and dropped the constraint. Fix at the memory layer.
The constraint surfaced ambiguously in turn one and the assistant interpreted it loosely on later turns. Fix at the user-experience layer (force confirmation).

Reading the trace tells you which layer to fix. Reading the metric does not.

Step 7: wire it to CI

Pin a 30-session smoke set in your repo. On every PR that touches the prompt or the memory implementation, run the run-assistant, per-turn-judge, and state-check scripts. Fail the build if either the per-turn rate or the per-session rate drops more than five points from the main-branch baseline.

For trace inspection in CI logs, write a compact summary to stdout (the report.py snippet above) and a JSON artifact for the run.

What this skips

Pairwise comparison between two assistant versions (use the position-swap pattern from MT-Bench). Persona drift detection (a classifier on the assistant's reply style). Cost and latency tracking (Langfuse logs both; report them in the same dashboard). Long-context state retention (RULER-style multi-needle for a chat session; see long-context eval with RULER).

For trajectory-level evals where the agent calls tools across turns, the agent eval with τ-bench recipe is the right pattern. The conversational-quality task type page covers the broader topic ³. The Moshkovich observability paper makes the case that per-session deterministic checks are the missing piece for production observability ⁴.

What to do next

The conversational-quality task type covers the broader rubric design. The Hamel evals essay covers why binary rubrics beat Likert in multi-turn settings ⁵. The Langfuse online eval recipe covers the production loop where new traces feed back into the eval set.

Langfuse Cookbook, "Multi-turn eval example." ↩ ↩²
Zheng et al., MT-Bench. Two-turn scoring pattern, NeurIPS 2023 D&B. ↩ ↩²
Hamel Husain, "Creating an LLM-as-a-Judge That Drives Business Results." ↩
Moshkovich et al., "Beyond Black-Box Benchmarking." arxiv 2503.06745. ↩
Hamel Husain, "Your AI Product Needs Evals." ↩