AI Evals › Cookbook

Online eval with Langfuse

Instrument an app with Langfuse, judge a 5% sample of production traces plus every thumbs-down, and promote confirmed failures into the offline dataset.

Offline evals score a frozen set; online evals score what production actually served. You need both, and they answer different questions: offline gates the release, online tells you what the release did to users ¹. Langfuse is the open-source default for the online half because the four primitives the loop needs (tracing SDK, scores API, annotation queues, datasets) live in one place, and since June 2025 all of them, including the managed LLM-as-a-judge evaluators, are MIT-licensed rather than paywalled ². If your observability is OpenTelemetry-first or you are already on Arize, the same loop in Phoenix is the sibling recipe: online eval with Phoenix. Pick one; running both is how teams end up with two dashboards nobody trusts.

This recipe wires the full loop with three scripts: an instrumented app, a judge worker that scores a deterministic 5% sample plus every user-flagged trace, and a promotion job that turns confirmed failures into offline eval cases. The hour breaks down as: fifteen minutes to instrument and see traces, twenty for the worker, fifteen for the dashboard and dataset wiring, ten to decide your sampling and alerting numbers.

What you build

app.py: a support bot instrumented with the v4 SDK; every call is a trace carrying session, user, and a prod tag, and thumbs-down feedback lands as a score plus a local flag queue.
judge_worker.py: a cron-shaped worker that pulls recent traces from the API, judges the sample, and writes BOOLEAN scores back with the critique in the comment.
promote.py: a weekly job that moves judged failures into a Langfuse dataset, deduplicated by content hash, linked to their source traces.

Step 1: instrument the app

pip install langfuse openai anthropic
export LANGFUSE_PUBLIC_KEY=pk-lf-...
export LANGFUSE_SECRET_KEY=sk-lf-...
export LANGFUSE_BASE_URL=https://cloud.langfuse.com   # or your own host
export OPENAI_API_KEY=... ANTHROPIC_API_KEY=...

The SDK moved to an OpenTelemetry core in v3 and renamed pieces again in v4; the code below is current against v4 and will not run on v2 snippets you find in older blog posts (langfuse.trace(...) is gone). The three stable entry points are the @observe decorator, the langfuse.openai drop-in client, and get_client() ³.

"""The instrumented app: every production call becomes a Langfuse trace.

Run:
    python app.py
"""
from __future__ import annotations

import json

from langfuse import get_client, observe, propagate_attributes
from langfuse.openai import openai  # drop-in client; calls are traced

langfuse = get_client()
MODEL = "gpt-4o-2024-08-06"
SYSTEM = (
    "You are the support assistant for a meal-kit company. Answer only from "
    "the provided policy context. If the context does not cover the "
    "question, say you do not know and offer the human handoff."
)

POLICY = """Cancellations: free until the Wednesday before delivery.
Refunds: spoiled items are refunded within 5 business days of a photo report.
Allergens: boxes are packed in a facility that handles nuts and shellfish."""


@observe()
def answer(*, question: str, context: str, session_id: str, user_id: str) -> dict:
    with propagate_attributes(
        session_id=session_id, user_id=user_id, tags=["prod", "support-bot"]
    ):
        resp = openai.chat.completions.create(
            model=MODEL,
            name="support-answer",
            messages=[
                {"role": "system", "content": SYSTEM},
                {"role": "user",
                 "content": f"CONTEXT:\n{context}\n\nQUESTION: {question}"},
            ],
        )
    return {
        "answer": resp.choices[0].message.content,
        "trace_id": langfuse.get_current_trace_id(),
    }


def record_feedback(trace_id: str, thumbs_up: bool) -> None:
    """Wire this to your feedback button. Failures also land in a local
    queue so the judge worker scores them at 100%, not the sample rate."""
    langfuse.create_score(
        trace_id=trace_id, name="user_feedback",
        value=1 if thumbs_up else 0, data_type="BOOLEAN",
    )
    if not thumbs_up:
        with open("flagged_traces.jsonl", "a") as f:
            f.write(json.dumps({"trace_id": trace_id}) + "\n")


if __name__ == "__main__":
    questions = [
        "Can I cancel my Friday box on Thursday?",
        "My salmon arrived warm, what now?",
        "Is the vegan box safe for my peanut allergy?",
        "Do you deliver to Alaska?",
    ]
    for i, q in enumerate(questions):
        out = answer(question=q, context=POLICY,
                     session_id=f"sess-{i:03d}", user_id="user-042")
        print(f"[{out['trace_id'][:8]}] {q}\n  -> {out['answer'][:80]}")
        if i == 2:  # simulate a thumbs-down on the allergy answer
            record_feedback(out["trace_id"], thumbs_up=False)
    langfuse.flush()

Run it and open the Langfuse UI: four traces tagged prod, each with the model call nested inside, one carrying a user_feedback score of 0. The trace id returned to the caller is the load-bearing detail; production feedback can only be attached if your API response carries it. Note what is sampled where: LANGFUSE_SAMPLE_RATE drops traces at ingestion (use it only under real volume pressure, since a dropped trace cannot be debugged or judged later), while the judge worker samples at scoring time. Keep ingestion at 100% for as long as you can afford it ³.

Step 2: the judge worker

Two ways to attach judge scores to production traces. The platform way: Langfuse's managed LLM-as-a-judge evaluators run inside Langfuse itself, configured in the UI with a variable mapping and a sampling percentage of new traces, and they are the right choice when you want zero infrastructure ⁴. The worker way, below, is the external evaluation pipeline pattern: you own the judge prompt, the model choice, the sampling logic, and the version pinning, and Langfuse is the system of record for the resulting scores ⁵. Teams that calibrated a judge offline almost always want the worker, because the exact prompt that cleared calibration is the one that must run online.

"""Scoring worker: judge a 5% sample of production traces plus every
user-flagged trace, write scores back to Langfuse, queue failures.

Run on a schedule (cron every 15 minutes is plenty to start):
    python judge_worker.py
"""
from __future__ import annotations

import hashlib
import json
from datetime import datetime, timedelta, timezone
from pathlib import Path

from anthropic import Anthropic
from langfuse import get_client

langfuse = get_client()
anthropic = Anthropic()

JUDGE_MODEL = "claude-sonnet-4-5"
SCORE_NAME = "grounded_v3"  # bump the suffix when the prompt changes
SAMPLE_PCT = 5
LOOKBACK_HOURS = 24
STATE = Path("scored_ids.json")
FLAGGED = Path("flagged_traces.jsonl")
FAILS = Path("fails.jsonl")

JUDGE_PROMPT = """You are grading one support-bot answer against one rubric.

RUBRIC: The answer must be fully supported by the CONTEXT. If the context
does not cover the question, the only passing answer is one that says so
and offers the handoff. Any unsupported claim is a FAIL.

CONTEXT:
{context}

QUESTION:
{question}

ANSWER UNDER REVIEW:
{answer}

Reply with JSON: {{"label": "PASS" or "FAIL", "critique": "one sentence"}}"""


def in_sample(trace_id: str) -> bool:
    """Deterministic 5% sample: same trace, same verdict, on every rerun."""
    return int(hashlib.sha256(trace_id.encode()).hexdigest(), 16) % 100 < SAMPLE_PCT


def unpack(trace) -> tuple[str, str, str]:
    """@observe stores kwargs as trace input and the return dict as output.
    Adjust this one function to your own trace schema."""
    kwargs = trace.input.get("kwargs", {}) if isinstance(trace.input, dict) else {}
    output = trace.output if isinstance(trace.output, dict) else {}
    return (kwargs.get("question", ""), kwargs.get("context", ""),
            output.get("answer", str(trace.output)))


def judge(question: str, context: str, answer: str) -> dict:
    msg = anthropic.messages.create(
        model=JUDGE_MODEL, max_tokens=200, temperature=0,
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            context=context, question=question, answer=answer)}],
    )
    text = msg.content[0].text.strip().removeprefix("```json").removesuffix("```")
    return json.loads(text)


def fetch_recent_traces() -> list:
    since = datetime.now(timezone.utc) - timedelta(hours=LOOKBACK_HOURS)
    traces, page = [], 1
    while True:
        batch = langfuse.api.trace.list(
            tags="prod", from_timestamp=since, page=page, limit=50
        ).data
        traces.extend(batch)
        if len(batch) < 50:
            return traces
        page += 1


def main() -> None:
    scored: set[str] = set(json.loads(STATE.read_text())) if STATE.exists() else set()
    flagged: set[str] = set()
    if FLAGGED.exists():
        flagged = {json.loads(l)["trace_id"]
                   for l in FLAGGED.read_text().splitlines() if l}

    traces = fetch_recent_traces()
    todo = [t for t in traces
            if t.id not in scored and (t.id in flagged or in_sample(t.id))]
    n_pass = 0
    for t in todo:
        question, context, answer = unpack(t)
        verdict = judge(question, context, answer)
        langfuse.create_score(
            trace_id=t.id, name=SCORE_NAME,
            value=1 if verdict["label"] == "PASS" else 0,
            data_type="BOOLEAN", comment=verdict["critique"],
        )
        scored.add(t.id)
        n_pass += verdict["label"] == "PASS"
        if verdict["label"] == "FAIL":
            with FAILS.open("a") as f:
                f.write(json.dumps({
                    "trace_id": t.id, "question": question,
                    "answer": answer, "critique": verdict["critique"],
                }) + "\n")

    STATE.write_text(json.dumps(sorted(scored)))
    langfuse.flush()
    print(f"window: {len(traces)} traces, {len(flagged)} user-flagged")
    print(f"judged: {len(todo)} ({SAMPLE_PCT}% sample + flagged), "
          f"PASS {n_pass}, FAIL {len(todo) - n_pass}")

Output shape from one worker run against the demo app's traces (yours will differ):

window: 412 traces, 7 user-flagged
judged: 26 (5% sample + flagged), PASS 22, FAIL 4

Three decisions in this worker are deliberate. Sampling hashes the trace id instead of calling random(), so reruns and backfills judge the same subset and your daily pass rate is not jittered by the sampler. The score name carries the judge version (grounded_v3), because a judge prompt change moves the metric and an unversioned score series silently splices two different instruments; never reuse a name after the prompt changes. And every FAIL writes the critique into the score comment, so a human reading the trace in Langfuse sees why the judge objected without leaving the page.

Step 3: read the dashboard like an eval, not like a vibe

In Langfuse, chart the daily mean of grounded_v3 next to user_feedback, split by release version. Two reading rules. Alert on the score crossing the offline error bar, not on any movement: if your judge cleared calibration with a dev-set pass rate of 0.91 plus or minus 0.04, a production day at 0.88 is noise and a week at 0.84 is a finding (adding error bars gives you the interval). And treat judge-vs-feedback divergence as its own signal: judge PASS with rising thumbs-downs means the rubric is missing a failure mode users can see; route a week of thumbs-down traces through an annotation-queue review and fix the rubric before trusting the metric again ³.

The judge itself gets audited on a schedule. Once a month, send 50 judge-scored traces to a human, compute agreement, and compare against the TPR/TNR the prompt cleared offline in LLM-judge calibration. An online judge that nobody re-checks decays into a dashboard decoration.

Step 4: close the loop into the offline set

Online eval pays for itself the day a production failure becomes an offline test case. The promotion job below trusts fails.jsonl; in practice you gate it behind the annotation queue, so a human confirms each failure (and writes the expected output) before it becomes a regression case.

"""Promote reviewed failures into the offline eval dataset.

Run weekly, after a human has walked the annotation queue:
    python promote.py
"""
from __future__ import annotations

import hashlib
import json
from pathlib import Path

from langfuse import get_client

langfuse = get_client()
DATASET = "support-bot-regressions"
FAILS = Path("fails.jsonl")
SEEN = Path("promoted_hashes.json")

langfuse.create_dataset(
    name=DATASET,
    description="Production failures promoted from online judge scores",
)

seen: set[str] = set(json.loads(SEEN.read_text())) if SEEN.exists() else set()
added = 0
for line in FAILS.read_text().splitlines() if FAILS.exists() else []:
    fail = json.loads(line)
    content_hash = hashlib.sha256(fail["question"].encode()).hexdigest()[:16]
    if content_hash in seen:
        continue  # the same failure mode does not get to vote twice
    langfuse.create_dataset_item(
        dataset_name=DATASET,
        input={"question": fail["question"]},
        expected_output=None,  # filled in during human review
        metadata={"critique": fail["critique"], "content_hash": content_hash},
        source_trace_id=fail["trace_id"],
    )
    seen.add(content_hash)
    added += 1

SEEN.write_text(json.dumps(sorted(seen)))
langfuse.flush()
print(f"promoted {added} new failures to dataset '{DATASET}'")

source_trace_id is the field that makes this auditable: every dataset item links back to the production trace it came from, so six months later you can still answer "why is this weird question in our eval set" ³. The content hash stops one angry user from contributing the same failure forty times and skewing the offline distribution. Your next offline run executes against this dataset, which is the moment the loop closes: production failures now gate releases.

What this does not do

It does not cover Phoenix or OTel-native instrumentation; that is the sibling recipe. It does not ensemble judges or distill a cheaper one; start with one calibrated judge and add machinery when the audit numbers say you need it. It does not handle multi-turn session scoring, which changes the unit of analysis from trace to session; the multi-turn conversation eval recipe covers that, and the same worker pattern applies with session_id filters. And the 5% number is a starting point, not advice: it is the same default the managed evaluators suggest, and the right rate is whatever makes judge cost a rounding error against your serving cost ⁴.

TIP

Run the worker from the same repo and CI system as your offline evals, with the judge prompt imported from one shared module. The moment online and offline judges drift apart in wording, their numbers stop being comparable and every "production regressed" conversation turns into an argument about prompts.

What to do next

If the judge has not been through calibration, do LLM-judge calibration first; an uncalibrated online judge automates a guess at production scale. For the conceptual map of where online evaluation sits in a platform, Chip Huyen's essay is the reference ¹, and the online vs offline chapter covers the decision in depth.

Chip Huyen, "Building a Generative AI Platform." The online/offline split and where scoring sits in the platform diagram. ↩ ↩²
Langfuse, "Open-sourcing the Langfuse product." June 4, 2025. LLM-as-a-judge evaluators, annotation queues, prompt experiments, and the playground moved under the MIT license. ↩
Langfuse documentation. SDK v4 instrumentation, scores, sampling, annotation queues, and datasets; API verified against langfuse 4.7.1 as of 2026-06-10. ↩ ↩² ↩³ ↩⁴
Langfuse documentation, "LLM-as-a-Judge evaluators." Managed evaluators with trace filters and sampling percentages, run inside the platform. ↩ ↩²
Langfuse cookbook, "Evaluate Langfuse traces with an external evaluation pipeline." The fetch-judge-score-back worker pattern this recipe extends. ↩