Online eval with Phoenix

Q: What you build?

Two files plus a server and a cron line:

Q: What you built?

A self-hosted online eval loop: OpenInference-instrumented traffic flowing into Phoenix, a scheduled job that samples recent LLM spans into a dataframe, two binary judges from the evals 3.x API, verdicts logged back as span annotations you can filter on, and failures banked as datasets that feed the offline suite.

Q: What to do next?

Online vs offline evaluation frames where this loop sits; sampling strategies covers smarter-than-uniform span sampling; drift detection is what to build on top of these annotations once they trend. For the same loop on the other open-source stack, see online eval with Langfuse.

Trace a live app into self-hosted Phoenix over OpenTelemetry, judge sampled spans with the evals classifiers, and bank the failures as a regression dataset.

Offline evals score a pinned dataset before you ship; online evals score what production actually did after you shipped. Phoenix is the open-source way to run the second loop without sending traces to a vendor: your app emits OpenTelemetry spans through OpenInference instrumentation, Phoenix stores them, and a scheduled job samples recent spans, judges them, and attaches the verdicts back to the traces as annotations ¹. This recipe builds that whole loop against a local Phoenix. The same loop on Langfuse is its own recipe: online eval with Langfuse.

One API note up front. The phoenix.evals package was rewritten; this recipe uses the current 3.x surface (create_classifier, evaluate_dataframe, prebuilt evaluators in phoenix.evals.metrics). The llm_classify and run_evals functions you will find in older tutorials lived under phoenix.evals.legacy in 2.x and were removed in 3.x; pin arize-phoenix-evals<3 if you need them ².

What you build

Two files plus a server and a cron line:

phoenix serve: the trace store and UI on localhost:6006.
app.py: a small instrumented assistant that generates real traffic.
score_recent.py: the online eval job. Samples the last hour of LLM spans, runs a faithfulness judge and a resolution judge, logs annotations back, and banks failures as a Phoenix dataset.

Step 1: run Phoenix

pip install arize-phoenix
phoenix serve

The UI is at http://localhost:6006 and the same port receives traces ³. One process is enough for this recipe and for a surprising amount of real usage; Docker, Kubernetes, and Phoenix Cloud are the growing-up paths ³.

Step 2: instrument the app

register() wires an OTel tracer provider with Phoenix defaults, and auto_instrument=True activates any installed OpenInference instrumentor, so every OpenAI call below produces an LLM span with input and output messages attached. No tracing code in the app itself.

"""A tiny support assistant that answers from pinned context, fully traced.

Run:
    pip install arize-phoenix-otel openinference-instrumentation-openai openai
    export OPENAI_API_KEY=...
    python app.py
"""
from __future__ import annotations

import os

from openai import OpenAI
from phoenix.otel import register

os.environ.setdefault("PHOENIX_COLLECTOR_ENDPOINT", "http://localhost:6006")
register(project_name="support-rag", auto_instrument=True, batch=True)

client = OpenAI()

CONTEXT = """Lumen CI runs builds in containers. Build minutes reset on the 1st.
The free tier includes 2,000 build minutes and 2 concurrent jobs. Paid tiers
add concurrency, not speed. Deploy keys rotate from Project Settings > Keys.
Incident history is published at status.lumen.dev."""


def answer(question: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"Answer only from this context:\n{CONTEXT}"},
            {"role": "user", "content": question},
        ],
    )
    return resp.choices[0].message.content


if __name__ == "__main__":
    questions = [
        "How many build minutes are on the free tier?",
        "Do paid tiers make individual builds faster?",
        "How do I rotate a deploy key?",
        "What was the root cause of yesterday's incident?",
        "Can I get more concurrency without paying?",
    ]
    for q in questions * 8:  # 40 requests of simulated traffic
        print(q[:40], "->", answer(q)[:60])

Run it and open the support-rag project in the UI: 40 traces, each with the full prompt, response, latency, and token counts. The fourth question is the plant: the context says where incident history lives but not the root cause, so a well-behaved app should decline, and a sloppy one will make something up. Online eval exists to find out which one you shipped.

Step 3: sample recent spans

Do not judge the firehose. Storage of traces is cheap; judge tokens are the real cost, so the standard pattern is scoring a fixed-size sample of recent traffic on a schedule, which is exactly where online evaluation sits in Huyen's platform architecture ⁴. The Phoenix client pulls spans into a pandas dataframe with the OpenInference attributes as columns ¹.

The scoring job, first half:

"""Sample recent LLM spans, judge them, log annotations, bank failures.

Run:
    pip install 'arize-phoenix-evals>=3' arize-phoenix-client pandas openai
    export OPENAI_API_KEY=...
    python score_recent.py --window-minutes 60 --sample 25
"""
from __future__ import annotations

import argparse
import json
from datetime import datetime, timedelta, timezone

from phoenix.client import Client
from phoenix.evals import create_classifier, evaluate_dataframe
from phoenix.evals.llm import LLM
from phoenix.evals.metrics import FaithfulnessEvaluator
from phoenix.evals.utils import to_annotation_dataframe

PROJECT = "support-rag"


def first(messages, role):
    return next(m["message.content"] for m in messages if m["message.role"] == role)


def main() -> None:
    p = argparse.ArgumentParser()
    p.add_argument("--window-minutes", type=int, default=60)
    p.add_argument("--sample", type=int, default=25)
    p.add_argument("--seed", type=int, default=7)
    args = p.parse_args()

    client = Client()  # reads PHOENIX_COLLECTOR_ENDPOINT, defaults to localhost
    spans = client.spans.get_spans_dataframe(
        project_identifier=PROJECT,
        start_time=datetime.now(timezone.utc) - timedelta(minutes=args.window_minutes),
        limit=1000,
    )
    llm_spans = spans[spans["span_kind"] == "LLM"]
    sampled = llm_spans.sample(
        n=min(args.sample, len(llm_spans)), random_state=args.seed
    )

    eval_df = sampled[["context.span_id"]].copy()
    msgs = sampled["attributes.llm.input_messages"]
    eval_df["input"] = msgs.apply(lambda m: first(m, "user"))
    eval_df["context"] = msgs.apply(lambda m: first(m, "system"))
    eval_df["output"] = sampled["attributes.llm.output_messages"].apply(
        lambda m: m[0]["message.content"]
    )
    eval_df = eval_df.set_index("context.span_id")

The column names are the OpenInference semantic conventions: attributes.llm.input_messages and attributes.llm.output_messages hold the message lists, context.span_id is the join key everything later hangs on ¹.

Step 4: judge the sample

Two judges. FaithfulnessEvaluator is prebuilt: it takes input, output, and context columns and returns a faithful or unfaithful label with score and explanation ². The second is a custom binary classifier built with create_classifier; the choices dict maps labels to scores so dashboards can average them. Both run over the dataframe in one call, and column names matching evaluator fields is the whole wiring ¹.

    judge_llm = LLM(provider="openai", model="gpt-4o-mini")

    faithfulness = FaithfulnessEvaluator(llm=judge_llm)
    resolution = create_classifier(
        name="resolution",
        llm=judge_llm,
        prompt_template=(
            "A support assistant answered a user. Decide if the answer RESOLVES "
            "the question or PUNTS (vague, off-topic, or invents facts the "
            "context does not support).\n\n"
            "QUESTION: {input}\n\nANSWER: {output}"
        ),
        choices={"resolved": 1.0, "punted": 0.0},
    )

    results = evaluate_dataframe(
        dataframe=eval_df, evaluators=[faithfulness, resolution]
    )

If your spans are agent traces instead of single completions, the metrics module also ships tool-selection and tool-response evaluators, and the Phoenix cookbook walks the agent-specific setup ⁵.

Step 5: log annotations back and read them in the UI

to_annotation_dataframe reshapes the results (label, score, explanation per evaluator) onto the span ids, and the client writes them as span annotations ¹.

    annotations = to_annotation_dataframe(dataframe=results)
    client.spans.log_span_annotations_dataframe(dataframe=annotations)
    print(f"scored {len(eval_df)} spans, logged {len(annotations)} annotations")

    # Bank failures as a dataset: today's incidents are tomorrow's offline evals.
    def label_of(cell):
        if isinstance(cell, str):
            cell = json.loads(cell)
        return cell.get("label") if isinstance(cell, dict) else None

    fails = results[results["faithfulness_score"].apply(label_of) == "unfaithful"]
    if len(fails):
        name = f"unfaithful-{datetime.now(timezone.utc):%Y%m%d}"
        client.datasets.create_dataset(
            name=name,
            dataframe=fails.reset_index(),
            input_keys=["input"],
            output_keys=["output"],
            metadata_keys=["context"],
            span_id_key="context.span_id",
        )
        print(f"banked {len(fails)} unfaithful spans -> dataset {name}")


if __name__ == "__main__":
    main()

Representative run against the step 2 traffic:

scored 25 spans, logged 50 annotations
banked 3 unfaithful spans -> dataset unfaithful-20260610

Three views in the UI now matter, in this order:

The filtered trace table. In the project view, filter with the span DSL: annotations['faithfulness'].label == 'unfaithful'. This is the working queue: every span the judge flagged, with latency and token counts alongside.
The span detail annotations panel. Each flagged span shows the judge's label, score, and explanation next to the actual prompt and response. Read the explanation against the response before believing either; judges are evidence, not verdicts.
The dataset page. The banked unfaithful-* datasets, each example linked back to its source span via the span id.

The point of the UI is reading failures, and the explanations attached to real production inputs are the highest-signal error-analysis feed you have.

Step 6: schedule it and close the loop

Online eval is a cron job, not a service:

*/15 * * * * cd /srv/support-rag && python score_recent.py --window-minutes 20 --sample 25 >> eval.log 2>&1

The 20-minute window on a 15-minute cadence means boundary spans can be scored twice; if that bothers you, persist a high-water-mark timestamp between runs instead of using a fixed window. Judge spend scales as sample size times cadence times evaluator count, and --sample is the budget knob: 25 spans and two judges per run is 4,800 small judge calls a day at this cadence, which is cheap with a small judge model and linear in everything if it stops being cheap.

Closing the loop is what separates online eval from a dashboard: the banked failure datasets become offline regression cases. Re-run your candidate prompt against unfaithful-* before every deploy, and the bug a user found in production gets found by CI forever after. Append later failures to a rolling set with add_examples_to_dataset once the dated-dataset pile gets silly. For deeper Phoenix eval workflows past this loop, the tutorials index is the catalog ⁶.

What this skips

Writing custom evaluator classes beyond create_classifier (the evals package supports full custom evaluators with input schemas ²), Phoenix Cloud auth and multi-tenant setup, and inline guardrails: everything here is post-hoc and asynchronous, which is the right default because a judge in the request path is latency and a new failure mode. Judge calibration is also out of scope and is not optional before you alert on these numbers; that is the LLM judge calibration recipe.

TIP

Pin the judge prompt and model version in the annotation name (for example faithfulness-v2) the first time you change either. Online metrics only trend meaningfully while the judge holds still, and an unversioned judge change reads exactly like a product regression.

What you built

A self-hosted online eval loop: OpenInference-instrumented traffic flowing into Phoenix, a scheduled job that samples recent LLM spans into a dataframe, two binary judges from the evals 3.x API, verdicts logged back as span annotations you can filter on, and failures banked as datasets that feed the offline suite.

What to do next

Online vs offline evaluation frames where this loop sits; sampling strategies covers smarter-than-uniform span sampling; drift detection is what to build on top of these annotations once they trend. For the same loop on the other open-source stack, see online eval with Langfuse.

Phoenix documentation, "Running Evals on Traces." Span dataframe columns, evaluator wiring, to_annotation_dataframe, and log_span_annotations_dataframe. ↩ ↩² ↩³ ↩⁴ ↩⁵
Phoenix evals package (arize-phoenix-evals). The 3.x API: LLM wrapper, create_classifier, evaluate_dataframe, prebuilt metrics. The 2.x llm_classify surface was removed in 3.0. ↩ ↩² ↩³
Phoenix documentation, self-hosting from the terminal. pip install arize-phoenix, phoenix serve, UI on localhost:6006. ↩ ↩²
Chip Huyen, "Building a Generative AI Platform." Where sampled online evaluation sits in the request lifecycle. ↩
Phoenix cookbook, evaluation section. Agent-specific evaluators: tool selection, tool invocation, goal achievement. ↩
Arize Phoenix tutorials index. Eval notebooks beyond the loop built here. ↩