Phoenix is Arize's open-source observability and eval platform. Where Langfuse leans on the application SDK, Phoenix's pitch is OpenTelemetry-native instrumentation. you write OTel spans and Phoenix reads them 1. The recipe will run Phoenix locally with the OpenInference instrumentor, score sampled traces against three of Phoenix's built-in eval templates (relevance, hallucination, QA correctness), and route failures into a review workflow 2.
Planned outline
- Install and run Phoenix locally.
pip install arize-phoenix and phoenix.launch_app() to get a local viewer.
- Instrument with OpenInference. The OTel spans carry input/output/tool calls. Existing OpenAI/Anthropic/Bedrock instrumentations work out of the box.
- Pick eval templates. Phoenix ships QA correctness, relevance, hallucination, and tool-call correctness as built-in LLM-judge prompts. Each is auditable; copy and adjust.
- Run evals against traces. The
phoenix.evals module loads a span dataframe and applies templates. Outputs attach back to the span as eval columns.
- Filter and review. The Phoenix viewer filters by eval result. Open the failed spans, read the inputs, write notes.
- Compare with offline. Pin a fixed dev set in a Phoenix dataset; re-run the same templates against it; compare the rates.
What this will skip in v1
Custom eval template authoring beyond a tiny example (worth its own recipe). Multi-modal eval. Phoenix Cloud-specific features. Migration patterns from Phoenix → Langfuse or vice versa.
For background on OTel-driven observability for LLMs and the session/trace abstraction, the Phoenix tutorials index is the starting place 1. The Phoenix + OpenAI Agents Cookbook page has the agent-specific evals (tool-selection, tool-result, goal-achievement) 2. Chip Huyen's GenAI platform essay frames where Phoenix fits in the request lifecycle 3.