Online evals run the judge against production traces, not a fixed test set. The point is to catch the failure modes that did not appear in your offline dev set, before users complain in Slack. Langfuse ships the primitives for this: a tracing SDK, a scores API to attach LLM-judge outputs to traces, and an annotation-queue surface for human review of the failures the judge flagged 1. This recipe will walk the production loop: instrument the app, score sampled traces with a calibrated judge, route failures to a queue, and feed reviewed failures back into the offline eval set.
Planned outline
- Sampling strategy. Score 5% of traces in real time, plus 100% of any trace where a downstream signal (thumbs-down, refund, escalation) fired. Document the rationale.
- Calibrated judge prompt. Reuse the prompt and the rubric from LLM-judge calibration. Pin its version; record the kappa it cleared offline.
- Score schema. One Langfuse
score per metric per trace. Numeric, boolean, or categorical. Include the critique in the comment field for in-Langfuse readability.
- Annotation queue. Pipe FAIL-labeled traces into a Langfuse annotation queue assigned to the domain expert. Hand-labels go back into the dataset.
- Alerting. Configure thresholds on rolling 1-day and 7-day windows. Alert on per-metric drop greater than the offline error bar.
- Feedback loop. Weekly job exports reviewed FAILs to the offline eval set with a content hash. Pin the eval-set version to the judge prompt version.
What this will skip in v1
Multi-judge ensembling. Cost optimization via judge distillation (Galileo Luna pattern). Real-time alerting from streaming traces (Langfuse Cloud handles this; self-hosted requires extra infra).
The Langfuse cookbook examples for RAG-with-Ragas and error analysis are the right companion reads 2 3. For the conceptual frame on online vs offline, see Chip Huyen's GenAI platform essay 4 and the online-vs-offline chapter.