Langfuse

Open-source LLM engineering platform: tracing, prompts, datasets, scores, and a genuinely self-hostable cloud.

Positioning

Langfuse sits in the observability and eval slot of the LLM stack with a strong open-source story. The MIT-licensed core covers traces, prompt management, datasets, and scoring; the enterprise modules live in an ee/ folder and require a license. Self-hosting via Docker or Kubernetes is documented and supported, not an afterthought, and the project shipped at a very high release cadence through 2025 and 2026 ¹.

The product integrates with more than a hundred LLM providers via LiteLLM, runs LLM-judge evaluators on traces, and exposes a Scores API that lets external evaluators (Ragas, DeepEval, your own) push results back to the same trace.

Strengths

Self-host actually works. Compose files, Helm charts, S3 or MinIO backends. Teams with data-residency requirements can run Langfuse on their own infrastructure without paying for a hosted plan.
High velocity. v3.176.0 shipped in May 2026; the changelog moves weekly. Bug fixes and integrations land quickly.
Open Scores model. The Scores API lets you attach numeric, categorical, or boolean scores to any trace from any source. Ragas, DeepEval, and custom evaluators all push to the same surface ².

Limitations

Pro features behind an enterprise license when self-hosted. Prompt experiments, SSO, and some governance features require an EE license even on a self-hosted deployment.
UI is observability-first. Compared to Braintrust or LangSmith, the offline-eval workbench is less polished. Most teams use Langfuse as the trace hub and run evals from notebooks or CI.
Vendor-published benchmarks are vendor-published. As with every other platform, treat performance claims from the company as positioning.

Best fit

Production observability where data residency matters, hybrid architectures where Langfuse is the trace hub and CI pushes results back via the Scores API, and teams that want to keep platform costs predictable by self-hosting.

Getting started

from langfuse import Langfuse
from langfuse.openai import openai  # auto-traced

langfuse = Langfuse()

# Run your app. Traces stream to Langfuse automatically.
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarize this email."}],
)

# Push an external score (from Ragas, DeepEval, or a human reviewer) back to the trace.
trace_id = response.usage.trace_id  # captured by the wrapper
langfuse.score(
    trace_id=trace_id,
    name="faithfulness",
    value=0.92,
    data_type="NUMERIC",
)

The Langfuse cookbook also documents a full RAG evaluation loop using Ragas, including dataset construction and dashboarding ³, and an error-analysis recipe that uses Langfuse as the trace store ⁴.

Pricing notes

The public pricing page lists Hobby (free), Core ( $29/month), Pro ($ 199/month), and Enterprise ($2,499/month) tiers. Trace volume, retention, and seats vary by tier. Self-host of the MIT core is free; the enterprise modules require a paid license. Confirm the current trace caps and overage policy on the pricing page before committing.

Alternatives

Consider LangSmith if your codebase is built on LangChain or LangGraph. Consider Arize Phoenix for an OpenTelemetry-native OSS hub. Consider Braintrust if a hosted, framework-agnostic console with strong dataset tooling is the priority and self-host is not required. For pure CI regression, pair Langfuse with DeepEval or Promptfoo.

Citations and last verified

Verified 2026-05-29 against Langfuse documentation and the Langfuse cookbook.

Langfuse Docs, https://langfuse.com/docs. ↩
Langfuse Scores overview, https://langfuse.com/docs/scores/overview. ↩
Langfuse cookbook, "Evaluation of RAG with Ragas." ↩
Langfuse cookbook, "Error analysis for LLM applications." ↩