Where to keep your eval datasets

Langfuse, Phoenix, Braintrust, LangSmith, or a JSONL in git. The choice that actually matters.

The question every team asks once their JSONL passes a hundred lines is which vendor to put it in. The right answer for most teams is: keep the JSONL in git, sync it to one vendor for the UI, and resist the temptation to make the vendor the source of truth. The vendors are good at browsing, sharing, and annotation. They are not good at being a versioned content-hashed contract with your past self, and that is what an eval dataset is.

The source-of-truth principle

A dataset is a contract that lets you compare results across time. The substrate has to be diff-able, hash-able, reproducible from scratch, portable across vendors, and never silently mutated by a UI click. Git plus JSONL satisfies all of those. A SaaS dataset object that lives only in a vendor's database satisfies none ^[1].

The vendor UIs are still useful. They are excellent at annotation throughput, at letting non-engineers tag traces into the set, and at running scores in a hosted environment. Use them as the labeling surface and the downstream view; keep the JSONL canonical, and treat the vendor copy as a derived artifact you can rebuild from source.

Four platforms

Langfuse Datasets is the OSS-leaning option: open-source server you self-host, Python and TypeScript SDK, dataset runs that pair each example with a model output and judge score, and a browse/annotate UI. The dataset object lives in Postgres, with a versioned API ^[2].

Arize Phoenix Datasets follows a similar pattern with an open-source product. Datasets pair with experiments (one run of system × judge × dataset version). Phoenix is heavier on the trace-ingestion side, so the natural flow is to capture traces in Phoenix, promote interesting traces into the dataset, and run experiments back against them ^[3].

Braintrust is a hosted, commercial product oriented around the experiment-as-a-row workflow. Datasets are first-class, judges are first-class, and the UI is the most polished labeling surface of the four. Portability is the tradeoff: the data lives in their cloud and exporting needs a script ^[4].

LangSmith Datasets is LangChain's hosted offering. Tightest integration with LangChain agents, built-in dataset versioning, well-documented evaluation API. Like Braintrust it is hosted-only at the production tier ^[5] ^[6].

Three axes for the decision

Open-source versus hosted. If you have a self-hosting team and prefer keeping data on your infrastructure, Langfuse and Phoenix. If you do not, Braintrust and LangSmith.

Labeling persona. If non-engineers (PMs, domain experts, QA) need to label directly, the Braintrust or LangSmith UI pays for itself within a quarter. If labeling is engineer-driven and lives in notebooks, the Phoenix or Langfuse SDK flow is enough.

Lock-in tolerance. If you intend to switch judge platforms, observability vendors, or evaluation frameworks in the next two years (you should plan to), keep the JSONL canonical so you can repoint at a new vendor in an afternoon.

A sync script

# sync_dataset.py
import json, os
from pathlib import Path
from langfuse import Langfuse

DATASET, VERSION = "booking-flow-golden", "v3"
SOURCE = Path(f"evals/data/{VERSION}/golden.jsonl")

client = Langfuse(
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
    secret_key=os.environ["LANGFUSE_SECRET_KEY"],
    host=os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com"),
)

def upsert():
    try:
        client.create_dataset(name=DATASET, metadata={"version": VERSION})
    except Exception:
        pass
    for line in SOURCE.read_text().splitlines():
        if not line.strip():
            continue
        r = json.loads(line)
        client.create_dataset_item(
            dataset_name=DATASET,
            id=r["id"],
            input=r["input"],
            expected_output=r["expected_output"],
            metadata={"version": VERSION, "labels": r.get("labels", {})},
        )

The script runs as part of CI whenever the dataset hash changes. The vendor gets the same examples, addressed by the same stable ids, with the same labels. The UI becomes a window into the canonical store, not an alternative to it.

What stays out of the vendor

The content hash stays in git because diff-and-blame depends on it. MIGRATION notes between versions stay in git because they capture reasoning a future engineer needs. Result records stay in your warehouse or in the vendor's run store but reference the git hash by id; the warehouse copy is the audit trail.

A pragmatic recommendation

For a team starting today: JSONL in git, sync to Langfuse if you self-host and want OSS, sync to Braintrust if you want a polished labeling surface and accept hosted-only. Whatever the choice, the test of whether the architecture is right is: can you migrate to a different vendor in two weeks. If yes, the JSONL is genuinely canonical. If no, the vendor has quietly become the source of truth, and the next migration will be a multi-quarter project. Fix that before it hardens.

With the storage architecture settled, the harder question is what the rows have to measure once the system retrieves before it generates: RAG Evals.