Start Here for Senior Data Engineers

Nine steps from trace schema through drift detection. JSONL and Parquet schemas, sampling policies, dataset versioning and lineage.

Data engineers shipping the eval pipeline own the trace schema, the golden-set ingest, the label propagation from production, the sampling policy, the dataset versioning, and the drift alarms. This track is data-engineering rigor. it ships JSONL and Parquet schemas, sampling policies, and lineage patterns. It assumes data-engineering fluency.

The first three steps build the production-to-eval pipeline. Steps four through six cover the data discipline (sampling, synthetic, versioning). The last three are about coverage, storage, and the monitoring that catches upstream breakage.

The track

Trace schema for eval and observability (16 min). The columns every trace needs: request id, session id, user bucket, prompt, retrieved context, tool calls, response, latency, cost, model version, prompt version. What not to log (raw PII, redaction at write time, retention tiers) and the Parquet vs JSONL trade-off ¹.
Building a golden set pipeline (16 min). The ingestion path from production traces to eval store. The labeling SLA. The dimensional sampling pattern ². The starting size: 30 examples, walk to 100, then 500+ when you need to distinguish small effects.
Label propagation: prod to eval store (14 min). The online and offline regimes, the handoffs between them, and how sampled judge labels land back in the eval store. The flywheel pattern that turns production traffic into eval data ³.
Sampling production traces (12 min). Why uniform sampling fails the tail. Stratified sampling, failure oversampling, and reservoir sampling for online aggregation. The policy that decides which traces reach the eval store.
Synthetic data generation patterns (14 min). When synthetic helps (cold start, rare classes, adversarial coverage). When it hurts (the surface-vocabulary cluster). The hybrid pattern: synthetic for diversity, prod traces for realism ⁴.
Dataset versioning and lineage (14 min). Content hashes per dataset version. The lineage DAG: what produced this dataset, what consumed it. The model-delta replay story: re-run a model from six months ago against the dataset version it was evaluated on.
Class balance and segment representation (14 min). Stratification by feature, scenario, persona. Avoiding the minority-class hole. Per-segment fairness metrics; why the headline number can hide a segment collapse.
Where eval datasets live (10 min). Langfuse, Phoenix, Braintrust, LangSmith, or JSONL in git. The source-of-truth principle: keep the canonical copy diff-able and hash-able, use the vendor as the labeling surface, and rebuild the vendor copy from source.
Drift detection on production traffic (12 min). Input and output distributions tracked over versioned reference windows. Detection that works in practice, why naive alarms fail, and what to do when one fires. The alarm that catches upstream schema and traffic changes before the eval scores go stale.

What comes after the track

Once the nine pages are read, the natural next moves: write the trace schema specification document; build the golden-set ingest DAG; pin the first dataset version; stand up the drift dashboard. The RAG eval entries in the cookbook are the natural follow-up for any team shipping retrieval ⁵. The retrieval metrics and generation metrics catalogs cover the per-metric depth. The versioning chapter is the deep-read on the discipline.

TIP

The trace schema (step 1) is the artifact that makes the rest of the program possible. It is the de facto contract with AI engineering: get the columns, the redaction policy, and the retention tiers agreed before the backfill conversation starts.

Chip Huyen, "Building a Generative AI Platform." ↩
Hamel Husain, "A Field Guide to Rapidly Improving AI Products." ↩
Shreya Shankar, "The AI Engineering Flywheel." ↩
Shankar et al., "SPADE." arxiv 2401.03038. ↩
Es et al., "RAGAS." arxiv 2309.15217. ↩