Data engineers shipping the eval pipeline own the trace schema, the golden-set ingest, the label propagation from production, the PII handling, the dataset versioning, and the data contract with AI engineering. This track is data-engineering rigor. it ships ER diagrams, JSONL and Parquet schemas, Airflow and Dagster DAG sketches, and dbt-style model patterns. It assumes data-engineering fluency.
The first three steps build the production-to-eval pipeline. Steps four through six cover the data discipline (PII, synthetic, versioning). The last three are about scale and the contract with the rest of the team.
The track
- Trace schema for eval and observability (16 min). The columns every trace needs: request id, session id, user id, prompt, retrieved context, tool calls, response, latency, cost, model version, prompt version. The ER diagram and the Parquet vs JSONL trade-off 1.
- Building a golden set pipeline (16 min). The ingestion path from production traces to eval store. The labeling SLA. The dimensional sampling pattern 2. The starting size: 30 examples, walk to 100, then 500+ when you need to distinguish small effects.
- Label propagation: prod to eval store (14 min). How online labels (thumbs-down, edits, escalations) become eval labels. The flywheel pattern. Filter for high-signal events; route to human review for low-confidence labels 3.
- PII handling for traces (14 min). Automated redaction (PII classifiers, regex fallback), audit trail, retention policy. The compliance hooks for GDPR and CCPA. What stays in the trace, what gets dropped before storage.
- Synthetic data generation patterns (14 min). When synthetic helps (cold start, rare classes, adversarial coverage). When it hurts (the surface-vocabulary cluster). The hybrid pattern: synthetic for diversity, prod traces for realism 4.
- Dataset versioning and lineage (14 min). Content hashes per dataset version. The lineage DAG: what produced this dataset, what consumed it. The model-delta replay story: re-run a model from six months ago against the dataset version it was evaluated on.
- Class balance and segment representation (14 min). Stratification by feature, scenario, persona. Avoiding the minority-class hole. Per-segment fairness metrics; why the headline number can hide a segment collapse.
- Choosing storage: warehouse vs vector vs object (14 min). Where traces live (warehouse: BigQuery or Snowflake; object store for raw blobs). Where datasets live (Parquet on S3 with a manifest). Where embeddings live (vector DB or warehouse extension). When to use each.
- Data contracts between AI eng and data eng (14 min). The schema contract: what columns are stable, what columns can change. Breaking-change policy. SLA on label turnaround. The handoff document.
What comes after the track
Once the nine pages are read, the natural next moves: write the trace schema specification document; build the golden-set ingest DAG; publish the PII redaction pipeline; pin the first dataset version. The RAG eval entries in the cookbook are the natural follow-up for any team shipping retrieval 5. The versioning chapter is the deep-read on the discipline.
TIP
The data contract (step 9) is the artifact that makes the rest of the program possible. Without it, the AI engineering team will keep breaking the schema; with it, the team has a clean breaking-change policy.