Backend engineers own the services the model sits inside: the API that calls it, the queue that buffers it, the database it reads, and the budget it burns. This track treats an eval as one more contract a service has to honor, next to the schema, the SLO, and the bill. It assumes staff-level fluency in Python or TypeScript and the usual comfort with services, queues, and SQL. Two steps are hands-on recipes that take an hour each; the rest are reading.
The order is deliberate. Steps one and two make the case and give you a runnable artifact. Steps three through five cover the contracts: what your services log, what the model returns, and how it calls your tools. Steps six and seven are the two recipes, one offline and execution-checked, one online and sampled. The last three wire results into the machinery you already trust: CI, the latency budget, and the release gate.
The track
- Why evals are the moat (12 min). The three reasons LLM features rot (non-determinism, subjectivity, drift) and why ship-and-pray fails for AI. Hamel's argument is the canonical one 1.
- Your first eval in 30 minutes (30 min). The runnable artifact: fifty labeled traces, a binary judge, a confusion matrix. Do this once before touching the production pipeline.
- Trace schema: what to log (14 min). The span fields your services emit are the substrate for every later step. Get the schema right before the backfill conversation starts.
- Structured output and JSON mode (12 min). Parse rate and field-level accuracy as the first eval layer. Models are measurably sensitive to format constraints, so the constraint itself needs a score 2.
- Tool-call evaluation (12 min). Schema, arguments, sequence, state. Per-tool scoring against your actual endpoint signatures, with BFCL as the reference methodology 3.
- Text-to-SQL eval with execution (75 min). The execution-checked pattern: run the generated query and compare row sets instead of grading SQL strings. Test-suite accuracy is the standard this recipe implements 4.
- Online eval with Langfuse (60 min). Sampled judge scoring on live traces. The wiring looks like any other consumer on your trace stream 5.
- Wiring evals into CI (14 min). Smoke, full, nightly tiers with separate datasets and cost envelopes. The eval becomes a merge check, not a quarterly report 6.
- Cost and latency in the scorecard (12 min). p95 latency and per-request cost belong in the same table as quality. A model change that wins on accuracy and blows the latency budget is a regression 7.
- Release gates from eval results (15 min). The decision rule that gates promotion: hard thresholds that block, soft thresholds that warn, and the override path.
What comes after the track
The natural next section is datasets: regression sets that auto-grow from production failures, and versioning and lineage so a score is always pinned to a dataset version. From there, drift detection closes the loop on the online side.
TIP
Step 3 is the one to read before your next schema review. Retrofitting trace fields after six months of production traffic is the most expensive version of this work.