Start Here for Backend Engineers

Ten steps from why evals matter through release gates. Trace logging from services, structured output, tool calls, execution-checked SQL, online eval, CI wiring, latency and cost budgets. Also available as a free PDF download.

Backend engineers own the services the model sits inside: the API that calls it, the queue that buffers it, the database it reads, and the budget it burns. This track treats an eval as one more contract a service has to honor, next to the schema, the SLO, and the bill. It assumes staff-level fluency in Python or TypeScript and the usual comfort with services, queues, and SQL. Two steps are hands-on recipes that take an hour each; the rest are reading.

The order is deliberate. Steps one and two make the case and give you a runnable artifact. Steps three through five cover the contracts: what your services log, what the model returns, and how it calls your tools. Steps six and seven are the two recipes, one offline and execution-checked, one online and sampled. The last three wire results into the machinery you already trust: CI, the latency budget, and the release gate.

The track

Why evals are the moat (12 min). The three reasons LLM features rot (non-determinism, subjectivity, drift) and why ship-and-pray fails for AI. Hamel's argument is the canonical one ¹.
Your first eval in 30 minutes (30 min). The runnable artifact: fifty labeled traces, a binary judge, a confusion matrix. Do this once before touching the production pipeline.
Trace schema: what to log (14 min). The span fields your services emit are the substrate for every later step. Get the schema right before the backfill conversation starts.
Structured output and JSON mode (12 min). Parse rate and field-level accuracy as the first eval layer. Models are measurably sensitive to format constraints, so the constraint itself needs a score ².
Tool-call evaluation (12 min). Schema, arguments, sequence, state. Per-tool scoring against your actual endpoint signatures, with BFCL as the reference methodology ³.
Text-to-SQL eval with execution (75 min). The execution-checked pattern: run the generated query and compare row sets instead of grading SQL strings. Test-suite accuracy is the standard this recipe implements ⁴.
Online eval with Langfuse (60 min). Sampled judge scoring on live traces. The wiring looks like any other consumer on your trace stream ⁵.
Wiring evals into CI (14 min). Smoke, full, nightly tiers with separate datasets and cost envelopes. The eval becomes a merge check, not a quarterly report ⁶.
Cost and latency in the scorecard (12 min). p95 latency and per-request cost belong in the same table as quality. A model change that wins on accuracy and blows the latency budget is a regression ⁷.
Release gates from eval results (15 min). The decision rule that gates promotion: hard thresholds that block, soft thresholds that warn, and the override path.

What comes after the track

The natural next section is datasets: regression sets that auto-grow from production failures, and versioning and lineage so a score is always pinned to a dataset version. From there, drift detection closes the loop on the online side.

TIP

Step 3 is the one to read before your next schema review. Retrofitting trace fields after six months of production traffic is the most expensive version of this work.

Hamel Husain, "Your AI Product Needs Evals." ↩
Sclar et al., "Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design." arxiv 2310.11324. ↩
Berkeley Function-Calling Leaderboard. ↩
Zhong et al., "Semantic Evaluation for Text-to-SQL with Distilled Test Suites." arxiv 2010.02840. ↩
Langfuse documentation. ↩
Hamel Husain & Shreya Shankar, "LLM Evals FAQ." ↩
Yan, Bischof, Frye, Husain, Liu, Shankar, "What We Learned from a Year of Building with LLMs." ↩

AI Evals

The most useful, opinionated reference for AI evals.

AI Evals is a curated technical reference for evaluating large language model systems in production. The site covers error analysis, LLM-as-judge calibration, RAG evaluation, agentic eval, statistical rigor, and twenty task-specific eval playbooks. Every claim is cited.

Why evals matter
Your first eval in 30 minutes
LLM-as-Judge introduction
RAG eval in two layers
Trajectory eval for agents
Glossary
Papers