SDET and QA engineers ship the CI integration, the regression sets, the flakiness budgets, and the release gate. This track is test-engineering pragmatic. it ships pytest and Vitest patterns, GitHub Actions and Buildkite snippets, mock fixtures, cost dashboards. It assumes test-engineering fluency and treats LLM evals as one more flaky-by-default test category to discipline.
The first three steps cover the architecture. Steps four through six are dataset and statistical discipline. The last four are operational (sharding, tool choice, PR wiring, release gates).
The track
- Anatomy of an AI CI pipeline (14 min). The smoke / full / nightly tier pattern. What runs on every PR, what runs on merge, what runs at 2 AM. Token cost and runtime budget for each tier.
- Eval tiering: smoke, full, nightly (12 min). Three tiers, three datasets, three cost envelopes. The default split: smoke ~50 examples, full ~500, nightly ~5,000 1.
- Making LLM judges deterministic enough (14 min). Temperature zero, pinned model versions, schema constraints, position swap on pairwise. The five disciplines that turn a judge from flaky to test-eligible.
- Golden set management (14 min). Content hashing, dataset versioning, the contract between PR author and eval. How to pin a result to a dataset version.
- Variance and flakiness budgets (16 min). Why naive standard errors are too small on clustered data. Wilson intervals on rates. When two points of drop is inside the noise band 2.
- Cost control for judge calls in CI (14 min). Sampling strategies, distilled judges, caching. Keeping the weekly bill under $50 for a smoke tier with one judge per trace.
- Sharding eval suites (14 min). Parallelism patterns. The right way to slice 5,000 traces across 10 runners. How to merge results without double-counting.
- Tool comparison: OpenAI Evals, RAGAS, Phoenix, LangSmith (18 min). Which tool fits which CI shape. The build-vs-buy decision table by team size, vendor preference, and CI runner 3.
- Wiring eval results into PR reviews (14 min). GitHub Actions snippets, PR comment formatting, the metric-table convention. How to make the eval result visible at the PR review surface, not buried in a build artifact.
- Release gates from eval scores (15 min). The decision rule that gates promotion. Hard thresholds (no merge if TPR drops > 5 points), soft thresholds (warning if mean drops > 2 points), and the override path.
What comes after the track
Once the ten pages are read, the natural next moves: write the eval CI workflow YAML; pin the smoke and full datasets in your repo; write the PR-comment formatter; pick a vendor (OpenAI Evals is the default if your stack is already OpenAI-shaped) 3. Inspect AI is the right secondary choice for research-grade trajectory evals that the CI hits weekly 4. The Applied LLMs essay is the right read on the discipline beyond the tooling 5.
TIP
The flakiness budget conversation (step 5) is the one where you build credibility with the rest of engineering. A team that can answer "is this score change real" with statistics has a different shipping cadence than a team that says "looks worse, retry."