Start Here for SDET and QA Engineers

Ten steps from anatomy of an AI CI pipeline to release gates. Test-engineering pragmatic, config snippets, pytest and Vitest patterns.

SDET and QA engineers ship the CI integration, the regression sets, the flakiness budgets, and the release gate. This track is test-engineering pragmatic. it ships pytest and Vitest patterns, GitHub Actions and Buildkite snippets, mock fixtures, cost dashboards. It assumes test-engineering fluency and treats LLM evals as one more flaky-by-default test category to discipline.

The first three steps cover the architecture. Steps four through six are dataset and statistical discipline. The last four are operational (regression sets, platform choice, suite sizing, release gates).

The track

Anatomy of an AI CI pipeline (12 min). The smoke / full / nightly tier pattern: what runs on every PR, what runs on merge, what runs at 2 AM, with a cost envelope per tier ¹. Sharding, cost caps, and the GitHub Actions wiring that publishes the metric table to the PR.
Online vs offline: what CI can catch (12 min). Offline suites on frozen datasets gate merges; online evals sample live traffic after the fact. The handoffs between the two, and why CI alone never sees the traffic distribution.
Making LLM judges deterministic enough (14 min). Position swap on pairwise, pinned rubric schemas, format-sensitivity checks, and the prompt-hash discipline. The pre-ship checklist that turns a judge from flaky to test-eligible.
Golden set management (14 min). Content hashing, dataset versioning, the contract between PR author and eval. How to pin a result to a dataset version.
Variance and flakiness budgets (16 min). Why naive standard errors are too small on clustered data. Wilson intervals on rates. When two points of drop is inside the noise band ².
Cost control: distilled judges (14 min). The cost math for swapping a frontier judge for a small fine-tuned one, what the swap silently costs, and the inline / calibration / gold architecture that keeps the cheap judge honest.
Regression sets and tiering (8 min). The set whose job is to fail loudly when a fixed bug returns. What goes in, the capture workflow from production failures, and how the set tiers into smoke and full.
Choosing the eval platform (10 min). Langfuse, Phoenix, Braintrust, LangSmith, or JSONL in git. The three decision axes, the source-of-truth principle, and what stays out of the vendor regardless of choice.
How big does the eval set need to be (12 min). Power analysis for test engineers: pick the delta you care about, compute the N that detects it, and stop arguing about score changes the suite was never sized to see. Includes a power table to steal.
Release gates from eval scores (15 min). The decision rule that gates promotion. Hard thresholds (no merge if TPR drops > 5 points), soft thresholds (warning if mean drops > 2 points), and the override path.

What comes after the track

Once the ten pages are read, the natural next moves: write the eval CI workflow YAML; pin the smoke and full datasets in your repo; write the PR-comment formatter; pick a vendor (OpenAI Evals is the default if your stack is already OpenAI-shaped) ³. Inspect AI is the right secondary choice for research-grade trajectory evals that the CI hits weekly ⁴. The Applied LLMs essay is the right read on the discipline beyond the tooling ⁵.

TIP

The flakiness budget conversation (step 5) is the one where you build credibility with the rest of engineering. A team that can answer "is this score change real" with statistics has a different shipping cadence than a team that says "looks worse, retry."

Hamel Husain & Shreya Shankar, "LLM Evals FAQ." 100 to 500 labels. ↩
Evan Miller, "Adding Error Bars to Evals." arxiv 2411.00640. ↩
OpenAI Evals GitHub repository. ↩
Inspect AI documentation, UK AI Safety Institute. ↩
Yan, Bischof, Frye, Husain, Liu, Shankar, "What We Learned from a Year of Building with LLMs." ↩