A CI that runs every eval on every PR will be too slow, too expensive, and will get bypassed within a month. A CI that runs no evals will let regressions through and surface them in production. The working pattern lives between: tier the suite, run the smoke set on every push, the full set on merge candidates, and the nightly set on schedule. Cap the cost per PR, publish the result table as a PR comment, and gate the merge on the change in metric, not the absolute value.
Hamel's argument that evals are a level of testing applies straight: Level 1 is pytest-style assertions, Level 2 is human-plus-judge on a labelled set, Level 3 is production A/B 1. CI's job is to make Level 1 cheap to fail and Level 2 fast to inspect. Production A/B lives downstream.
The three tiers
Smoke. Ten to fifty assertions covering the highest-value workflows. Runs on every push, in under two minutes, at single-digit-dollar cost. The point is to catch a malformed prompt change, a regression on a canonical happy path, or a broken tool wiring before the PR sits in review. Smoke does not need a judge; deterministic checks (schema validation, regex on outputs, exact-match on a tiny golden set) are enough for the layer.
Full. One to five hundred labelled examples covering the full feature surface, run with the calibrated judge. Triggered on PRs labelled eval and on every merge candidate. Budget five to fifteen minutes and tens of dollars. The full tier is the merge gate; the result is a quality, cost, and latency triple with intervals against the baseline.
Nightly. The expanded regression suite plus contamination probes, robustness sweeps (format, order, temperature), and drift comparisons against the previous nightly. Runs on a schedule, posts the result to a dashboard, and pages on regression. The nightly tier catches the slow leaks the merge tier cannot afford to.
A working GitHub Actions setup
# .github/workflows/evals.yml
name: evals
on:
pull_request:
types: [opened, synchronize, reopened, labeled]
schedule:
- cron: "0 6 * * *" # 06:00 UTC nightly
jobs:
smoke:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -e .[dev]
- run: pytest evals/smoke -q --maxfail=3
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
EVAL_BUDGET_USD: "5"
full:
if: contains(github.event.pull_request.labels.*.name, 'eval')
runs-on: ubuntu-latest
timeout-minutes: 20
strategy:
matrix:
shard: [0, 1, 2, 3]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -e .[dev]
- name: run shard
run: |
python -m evals.runner --suite full --shard ${{ matrix.shard }}/4 \
--judge-version "$(cat configs/judge.version)" \
--dataset-hash "$(cat datasets/full.hash)" \
--out shard-${{ matrix.shard }}.json
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
EVAL_BUDGET_USD: "15"
- uses: actions/upload-artifact@v4
with:
name: shard-${{ matrix.shard }}
path: shard-${{ matrix.shard }}.json
report:
needs: full
if: contains(github.event.pull_request.labels.*.name, 'eval')
runs-on: ubuntu-latest
steps:
- uses: actions/download-artifact@v4
- run: python -m evals.report --merge shard-*.json --baseline main \
--pr ${{ github.event.pull_request.number }}
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
A pytest-style harness for the smoke tier keeps the failure surface familiar. Assertions can be deterministic (assert response.startswith("Refund of")) or judge-backed (assert score_faithfulness(...).label == "pass"); the runner caps cost via the EVAL_BUDGET_USD env and writes the verdict to a JSON report that the comment step reads.
# evals/smoke/test_refund_flow.py
import pytest
from evals.harness import run_workflow, judge
@pytest.mark.parametrize("case", load_cases("evals/cases/refund_smoke.jsonl"))
def test_refund_flow(case, judge_version):
out = run_workflow("refund", case["input"])
assert out["schema_valid"], out["raw"]
v = judge.faithfulness(out["response"], context=case["context"], version=judge_version)
assert v.label == "pass", v.rationale
Sharding and cost control
A four-shard full suite cuts wall time by roughly four with a small overhead and keeps any one shard's failure surface small enough to debug. The sharding key should be the dataset row, not the test file; rows shuffle deterministically by hash so reruns are reproducible. Each shard inherits the same judge version and dataset hash from configs/. Pin both per release.
Cost control is a hard cap at the runner level, not a soft warning. A EVAL_BUDGET_USD env that the runner enforces by stopping before the next call when the projected spend exceeds the cap is enough to keep a runaway loop from costing a five-figure surprise. Vercel's writeup of evals in CI makes the same recommendation; the TypeScript pattern from Xata's blog adapts cleanly when the stack is Node-first 2 3.
Publishing the result to the PR
The comment is the user interface. A useful default: a four-column table per slice, baseline vs candidate, with the delta and a 95% interval, plus a one-line "merge / hold / block" verdict at the top. Block when any slice's quality delta excludes zero on the negative side; hold when cost or latency regressions exceed the configured thresholds; merge when the deltas are within the bands. Hamel's FAQ argues the comment is where the reviewer actually adjudicates the change; do not bury the result in a job log 4.
The merge gate
Two opinions. First, gate on delta with intervals, not on absolute pass rate. A PR that takes a 92% baseline to 91% with overlapping intervals is not a regression; one that takes it to 88% with a delta interval excluding zero is. Second, no automatic overrides. A PR that fails the gate can ship after a human override with a comment explaining why; silent green is worse than loud red.
[!IMPORTANT]
A CI eval that never fails is theatre. If your suite has not blocked a PR in the last month, the suite is too lenient, the data is stale, or the team is gaming it. Walk traces; refresh the dataset; tighten the thresholds 5.
What goes in the nightly
The nightly tier is where the expensive checks live. A useful default:
- The full regression suite at the latest dataset version.
- A robustness sweep: each prompt run across five plausible formats and three random seeds.
- A contamination probe: the canary strings, a rephrased subset, and a private hold-out.
- A drift comparison against the previous nightly: input distribution, output distribution, per-eval score distribution.
Nightly pages on regression; the runbook is the same as the production drift one 6.
What to do this week
- Split your current suite into smoke (a few minutes) and full (about fifteen minutes). Move the deterministic checks to smoke; keep the judge-backed checks in full.
- Add an
EVAL_BUDGET_USD env to the runner and enforce it. Pin the judge version and dataset hash in the workflow.
- Wire the result to the PR comment. Block on delta intervals, not on absolute rates. Override requires a human comment.