Teams shipping TypeScript LLM apps typically run their evals inside Vitest because the test runner, snapshot tools, and reporter are already wired up. The Vercel AI SDK is the cleanest TypeScript surface for this: generateText, generateObject, and streamObject accept a tools map you can capture and assert against, and the SDK normalizes provider differences so the same eval runs against OpenAI, Anthropic, Mistral, and Bedrock 1. The recipe will adapt the Xata blog post pattern: one Vitest file per scenario, a shared judge helper, snapshot-friendly assertions on tool-call shape, and a custom reporter that summarizes pass rates 2.
Planned outline
- Project scaffold. Vitest config with a longer test timeout (LLM calls), a custom reporter for eval-style output, and a fixtures dir for scenarios.
- Shared judge helper. A
judge(query, response, rubric) function that wraps generateObject with a Zod schema for the label and critique.
- Per-scenario test files. Each
*.eval.test.ts describes one feature; uses it.each to iterate over fixture rows; calls the SUT, then the judge.
- Tool-call assertions. The Vercel SDK exposes
toolCalls and toolResults. Snapshot the shape; assert on tool name, args validation, and execution outcome.
- Snapshot strategy. Snapshot the rubric output and the critique. Not the prose response. The prose changes; the label should not.
- CI wiring. Run on every PR; print a markdown table to the PR via a GitHub Action comment.
What this will skip in v1
Cost tracking (write a simple wrapper). Streaming evals (not yet stable). Multimodal evals (the SDK supports images but the test ergonomics deserve a dedicated recipe). Vercel KV-backed dataset versioning.
The Xata blog post is the closest published walkthrough and the source of the patterns above 2. The Vercel-published intro to evals is the right starting context 1.