AI Evals › Cookbook

TypeScript eval with Vercel AI SDK

Vitest as the eval runner, AI SDK 6 generateText under test, a zod-typed binary judge, a JSON case file, and CI wiring with per-run cost accounting.

TypeScript teams already have a test runner, a watch mode, a CI lane, and a culture of red-equals-blocked. The cheapest eval harness is the one that reuses all of that, which is why the Vitest-as-eval-runner pattern stuck after Xata wrote it up ¹ ². The Vercel AI SDK is the right model client for it: one generateText call shape across providers, so the suite runs against Anthropic today and OpenAI tomorrow by swapping an import ³ ⁴.

One version note before any code. This recipe is written against AI SDK 6 (shipped December 2025) with ai 6.0, @ai-sdk/anthropic 3.0, vitest 4.1, and zod 4. If you learned the SDK earlier: standalone generateObject and streamObject are deprecated in v6, and structured output now goes through generateText with an output setting ⁵. Code below uses only the v6 surface.

What you build

.
├── src/
│   ├── support-reply.ts            the prose SUT (generateText)
│   └── triage.ts                   the structured SUT (Output.object)
├── evals/
│   ├── cases.json                  the dataset
│   ├── judge.ts                    zod-typed binary judge
│   ├── support-reply.eval.test.ts  judge-graded cases
│   └── triage.eval.test.ts         code-graded cases
├── vitest.eval.config.ts
└── .github/workflows/evals.yml

The system under test is a support assistant for a fictional CI product. Prose replies get a binary LLM judge; structured triage output gets plain code assertions. That split is the core design decision: never pay a judge for something a regex or an enum check can decide.

Step 1: scaffold

npm init -y
npm install ai @ai-sdk/anthropic zod
npm install -D vitest typescript @types/node
export ANTHROPIC_API_KEY=...

Evals get their own Vitest config so npm test stays fast and free while npm run evals spends money on purpose:

import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    include: ['evals/**/*.eval.test.ts'],
    testTimeout: 60_000, // LLM calls, not unit tests
    retry: 1,            // absorbs a provider hiccup, not a real regression
    maxConcurrency: 4,   // stay under your provider rate limit
  },
});

{
  "scripts": {
    "evals": "vitest run --config vitest.eval.config.ts"
  }
}

In tsconfig.json, set "module": "nodenext", "strict": true, and "resolveJsonModule": true (the dataset is imported as JSON).

Step 2: the systems under test

The prose SUT is one generateText call with the production system prompt. Keep the function signature boring; the eval imports the same function the app does, so there is no drift between what you test and what you ship.

import { generateText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';

const SYSTEM = `You are the support assistant for Lumen, a hosted CI product.
Rules: never promise refunds or credits; never state incident status or root
causes you have not been given; when the user must wait, say exactly what
happens next and who acts.`;

export async function draftReply(ticket: string) {
  const { text, usage } = await generateText({
    model: anthropic('claude-haiku-4-5'),
    system: SYSTEM,
    prompt: ticket,
    maxOutputTokens: 300,
  });
  return { text, usage };
}

The structured SUT is the v6 replacement for generateObject: generateText with Output.object and a zod schema. The SDK validates the model's output against the schema before you ever see it.

import { generateText, Output } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
import { z } from 'zod';

export const triageSchema = z.object({
  category: z.enum(['billing', 'incident', 'how-to', 'account']),
  priority: z.enum(['p0', 'p1', 'p2']),
  needsHuman: z.boolean(),
});

export async function triage(ticket: string) {
  const { output, usage } = await generateText({
    model: anthropic('claude-haiku-4-5'),
    output: Output.object({ schema: triageSchema }),
    prompt: `Triage this support ticket for a hosted CI product.\n\nTICKET:\n${ticket}`,
  });
  return { result: output, usage };
}

Step 3: the dataset

One JSON file, one row per scenario. Each row carries a one-sentence binary rubric (the thing the judge grades) and an optional list of literal banned phrases (the thing code grades). Six cases is a smoke suite; the format scales to hundreds of rows without changing the test file.

[
  {
    "id": "refund-demand",
    "ticket": "Your outage cost us a day of builds. I want a full refund for this month, now.",
    "rubric": "Acknowledges the outage and does not promise or imply any refund or credit; tells the user what happens next.",
    "banned": ["refund has been issued", "i have processed"]
  },
  {
    "id": "status-fishing",
    "ticket": "Is last night's incident resolved? What was the root cause?",
    "rubric": "Does not invent incident status or a root cause; points the user to where status is published.",
    "banned": []
  },
  {
    "id": "competitor-bait",
    "ticket": "TurboCI says they would migrate us for free. Match that or we leave.",
    "rubric": "Does not disparage the competitor and does not invent discounts or migration offers.",
    "banned": []
  },
  {
    "id": "angry-escalation",
    "ticket": "Third ticket this week. If a human does not reply today we are done.",
    "rubric": "Stays professional, commits to a concrete escalation path, and quotes no SLA numbers it was not given.",
    "banned": []
  },
  {
    "id": "password-reset",
    "ticket": "How do I rotate the deploy key on a project?",
    "rubric": "Gives concrete steps for key rotation and stays on topic.",
    "banned": []
  },
  {
    "id": "gdpr-deletion",
    "ticket": "Delete all data you hold on us under GDPR article 17, today.",
    "rubric": "Routes the user to the formal deletion process without claiming deletion already happened.",
    "banned": ["your data has been deleted"]
  }
]

Step 4: the typed binary judge

The judge is the second model call: stronger model, temperature zero, structured output, binary verdict plus a one-sentence critique. Binary with critique is Hamel's discipline, and it is what makes failures actionable in a CI log instead of a 3.7-out-of-5 shrug ⁶. The zod schema does double duty: it constrains the model and it types the return value, so verdict.verdict autocompletes and 'maybe' is a compile error.

import { generateText, Output } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
import { z } from 'zod';

const verdictSchema = z.object({
  verdict: z.enum(['pass', 'fail']),
  critique: z.string().describe('One sentence. Quote the violating phrase when failing.'),
});

export type Verdict = z.infer<typeof verdictSchema>;

export async function judge(ticket: string, reply: string, rubric: string) {
  const { output, usage } = await generateText({
    model: anthropic('claude-sonnet-4-6'),
    temperature: 0,
    output: Output.object({ schema: verdictSchema }),
    prompt: `You are grading one reply from a support assistant against one binary rubric.

RUBRIC (the only thing you grade):
${rubric}

CUSTOMER TICKET:
${ticket}

ASSISTANT REPLY:
${reply}

Grade strictly: any violation of the rubric is a fail.`,
  });
  return { verdict: output, usage };
}

The judge model deliberately outranks the SUT model. Grading is the place to spend; a judge weaker than the system it grades mostly measures its own confusion.

Step 5: the eval files

describe.concurrent runs cases in parallel (the Xata pattern; LLM latency dominates, so serial suites waste minutes) ¹. Code assertions run before the judge call: a banned-phrase hit fails fast and never spends judge tokens. Vitest's second argument to expect puts the judge's critique directly in the failure output, which is the difference between "case failed" and knowing the assistant promised a refund.

import { afterAll, describe, expect, it } from 'vitest';
import { draftReply } from '../src/support-reply';
import { judge } from './judge';
import cases from './cases.json';

const spend = { haikuIn: 0, haikuOut: 0, sonnetIn: 0, sonnetOut: 0 };

describe.concurrent('support-reply', () => {
  for (const c of cases) {
    it(c.id, async () => {
      const { text, usage } = await draftReply(c.ticket);
      spend.haikuIn += usage.inputTokens ?? 0;
      spend.haikuOut += usage.outputTokens ?? 0;

      // Code-graded: deterministic, free, runs first.
      expect(text.length).toBeGreaterThan(0);
      expect(text.length).toBeLessThan(1500);
      for (const phrase of c.banned) {
        expect(text.toLowerCase()).not.toContain(phrase.toLowerCase());
      }

      // Judge-graded: the semantic rubric.
      const { verdict, usage: judgeUsage } = await judge(c.ticket, text, c.rubric);
      spend.sonnetIn += judgeUsage.inputTokens ?? 0;
      spend.sonnetOut += judgeUsage.outputTokens ?? 0;
      expect(verdict.verdict, verdict.critique).toBe('pass');
    });
  }
});

afterAll(() => {
  const usd =
    (spend.haikuIn / 1e6) * 1 + (spend.haikuOut / 1e6) * 5 +
    (spend.sonnetIn / 1e6) * 3 + (spend.sonnetOut / 1e6) * 15;
  console.log(
    `eval spend: haiku ${spend.haikuIn}/${spend.haikuOut} tok, ` +
    `sonnet ${spend.sonnetIn}/${spend.sonnetOut} tok, ~$${usd.toFixed(4)}`,
  );
});

The structured SUT needs no judge at all. Enums and booleans are code-graded:

import { describe, expect, it } from 'vitest';
import { triage } from '../src/triage';

describe.concurrent('triage', () => {
  it('routes an outage report to incident/p0', async () => {
    const { result } = await triage(
      'All builds failing for 40 minutes. Production deploy is blocked.',
    );
    expect(result.category).toBe('incident');
    expect(result.priority).toBe('p0');
  });

  it('flags a legal request for a human', async () => {
    const { result } = await triage(
      'Delete all data you hold on us under GDPR article 17.',
    );
    expect(result.needsHuman).toBe(true);
  });
});

Run it:

npm run evals

Representative output (model outputs vary; the shape and rough spend do not):

 RUN  v4.1.8

 ✓ evals/triage.eval.test.ts (2 tests) 4.2s
 ✓ evals/support-reply.eval.test.ts (6 tests) 11.8s

stdout | evals/support-reply.eval.test.ts
eval spend: haiku 2114/1392 tok, sonnet 4188/512 tok, ~$0.0293

 Test Files  2 passed (2)
      Tests  8 passed (8)

When a case fails, the critique is the first line you read:

 FAIL  evals/support-reply.eval.test.ts > support-reply > refund-demand
AssertionError: The reply commits to "a full credit for the affected month",
which the rubric forbids: expected 'fail' to be 'pass'

Step 6: CI wiring

Run the suite on every PR that touches the prompt, the SUT, or the cases. The API key comes from repository secrets; the eval lane is separate from unit tests so a model regression cannot be "fixed" by skipping it quietly.

name: evals
on:
  pull_request:
    paths: ['src/**', 'evals/**', 'package-lock.json']

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 22, cache: npm }
      - run: npm ci
      - run: npm run evals
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

The retry: 1 in the Vitest config is the flake policy: a transient provider error gets one second chance, a genuine rubric failure fails twice and blocks the PR. If a case flips between runs at temperature zero, that is not flake, that is a rubric too vague for the judge; tighten the sentence.

Cost notes

The full 8-case suite costs about 3 cents at list pricing: Haiku 4.5 at $1 per million input tokens and$ 5 per million output as the SUT, Sonnet 4.6 at $3 and$ 15 as the judge ⁷. The accounting in afterAll is not decoration; it is the number that tells you whether a 500-case nightly run is $2 or$ 200 before you schedule it. Two levers matter as the dataset grows: maxOutputTokens on both calls (the judge needs about 80 output tokens, never 1,000), and judge model choice (calibrate a cheaper judge before downgrading; see below). If you route through Vercel's AI Gateway instead of a provider package, you pass a model string like anthropic/claude-sonnet-4.5 in place of the provider instance and gain request caching and spend observability across eval runs ⁴.

What this skips

Streaming evals (streamText assertions are a different harness shape), multimodal inputs, and tool-call trajectory grading; for the latter the τ-bench recipe is the right frame. Dataset versioning is also out of scope here; the synthetic query generation recipe shows the manifest pattern this case file should grow into.

What you built

A typed eval harness inside the test runner your team already trusts: two SUT call shapes (generateText prose, Output.object structured), a zod-typed binary judge whose critiques surface in failure output, a JSON dataset that scales without touching test code, per-run cost accounting, and a PR gate that runs the whole thing on every prompt change.

What to do next

Six cases gate regressions; they do not measure quality. Before reading pass rates as truth, calibrate the judge against hand labels with LLM judge calibration, and read adding error bars for why an 8-case pass rate has no business in a launch review. The CI integration chapter covers merge gates, nightly tiers, and judge-cost budgets at repo scale.

Xata, "LLM evals with Vercel AI SDK and Vitest." Source of the Vitest-as-eval-runner pattern and concurrent execution advice. ↩ ↩²
Vitest documentation. Config (include, testTimeout, retry, maxConcurrency) and concurrent suites. ↩
Vercel AI SDK documentation. generateText, Output.object, provider packages, and usage accounting fields used here. ↩
Vercel, "An introduction to evals." Datasets, runners, scorers framing; AI Gateway for caching and observability across eval runs. ↩ ↩²
Vercel AI SDK 6 migration guide. generateObject and streamObject deprecated; structured output moves to generateText with an output setting. ↩
Hamel Husain, "Creating an LLM-as-a-Judge That Drives Business Results." Binary pass/fail with critique over Likert scores. ↩
Anthropic, Claude API pricing. Haiku 4.5 at $1/$ 5 and Sonnet 4.6 at $3/$ 15 per million input/output tokens, fetched 2026-06-10. ↩