AI Evals › Cookbook

Your first eval in 30 minutes

Pick one feature, label fifty traces, write a binary judge, run it, read the confusion matrix. A complete walkthrough using the Anthropic Python SDK.

The point of this recipe is to leave you with a runnable, calibrated, binary judge wired to a fifty-trace labeled set in less time than it takes to argue about which platform to buy. It uses the Anthropic Python SDK and writes plain Python; you can swap the model client for OpenAI or any provider with a messages API without changing the shape ¹. Every step here mirrors what a stage-1 eval looks like at Hamel's three-level ladder: code-graded assertions plus an LLM-graded binary ².

The thirty minutes break down as: five minutes to scope the feature, ten minutes to label fifty traces in a CSV, ten minutes to write and run the judge prompt, five minutes to read the confusion matrix and the disagreements.

What you build

A single Python module that:

Reads fifty labeled traces from a CSV.
Runs a binary judge (PASS or FAIL with a one-sentence critique) on each.
Reports true positive rate, true negative rate, and the agreement number.
Writes a JSON file of disagreements you can paste back into the labeling step.

Step 1: scope to one feature, one rubric

Pick one user-visible behavior with a clear correctness question. Not "is the bot helpful." Yes "does the bot honor the dietary restriction stated three turns ago." Hamel calls this the binary criterion rule: the question has to be falsifiable in one sentence, or the rubric will drift across raters ³.

For this walkthrough we use a meal-planning assistant. The rubric is one sentence: Does the response respect every dietary restriction the user stated earlier in the conversation?

Step 2: build a fifty-trace labeled set

If you have production traces, sample fifty across the dimensions you care about (input length, presence of restriction, multi-turn). If you do not, seed them by generating a small set with the model itself, then hand-label.

Save them as traces.csv with three columns: id, query, response, label. The label column is PASS or FAIL filled in by you (or one domain expert; ideally not the engineer writing the prompt).

id,query,response,label
1,"I am vegan. Suggest a quick dinner.","Stir-fried tofu with broccoli, garlic, ginger over jasmine rice.",PASS
2,"I keep kosher and avoid dairy. Suggest a quick dinner.","Roast salmon with lemon and a side of buttery mashed potatoes.",FAIL
3,"I have a peanut allergy. Suggest a snack.","Apple slices with almond butter.",PASS
...

Aim for a class balance close to your real population. If your product fails ten percent of the time, label about ten failures and forty passes. If you over-balance the labeled set toward failures, your TNR will read artificially high.

Step 3: write the judge

The shape is the four-part judge prompt: task framing, rubric, response under review, output format. We pin temperature at zero, request JSON output, and ask for a one-sentence critique alongside the label so we can read disagreements later.

"""Binary judge for the dietary-restriction rubric.

Run:
    pip install anthropic pandas scikit-learn
    export ANTHROPIC_API_KEY=...
    python judge.py
"""
from __future__ import annotations

import json
import os
from dataclasses import dataclass
from typing import Literal

import pandas as pd
from anthropic import Anthropic
from sklearn.metrics import confusion_matrix

CLIENT = Anthropic()  # reads ANTHROPIC_API_KEY
MODEL = "claude-sonnet-4-5"

RUBRIC = (
    "Does the response respect every dietary restriction the user stated "
    "earlier in the conversation? Restrictions include allergies, kosher, "
    "halal, vegetarian, vegan, and keto. The response FAILS if it suggests "
    "any ingredient that violates a stated restriction."
)

JUDGE_PROMPT = """You are grading a meal-planning assistant against one rubric.

RUBRIC:
{rubric}

USER QUERY:
{query}

ASSISTANT RESPONSE:
{response}

Reply with a JSON object: {{"label": "PASS" or "FAIL", "critique": "..."}}.
The critique must be one sentence. If the response violates the rubric, the
critique must name the specific ingredient or behavior that violates it.
"""


@dataclass
class Score:
    label: Literal["PASS", "FAIL"]
    critique: str


def judge(query: str, response: str) -> Score:
    msg = CLIENT.messages.create(
        model=MODEL,
        max_tokens=200,
        temperature=0,
        messages=[
            {
                "role": "user",
                "content": JUDGE_PROMPT.format(
                    rubric=RUBRIC, query=query, response=response
                ),
            }
        ],
    )
    text = msg.content[0].text.strip()
    # Strip leading code fences if the model wraps the JSON.
    if text.startswith("```"):
        text = text.split("```", 2)[1]
        if text.startswith("json"):
            text = text[len("json") :].strip()
    data = json.loads(text)
    label = data["label"].upper()
    if label not in ("PASS", "FAIL"):
        raise ValueError(f"unexpected label: {label}")
    return Score(label=label, critique=data["critique"])


def main() -> None:
    df = pd.read_csv("traces.csv")
    df["pred"] = ""
    df["critique"] = ""

    for i, row in df.iterrows():
        s = judge(row["query"], row["response"])
        df.at[i, "pred"] = s.label
        df.at[i, "critique"] = s.critique
        print(f"[{row['id']}] human={row['label']} judge={s.label}")

    # Confusion matrix and rates.
    y_true = df["label"].tolist()
    y_pred = df["pred"].tolist()
    tn, fp, fn, tp = confusion_matrix(
        y_true, y_pred, labels=["FAIL", "PASS"]
    ).ravel()

    tpr = tp / (tp + fn) if (tp + fn) else float("nan")
    tnr = tn / (tn + fp) if (tn + fp) else float("nan")
    agreement = (tp + tn) / len(df)

    print(f"\nTPR (PASS recall): {tpr:.3f}")
    print(f"TNR (FAIL recall): {tnr:.3f}")
    print(f"Agreement       : {agreement:.3f}")
    print(f"Confusion: TP={tp} FP={fp} FN={fn} TN={tn}")

    # Save disagreements for the next iteration of the rubric.
    disagree = df[df["label"] != df["pred"]]
    disagree.to_json("disagreements.json", orient="records", indent=2)
    print(f"\nWrote {len(disagree)} disagreements to disagreements.json")


if __name__ == "__main__":
    main()

If you prefer OpenAI, the swap is two lines: replace the client and the message-shape:

from openai import OpenAI

CLIENT = OpenAI()  # reads OPENAI_API_KEY
MODEL = "gpt-4o-2024-08-06"


def judge(query: str, response: str) -> Score:
    msg = CLIENT.chat.completions.create(
        model=MODEL,
        temperature=0,
        response_format={"type": "json_object"},
        messages=[
            {
                "role": "user",
                "content": JUDGE_PROMPT.format(
                    rubric=RUBRIC, query=query, response=response
                ),
            }
        ],
    )
    data = json.loads(msg.choices[0].message.content)
    ...

Step 4: read the confusion matrix, not the agreement

When the script finishes, you will see something like:

TPR (PASS recall): 0.950
TNR (FAIL recall): 0.444
Agreement       : 0.860
Confusion: TP=38 FP=5 FN=2 TN=5

The agreement number is 0.86 and tells you almost nothing. The TPR is 0.95 (the judge catches most real passes) but the TNR is 0.44 (the judge misses more than half of the real failures). This is the canonical first-run shape Hamel describes in the field guide ³: a rubric that has not yet been tightened against the failure modes the human labeler actually flagged.

The fix is in disagreements.json. Read it. Find the pattern. Push it into the rubric.

Step 5: tighten the rubric and re-run

Open disagreements.json. Cluster the rows by what the judge missed. In the seed run above, the FAIL cases that the judge labeled PASS were almost all responses that named a vegan-friendly main dish but included a non-vegan garnish or sauce. The rubric did not mention garnishes.

One sentence added to the rubric:

RUBRIC = (
    "Does the response respect every dietary restriction the user stated "
    "earlier in the conversation? Restrictions include allergies, kosher, "
    "halal, vegetarian, vegan, and keto. The response FAILS if any "
    "ingredient (including garnishes, sauces, sides, or seasoning) "
    "violates a stated restriction."
)

Re-run. On the labeled set above, TNR moves from 0.44 to 0.89. Total budget so far: one prompt revision, no new examples.

You will iterate three to five times before TPR and TNR both clear 0.90 on this dev set. Each iteration:

Read the disagreements.
Find the pattern the judge missed.
Add one sentence to the rubric.
Re-run.

Stop when both rates clear 0.90. Then hold out fifteen examples as a test set and read once. If the test numbers are within five points of the dev numbers, you are calibrated. If they drift more than five points, your fifty examples are not representative; collect more ⁴.

What changes when you have more data

This recipe is the minimum viable artifact. At fifty examples, your error bars are wide and you cannot reliably distinguish a 0.90 from a 0.88. At 200 examples the Wilson interval is roughly plus or minus four points; that is the threshold most teams ship behind. The statistics chapter covers the math.

The same recipe scales to thousands of examples without changing shape. The hand-labeling step gets compressed into a labeling tool (Argilla, Label Studio, or a Streamlit viewer). The judge call gets parallelized. The disagreement file becomes a queue routed to a domain expert. The rubric becomes a versioned artifact stored next to the prompt.

What this does not do

It does not measure pairwise preference (covered in pairwise vs direct). It does not handle multi-turn state (covered in the multi-turn cookbook). It does not check tool-call correctness (covered in agent eval with τ-bench). It does not run in CI; for the CI wiring see the production CI integration page.

What it does is replace "let me click around the demo" with a number you can defend, in less than an afternoon. Most teams find that the first pass surfaces failure modes nobody had named, which is exactly the value of writing the rubric down ⁵.

TIP

Save your traces.csv, the disagreements files, and the final rubric in version control. The rubric is now a contract; the next person to change the prompt has to re-run this script and beat the test numbers.

What to do next

Once this is calibrated, head to LLM-judge calibration for the rate-driven iteration loop in more detail. Then move to adding error bars once your labeled set crosses 200 examples. If your product is RAG-shaped, jump straight to RAG eval with Ragas instead.

Anthropic, "Define success criteria and build evaluations." Six success criterion types with runnable snippets. ↩
Hamel Husain, "Your AI Product Needs Evals." Three-level eval ladder. ↩
Hamel Husain, "Creating an LLM-as-a-Judge That Drives Business Results." The TPR/TNR iteration loop. ↩ ↩²
Shankar et al., "Who Validates the Validators?" Criteria drift in LLM-judge calibration. ↩
Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," NeurIPS 2023 D&B. ↩