AI Evals › Cookbook

Text-to-SQL eval with execution

Score SQL by executing it against a seeded SQLite warehouse and comparing result sets, with a failure taxonomy for syntax, schema, shape, and value errors.

Text-to-SQL is the rare LLM task with a deterministic check: run the generated query and compare result sets. Teams still grade it by comparing SQL text, and text comparison lies in both directions. Spider's exact-match metric, the careful version that decomposes queries into clauses before comparing, was measured at a 2.5% average false-negative rate and 8.1% in the worst case; correct queries marked wrong because they were written differently ¹ ². Naive string match is strictly worse. This is why BIRD made execution accuracy the headline metric, and its launch numbers frame the task honestly: the best model of the day scored 40.08% against human data engineers at 92.96% ³. The public leaderboard top sits near 82% three years later (fetched as of 2026-06-10) ⁴.

This recipe builds the execution-based scorer on a seeded SQLite warehouse, so the demonstration runs offline and your output matches this page exactly. The 75 minutes break down as: ten to generate the warehouse and run the deterministic demo, twenty to wire your model in and score it, the rest to read failures and harden the gold set.

What you build

Three modules:

make_db.py writes warehouse.db (a seeded four-table commerce schema), gold.json (12 questions with gold SQL and an order-sensitivity flag), and predictions_demo.json (a frozen fake model run with three planted disagreement cases and four planted failures).
score_sql.py executes gold and predicted SQL, canonicalizes result sets, prints execution accuracy next to string-match accuracy, and emits a failure taxonomy.
run_model.py generates SQL for the same questions with the schema in the system prompt, so the scorer has something real to grade.

Step 1: the seeded warehouse and gold set

"""Build a seeded SQLite warehouse, a 12-question gold set, and a demo
prediction file that shows where string match and execution match disagree.

Run:
    python make_db.py
Writes warehouse.db, gold.json, predictions_demo.json. Stdlib only.
"""
import json
import random
import sqlite3

rng = random.Random(11)

COUNTRIES = ["DE", "US", "IN", "BR", "JP"]
CATALOG = [
    ("Espresso machine", "kitchen", 240.0), ("Chef knife", "kitchen", 85.0),
    ("Cast-iron pan", "kitchen", 45.0), ("Standing desk", "office", 420.0),
    ("Task chair", "office", 260.0), ("Monitor arm", "office", 95.0),
    ("Trail backpack", "outdoor", 130.0), ("Headlamp", "outdoor", 30.0),
    ("Tent 2p", "outdoor", 210.0), ("Yoga mat", "fitness", 35.0),
    ("Kettlebell 16kg", "fitness", 60.0), ("Foam roller", "fitness", 25.0),
]

def iso(year: int, month: int, day: int) -> str:
    return f"{year:04d}-{month:02d}-{day:02d}"

def rand_date(y0: int, y1: int) -> str:
    y = rng.randint(y0, y1)
    m = rng.randint(1, 5 if y == 2026 else 12)  # the data stops in May 2026
    return iso(y, m, rng.randint(1, 28))

# Skewed popularity, so rare products exist (and q08 has a real answer).
POPULARITY = [8, 7, 6, 5, 5, 4, 4, 3, 3, 2, 1, 1]

con = sqlite3.connect("warehouse.db")
cur = con.cursor()
cur.executescript("""
DROP TABLE IF EXISTS order_items; DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS products;    DROP TABLE IF EXISTS customers;
CREATE TABLE customers (customer_id INTEGER PRIMARY KEY, name TEXT,
                        country TEXT, signup_date TEXT);
CREATE TABLE products  (product_id INTEGER PRIMARY KEY, name TEXT,
                        category TEXT, unit_price REAL);
CREATE TABLE orders    (order_id INTEGER PRIMARY KEY, customer_id INTEGER,
                        order_date TEXT, status TEXT);
CREATE TABLE order_items (order_id INTEGER, product_id INTEGER,
                          quantity INTEGER, unit_price REAL);
""")

for i in range(1, 41):
    cur.execute("INSERT INTO customers VALUES (?,?,?,?)",
                (i, f"customer_{i:02d}", rng.choice(COUNTRIES), rand_date(2023, 2025)))
for i, (name, cat, price) in enumerate(CATALOG, start=1):
    cur.execute("INSERT INTO products VALUES (?,?,?,?)", (i, name, cat, price))

oid = 0
for _ in range(600):
    oid += 1
    status = rng.choices(["completed", "refunded", "pending"], weights=[80, 12, 8])[0]
    cur.execute("INSERT INTO orders VALUES (?,?,?,?)",
                (oid, rng.randint(1, 40), rand_date(2024, 2026), status))
    for _ in range(rng.randint(1, 4)):
        pid = rng.choices(range(1, 13), weights=POPULARITY)[0]
        cur.execute("INSERT INTO order_items VALUES (?,?,?,?)",
                    (oid, pid, rng.randint(1, 3), CATALOG[pid - 1][2]))
con.commit()

GOLD = [
    {"id": "q01", "question": "How many customers signed up in 2025?",
     "sql": "SELECT COUNT(*) FROM customers WHERE signup_date LIKE '2025%'",
     "ordered": False},
    {"id": "q02", "question": "List the distinct product categories.",
     "sql": "SELECT DISTINCT category FROM products", "ordered": False},
    {"id": "q03", "question": "Total revenue from completed orders.",
     "sql": "SELECT SUM(oi.quantity * oi.unit_price) FROM order_items oi "
            "JOIN orders o ON o.order_id = oi.order_id WHERE o.status = 'completed'",
     "ordered": False},
    {"id": "q04", "question": "Revenue by product category, completed orders only, highest first.",
     "sql": "SELECT p.category, SUM(oi.quantity * oi.unit_price) AS rev "
            "FROM order_items oi JOIN orders o ON o.order_id = oi.order_id "
            "JOIN products p ON p.product_id = oi.product_id "
            "WHERE o.status = 'completed' GROUP BY p.category ORDER BY rev DESC",
     "ordered": True},
    {"id": "q05", "question": "Top 3 customers by lifetime completed-order spend.",
     "sql": "SELECT c.name, SUM(oi.quantity * oi.unit_price) AS spend "
            "FROM customers c JOIN orders o ON o.customer_id = c.customer_id "
            "JOIN order_items oi ON oi.order_id = o.order_id "
            "WHERE o.status = 'completed' GROUP BY c.customer_id "
            "ORDER BY spend DESC, c.customer_id LIMIT 3",
     "ordered": True},
    {"id": "q06", "question": "Average item count per completed order.",
     "sql": "SELECT AVG(n) FROM (SELECT COUNT(*) AS n FROM order_items oi "
            "JOIN orders o ON o.order_id = oi.order_id "
            "WHERE o.status = 'completed' GROUP BY oi.order_id)",
     "ordered": False},
    {"id": "q07", "question": "How many orders were refunded in 2025?",
     "sql": "SELECT COUNT(*) FROM orders WHERE status = 'refunded' "
            "AND order_date LIKE '2025%'", "ordered": False},
    {"id": "q08", "question": "Which products had no refunded sales in 2026?",
     "sql": "SELECT p.name FROM products p WHERE NOT EXISTS (SELECT 1 FROM "
            "order_items oi JOIN orders o ON o.order_id = oi.order_id "
            "WHERE oi.product_id = p.product_id AND o.status = 'refunded' "
            "AND o.order_date LIKE '2026%')", "ordered": False},
    {"id": "q09", "question": "Completed revenue per month of 2025, in month order.",
     "sql": "SELECT substr(o.order_date, 6, 2) AS month, "
            "SUM(oi.quantity * oi.unit_price) AS rev FROM order_items oi "
            "JOIN orders o ON o.order_id = oi.order_id "
            "WHERE o.status = 'completed' AND o.order_date LIKE '2025%' "
            "GROUP BY month ORDER BY month", "ordered": True},
    {"id": "q10", "question": "Number of customers per country.",
     "sql": "SELECT country, COUNT(*) FROM customers GROUP BY country",
     "ordered": False},
    {"id": "q11", "question": "Top 5 products by units sold in completed orders.",
     "sql": "SELECT p.name, SUM(oi.quantity) AS units FROM order_items oi "
            "JOIN orders o ON o.order_id = oi.order_id "
            "JOIN products p ON p.product_id = oi.product_id "
            "WHERE o.status = 'completed' GROUP BY p.product_id "
            "ORDER BY units DESC, p.product_id LIMIT 5", "ordered": True},
    {"id": "q12", "question": "What fraction of all orders were refunded?",
     "sql": "SELECT AVG(status = 'refunded') FROM orders", "ordered": False},
]

with open("gold.json", "w") as f:
    json.dump(GOLD, f, indent=2)

# A frozen "model run": eight semantically correct answers (three written
# differently from the gold), four planted failures. This file exists so the
# scorer's output is reproducible before you spend tokens on a real model.
DEMO = {
    "q01": "SELECT COUNT(*) FROM customers WHERE strftime('%Y', signup_date) = '2025'",
    "q02": "SELECT DISTINCT category FROM products",
    "q03": "SELECT SUM(rev) FROM (SELECT oi.quantity * oi.unit_price AS rev "
           "FROM order_items oi, orders o WHERE o.order_id = oi.order_id "
           "AND o.status = 'completed')",
    "q04": "SELECT p.category, SUM(oi.quantity * oi.unit_price) AS rev "
           "FROM order_items oi JOIN orders o ON o.order_id = oi.order_id "
           "JOIN products p ON p.product_id = oi.product_id "
           "GROUP BY p.category ORDER BY rev DESC",
    "q05": "SELECT c.name, SUM(oi.quantity * oi.unit_price AS spend "
           "FROM customers c JOIN orders o ON o.customer_id = c.customer_id "
           "GROUP BY c.customer_id ORDER BY spend DESC LIMIT 3",
    "q06": "SELECT AVG(n) FROM (SELECT COUNT(*) AS n FROM order_items oi "
           "JOIN orders o ON o.order_id = oi.order_id "
           "WHERE o.status = 'completed' GROUP BY oi.order_id)",
    "q07": "SELECT COUNT(*) FROM refunds WHERE refund_date LIKE '2025%'",
    "q08": "SELECT p.name FROM products p LEFT JOIN (SELECT DISTINCT oi.product_id "
           "FROM order_items oi JOIN orders o ON o.order_id = oi.order_id "
           "WHERE o.status = 'refunded' AND o.order_date LIKE '2026%') s "
           "ON s.product_id = p.product_id WHERE s.product_id IS NULL",
    "q09": "SELECT substr(o.order_date, 6, 2) AS month, "
           "SUM(oi.quantity * oi.unit_price) AS rev FROM order_items oi "
           "JOIN orders o ON o.order_id = oi.order_id "
           "WHERE o.status = 'completed' AND o.order_date LIKE '2025%' "
           "GROUP BY month ORDER BY month",
    "q10": "SELECT country, COUNT(*) FROM customers GROUP BY country",
    "q11": "SELECT p.name, SUM(oi.quantity) AS units FROM order_items oi "
           "JOIN orders o ON o.order_id = oi.order_id "
           "JOIN products p ON p.product_id = oi.product_id "
           "WHERE o.status = 'completed' GROUP BY p.product_id "
           "ORDER BY units DESC, p.product_id",
    "q12": "SELECT AVG(status = 'refunded') FROM orders",  # SQLite treats booleans as 0/1; Postgres needs AVG(CASE WHEN ...)
}
with open("predictions_demo.json", "w") as f:
    json.dump(DEMO, f, indent=2)

counts = {t: cur.execute(f"SELECT COUNT(*) FROM {t}").fetchone()[0]
          for t in ("customers", "products", "orders", "order_items")}
con.close()
print("wrote warehouse.db:", ", ".join(f"{t}={n}" for t, n in counts.items()))
print(f"wrote gold.json ({len(GOLD)} questions), predictions_demo.json")

Expected output, exactly:

wrote warehouse.db: customers=40, products=12, orders=600, order_items=1446
wrote gold.json (12 questions), predictions_demo.json

Two gold-set decisions worth copying. Each question carries an ordered flag: only questions that ask for an order ("highest first", "top 3") compare rows as sequences; everything else compares as multisets, because SQL result order without ORDER BY is an implementation accident, not an answer. And the top-k gold queries pin a tie-breaker (ORDER BY spend DESC, c.customer_id); top-k with ties is otherwise genuinely ambiguous, and an eval that punishes the model for the database's coin flip is measuring noise.

Step 2: the scorer

"""Execution-match scorer for text-to-SQL, with a string-match column to
show what string match gets wrong, and a failure taxonomy.

Run:
    python score_sql.py predictions_demo.json
Stdlib only. Expects warehouse.db and gold.json next to it.
"""
from __future__ import annotations

import json
import re
import sqlite3
import sys

DB = "warehouse.db"


def normalize_sql(sql: str) -> str:
    """The fair version of string match: case, whitespace, semicolon."""
    return re.sub(r"\s+", " ", sql.strip().rstrip(";").lower())


def canon_row(row: tuple) -> tuple:
    """Round floats so 1810.0000000003 == 1810.0; keep everything else."""
    return tuple(round(v, 4) if isinstance(v, float) else v for v in row)


def execute(sql: str) -> list[tuple]:
    con = sqlite3.connect(f"file:{DB}?mode=ro", uri=True)
    try:
        rows = con.execute(sql).fetchall()
    finally:
        con.close()
    return [canon_row(r) for r in rows]


def result_match(gold_rows: list, pred_rows: list, ordered: bool) -> bool:
    if ordered:
        return gold_rows == pred_rows
    key = lambda r: tuple(str(v) for v in r)  # stringify: None vs int sorts
    return sorted(gold_rows, key=key) == sorted(pred_rows, key=key)


def classify_failure(err: Exception | None, gold_rows: list, pred_rows: list) -> str:
    if err is not None:
        msg = str(err).lower()
        if "no such table" in msg or "no such column" in msg:
            return "schema_error"
        return "syntax_error"
    if len(gold_rows) != len(pred_rows) or (
        gold_rows and pred_rows and len(gold_rows[0]) != len(pred_rows[0])
    ):
        return "shape_mismatch"
    return "value_mismatch"


def main() -> None:
    gold = json.load(open("gold.json"))
    preds = json.load(open(sys.argv[1] if len(sys.argv) > 1 else "predictions_demo.json"))

    # Sanity gate: every gold query must execute and return rows.
    for g in gold:
        rows = execute(g["sql"])
        assert rows and rows[0][0] is not None, f"gold {g['id']} returns nothing"

    exec_ok = string_ok = 0
    failures: list[tuple[str, str]] = []
    for g in gold:
        pred_sql = preds.get(g["id"], "")
        string_hit = normalize_sql(pred_sql) == normalize_sql(g["sql"])
        gold_rows = execute(g["sql"])
        err, pred_rows = None, []
        try:
            pred_rows = execute(pred_sql)
        except sqlite3.Error as e:
            err = e
        exec_hit = err is None and result_match(gold_rows, pred_rows, g["ordered"])

        exec_ok += exec_hit
        string_ok += string_hit
        flag = ""
        if exec_hit and not string_hit:
            flag = "string match says wrong, execution says right"
        if not exec_hit:
            label = classify_failure(err, gold_rows, pred_rows)
            failures.append((g["id"], label))
            flag = label
        line = (f"{g['id']}  exec={'PASS' if exec_hit else 'FAIL'}  "
                f"string={'PASS' if string_hit else 'FAIL'}  {flag}")
        print(line.rstrip())

    n = len(gold)
    print(f"\nexecution accuracy : {exec_ok}/{n} = {exec_ok / n:.3f}")
    print(f"string-match (norm): {string_ok}/{n} = {string_ok / n:.3f}")
    print("\nfailure taxonomy:")
    for qid, label in failures:
        print(f"  {qid}: {label}")


if __name__ == "__main__":
    main()

Expected output, exactly:

q01  exec=PASS  string=FAIL  string match says wrong, execution says right
q02  exec=PASS  string=PASS
q03  exec=PASS  string=FAIL  string match says wrong, execution says right
q04  exec=FAIL  string=FAIL  value_mismatch
q05  exec=FAIL  string=FAIL  syntax_error
q06  exec=PASS  string=PASS
q07  exec=FAIL  string=FAIL  schema_error
q08  exec=PASS  string=FAIL  string match says wrong, execution says right
q09  exec=PASS  string=PASS
q10  exec=PASS  string=PASS
q11  exec=FAIL  string=FAIL  shape_mismatch
q12  exec=PASS  string=PASS

execution accuracy : 8/12 = 0.667
string-match (norm): 5/12 = 0.417

failure taxonomy:
  q04: value_mismatch
  q05: syntax_error
  q07: schema_error
  q11: shape_mismatch

Step 3: read the disagreement rows

String match calls a 66.7% system a 41.7% system. The three rows where the columns disagree are all real-world shapes: q01 filters the year with strftime instead of LIKE, q03 computes revenue through a subquery instead of a join-then-aggregate, q08 writes the anti-join as LEFT JOIN ... IS NULL instead of NOT EXISTS. Every one is correct SQL. Normalization cannot fix this; the space of equivalent SQL is unbounded, which is the formal reason Spider's clause-decomposition metric still false-negatives 2.5% on average ¹. If your team reports string match, your model is better than your dashboard says, and prompt changes get judged by whether they happen to produce the gold's phrasing.

The taxonomy names which layer broke. syntax_error (q05, an unbalanced paren) means decoding or prompt format. schema_error (q07 querying a refunds table that does not exist) means the schema prompt is incomplete or ignored; it is the strongest argument for putting DDL, not prose descriptions, in the system prompt. shape_mismatch (q11 returning 12 rows where 5 were asked for) is usually a dropped LIMIT or a wrong grain. value_mismatch (q04 forgetting the status = 'completed' filter) is the dangerous one: the query runs, returns four plausible category rows, and is wrong. Silent wrong answers reach dashboards; every other class fails loudly in front of an engineer. Track the value_mismatch share separately; it is the number that should gate a launch.

Execution match has its own failure direction. A wrong query can return the right rows by coincidence on one database state; small databases make this likely. The fix at benchmark scale is running each query against many database variants and comparing across the suite, which is exactly what distilled test suites do ¹. The recipe-scale fix is cheaper: re-seed the generator (change random.Random(11)), rebuild, and re-score. A prediction that passes on two unrelated database states is almost never a coincidence.

Step 4: point a real model at it

"""Generate SQL for the gold questions with the schema in the system prompt.

Run:
    pip install anthropic
    export ANTHROPIC_API_KEY=...
    python run_model.py
Writes predictions_model.json; score it with score_sql.py.
"""
from __future__ import annotations

import json
import sqlite3

from anthropic import Anthropic

CLIENT = Anthropic()
MODEL = "claude-sonnet-4-5"

SYSTEM = """You translate questions into SQLite SQL for this schema:

{schema}

Rules:
- Return one SQL statement and nothing else. No markdown, no commentary.
- Revenue is order_items.quantity * order_items.unit_price.
- Unless the question says otherwise, exclude refunded and pending orders
  from revenue and sales questions.
- Dates are TEXT in YYYY-MM-DD form."""


def schema_dump() -> str:
    con = sqlite3.connect("warehouse.db")
    rows = con.execute(
        "SELECT sql FROM sqlite_master WHERE type = 'table' ORDER BY name"
    ).fetchall()
    con.close()
    return "\n".join(r[0] + ";" for r in rows)


def generate(question: str, system: str) -> str:
    msg = CLIENT.messages.create(
        model=MODEL,
        max_tokens=400,
        temperature=0,
        system=system,
        messages=[{"role": "user", "content": question}],
    )
    sql = msg.content[0].text.strip()
    if sql.startswith("```"):
        sql = sql.strip("`").removeprefix("sql").strip()
    return sql.rstrip(";")


def main() -> None:
    system = SYSTEM.format(schema=schema_dump())
    gold = json.load(open("gold.json"))
    preds = {}
    for g in gold:
        preds[g["id"]] = generate(g["question"], system)
        print(f"{g['id']}: {preds[g['id']][:70]}")
    with open("predictions_model.json", "w") as f:
        json.dump(preds, f, indent=2)


if __name__ == "__main__":
    main()

Then python score_sql.py predictions_model.json. The system prompt's business rules ("revenue means", "exclude refunded") are not decoration: they are the resolution of the ambiguity that q04-style questions otherwise hide, and writing them down is half the value of building the eval. The scorer connects read-only (mode=ro), which you should keep even on a toy database, because the model will eventually emit a DROP.

Step 5: grow it toward your warehouse

Three substitutions, in order. Replace the schema and seed data with a sanitized copy of yours (SQLite if you can export, a read-only replica user if you cannot; never production credentials). Replace the 12 questions with 50 from real user requests, keeping the ordered flag discipline and the tie-breaker rule. Keep predictions_demo.json alive as the scorer's regression test: it pins the scorer's own behavior, so a scoring-logic change that shifts 8/12 tells you the harness moved, not the model.

When the set crosses about 100 questions, anchor against BIRD's mini-dev split for an external reference point; BIRD adds the things a private gold set undersamples (dirty values, external-knowledge joins, efficiency scoring via R-VES) ⁴. The OpenAI cookbook has a comparable execution-based walkthrough if your stack is OpenAI-native ⁵.

What this does not do

One-shot generation only: no self-correction loops, no agentic schema exploration, and no dialect coverage (SQLite here; BIRD spans dialects, and strftime vs LIKE tricks do not port to Postgres unchanged). No query-cost scoring: at production scale a correct query that scans 33 GB is its own failure mode, which is why BIRD reports an efficiency score (R-VES) next to accuracy ⁴. And no LLM judging anywhere: this task has ground truth, so spending judge tokens on it is a category error. The text-to-SQL task type covers the surrounding territory.

TIP

Log every production NL-to-SQL pair with its execution result, and route the ones users corrected into the gold set with the correction as gold SQL. A text-to-SQL eval that does not grow from production corrections goes stale in one quarter.

What to do next

At 50+ questions, put intervals on the accuracy with adding error bars before comparing two prompts. If the model's failures cluster on ambiguous questions, that is a product problem, not a model problem; the disambiguation patterns in your first eval apply unchanged.

Zhong, Yu, and Klein, "Semantic Evaluation for Text-to-SQL with Distilled Test Suites." EMNLP 2020, arXiv 2010.02840. The 2.5% average / 8.1% worst-case false-negative measurement of Spider's exact-match metric, and the multi-database fix for execution false positives. ↩ ↩² ↩³
Yu et al., "Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task." EMNLP 2018. The clause-decomposition exact-match metric and its successor, test-suite accuracy. ↩
Li et al., "Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs." NeurIPS 2023, arXiv 2305.03111. 12,751 pairs over 95 databases (33.4 GB); 40.08% best-model vs 92.96% human execution accuracy at launch. ↩
BIRD-bench leaderboard and dataset site. Execution accuracy and R-VES definitions; leaderboard figures fetched as of 2026-06-10. ↩ ↩² ↩³
OpenAI Cookbook, "How to evaluate LLMs for SQL generation." The closest published analog to this recipe. ↩