AI Evals › Cookbook

Synthetic query generation for golden sets

Build a RAG eval set from your own corpus: seeded chunk sampling, multi-hop and unanswerable query synthesis, MinHash dedup, and a pinned manifest you can diff.

The first reach when building a RAG eval set is to ask the model to write queries against your corpus. The mistake is shipping that raw output. Unfiltered synthetic queries cluster around the surface vocabulary of the source documents, skew toward single-chunk lookups your retriever already nails, and miss the two query types that actually break RAG systems: questions that span documents and questions your corpus cannot answer. The fix is a pipeline, not a prompt: chunk, synthesize three query types, dedup, judge-filter, pin. RAGAS automated the first version of this loop ¹; this recipe builds the framework-free equivalent so you own every stage.

What you build

Five small modules that run in sequence and leave auditable artifacts behind:

chunk_corpus.py turns a directory of markdown into chunks.jsonl with content-hashed ids.
generate_queries.py synthesizes factual, multi-hop, and unanswerable queries into raw_queries.jsonl.
dedup_queries.py drops near-duplicates with MinHash LSH.
filter_queries.py runs a binary judge over every survivor and drops the fakes.
build_manifest.py writes the final eval_set.jsonl plus a manifest.json with the content hash, counts, and generation config.

Step 1: chunk the corpus

Chunk at the granularity your retriever indexes, not at whole-document granularity. The chunk id is a content hash, so re-running on an unchanged corpus yields identical ids and your eval set stays diffable.

"""Chunk a markdown corpus into retrieval-sized pieces.

Run:
    python chunk_corpus.py docs/ --max-words 220
"""
from __future__ import annotations

import argparse
import hashlib
import json
import re
from pathlib import Path


def chunk_doc(text: str, max_words: int) -> list[str]:
    chunks: list[str] = []
    for section in re.split(r"\n(?=#{2,3} )", text):
        words: list[str] = []
        for para in section.split("\n\n"):
            w = para.split()
            if words and len(words) + len(w) > max_words:
                chunks.append(" ".join(words))
                words = []
            words.extend(w)
        if len(words) >= 40:  # drop heading-only crumbs
            chunks.append(" ".join(words))
    return chunks


def main() -> None:
    p = argparse.ArgumentParser()
    p.add_argument("corpus_dir")
    p.add_argument("--max-words", type=int, default=220)
    args = p.parse_args()

    rows = []
    for path in sorted(Path(args.corpus_dir).rglob("*.md")):
        for text in chunk_doc(path.read_text(), args.max_words):
            cid = hashlib.sha256(text.encode()).hexdigest()[:12]
            rows.append({"chunk_id": cid, "doc": str(path), "text": text})

    with open("chunks.jsonl", "w") as f:
        for r in rows:
            f.write(json.dumps(r) + "\n")
    print(f"{len(rows)} chunks from {len({r['doc'] for r in rows})} docs -> chunks.jsonl")


if __name__ == "__main__":
    main()

137 chunks from 24 docs -> chunks.jsonl

Step 2: synthesize three query types

One prompt per query type, because the failure modes differ. Factual queries are the baseline. Multi-hop queries force retrieval to assemble evidence across chunks; HotpotQA established the pattern with 113k questions that require reasoning over multiple supporting documents ². Unanswerable queries test whether your system abstains instead of hallucinating; SQuAD 2.0 added over 50,000 adversarially written unanswerable questions and a system at 86% F1 on the answerable-only version dropped to 66% ³. An eval set with no unanswerables cannot see your worst production failure.

"""Synthesize factual, multi-hop, and unanswerable queries from corpus chunks.

Run:
    pip install anthropic
    export ANTHROPIC_API_KEY=...
    python generate_queries.py --chunks 40 --seed 7
"""
from __future__ import annotations

import argparse
import json
import random

from anthropic import Anthropic

CLIENT = Anthropic()
MODEL = "claude-sonnet-4-6"

FACTUAL = """You write search queries for an internal documentation assistant.

PASSAGE:
{chunk}

Write 2 questions a real user would type whose answers are fully contained in
the passage. Vary surface form: do not reuse the passage's distinctive phrasing
and never mention "the passage" or "the document".

Reply with a JSON array of 2 strings."""

MULTI_HOP = """You write search queries for an internal documentation assistant.

PASSAGE A:
{chunk_a}

PASSAGE B:
{chunk_b}

Write 1 question that can only be answered by combining a fact from passage A
with a fact from passage B. A reader holding only one of the two passages must
be unable to answer. Do not staple two independent questions with "and".

Reply with a JSON array of 1 string."""

UNANSWERABLE = """You write search queries for an internal documentation assistant.

PASSAGE (representative of the corpus):
{chunk}

Write 1 question on the same topic, in the same vocabulary, that this passage
plausibly CANNOT answer: ask for a number, date, comparison, or procedure it
does not state. It must read like a real user query, not a trick question.

Reply with a JSON array of 1 string."""


def ask(prompt: str) -> list[str]:
    msg = CLIENT.messages.create(
        model=MODEL,
        max_tokens=300,
        messages=[{"role": "user", "content": prompt}],
    )
    text = msg.content[0].text
    return json.loads(text[text.find("[") : text.rfind("]") + 1])


def main() -> None:
    p = argparse.ArgumentParser()
    p.add_argument("--chunks", type=int, default=40)
    p.add_argument("--seed", type=int, default=7)
    args = p.parse_args()

    chunks = [json.loads(line) for line in open("chunks.jsonl")]
    rng = random.Random(args.seed)
    pool = rng.sample(chunks, min(args.chunks, len(chunks)))
    half = len(pool) // 2

    rows = []
    for c in pool:  # 2 factual per chunk
        for q in ask(FACTUAL.format(chunk=c["text"])):
            rows.append({"query": q, "type": "factual", "source_chunk_ids": [c["chunk_id"]]})
    for _ in range(half):  # 1 multi-hop per sampled pair
        a, b = rng.sample(pool, 2)
        for q in ask(MULTI_HOP.format(chunk_a=a["text"], chunk_b=b["text"])):
            rows.append({"query": q, "type": "multi_hop",
                         "source_chunk_ids": [a["chunk_id"], b["chunk_id"]]})
    for c in pool[:half]:  # 1 unanswerable per chunk, first half of the pool
        for q in ask(UNANSWERABLE.format(chunk=c["text"])):
            rows.append({"query": q, "type": "unanswerable",
                         "source_chunk_ids": [c["chunk_id"]]})

    with open("raw_queries.jsonl", "w") as f:
        for r in rows:
            f.write(json.dumps(r) + "\n")
    counts = {t: sum(1 for r in rows if r["type"] == t) for t in
              ("factual", "multi_hop", "unanswerable")}
    print(f"{len(rows)} raw queries -> raw_queries.jsonl  {counts}")


if __name__ == "__main__":
    main()

Representative run (generation is not seed-stable; the chunk sample is):

120 raw queries -> raw_queries.jsonl  {'factual': 80, 'multi_hop': 20, 'unanswerable': 20}

Leave the generator at default temperature; you want surface diversity here. The judge in step 4 is where determinism matters.

Step 3: dedup with MinHash

Models asked for queries against similar chunks produce near-identical phrasings, and twenty copies of one question silently overweight that topic in every metric you compute later. MinHash approximates Jaccard similarity on word shingles cheaply ⁴; datasketch's LSH index makes the pairwise check linear instead of quadratic ⁵. Keep the first occurrence, drop the rest.

"""Drop near-duplicate queries with MinHash LSH.

Run:
    pip install datasketch
    python dedup_queries.py --threshold 0.7
"""
from __future__ import annotations

import argparse
import json

from datasketch import MinHash, MinHashLSH


def minhash(text: str, num_perm: int = 128) -> MinHash:
    tokens = text.lower().split()
    grams = [" ".join(g) for g in zip(tokens, tokens[1:])] or tokens
    m = MinHash(num_perm=num_perm)
    for g in grams:
        m.update(g.encode())
    return m


def main() -> None:
    p = argparse.ArgumentParser()
    p.add_argument("--threshold", type=float, default=0.7)
    args = p.parse_args()

    rows = [json.loads(line) for line in open("raw_queries.jsonl")]
    lsh = MinHashLSH(threshold=args.threshold, num_perm=128)
    kept = []
    for i, r in enumerate(rows):
        m = minhash(r["query"])
        if lsh.query(m):  # any neighbor above threshold: drop
            continue
        lsh.insert(str(i), m)
        kept.append(r)

    with open("deduped_queries.jsonl", "w") as f:
        for r in kept:
            f.write(json.dumps(r) + "\n")
    print(f"{len(rows)} -> {len(kept)} queries "
          f"({len(rows) - len(kept)} near-duplicates dropped at jaccard >= {args.threshold})")


if __name__ == "__main__":
    main()

120 -> 109 queries (11 near-duplicates dropped at jaccard >= 0.7)

Step 4: judge-filter the synthetics

Synthetic generation has a known defect rate: queries that quote the chunk verbatim, multi-hops answerable from one passage, "unanswerables" the passage actually answers. RAGAS filters its generated questions with an LLM critique pass for exactly this reason ¹, and SPADE showed the same move for pipeline assertions: generate candidates, then validate them before trusting them ⁶. The judge is binary with a one-sentence critique, temperature zero.

"""Binary judge filter over synthetic queries. Drops FAILs.

Run:
    python filter_queries.py
"""
from __future__ import annotations

import json

from anthropic import Anthropic

CLIENT = Anthropic()
MODEL = "claude-sonnet-4-6"

JUDGE_PROMPT = """You are auditing one synthetic query for a RAG eval set.

QUERY:
{query}

QUERY TYPE: {qtype}

SOURCE PASSAGE(S):
{sources}

Rubric. All parts must hold for PASS:
1. REALISTIC: a working user might type this into a docs assistant. No
   quiz-style phrasing, no "according to the passage".
2. SELF-CONTAINED: the query makes sense without seeing the passage.
3. GROUNDING, by type:
   - factual or multi_hop: the passage(s) fully answer the query. For
     multi_hop, answering must require BOTH passages, not either one alone.
   - unanswerable: the passage does NOT answer the query, and the query still
     reads like a fair question about this corpus.

Reply with a JSON object: {{"label": "PASS" or "FAIL", "critique": "one sentence"}}.
"""


def judge(query: str, qtype: str, sources: str) -> dict:
    msg = CLIENT.messages.create(
        model=MODEL,
        max_tokens=150,
        temperature=0,
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            query=query, qtype=qtype, sources=sources)}],
    )
    text = msg.content[0].text
    return json.loads(text[text.find("{") : text.rfind("}") + 1])


def main() -> None:
    chunks = {c["chunk_id"]: c["text"] for c in
              (json.loads(line) for line in open("chunks.jsonl"))}
    rows = [json.loads(line) for line in open("deduped_queries.jsonl")]

    kept, dropped = [], []
    for r in rows:
        sources = "\n\n---\n\n".join(chunks[cid] for cid in r["source_chunk_ids"])
        verdict = judge(r["query"], r["type"], sources)
        if verdict["label"] == "PASS":
            kept.append(r)
        else:
            dropped.append({**r, "critique": verdict["critique"]})

    with open("filtered_queries.jsonl", "w") as f:
        for r in kept:
            f.write(json.dumps(r) + "\n")
    with open("rejects.jsonl", "w") as f:
        for r in dropped:
            f.write(json.dumps(r) + "\n")
    counts = {t: sum(1 for r in kept if r["type"] == t) for t in
              ("factual", "multi_hop", "unanswerable")}
    print(f"kept {len(kept)}/{len(rows)} ({len(kept) / len(rows):.0%})  {counts}")
    print(f"rejects with critiques -> rejects.jsonl")


if __name__ == "__main__":
    main()

kept 92/109 (84%)  {'factual': 59, 'multi_hop': 16, 'unanswerable': 17}
rejects with critiques -> rejects.jsonl

Read rejects.jsonl before trusting the judge. If the critiques look wrong, your filter is deleting good queries or passing bad ones, and every downstream number inherits that. Ten minutes of reading rejects is the cheapest validation you will ever do; "Who Validates the Validators" is the longer argument for why this step is not optional ⁷.

Step 5: pin the manifest

An eval set you cannot identify by hash is an eval set you cannot compare across runs. Sort, hash, and record how the set was made.

"""Write the final eval set and its manifest.

Run:
    python build_manifest.py
"""
from __future__ import annotations

import datetime
import hashlib
import json

rows = sorted(
    (json.loads(line) for line in open("filtered_queries.jsonl")),
    key=lambda r: (r["type"], r["query"]),
)
for i, r in enumerate(rows):
    r["id"] = f"q{i:04d}"

payload = "".join(json.dumps(r, sort_keys=True) + "\n" for r in rows)
with open("eval_set.jsonl", "w") as f:
    f.write(payload)

manifest = {
    "name": "docs-rag-evalset",
    "version": "0.1.0",
    "sha256": hashlib.sha256(payload.encode()).hexdigest(),
    "created": datetime.date.today().isoformat(),
    "counts": {t: sum(1 for r in rows if r["type"] == t) for t in
               ("factual", "multi_hop", "unanswerable")},
    "generator_model": "claude-sonnet-4-6",
    "judge_model": "claude-sonnet-4-6",
    "chunk_sample_seed": 7,
    "dedup_jaccard_threshold": 0.7,
    "corpus_chunks": 137,
    "human_review": {"sampled": 0, "rejected": 0},  # fill in after step 6
}
with open("manifest.json", "w") as f:
    json.dump(manifest, f, indent=2)
print(f"eval_set.jsonl ({len(rows)} queries)  sha256 {manifest['sha256'][:12]}...")

eval_set.jsonl (92 queries)  sha256 3f61c00de2aa...

Commit eval_set.jsonl, manifest.json, and rejects.jsonl together. The manifest is what makes "we scored 78% on the eval set" reviewable: which set, built from which corpus state, filtered how. When the corpus changes, regenerate and bump the version; never edit the JSONL by hand without a version bump.

Step 6: hand-label an anchor sample

Synthetic queries approximate your users; they are not your users. Before the set gates anything, have a domain expert read 30 to 50 queries, mark each keep-or-kill, and write gold answers for the keepers. Record the sample size and rejection count in the manifest. Hamel and Shreya's FAQ takes the same position: synthetic data is a legitimate way to start, and it never replaces looking at real data; swap in production queries as they accumulate ⁸. If the expert kills more than about 1 in 10, your generation prompts need work before you scale the set up.

Cost and time

The 120-query run above makes roughly 190 model calls (80 generation, 109 judge), most under 700 input tokens. At Claude Sonnet 4.6 list pricing ( $3 per million input tokens,$ 15 per million output) the whole pipeline lands under a dollar; a 1,000-query set is low single-digit dollars ⁹. Wall clock is dominated by sequential API calls, about 15 minutes; parallelize the judge loop if you care. The expensive step is the human one, and that is the point of doing the cheap steps well.

What this set does and does not buy you

It buys you a regression instrument: a pinned, typed, deduplicated query set with known provenance that you can run on every retriever or prompt change. It does not buy you a measurement of production quality, because the query distribution is the model's guess at your users, anchored by one expert pass. Treat absolute scores on it with suspicion and deltas with respect; replace synthetic strata with logged production queries as soon as you have them. If you would rather drive this loop from a framework, RAGAS ships a knowledge-graph testset generator with single-hop and multi-hop synthesizers that automates steps 2 through 4 at the cost of owning the prompts ¹⁰.

TIP

Keep rejects.jsonl forever. It is a free labeled dataset of "bad synthetic query" examples, and the day you fine-tune or few-shot a cheaper generator, those negatives are exactly what you need.

What you built

A five-stage eval-set factory: content-hashed corpus chunks, three-type query synthesis (factual, multi-hop, unanswerable), MinHash dedup at 0.7 Jaccard, a binary judge filter with audited rejects, and a versioned manifest that makes the set citable in a review. Re-running it on a changed corpus is one command per stage.

What to do next

Run the set through the RAGAS recipe to score retrieval and faithfulness against it, and put error bars on the result with adding error bars; at n=92 your intervals will be wide, which is worth knowing before you celebrate a 3-point gain. The synthetic query generation chapter covers the methodology tradeoffs, and dataset versioning covers where the manifest pattern goes from here.

Es et al., "RAGAS: Automated Evaluation of Retrieval Augmented Generation." arXiv 2309.15217. Synthetic question generation with LLM critique filtering for RAG eval. ↩ ↩²
Yang et al., "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering." arXiv 1809.09600. 113k questions requiring multiple supporting documents. ↩
Rajpurkar et al., "Know What You Don't Know: Unanswerable Questions for SQuAD." arXiv 1806.03822. 50,000+ adversarial unanswerables; 86% F1 systems drop to 66%. ↩
Wikipedia, "MinHash." Jaccard similarity estimation via min-wise hashing. ↩
datasketch documentation. MinHash and MinHashLSH APIs used in the dedup stage. ↩
Shankar et al., "SPADE: Synthesizing Data Quality Assertions for Large Language Model Pipelines." arXiv 2401.03038. Generate candidate checks, then validate before trusting. ↩
Shankar et al., "Who Validates the Validators?" arXiv 2404.12272. Judges need human alignment before their outputs are trusted. ↩
Hamel Husain and Shreya Shankar, "LLM Evals FAQ." Position on synthetic data as a starting point and hand-labeling as the anchor. ↩
Anthropic, Claude API pricing. Sonnet 4.6 at $3 per million input tokens and$ 15 per million output tokens, fetched 2026-06-10. ↩
RAGAS documentation, "Testset Generation for RAG." Knowledge-graph based generator with single-hop and multi-hop query synthesizers. ↩