The first reach when building a RAG eval set is to ask the model to write queries against your corpus. The mistake is shipping that set without filtering. Raw synthetic queries cluster around the surface vocabulary of the source documents, skew toward easy retrievals, and miss the hard query types your users actually send. The fix is a three-step pipeline: generate, filter, anchor. This recipe will walk it for a small documentation corpus, with the same shape used in the RAGAS paper 1 and the SPADE work on quality assertions 2.
Planned outline
- Generate. For each chunk in the corpus, prompt the model for three queries that the chunk could answer. Cap to ~3,000 raw queries.
- Diversity filter. MinHash dedup on the query text. Drop near-duplicates above 0.7 Jaccard.
- Difficulty stratification. Embed every query and every chunk; for each query, compute the rank of its source chunk in the cosine-similarity ranking. Easy queries rank their source chunk at position 1; hard queries rank it at position 5 or worse. Keep a stratified mix.
- Persona stratification. Use Hamel's feature × scenario × persona framework to bucket queries by persona-style (new user, power user, edge case) and ensure coverage 3.
- Hand-label a sample. Take 50 queries; have a domain expert vet them for realism and write the gold answer. Drop the queries the expert marks as not-real.
- Pin and version. Content-hash the final JSONL, commit it to the repo, tag the eval-set version next to the model version it was built for.
What this will skip in v1
Adversarial query generation (separate exercise; useful for robustness eval). Multi-hop queries (require chain-of-document reasoning; deserves its own recipe). Domain-specific seeding when the corpus is non-English.
For the methodology background, see the synthetic vs human chapter and the dataset construction chapter. The SPADE paper is the right reading on validator quality once your eval set has both synthetic and human entries 2. The RAGAS paper covers the original automation of this loop in the RAG context 1.