The bootstrap problem for a RAG eval set is uncomfortable: you need labeled queries to evaluate retrieval, you cannot label queries without knowing which chunks are relevant, and producing per-chunk relevance labels by hand at scale is exactly the kind of work that does not get done. Synthetic query generation breaks the loop by inverting the dependency: the chunk is the ground truth, and the LLM generates a query that the chunk would correctly answer. The query is synthetic, the relevance label is by construction.
The inversion
Take a chunk. Ask an LLM to write a question that the chunk answers. Record the (query, chunk_id) pair with the chunk_id as the ground-truth relevance label. Repeat across a sampled subset of chunks. The output is a labeled retrieval set whose labels are mechanically correct because the query was generated against the chunk [1]. The same flow can produce generation labels by asking the LLM to write an answer alongside the question, conditioned strictly on the chunk.
The caution for any synthetic ground truth applies: an LLM-written gold answer is a hypothesis about correctness, not the truth, and a human pass over a fraction of the set is non-optional [2].
A runnable generator
# synth_queries.py
import json, random
from pathlib import Path
from anthropic import Anthropic
client = Anthropic()
PROMPT = """Write one question the passage below answers. The question
should be specific enough that the passage is the only plausible source.
Phrase it the way a user would, not the way the document phrases itself.
Do not include the answer.
Passage (chunk id {chunk_id}):
\"\"\"{chunk_text}\"\"\"
Return JSON: {{"question": "...", "answer": "...", "rationale": "..."}}"""
def make_pair(chunk_id: str, chunk_text: str) -> dict | None:
out = client.messages.create(
model="claude-opus-4-7", max_tokens=512,
messages=[{"role": "user", "content":
PROMPT.format(chunk_id=chunk_id, chunk_text=chunk_text)}],
)
try:
payload = json.loads(out.content[0].text)
except json.JSONDecodeError:
return None
return {
"id": f"synth-{chunk_id}",
"query": payload["question"],
"relevant_chunk_ids": [chunk_id],
"reference_answer": payload["answer"],
"source": "synth-llm",
"needs_human_review": True,
}
Every row carries needs_human_review: True so the set cannot be used in CI until a human signs off on a subset.
Sampling
Uniform random sampling over indexed chunks produces a set whose retrieval difficulty matches the corpus distribution, which is rarely what you want. You want oversampling on chunks that matter (recent, high-traffic, policy) and undersampling on near-duplicates that would produce the same question ten times. A reasonable mix: one-third uniform, one-third stratified by document type, one-third sampled from clusters (one question per cluster) [3].
Building in difficulty
Default synthetic queries are easy: the model paraphrases the chunk, the retriever finds it. Three patterns force probes of the retriever's real capability.
Multi-hop queries combine information from two or more chunks. Take a pair that share an entity, ask the LLM to write a question that requires both. Ground truth becomes the set of chunk ids.
Paraphrased and adversarial queries strip lexical overlap with the source. The model writes in a different register, with synonyms, or under an unusual framing. A retriever leaning on lexical match will fail these; a dense retriever should not. The gap between easy and adversarial subsets is the diagnostic.
Negative queries are questions the corpus does not answer. Sample a chunk, ask the LLM to write a question about a topic the chunk does not cover. Ground truth is the empty set; the right behavior is for the retriever to return nothing relevant, or for generation to refuse. A measured fraction of negatives makes the eval honest about over-retrieval [4].
RAGAS testset generation
Ragas ships a testset generator that wraps the chunk-to-query flow with the difficulty mix above. It produces single-hop, multi-hop, and reasoning-required queries in measured proportions, and emits a dataset object that plugs into the Ragas scoring pipeline [5]. Useful for a first set in an hour. Not useful as a final calibrated set: read every example and rewrite a quarter of them before promoting.
The honesty check
Synthetic queries sound like the document. A real user asks "do you have anything for fri night," not "what is the available capacity on the date specified as Friday." If your synthetic queries are all written in the document's register, retrieval scores will be optimistic relative to production. Two-step calibration: sample a hundred real production queries, label by hand, compare retrieval metrics between synthetic and real subsets. If synthetic numbers are more than five points higher, the synthesizer prompt is too document-aligned.
CAUTION
Never gate releases on a synthetic-only set. Synthetic queries are excellent for bootstrapping and breadth, and they will silently mislead if they become the only thing you measure. A production split has to be in the gate.
Minimum human review on a thousand-query synthetic set is a uniformly sampled hundred reviewed by the principal expert. Expected first-pass rejection rate is twenty to thirty percent. Once the rate stabilizes, the synthetic set joins the eval mix alongside production-sampled and adversarial splits.