The argument over synthetic versus human-authored data is mostly miscast. The two are not substitutes; they answer different questions. Human-authored data tells you what your users do and what your team thinks the right answer is. Synthetic data tells you what your system does under inputs you would not otherwise see. A set with only one of them has a blind spot.
The honest accounting
Human-authored examples carry three properties synthetic data cannot fake: they reflect the distribution of inputs your users produce, they carry implicit ground truth from a domain expert who could defend the label, and they anchor the team's mental model of what the product is supposed to do [1]. The cost is real. A principal expert labels ten to thirty examples per hour for a non-trivial product, so a five-hundred-example set is a multi-day commitment.
Synthetic examples flip the accounting. An LLM produces a hundred inputs in two minutes, with systematic coverage of the feature, scenario, persona grid. What you do not get is ground truth: synthesizing the gold answer with the same family of model you are evaluating is the dominant failure mode in synthetic data.
The rule
Synthesize the input. Hand-label the output. The classic failure is using an LLM to generate both the query and the gold answer, then evaluating a second LLM against the first LLM's idea of correctness. The eval measures how similar the two models think, which is high even when both are wrong.
# synth_inputs.py
from anthropic import Anthropic
client = Anthropic()
def paraphrase(seed: str, n: int = 10) -> list[str]:
prompt = (
f"Rewrite the message below {n} ways. Vary tone, formality, and "
f"information missing. Keep intent identical. Do not invent details "
f"the seed does not provide. One per line.\n\nSeed: {seed}"
)
out = client.messages.create(
model="claude-opus-4-7", max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
return [l.strip() for l in out.content[0].text.splitlines() if l.strip()]
The synthetic part buys breadth. The labeling part buys truth.
When synthetic ground truth is defensible
Programmatic ground truth is the cleanest exception: a calculator, a SQL executor, a code unit test, or a constraint check produces the gold answer mechanically. Anything that compiles, runs, or matches a regex falls in this bucket. Bootstrap labels also work under an explicit "provisional" flag, with a real backfill date that someone is accountable for. Provisional labels left in for two quarters become canonical by accident.
SPADE and the hybrid pattern
The SPADE paper showed LLM-generated assertions, filtered by human review, cut false-failure rates by roughly twenty-one percent across two thousand-plus deployed pipelines [2]. The takeaway is not "synthetic assertions work"; it is "synthetic-propose, human-dispose works." The LLM produces candidates at scale, the human picks which ones to keep, and picking is cheaper than writing.
CAUTION
The propose-and-dispose loop only works if the human is empowered to reject. If the workflow nudges acceptance (defaults to accept, accepts on enter, hides input details), low-quality candidates slip in and the set quietly degrades.
A target ratio
For a mature dataset the mix that works is roughly half human-sampled from production, a quarter synthetic with hand-labels, and a quarter adversarial cases the team wrote to lock down known bugs and policy boundaries [3]. The exact numbers move with the product, but the principle holds: production anchors the distribution, synthetic fills coverage gaps, and adversarial cases catch regressions [4].
Three diagnostics expose a bad mix. Pass rates that disagree by more than ten points between subsets mean the synthetic part is not probing the same surface as the human part. Failure modes that exist only in synthetic examples mean the synthesizer is generating inputs production never sees. Failure modes that exist only in human examples mean synthetic coverage is missing the cells that break. The fix in each case is to pull fresh production traces, label them, and rebalance.