Every eval tutorial assumes you have a labeled dataset. You have a prompt, a demo that mostly works, and zero labels. The cold-start question is the same on every team: where do I start when nothing is labeled, and is "build a gold dataset first" actually step one?
It usually is not. The answer is per-task, not philosophical.
Two terms, one decision
A reference-based eval compares a model output to a known-correct reference and scores the match: exact match, execution match, token F1, or a learned similarity metric. The glossary entry and the reference-based metrics page cover the metric zoo. The defining requirement is one gold answer per input.
A reference-free eval scores an output with no per-input gold: a code assertion ("output parses as JSON"), an LLM judge applying a written rubric, or a faithfulness check against retrieved context 1. The judge needs a rubric and calibration. It does not need a reference answer.
These are not two philosophies of evaluation. They are answers to a per-task question: what does one gold example cost for this task, and does a deterministic check exist? Where gold is cheap and checking is mechanical, reference-based wins immediately. Where gold is expensive and ill-defined, reference-free is how you start, and (this is the part most teams miss) it is also how you end up with gold.
Some tasks are born reference-based
For a large family of tasks, the gold answer is cheap to produce and the comparison is deterministic. Classification: the label is one click. Extraction: the correct fields are in the source document. Structured output: the schema is the contract, and a validator is the eval 2. Text-to-SQL: you do not compare query strings, you execute both queries and compare result sets, which forgives every harmless syntactic difference and catches every silent wrong answer 3. Translation is the partial member: reference-based metrics work, but use a learned metric. At the WMT24 metrics shared task, BLEU ranked 23rd of 26 metrics by correlation with human judgment while fine-tuned neural metrics like XCOMET topped the table 4.
If your task is in this family, do not start with a judge. Label thirty examples this afternoon and wire the deterministic check into CI. Thirty examples beats zero, and code-based checks are the cheapest eval you will ever run 5.
flowchart TD
A[New AI feature, no labels] --> B{Deterministic check possible?\njson.loads, execution, exact label}
B -- yes --> C[Label 30 examples\nwire assertions into CI]
C --> D[Grow toward a versioned\ngolden dataset]
B -- no --> E[Log traces from real usage]
E --> F[Error analysis:\nopen-code 100 traces]
F --> G[Binary rubric judge\nper failure mode]
G --> H[Hand-label ~100 traces\ncalibrate judge TPR/TNR >= 0.9]
H --> I[Calibration labels become\ngold dataset v0.1]
I --> D
Open-ended tasks start reference-free, for a boring reason
Chat, summarization, RAG answers, and agent trajectories mostly start reference-free. Not because reference-free is better. Because gold for these tasks is expensive to author and, worse, ill-defined at the start. Shankar et al. documented the mechanism and named it criteria drift: "users need criteria to grade outputs, but grading outputs helps users define criteria" 6. You cannot write the gold answer for "a good summary of this support thread" before you have read bad ones. The rubric is a downstream artifact of looking at failures, not an upstream input.
So the starting stack for open-ended tasks is fixed, and none of it requires a label you do not have yet:
- Log traces. Real inputs, full tool calls, final outputs. No traces, no program.
- Error analysis. Read 100 traces and open-code what went wrong. Three failure modes typically account for 60 to 80 percent of bug volume, so the reading pays for itself fast 7. The 60-80 rule and the open coding workflow pages cover the discipline.
- One binary judge per failure mode. Not a 1-to-5 helpfulness score; a pass/fail question per discovered failure mode, written as a falsifiable rubric (rubric design, LLM-as-Judge introduction).
- Calibrate the judge. Hand-label roughly 100 traces with the principal domain expert and iterate the judge prompt until true-positive and true-negative rates both clear 0.90 against those labels 8. The calibration page walks the loop.
Hamel and Shreya's FAQ says it plainly: you usually lack reference outputs on production data, so you rely on reference-free evaluators like LLM-as-judge 9. The judge is not a shortcut around labeling. It is a multiplier on a small amount of labeling.
The labels you just made are the gold you were missing
Here is the rule this page exists to state:
NOTE
The ~100 examples you hand-label to calibrate your judge are the embryo of your gold dataset. Reference-free evaluation is how you bootstrap gold while shipping. It is not a permanent alternative to gold.
Look at what calibration forced you to produce: real inputs sampled from production, expert-labeled outcomes, a written rubric that defines what pass means, and a measured agreement rate. That is a gold dataset by any working definition. Small, but versionable, and Shankar's flywheel framing applies from day one: the eval set is a living artifact that production examples flow back into on a cadence 10.
The graduation path from there has three lanes, and they run in parallel:
- Calibration labels grow into a regression suite. Every bug you fix contributes its trace and its corrected label. The set runs in CI and must pass at one hundred percent. The regression sets chapter covers what belongs and what does not.
- Synthetic generation expands coverage. Synthesize inputs against a feature-scenario-persona grid, keep human judgment on the outputs. The synthetic query generation recipe is a runnable version for RAG; synthetic vs human covers the rule that prevents poisoning: synthesize the input, hand-label the output.
- Deterministic subsets migrate to reference-based checks. Once error analysis shows that a slice of the task has a checkable answer (the date extraction inside the chat reply, the SQL the agent writes, the JSON the tool call must emit), move that slice to reference-based metrics and spend judge budget only where judgment is required.
The end state is the same place the born-reference-based tasks started: a versioned golden dataset with deterministic checks where possible and calibrated judges where not. The maturity sequence differs; the destination does not. On the broader ladder, this is stage two of the maturity model.
Where reference-free betrays you
Reference-free evaluation has failure modes worth naming before you trust it, because every one of them produces a confident number.
Criteria drift. Your own notion of pass shifts as you read more outputs 6. A judge calibrated against January's rubric scores February's traces against a definition of quality the team no longer holds. The defense is recalibration on a cadence and treating the rubric as a versioned artifact, not a constant.
Judge drift. The judge is an LLM call. Provider model revisions and your own prompt edits move its decision boundary, which is why calibration is a loop, not a ceremony 8. A judge nobody has re-checked against human labels in six months is an unmeasured instrument.
Faithfulness is not correctness. RAG faithfulness checks score whether the answer is grounded in the retrieved context 11. An answer can be perfectly faithful to a wrong or stale document. Faithfulness bounds hallucination; it says nothing about whether retrieval surfaced the truth.
Uncalibrated pairwise preference measures style. LLM judges carry position bias, verbosity bias, and self-enhancement bias 12. Length alone is enough to game a preference eval: controlling for it raised AlpacaEval's correlation with Chatbot Arena from 0.94 to 0.98, which is another way of saying the uncontrolled version was partly ranking verbosity 13. A pairwise win rate with no human anchor is a style poll.
All four defenses reduce to the same artifact: a small human-labeled set the automated signal answers to. Which is the punchline again, from the other direction.
Your first week, no gold dataset
TIP
Day 1. Instrument trace logging with a real trace schema. Run the system over 30 to 50 real or synthesized inputs if you have no traffic yet. The 30-minute first eval is the runnable companion.
Day 2. Write the deterministic assertions first: schema validation, required disclaimers, length caps, tool-argument checks 5. If json.loads catches a third of your failures, you did not need a judge for that third.
Day 3. Open-code 100 traces. Stop when 20 consecutive traces produce no new failure category 9.
Day 4. Cluster the notes into failure modes and keep the top three. Write one binary rubric per failure mode.
Day 5. Draft one judge per rubric. Hand-label 100 traces with your domain expert.
Days 6-7. Calibrate until TPR and TNR clear 0.90. Commit the labeled set to git as gold-v0.1.jsonl. You now have a gold dataset. Next week it starts feeding the regression suite.
The whole week in code is smaller than the diagrams suggest:
import json
def judge(trace: dict, rubric: str, llm) -> dict:
"""Reference-free binary judge: rubric in, pass/fail plus critique out."""
raw = llm(
f"CRITERION: {rubric}\n\nINPUT: {trace['input']}\n"
f"OUTPUT: {trace['output']}\n\n"
'Return JSON: {"label": "pass" or "fail", "critique": "one sentence"}',
temperature=0,
)
return json.loads(raw)
def save_label(trace: dict, human_label: str, rubric_version: str) -> None:
"""Every calibration label is a gold-dataset row. Append-only, in git."""
row = {**trace, "gold_label": human_label, "rubric_version": rubric_version}
with open("gold-v0.1.jsonl", "a") as f:
f.write(json.dumps(row) + "\n")
The function names are the argument. The judge is reference-free; the labels you write while distrusting it are the references. The debate dissolves the moment your calibration file gets a version number.