Shipping an AI feature is the same loop a research scientist runs in a lab. You watch the artifact behave, you write down a guess about why it misbehaves, you change one thing, you measure whether the change helped, and you repeat. The framework is old. What changes for LLM systems is that the artifact is non-deterministic, the success criteria are a moving target, and the experiment cost is measured in API tokens rather than reagents. The discipline still applies; the rigor has to be tighter, because the feedback signal is noisier [1].
The five steps, mapped to a product team
The traditional cycle of observe, hypothesize, experiment, measure, iterate is not a slogan. Each step has a concrete artifact a small team can produce in days, not quarters.
Observe. Read traces. Not dashboards; raw traces. Hamel calls this "removing all friction from looking at the data," and it is the single highest-return habit a team can adopt [1]. The artifact is a tab in a viewer plus a one-line note per trace.
Hypothesize. Cluster the notes into three or four candidate failure modes. The hypothesis is "if we fix mode X, error rate Y drops by Z." It is wrong as stated, because you do not know Z yet. Stating it anyway forces the team to commit to a falsifiable change.
Experiment. Make exactly one change. A prompt edit, a new tool description, a model swap, a chunking parameter. Multi-variable changes corrupt the measurement; if both the prompt and the model move at once, you cannot attribute the result.
Measure. Run the change against the labeled set you built from the open-coding pass. Compare against the baseline with a confidence interval, not a point estimate. A 4-point bump on 50 examples sits inside the noise; the team that ships off that signal is shipping off noise.
Iterate. Update the dataset with the new failure modes the change surfaced. Update the rubric if the criteria themselves shifted [2]. Re-run open coding on a fresh batch. The loop closes on the next observation pass.
The loop in code
A minimal harness that captures the spine of the scientific cycle is shorter than most teams expect.
from dataclasses import dataclass
from typing import Callable, Iterable
@dataclass
class Hypothesis:
name: str
change: str # one-line description of the intervention
expected_failure_mode: str
@dataclass
class Result:
pass_rate: float
ci_low: float
ci_high: float
n: int
def run_experiment(
examples: Iterable[dict],
system_under_test: Callable[[str], str],
judge: Callable[[dict, str], bool],
) -> Result:
"""Score the system on a fixed dataset. Return Wilson 95 percent CI."""
outcomes = [judge(ex, system_under_test(ex["input"])) for ex in examples]
passes = sum(outcomes)
n = len(outcomes)
p = passes / n
# Wilson interval, k=1.96
z = 1.96
denom = 1 + z * z / n
centre = (p + z * z / (2 * n)) / denom
spread = z * ((p * (1 - p) / n + z * z / (4 * n * n)) ** 0.5) / denom
return Result(p, centre - spread, centre + spread, n)
The dataset is fixed; the system changes between runs; the judge is the same function on both sides of the comparison. That triad is what makes the measurement attributable to the change rather than to drift in the eval itself.
Why this is harder for AI than for classical ML
Three properties of LLM systems make the scientific method feel slipperier in practice than the textbook describes. None of them is fatal, and each is the reason a specific page later in this site exists.
The observation step is harder because the artifact is non-deterministic; two reads of the same trace can produce different judgments unless the judgment is binary and codified. The hypothesis step is harder because criteria themselves drift as the team reads more outputs, so the rubric is a downstream artifact, not an upstream input [2]. The measurement step is harder because the baseline is noisy; a single eval run is one realization of a distribution, and a 5-point movement on 100 examples is rarely real [3].
The fix for all three is the same: treat the eval set, the rubric, and the judge as versioned artifacts that evolve through the loop, not as fixed inputs you pin at the start.
Cadence beats heroics
The teams that build durable AI products run this loop on a weekly cadence with twenty to fifty fresh traces per pass. They do not run a quarterly heroic eval program once and declare victory. Shankar's flywheel framing is that the loop tightens as it runs: each pass improves the dataset, which improves the judge, which surfaces the next failure mode [4].
If you run the loop once a quarter you are doing science once a quarter. If you run it every week you are doing science every week. The technique is not what differentiates the teams; the cadence is.