You cannot judge every production trace, and you should not try. Judge cost scales linearly with trace volume; once you cross ten thousand traces a day the bill becomes the most visible line item on the eval program. The question is not whether to sample but which traces to sample, and the default answer (uniform random at 1 to 5 percent) silently fails on the tail slices that usually matter most.
The working strategy is stratified plus failure-oversampled plus reweighted. Stratify by the dimensions you care about (intent, persona, language, latency bucket), oversample the slices where the judge is most informative (refusals, errors, tail intents), and reweight back to population when reporting a global mean. The cost overhead is small; the diagnostic gain is the difference between knowing the system is fine and knowing why a 0.5% segment broke last Tuesday.
A uniform 2% sample of one million traces yields twenty thousand traces a day. If your enterprise compliance flow is 0.5% of traffic, the daily sample contains roughly one hundred compliance traces, and a weekly metric for that slice rests on seven hundred. At the rates compliance failures actually occur (single-digit percent), the weekly count is in the teens, and the Wilson interval on the rate is so wide the dashboard alarm is uninformative.
Three patterns surface this in production reliably. First, a critical slice that comprises less than 1% of volume but produces most of the customer complaints. Second, a refusal rate that swings in a direction you did not expect because the judge sample was dominated by the top intent. Third, an A/B comparison where the global effect is null but every stratum moved in a meaningful direction. Applied LLMs flagged uniform sampling as the most common cause of "the dashboard says fine; the support tickets say not fine" 1.
Stratified sampling
Bucket the trace stream by the dimension you care about and sample uniformly within each bucket. The dimensions worth stratifying on are the ones you would slice the dashboard by: intent, persona, language, model version, retrieval recipe, latency bucket. The granularity is rarely subtle; the right answer is usually four to eight strata per dimension.
The reporting consequence is that the sample is biased. A uniform mean across the stratified sample is not the population mean. Two fixes:
- For dashboards that need population means, reweight: multiply each stratum's metric by its population share and sum. The post-stratified estimator has a slightly higher variance than uniform but corrects the bias.
- For dashboards that report per-stratum metrics, ignore the global mean and report each stratum with its own interval. This is often what the user actually wants.
Phoenix and Langfuse both expose enough trace metadata to make stratification a few-line filter, and both ship dataset and sampling primitives for the common cases 2 3. A typical setup pulls one sample per stratum per hour into a queue and lets the judge consume them at whatever rate the budget allows.
Failure oversampling
The most informative traces for the next eval iteration are the ones where the judge or the user expressed something negative. Refusal, error, low judge score, negative user feedback, escalation to a human. Sample these at a higher rate than positives because the per-trace information value is higher.
A useful rule from the field guide: when error analysis flags a cluster, push the sample rate for that cluster to 100% until you have collected enough traces to build assertions or refresh the dataset 4. Once the cluster is understood, drop the rate back. The shape is a peaked sampler that follows the work of the eval program.
A working policy
"""
Stratified plus failure-oversampled sampling, with reweighting on the
reporting side. No external deps.
"""
from dataclasses import dataclass
import random
from collections import defaultdict
from typing import Iterable, Callable
@dataclass
class SamplePolicy:
strata_fn: Callable[[dict], str]
target_per_stratum: dict[str, int]
failure_signal: Callable[[dict], bool]
failure_multiplier: float = 5.0
def sample_window(
traces: Iterable[dict],
policy: SamplePolicy,
rng: random.Random,
) -> list[dict]:
bucketed: dict[str, list[dict]] = defaultdict(list)
for t in traces:
s = policy.strata_fn(t)
bucketed[s].append(t)
out: list[dict] = []
for stratum, items in bucketed.items():
target = policy.target_per_stratum.get(stratum, 0)
if not target or not items:
continue
positives = [t for t in items if not policy.failure_signal(t)]
failures = [t for t in items if policy.failure_signal(t)]
# Oversample failures
fail_target = min(len(failures), int(target * policy.failure_multiplier /
(1 + policy.failure_multiplier)))
pos_target = min(len(positives), target - fail_target)
out.extend(rng.sample(failures, fail_target))
out.extend(rng.sample(positives, pos_target))
return out
def reweighted_mean(
sampled: list[dict],
metric: Callable[[dict], float],
population_share: dict[str, float],
strata_fn: Callable[[dict], str],
) -> float:
"""Post-stratified mean: weight each stratum to its population share."""
by_stratum: dict[str, list[float]] = defaultdict(list)
for t in sampled:
by_stratum[strata_fn(t)].append(metric(t))
return sum(
population_share[s] * (sum(v) / len(v))
for s, v in by_stratum.items() if v
)
The target_per_stratum in the policy is the most important knob. Set it so that each stratum gets enough traces for a meaningful Wilson interval, not so that the global sample rate stays under some budget number. If the budget is binding, drop the long-tail strata to a smaller target rather than uniformly halving every stratum.
Reservoir sampling for online aggregation
For a continuous stream where you cannot buffer the whole window, reservoir sampling gives you a uniform sample of size k from a stream of unknown length in one pass. The construction is small enough to inline.
def reservoir(stream, k: int, rng: random.Random) -> list:
res = []
for i, item in enumerate(stream):
if i < k:
res.append(item)
else:
j = rng.randint(0, i)
if j < k:
res[j] = item
return res
Pair reservoir per stratum and you have stratified reservoir sampling, which gives you a usable sample at any point in a stream without holding the full window in memory.
What to do this week
- Identify the dimensions you would slice the dashboard by. These are your strata. Set a per-stratum target high enough for a meaningful Wilson interval at your refresh cadence.
- Add a failure signal (judge score below threshold, user thumbs-down, escalation flag) and oversample the negatives by a factor of three to five.
- Switch dashboards that report a global mean to a post-stratified estimator. The number will move; the interval will tighten on the strata that matter 5 6.