Cost and latency belong in the scorecard

A model that wins quality and triples p95 latency is not a win. Put quality, cost, and latency on one scorecard and pick on the Pareto frontier.

A quality scorecard with no cost column kills projects six months in when finance discovers the per-request line item. A quality scorecard with no latency column ships a model that wins offline and falls over under p95 traffic. Cost and latency are not infrastructure concerns separate from the eval program. They are eval metrics on the same scorecard, sampled from the same traces, reported in the same table.

The opinion: every release scorecard has three columns at minimum, beyond per-eval quality scores. Median latency, p95 latency, cost per request. Every comparison is reported as a quality-cost-latency triple with intervals on each, and the release decision is "which corner of the Pareto frontier does the product need" rather than "which is best." Huyen's GenAI platform piece is the cleanest statement of why these three live together in production ¹.

Why a single quality number is not enough

Three patterns recur in production releases reliably.

The first is the model that wins quality by spending more tokens. Verbose responses, extra tool calls, longer CoT. The quality column lights up; the cost column doubles; the user complaint queue fills with "too long." Hamel and Shankar's FAQ flags verbosity-driven quality wins as one of the most common eval failures because the verbosity is invisible to single-axis judges ².

The second is the model that wins quality and triples p95. The median moves a little, the tail moves a lot, and the SRE on-call discovers the change Monday morning when the dashboard goes red. Latency at p95 is not a finance concern; it is a UX concern that finance will inherit when retention slips.

The third is the model that hits the quality bar in offline and misses it in online because the offline judge does not penalise the costly path. The fix is not "judge harder"; it is "score the same trace on three axes."

The minimum scorecard

Four columns per release, one row per slice.

Quality (the primary eval metric, with its Wilson interval).
p50 latency in milliseconds (with a bootstrap interval).
p95 latency in milliseconds (with a bootstrap interval).
Cost per request in dollars, broken into model cost and tool cost (with a bootstrap interval).

Add a row per slice (intent, persona, surface). A scorecard that reports only the global mean hides the slice where one column moved badly. The same applies to release comparisons: candidate vs baseline, per slice, all three columns.

Sampling for cost and latency is different from sampling for quality. Latency and cost are per-trace logged on every request, so the dashboard runs over the full population, not the judged sample. The interval on p95 needs a bootstrap or a quantile-specific formula because the normal approximation does not apply at the tail ³.

Computing the columns

"""
Compute the four-column scorecard with intervals. Quality is from the
judge sample; latency and cost are from the full trace stream.
"""
import numpy as np
from statsmodels.stats.proportion import proportion_confint

def wilson(successes: int, n: int, alpha: float = 0.05):
    return proportion_confint(successes, n, alpha=alpha, method="wilson")

def bootstrap_quantile(values, q: float, n_boot: int = 5000, alpha: float = 0.05, rng=None):
    rng = rng or np.random.default_rng(0)
    arr = np.asarray(values)
    boots = np.array([
        np.quantile(rng.choice(arr, size=len(arr), replace=True), q)
        for _ in range(n_boot)
    ])
    return float(np.quantile(boots, alpha / 2)), float(np.quantile(boots, 1 - alpha / 2))

def bootstrap_mean(values, n_boot: int = 5000, alpha: float = 0.05, rng=None):
    rng = rng or np.random.default_rng(0)
    arr = np.asarray(values)
    boots = np.array([
        rng.choice(arr, size=len(arr), replace=True).mean()
        for _ in range(n_boot)
    ])
    return float(np.quantile(boots, alpha / 2)), float(np.quantile(boots, 1 - alpha / 2))

def scorecard_row(name: str, judged: list[dict], traces: list[dict]) -> dict:
    successes = sum(1 for t in judged if t["pass"])
    n = len(judged)
    lat_ms = [t["latency_ms"] for t in traces]
    cost_usd = [t["cost_usd"] for t in traces]
    return {
        "slice": name,
        "quality_rate": successes / n,
        "quality_ci": wilson(successes, n),
        "p50_latency_ms": float(np.median(lat_ms)),
        "p50_ci": bootstrap_quantile(lat_ms, 0.5),
        "p95_latency_ms": float(np.quantile(lat_ms, 0.95)),
        "p95_ci": bootstrap_quantile(lat_ms, 0.95),
        "cost_per_request_usd": float(np.mean(cost_usd)),
        "cost_ci": bootstrap_mean(cost_usd),
    }

Two reporting habits worth adopting. First, report cost in dollars, not tokens; tokens are a number engineers argue about, dollars are the number leadership reads. Second, break cost into model and tool components on agent workflows; a quality win that triples tool-call cost without moving model cost is a different story from one that doubles model cost.

The Pareto framing

Once three axes are on the table, "best" stops being a useful question. The right question is "where on the quality-cost-latency frontier should this product live." Customer support: high quality, latency over cost. Bulk summarisation: cost over latency, quality at the floor. Internal coding assistant: latency over quality and cost in that order.

The release scorecard should make the frontier visible. For two candidates with similar quality, plot p95 latency against cost per request, and pick the one closer to the product's corner of the frontier. The single-number-with-CI comparison is still the unit of decision, but the frontier is the context the decision is made in ⁴.

Alert thresholds

Three thresholds worth committing to as defaults.

Quality regression: 95% interval on the delta excludes zero on the negative side, evaluated per slice ⁵.
Latency regression: p95 delta exceeds 20% of the baseline, sustained over a one-hour window. Spikes shorter than that are usually infrastructure noise.
Cost regression: cost per request delta exceeds 15% of the baseline, sustained over a one-day window. Shorter windows alarm on traffic mix shifts.

A scorecard with alarms on all three axes catches the three failure modes above at the time they ship, not three weeks later when the finance review surfaces them.

TIP

A release that wins quality and loses cost or latency at the configured threshold is a regression candidate. The default is to revert and re-scope, not to ship the quality win and absorb the loss silently ⁶.

What to do this week

Add three columns to your release scorecard: p50, p95, cost per request. Pull them from the same trace stream that already carries the quality scores.
Convert at least one ongoing model comparison into a quality-cost-latency triple. Compute intervals on each.
Set the alert thresholds above. Wire them to the same channel as the quality drift alarms; the runbook is the same.

Chip Huyen, "Building A Generative AI Platform" (2024). ↩
Hamel Husain and Shreya Shankar, "LLM Evals FAQ." ↩
Miller, "Adding Error Bars to Evals" (Anthropic), arXiv:2411.00640. ↩
Applied LLMs, "What We Learned from a Year of Building with LLMs." ↩
Anthropic, "A Statistical Approach to Model Evaluations." ↩
Shreya Shankar, "The AI Engineering Flywheel." ↩