What "good" looks like
Recommendation is the eval category where no public benchmark works generically. The dataset is your interaction logs and the truth is your business metric. The primary offline metric is NDCG@10 or hit-rate@K computed against a temporal split of historical interactions; the primary online metric is uplift on the metric you ship to (revenue, retention, watch time, conversion). Offline and online often disagree; the right framing is to treat offline as a filter and A/B as the ground truth.
The "good" target is whatever A/B test shows statistically significant positive uplift on the ship metric without regressing the guardrails (diversity, fairness, exposure). Eugene Yan's "User Feedback: Building Data Flywheels" is the practical reference for this whole category[1].
Secondary metrics
Coverage, the proportion of the catalog ever recommended. Diversity, often measured as a Gini coefficient over the recommendation distribution. Novelty, how often new items appear in recommendations. Position bias, the tendency of users to click higher-ranked items regardless of relevance, which biases offline metrics computed from logs. Cold-start metrics for new users and new items. Long-tail uplift, the recommendation rate for non-head items.
Offline eval design
NDCG@10, hit-rate@K, AUC, and MAP on historical interactions with a temporal split, never a random split. The random split causes data leakage because user preferences and item popularity drift over time; future events leak into the training data. Counterfactual evaluation via IPS (inverse propensity scoring) or doubly-robust estimators is the right move when feasible, because it corrects for the selection bias introduced by the current ranker's exposure.
Hard negatives matter here as much as in retrieval. Sampling negatives uniformly from the catalog overstates model performance; sample by popularity or by current-ranker exposure to make the eval predict real lift.
Online eval design
A/B test on the metric you ship to. Revenue, retention, watch time, conversion; the right metric is the business metric, not the ML metric. Guardrails on diversity, fairness, and exposure; the new ranker may improve the ship metric while concentrating exposure on a small set of items, which is a long-term quality regression. Alert on selection bias if the new ranker shifts the candidate pool meaningfully; a candidate-pool shift can invalidate the A/B comparison.
The pattern Shankar's flywheel writeup describes applies directly: scores have to come back to a product KPI you can move, and the right architecture is a closed loop from user behavior to retraining[2].
Top benchmarks
- Your interaction logs. The dataset that predicts production. Public recsys benchmarks (MovieLens, Amazon reviews) are useful for methodology learning but do not transfer to your domain.
- Offline temporal split of your own logs with hard negatives. The minimum useful eval artifact.
- A/B test rig with proper randomization and guardrails. The truth.
Cornac and Elliot are open-source recsys evaluators with strong implementations of the standard metrics. An in-house A/B framework with proper randomization, sample-size calculators, and metric pipelines. IPS and doubly-robust estimators implemented either in-house or via libraries like Open Bandit Pipeline for off-policy evaluation.
Common pitfalls
Random offline split. Data leakage; future events leak into training. Always temporal. Optimizing offline NDCG without online validation. The two often disagree; treat offline as a filter. Ignoring exposure bias. Your current ranker decides what users see, which biases the logs; counterfactual estimators or careful negative sampling are required. No diversity or fairness guardrails on A/B tests. A ranker can lift the ship metric and concentrate exposure on head items, eroding catalog diversity; this is a long-term regression that A/B alone will not catch. Reading too much into a single A/B run; the variance is real and the right pattern is rotating panels of experiments[3].
Canonical reading
Eugene Yan's "Patterns for Building LLM-based Systems & Products" is the most useful single reading in this category because it covers data flywheels for recommender systems in the LLM era: how to design feedback loops that improve the recommender over time and how to instrument them[1]. Read with Shankar's AI engineering flywheel piece for the operational pattern[2].
Day-1 starter
- Temporal-split historical eval with NDCG@10 and hit-rate. Hard negatives sampled by popularity or current-ranker exposure. The temporal split is non-negotiable.
- A/B harness with guardrails on revenue (or your ship metric) plus diversity plus fairness. Without guardrails, you ship rankers that lift the headline and regress the catalog.
- Counterfactual evaluation (IPS) for offline candidate ranking. The IPS estimator is what closes the gap between offline scores and online uplift.