Recommendation and personalization

Rank items for a user. Offline NDCG plus hit-rate to filter; online A/B for ground truth; temporal splits, never random.

What "good" looks like

Recommendation is evaluated in two passes: an offline pass that ranks against a temporal split of your own historical interactions, and an online A/B test that measures uplift on the business metric you ship to. No public benchmark works generically. Your interaction logs are the dataset and your business metric is the truth.

The two targets:

Offline: NDCG@10 or hit-rate@K, computed against a temporal split of historical interactions.
Online: statistically significant positive uplift on the ship metric (revenue, retention, watch time, conversion) with no regression on the guardrails (diversity, fairness, exposure).

Offline and online often disagree. The right framing treats offline as a filter and A/B as the ground truth. The "good" target is whatever A/B test shows a significant positive uplift on the ship metric without regressing the guardrails.

Eugene Yan's "User Feedback: Building Data Flywheels" is the practical reference for this whole category^[1].

flowchart TD
    LOGS["Interaction logs"] --> SPLIT["Temporal split, never random"]
    SPLIT --> NEG["Hard negatives by popularity or exposure"]
    NEG --> OFF["Offline: NDCG@10, hit-rate@K"]
    OFF --> IPS["IPS / doubly-robust correction"]
    IPS --> FILTER{"Passes offline filter?"}
    FILTER -->|no| REJECT["Reject ranker"]
    FILTER -->|yes| AB["A/B test on ship metric"]
    AB --> GUARD{"Uplift and guardrails hold?"}
    GUARD -->|no| REJECT
    GUARD -->|yes| SHIP["Ship"]

Secondary metrics

Metric	What it catches	How to score
Coverage	Catalog items never surfaced	Proportion of the catalog ever recommended
Diversity	Over-concentration on a few items	Gini coefficient over the recommendation distribution
Novelty	Stale, repetitive slates	How often new items appear in recommendations
Position bias	Clicks that reflect rank, not relevance	Tendency of users to click higher-ranked items regardless of relevance, which biases offline metrics from logs
Cold-start	Failure on new users and new items	Separate metrics for new users and new items
Long-tail uplift	Head-item concentration	Recommendation rate for non-head items

Offline eval design

Metrics. Use NDCG@10, hit-rate@K, AUC, and MAP on historical interactions.

Split temporally, never randomly. A random split causes data leakage because user preferences and item popularity drift over time, so future events leak into the training data.

Correct for exposure with counterfactual estimation. IPS (inverse propensity scoring) or doubly-robust estimators are the right move when feasible, because they correct for the selection bias introduced by the current ranker's exposure.

Sample hard negatives. Hard negatives matter here as much as in retrieval. Sampling negatives uniformly from the catalog overstates model performance. Sample by popularity or by current-ranker exposure to make the eval predict real lift.

Online eval design

A/B on the business metric, not the ML metric. Test on the metric you ship to: revenue, retention, watch time, conversion.

Guardrail diversity, fairness, and exposure. The new ranker may improve the ship metric while concentrating exposure on a small set of items, which is a long-term quality regression.

Alert on selection bias. If the new ranker shifts the candidate pool meaningfully, a candidate-pool shift can invalidate the A/B comparison.

The pattern Shankar's flywheel writeup describes applies directly: scores have to come back to a product KPI you can move, and the right architecture is a closed loop from user behavior to retraining^[2].

Top benchmarks

Your interaction logs. The dataset that predicts production. Public recsys benchmarks (MovieLens, Amazon reviews) are useful for methodology learning but do not transfer to your domain.
Offline temporal split of your own logs with hard negatives. The minimum useful eval artifact.
A/B test rig with proper randomization and guardrails. The truth.

Tools especially suited

Cornac: open-source recsys evaluator with strong implementations of the standard metrics.
Elliot: open-source recsys evaluator with strong implementations of the standard metrics.
In-house A/B framework: proper randomization, sample-size calculators, and metric pipelines.
IPS and doubly-robust estimators: off-policy evaluation in-house or via libraries like Open Bandit Pipeline.

Common pitfalls

Pitfall	Why it bites	Fix
Random offline split	Data leakage; future events leak into training	Always split temporally
Optimizing offline NDCG without online validation	The two often disagree	Treat offline as a filter, A/B as the truth
Ignoring exposure bias	Your current ranker decides what users see, which biases the logs	Counterfactual estimators or careful negative sampling
No diversity or fairness guardrails on A/B tests	A ranker can lift the ship metric and concentrate exposure on head items, a long-term regression A/B alone will not catch	Add diversity and fairness guardrails
Reading too much into a single A/B run	The variance is real	Rotate panels of experiments^[3]

Canonical reading

Eugene Yan's "Patterns for Building LLM-based Systems & Products" is the most useful single reading in this category because it covers data flywheels for recommender systems in the LLM era: how to design feedback loops that improve the recommender over time and how to instrument them^[1]. Read with Shankar's AI engineering flywheel piece for the operational pattern^[2].

Day-1 starter

Temporal-split historical eval with NDCG@10 and hit-rate. Hard negatives sampled by popularity or current-ranker exposure. The temporal split is non-negotiable.
A/B harness with guardrails on revenue (or your ship metric) plus diversity plus fairness. Without guardrails, you ship rankers that lift the headline and regress the catalog.
Counterfactual evaluation (IPS) for offline candidate ranking. The IPS estimator is what closes the gap between offline scores and online uplift.

Cross-cutting links

Online vs offline: where each fires: the offline-filter, online-truth split in depth.
Drift detection for production AI: catching candidate-pool shift after a ranker change.
Statistical power: how big does the eval set need to be: sizing the A/B before you run it.
Online evaluation patterns: guardrailed experiments as a technique.
Adding error bars to your evals: uplift numbers with intervals instead of point estimates.