What "good" looks like
The primary metric is a composite of three signals: human preference (pairwise A/B against the prior model or a strong baseline), prompt adherence (CLIP score from the CLIP paper)[1], and realism (FID, Frechet Inception Distance, from Heusel et al.)[2]. No single one of the three is sufficient. CLIP score rewards alignment with the text but not necessarily aesthetic quality; FID rewards realism but does not measure whether the image matches the prompt; human preference is the truth but slow and expensive.
The "good" target depends on the use case. Stock-photo replacement needs FID below domain thresholds and CLIP score that beats your prior model on a fixed prompt set; creative tools need human preference above 50% versus the baseline. Always report all three; reporting only FID rewards models that produce realistic but off-prompt images.
Secondary metrics
Compositionality. Counting (three cats), spatial relations (cat on top of dog), color binding (red ball next to blue cube). Text-in-image fidelity, the model can render legible specified text. Safety, including NSFW classifier flags and IP-related concerns (real faces, trademarked logos). Aesthetic preference, scored on its own rubric rather than coupled to prompt adherence. Prompt-leak, the model literally renders prompt text into the image (an extreme failure mode).
Offline eval design
PartiPrompts is the canonical creative-prompt set: 1,600+ prompts across categories from Google Research[3]. HEIM (Holistic Evaluation of Image Models) for cross-axis comparison. On top of those, build an in-house creative-brief set of 50 to 200 prompts that match your product distribution, paired with 3-rater pairwise preference labels against the baseline. CLIP score per prompt, FID over batches, and the pairwise preference are the three metrics to track.
For safety, run an NSFW classifier on every generation; the classifier-flag rate is a leading indicator of distribution drift after model swaps.
Online eval design
Implicit feedback (saves, edits, re-rolls) is the primary online signal. A high re-roll rate on a fixed prompt distribution is the leading indicator of quality regression. Explicit thumbs-up and thumbs-down on a per-image basis. NSFW classifier pass on every output. Alert on classifier-flag rate increases and on style-drift signatures after model swaps (a model swap can subtly shift the color palette or composition style; users notice).
Top benchmarks
- PartiPrompts, https://github.com/google-research/parti. 1,600+ creative prompts; the standard for creative-prompt breadth[3].
- HEIM. Holistic eval across realism, aesthetics, and prompt adherence.
- Your in-house creative-brief set. 50 to 200 prompts matching your product distribution; the eval that predicts shipping decisions.
torchmetrics for FID and CLIP score implementations. The PartiPrompts harness for prompt-set parity. An internal pairwise rating tool (most teams build their own; the simplest is a web app showing two images side by side and recording the click). LangSmith or Braintrust if you need to integrate ratings with the rest of your eval stack.
Common pitfalls
FID-only. FID correlates poorly with text-image alignment; a model can have great FID and ignore the prompt. The CLIP-plus-FID-plus-preference composite exists because no one of the three is enough. Tiny eval set. PartiPrompts has 1,600 prompts for a reason; a 20-prompt eval has too much variance to detect anything but enormous changes. No safety eval. NSFW and IP flags are production-critical; without an automated classifier-pass, regression detection in this dimension is impossible. Hamel's field guide framing about looking at outputs applies directly to image gen: the failure modes (extra fingers, prompt drift) are visible by eye in 30 minutes of inspection that a metric will not surface[4].
Canonical reading
The FID paper is the canonical reference for the realism metric because it formalizes the distribution-distance interpretation that underpins most modern image quality work[2]. The CLIP paper is the canonical reference for the alignment metric[1]. Read them together: FID for what a "good" image looks like in distribution, CLIP for whether the image matches what was asked.
Day-1 starter
- 50 PartiPrompts plus 50 in-house prompts. CLIP score per prompt, FID over batches, and 3-rater pairwise preference against the prior model.
- A safety classifier pass on every output. Track classifier-flag rate over time; the rate is a regression signal independent of quality.
- Implicit-feedback dashboard (saves, edits, re-rolls) in production. The re-roll rate is the most useful single online signal; it tracks quality regression without needing explicit labels.