Image generation

Text prompt to image. Human preference plus CLIP score plus FID; not FID alone (correlates poorly with text-image alignment).

What "good" looks like

The primary metric is a composite of three signals. No single one is sufficient:

Human preference: pairwise A/B against the prior model or a strong baseline. The truth, but slow and expensive.
Prompt adherence: CLIP score, from the CLIP paper^[1]. Rewards text alignment, not aesthetic quality.
Realism: FID (Frechet Inception Distance), from Heusel et al.^[2]. Rewards realism, not whether the image matches the prompt.

The target depends on the use case. Stock-photo replacement needs FID below domain thresholds and a CLIP score that beats your prior model on a fixed prompt set. Creative tools need human preference above 50% versus the baseline.

Always report all three. Reporting only FID rewards realistic but off-prompt images.

flowchart TD
    GEN["Generated image"] --> CLIP["CLIP score beats prior model?"]
    GEN --> FID["FID below domain threshold?"]
    GEN --> PREF["Human preference above 50%?"]
    GEN --> SAFE{"NSFW classifier pass?"}
    SAFE -->|no| BLOCK["Block, regression signal"]
    SAFE -->|yes| GATE{"All three signals clear?"}
    CLIP --> GATE
    FID --> GATE
    PREF --> GATE
    GATE -->|yes| SHIP["Ship"]
    GATE -->|no| FIX["Fix weak axis, never FID alone"]

Secondary metrics

Metric	What it catches	How to score
Compositionality	Counting (three cats), spatial relations (cat on top of dog), color binding (red ball next to blue cube)	Per-property pass/fail vs the prompt
Text-in-image fidelity	Legible specified text rendered	Rendered text matches the requested string
Safety	NSFW flags, IP concerns (real faces, trademarked logos)	Automated classifier plus IP checks
Aesthetic preference	Visual quality, decoupled from prompt match	Own rubric, not coupled to prompt adherence
Prompt-leak	Model renders prompt text into the image (an extreme failure mode)	Flag images drawing the prompt as text

Offline eval design

Start on the public sets. PartiPrompts is the canonical creative-prompt set: 1,600+ prompts across categories from Google Research^[3]. HEIM (Holistic Evaluation of Image Models) gives cross-axis comparison.

Add an in-house set. Build a creative-brief set of 50 to 200 prompts from your product distribution, paired with 3-rater pairwise preference labels against the baseline. Three metrics to track: CLIP score per prompt, FID over batches, and the pairwise preference.

Run an NSFW classifier on every generation. The classifier-flag rate is a leading indicator of distribution drift after model swaps.

Online eval design

Implicit feedback is the primary online signal. Saves, edits, and re-rolls carry the most information. A high re-roll rate on a fixed prompt distribution is the leading indicator of quality regression. Layer explicit thumbs-up and thumbs-down per image on top.

Keep the safety pass live. Run the NSFW classifier on every output. Alert on classifier-flag-rate increases and on style-drift signatures after model swaps: a swap can shift the color palette or composition style, and users notice.

Top benchmarks

PartiPrompts, https://github.com/google-research/parti. 1,600+ creative prompts; the standard for creative-prompt breadth^[3].
HEIM. Holistic eval across realism, aesthetics, and prompt adherence.
Your in-house creative-brief set. 50 to 200 prompts matching your product distribution; the eval that predicts shipping decisions.

Tools especially suited

torchmetrics: FID and CLIP score implementations.
PartiPrompts harness: prompt-set parity across runs.
Internal pairwise rating tool: most teams build their own; the simplest shows two images side by side and records the click.
LangSmith or Braintrust: integrate ratings with the rest of your eval stack.

Common pitfalls

Pitfall	Why it bites	Fix
FID-only	FID correlates poorly with text-image alignment; a model can have great FID and ignore the prompt	Use the CLIP-plus-FID-plus-preference composite; no one of the three is enough
Tiny eval set	A 20-prompt eval has too much variance to detect anything but enormous changes	Use PartiPrompts scale (1,600 prompts for a reason)
No safety eval	NSFW and IP flags are production-critical; without a classifier-pass, regression detection here is impossible	Run a classifier pass on every output
Not looking at outputs	Failure modes (extra fingers, prompt drift) show by eye but no metric surfaces them	Spend 30 minutes inspecting outputs, as Hamel's field guide argues^[4]

Canonical reading

The FID paper is the canonical reference for the realism metric because it formalizes the distribution-distance interpretation that underpins most modern image quality work^[2]. The CLIP paper is the canonical reference for the alignment metric^[1]. Read them together: FID for what a "good" image looks like in distribution, CLIP for whether it matches what was asked.

Day-1 starter

50 PartiPrompts plus 50 in-house prompts. CLIP score per prompt, FID over batches, and 3-rater pairwise preference against the prior model.
A safety classifier pass on every output. Track classifier-flag rate over time; the rate is a regression signal independent of quality.
Implicit-feedback dashboard (saves, edits, re-rolls) in production. The re-roll rate is the most useful single online signal; it tracks quality regression without needing explicit labels.

Cross-cutting links

Pairwise comparison and Elo: the mechanics behind the 3-rater preference eval.
Inter-rater agreement: Cohen's kappa and Krippendorff's alpha: checking that raters agree before trusting their preferences.
Drift detection for production AI: catching style drift after model swaps.
HELM: the harness family whose HEIM extension covers image-generation eval.
Adding error bars to your evals: deciding whether a preference delta is real.