AILuminate in practice

MLCommons' standardized safety benchmark, the twelve hazard categories, and how to read its grade scale alongside HarmBench.

AILuminate is MLCommons' attempt at a standardized safety benchmark with formal governance: a shared hazard taxonomy, a held-out test set, a public practice set, and a grade scale that lets non-technical buyers compare models without parsing per-category numbers. It is the closest the field has to an industry-wide safety benchmark, and it sits in a different role in your eval program from HarmBench ¹.

The argument: HarmBench is the benchmark you run for engineering signal; AILuminate is the benchmark you point at when a procurement team or board asks for a third-party-graded number. Both belong in the standing suite; neither is sufficient alone.

What AILuminate measures

AILuminate organizes prompts into twelve hazard categories. The categories cover violent crimes, non-violent crimes, sex-related crimes, child sexual exploitation, defamation, specialized advice (medical, legal, financial), privacy violations, intellectual property, indiscriminate weapons (CBRN), hate, suicide and self-harm, and sexual content. The set includes both adversarial prompts written to elicit unsafe responses and benign prompts in the same domains, which is what enables the over-refusal scoring covered in a later chapter ¹.

The benchmark publishes a public practice set and reserves a held-out evaluation set under MLCommons governance. The held-out set is what produces the official grades; the practice set is for engineering use, including iteration during development. The held-out architecture is a real difference from HarmBench, where the entire corpus is open. Public hold-outs are harder to contaminate, though they age the same way.

The grade scale

AILuminate maps numerical safe-response rates to a letter grade per category and an overall grade. The mapping is designed to give buyers a stable comparison even when the underlying numbers shift because the hazard set or scoring methodology evolves. A model's grade for a category is a function of its safe-response rate on adversarial prompts in that category, with the bar set higher for harm categories with greater potential consequence.

The practical implication for your eval program: AILuminate's grade is the artifact you point at on your trust portal and in your model card. Your own per-category numbers from running the practice set should be reported alongside, with the date and the model fingerprint pinned. Auditors will accept the grade; engineers should read both.

Running the practice set

The practice set is downloadable and runnable with a thin harness around your own model invocation. The same scoring discipline as HarmBench applies: a classifier (in this case the released AILuminate evaluator) labels each output, and aggregate metrics are reported per category and overall.

"""
AILuminate practice-set runner, sketched. The official harness handles
prompt formatting, model wrappers, and grade conversion.
"""
from dataclasses import dataclass
from collections import Counter
from typing import Callable

@dataclass
class Prompt:
    id: str
    hazard_category: str
    text: str
    intent: str  # "adversarial" or "benign"

@dataclass
class Outcome:
    prompt: Prompt
    output: str
    safe_label: bool   # classifier: True if response is safe

def run_one(
    prompt: Prompt,
    target_model: Callable[[str], str],
    safety_classifier: Callable[[Prompt, str], bool],
) -> Outcome:
    output = target_model(prompt.text)
    safe = safety_classifier(prompt, output)
    return Outcome(prompt, output, safe)

def per_category_safe_rate(outcomes: list[Outcome]) -> dict[str, float]:
    by_cat: dict[str, list[bool]] = {}
    for o in outcomes:
        if o.prompt.intent == "adversarial":
            by_cat.setdefault(o.prompt.hazard_category, []).append(o.safe_label)
    return {
        cat: sum(vals) / len(vals)
        for cat, vals in by_cat.items()
    }

def over_refusal_rate(outcomes: list[Outcome]) -> float:
    """A 'safe' label on a benign prompt is over-refusal."""
    benign = [o for o in outcomes if o.prompt.intent == "benign"]
    if not benign:
        return 0.0
    # By the AILuminate convention, a benign prompt should receive a helpful
    # (not refusal) response; mis-labeling here is over-refusal.
    return sum(1 for o in benign if o.safe_label is False) / len(benign)

The harness above is a sketch; the published one handles model-specific prompt formatting and the official grade conversion. Use the upstream harness for any number you report externally ¹.

How to read AILuminate next to HarmBench

The two benchmarks overlap and differ in ways that matter.

HarmBench emphasizes attack methods. It tests not just whether a model refuses a direct harmful request, but how it behaves under eighteen different adversarial transformations of the same request ². The number is dominated by attack-method capability.

AILuminate emphasizes hazard coverage. It standardizes the hazard taxonomy across the industry and grades on a stable scale ¹. The number is dominated by category coverage and the safe-response threshold per category.

In a working program, both go in. AILuminate is what you report externally; HarmBench is what you watch internally for ASR drift under specific attack methods. The Anthropic 2022 red-team corpus provides a useful third leg if you want a human-attack baseline rather than a synthetic one ³.

NOTE

Public benchmark grades are dated, and the date matters. A 2026-05 AILuminate grade is the model's score against the corpus version on that date. When the hazard set updates, the grade is not directly comparable; rerun on the new version before claiming a number.

What to put in your trust portal

A short table on your trust page that lists each public safety benchmark, the model version evaluated, the date, and the published grade or score is the minimum useful artifact for an enterprise procurement conversation. Pair it with a sentence noting that internal private-corpus testing is also performed, with the date of the most recent run, but without publishing the corpus itself. The customer wants to see the discipline; they do not need to see the cases.

The next chapter, Over-refusal and the balance question, unpacks why a model that scores well on adversarial prompts can still be a bad product if its refusal rate on benign prompts is high.

MLCommons, "AILuminate." https://mlcommons.org/benchmarks/ailuminate/ ↩ ↩² ↩³ ↩⁴
Mazeika et al., "HarmBench" (2024). https://arxiv.org/abs/2402.04249 ↩
Ganguli et al., "Red Teaming Language Models" (2022). https://arxiv.org/abs/2209.07858 ↩