Safety & Red-Team

OWASP LLM Top 10, HarmBench, AILuminate, jailbreaks and defenses. Also available as a free PDF download.

Safety evaluation is the part of the program where adversarial inputs meet your release gate. It is also the part where vendor leaderboard numbers are the least informative, because the threat model changes with every product surface and the public attack corpora go stale within a quarter of being published. The six chapters here cover the standards, the public benchmarks, the failure modes that public benchmarks miss, and the program structure that turns one-off audits into a continuous discipline.

The opinionated frame: treat safety as a calibrated classifier problem with two confusion matrices, not as a single refusal-rate number. Over-refusal is a real cost, jailbreaks are an empirical baseline you have to keep meeting, and the OWASP Top 10 is the floor, not the ceiling. Public benchmarks (HarmBench, AILuminate, AdvBench) anchor your model card. A private adversarial set built from your own surfaces is what protects shipping decisions. The condensed reference for the perturbation side of this work is robustness testing.

Chapters:

Walking the OWASP Top 10 for LLM apps. The ten failure categories, what each means for an evals practitioner, and the smallest probe you can run for each.
Designing a red-team program. Frequency, staffing, intake, scoring, and the boundary between manual red-team and automated regression.
HarmBench in practice. A code walk through the public corpus, the StrongREJECT classifier, and how to read the leaderboard without overfitting to it.
AILuminate in practice. MLCommons' standardized safety benchmark, its 12 hazard categories, and where it sits relative to HarmBench in a real eval program.
Over-refusal and the balance question. Why refusal rate alone is a misleading metric, the XSTest pattern, and how to grade refusal against necessity.
Jailbreaks and defenses. GCG-style adversarial suffixes, the empirical floor on defense, and what your guardrail stack should actually do.