Designing a red-team program

The operating model for an LLM red team: frequency, staffing, intake, scoring, and where it sits relative to automated regression.

A red team is the part of the program that produces adversarial cases your regression suite has never seen. Ganguli et al.'s Anthropic study is the largest public dataset on what a red-team workflow actually looks like in practice: 38,961 human attack attempts across multiple models, with the attempts themselves released for downstream use ¹. The lessons in that paper, plus the operational notes in Anthropic's own writeup on evaluating AI systems ², form the spine of the program described here.

The argument: a red team that exists only at audit time produces audit-time numbers. The team that exists on a weekly cadence, with intake from production, produces a stream of regression cases that make your suite continuously harder to ship a bad change against.

The four pieces of a program

A working red-team program has four pieces. None of them is the headline finding; the headline finding is what they produce together.

Intake. Every adversarial case that ever finds a real failure goes into a labeled corpus. Sources: the red team's own work, production incident reports, customer-reported jailbreaks, public benchmark cases that your model also fails on, and crowdworker submissions if you run a bounty. The corpus is the artifact that compounds over time. Without intake, you are paying a red team to retest the same five categories every quarter.

Cadence. Twice a month for manual, weekly for automated. Manual sessions are where humans produce novel attack categories. Automated runs are where you regress against the standing corpus. Both are scoped: a manual session has a stated target (one product surface, one threat model, six hours) and produces a written summary. Automation runs over the full corpus every Monday morning and reports any movement.

Scoring. Binary per-case judgments, like every other rubric on this site. A case either elicited a harmful output or it did not. The harmfulness threshold is a written rubric, not a vibe call; the rubric is calibrated against humans the same way an LLM-judge is. Aggregate metrics: attack success rate (ASR) per category, refusal rate on a paired benign set, time-to-bypass for new defenses. Report each separately; do not collapse them into a single safety score that hides the trade-off.

Roles. A program director (one person, fractional time), a manual red team (three to five people, mix of internal and external), an automation engineer (one person, who owns the regression harness), and a triage path into engineering. The director's job is to keep the intake fed and the cadence honest; the engineer's job is to make sure every failing case becomes a regression test.

Manual vs automated, divided cleanly

The line between manual and automated red teaming is straightforward once you stop pretending the same artifact serves both.

Manual red teaming finds new attack categories. Humans are still better at producing attacks that exploit context, ambiguity, and product-specific knowledge. A six-hour manual session produces, on average, two to five new attack categories worth adding to the corpus. Most attacks within those categories can then be automated.

Automated red teaming, including the public corpora (HarmBench, AILuminate, AdvBench) and your own accumulated corpus, runs continuously ³. The goal is to catch regressions, not to discover new attacks. Optimization-based automated red teaming (GCG-style suffix attacks) sits in the middle: it finds new specific cases within known categories at a cost that is high but bounded. Promptfoo's red-team module is the cheapest entry point for a working automated harness ⁴.

NOTE

If your automated suite has not added new categories of cases in a quarter, your manual program is dormant or your intake pipeline is broken. The right ratio is roughly fifty cases added per month from the manual side, against ten thousand or so already in the automated suite.

What to copy from the Ganguli setup

The Anthropic 2022 study is worth lifting three patterns from directly ¹. First, instructions to attackers are minimal and open-ended; the team did not pre-categorize the attacks, which let novel categories emerge. Second, the model under test was sampled at multiple sizes and training stages; comparing attack success across these is what made the trends visible. Third, the corpus was released; downstream researchers used it to build defenses, and the model itself was made more robust by training against the corpus.

The operational copy: have your manual red team submit free-form attempts into a single bucket, not pre-categorized. Categorize after the fact, so the taxonomy emerges from the data. Run the same corpus against multiple model checkpoints. Treat the corpus as a shared artifact across teams, not a per-product asset.

Boundary with the eval program

The red team's corpus is one input to your safety eval; the safety eval is one input to your release gate. The boundary matters because the failure modes you want to catch at release time are different from the failure modes a red team is actively producing.

Concretely, the red team produces the standing corpus. The eval program runs that corpus on every release and reports the diff. The release gate uses the diff plus a small set of always-on probes (the OWASP Top 10 floor, see the previous chapter) to make a ship-or-don't decision. The red team's findings, on average, lag the release window by two weeks; your gate cannot wait for them.

Anthropic frames the same boundary as "external red teaming as a complement to internal evals, not a substitute" ². The point is that a successful release on the standing corpus does not mean the model is safe; it means the model is no worse than the previous release on the attacks you currently know about.

A 90-day plan

Day	Action
0 to 14	Stand up the corpus repo. Seed with HarmBench and AILuminate cases (public). Add the first 50 internal cases by hand.
15 to 30	Run the corpus on the current production model. Write the first scoring rubric. Calibrate against a human spot-check.
31 to 60	Hire or contract the first three manual red-teamers. Two paid sessions per month.
61 to 90	Wire the automated run into CI as a non-blocking signal. Add the always-on OWASP probes to the release gate.

By day ninety, you have a corpus that grows monthly, a regression number that moves on each release, and a paper trail for the customer trust portal and any regulator who asks.

The next two chapters walk through the public corpora you should seed with: HarmBench in practice and AILuminate in practice.

Ganguli et al., "Red Teaming Language Models to Reduce Harms," Anthropic (2022). https://arxiv.org/abs/2209.07858 ↩ ↩²
Anthropic, "Challenges in Evaluating AI Systems." https://www.anthropic.com/news/evaluating-ai-systems ↩ ↩²
Mazeika et al., "HarmBench" (2024). https://arxiv.org/abs/2402.04249 ↩
Promptfoo, "Red-team documentation." https://www.promptfoo.dev/docs/red-team/ ↩