The NIST AI Risk Management Framework is the most widely adopted voluntary AI governance reference in the United States and a common pointer in vendor agreements outside it. It is not a checklist; it is a structured set of outcomes organized into four functions (Govern, Map, Measure, Manage) with seven cross-cutting characteristics for trustworthy AI 1. The reason this page exists is that the framework's vocabulary is the lingua franca for AI governance conversations in 2026, and the most useful artifact your team can produce is a cross-walk from RMF outcomes to your existing eval activities.
The four functions, with eval cross-walk
The mapping below uses NIST's RMF 1.0 function names. The "Eval activity" column maps each function to the work your team is likely already doing or planning, with the page on this site that covers it where applicable.
| Function | What NIST says | Eval activity |
|---|
| Govern | Policies, accountability, oversight | Anthropic RSP or OpenAI Preparedness as a reference; written eval ownership; release-gate policy |
| Map | Context, use cases, stakeholders | Persona and use-case inventory; system-level risk identification; risk register entries |
| Measure | Quantitative and qualitative evaluation | Standing eval suite (offline and online), error analysis, LLM-as-Judge calibration, safety benchmarks (see HarmBench, AILuminate) |
| Manage | Risk prioritization, response, communication | Incident response process; post-market monitoring; board readouts; model and system cards |
The function names are deliberately broad. The cross-walk is what makes them actionable.
The seven trustworthiness characteristics
NIST also defines seven characteristics that any trustworthy AI system should exhibit. Each maps to specific eval activities; the cross-walk below is the artifact most audits accept as evidence.
| Characteristic | Eval activity that demonstrates it |
|---|
| Valid and reliable | Accuracy benchmarks against held-out test set, confidence intervals on metrics, statistical rigor |
| Safe | Adversarial testing, red-team program, public safety benchmark grades (HarmBench, AILuminate) |
| Secure and resilient | Input-filter coverage, output-filter coverage, robustness under prompt perturbation |
| Accountable and transparent | Ownership documented per system, model card published, change log of model and prompt updates |
| Explainable and interpretable | Trace logging, decision rationale capture, judge critiques alongside scores |
| Privacy-enhanced | PII-canary probes in eval set, differential-privacy or redaction in prompts where applicable, data-retention policy documented |
| Fair (with harmful bias managed) | Demographic-slice testing on benchmarks where labels exist; disparity reporting per slice |
A working audit binder maps each row to specific eval artifacts with dates and owners. The artifact is one page per row; the binder is fewer than twenty pages for a serious organization. If yours is approaching a hundred, you are producing process documentation, not evidence.
What the RMF buys you
Three concrete things, regardless of jurisdiction.
First, a vocabulary. When a customer's procurement team or a regulator asks how you manage AI risk, "we follow NIST AI RMF" is an answer that closes a meaningful percentage of conversations on its own. Microsoft, Google, IBM, and most major model providers publish RMF alignment claims in their trust portals 2. Saying yes is table stakes.
Second, a checklist. The cross-walk above is the smallest useful version of that checklist. The longer version, as a working document, becomes the index for your audit binder.
Third, a forcing function. Mapping your existing eval activities to RMF outcomes surfaces gaps. The most common gap is on the Govern side: teams have measurement and management but no documented policy for who decides what to ship and why. The second most common is on the Manage side: incidents happen and are fixed but no post-mortem feeds back into the eval suite. Both are cheap to close once named.
What the RMF does not buy you
The RMF is voluntary and outcome-based, not prescriptive. It does not name specific benchmarks. It does not set numerical thresholds. It does not certify your program. Several of these are now being filled in by the EU AI Act (which does point at testing protocols for high-risk systems), by sector-specific regulators (financial services, healthcare), and by the GPAI Code of Practice 3. Adopt the RMF as the structuring document; layer the specifics on top.
Mapping to EU AI Act obligations
If your eval program is RMF-aligned, the EU AI Act mapping is short. The Measure and Manage functions correspond directly to the Act's accuracy testing (Art. 15) and post-market monitoring (Art. 72) obligations 3. The Govern function corresponds to the quality management system and risk-management process (Arts. 9, 17). The Map function maps to the intended-purpose specification and conformity assessment scoping.
The shortest defensible audit answer is a one-page exhibit that lists each Act obligation, the RMF function it maps to, and the specific eval artifact that demonstrates it. That exhibit, with citations to the underlying documents (your risk register row, your model card section, your post-market report), is what auditors look for.
A 30-day starter
| Week | Action |
|---|
| 1 | Pick one product. List the seven RMF characteristics on a single page. Note one eval activity per characteristic. |
| 2 | Fill the gaps. The most common holes are documented ownership (Govern) and demographic-slice testing (Fair). |
| 3 | Stand up the model card. Anthropic's framing in their "Challenges in Evaluating AI Systems" piece is a useful reference for what to include and what to leave out 4. |
| 4 | Wire the audit binder index. One page per characteristic; one row per Act obligation; the cross-walk above as the master table. |
The cross-walk produced this month is the same document you will hand to a customer's procurement team, an internal compliance team, a regulator, or an acquirer. Build it once.
The next chapters cover the two major vendor scaling policies that practitioners reference when designing internal versions: the Anthropic Responsible Scaling Policy and the OpenAI Preparedness Framework.