The EU AI Act is the first major Western regulation that maps specific obligations onto AI system providers and deployers, and a meaningful share of those obligations are evaluation requirements. The Act entered into force on August 1, 2024, and applies in phases. By the time most readers see this page, the GPAI provisions are live and the high-risk regime is months away from full application. The cheatsheet below is the version you pin above your desk 1.
Application timeline
| Date | What goes live |
|---|
| Feb 2, 2025 | Prohibited practices ban (Article 5) and AI literacy obligations |
| Aug 2, 2025 | General-Purpose AI (GPAI) model obligations, including systemic-risk classification, codes of practice, and governance and penalty provisions |
| Aug 2, 2026 | Full application: high-risk system obligations under Annex III, transparency obligations for limited-risk systems, sandbox and testing-in-real-world requirements |
| Aug 2, 2027 | Extended deadline for high-risk systems embedded in regulated products (Annex I), including medical devices and machinery |
Read the dates as the latest a given obligation can come into force. Member states and the AI Office have ratcheted up enforcement readiness ahead of each milestone; treating the dates as soft deadlines has not been a winning bet on prior EU technology regulation 1.
The four risk tiers
| Tier | What it covers | Eval implication |
|---|
| Prohibited (Art. 5) | Social scoring, manipulative systems, real-time biometric ID in public (narrow exceptions) | If you are in this tier, the obligation is not eval; it is exit |
| High-risk (Annex III) | Critical infra, education, employment, essential services, law enforcement, migration, justice, democratic process | Quality management system, conformity assessment, post-market monitoring, automated logging, accuracy and robustness testing, human oversight |
| Limited-risk | Chatbots, generative content, emotion recognition (where used) | Transparency: users informed they are interacting with AI, content marked as AI-generated |
| Minimal-risk | Everything else | No specific obligations |
A separate axis cuts across the tiers for GPAI models (the regime targets foundation-model providers specifically) and for GPAI models with systemic risk (the largest models, currently identified by training-compute thresholds and Commission designation).
GPAI obligations, in plain text
If you provide a general-purpose AI model placed on the EU market on or after August 2, 2025, three obligations apply regardless of tier:
- Technical documentation. A summary of training content, training methodology, energy consumption, and capability and limitation testing. The Commission publishes a template under the GPAI code of practice.
- Copyright policy. A documented policy for respecting EU copyright law, including a policy on text and data mining opt-outs.
- Summary of training data. A public-facing summary "sufficiently detailed" to let copyright holders assess whether their works were used.
If your model is classified as having systemic risk, four further obligations apply:
- Model evaluation under standardized protocols, including adversarial testing.
- Systemic-risk assessment and mitigation, with documentation maintained.
- Serious-incident reporting to the AI Office and national authorities.
- Cybersecurity protections appropriate to the model's capabilities.
Obligation 4 is where the evals chapter of your governance program lives. The Act does not name specific benchmarks; the GPAI Code of Practice is the soft-law instrument that points at concrete protocols. Track the code's updates as you would the OWASP top 10: a moving target you re-baseline against 1.
High-risk obligations, in plain text
If you provide a high-risk AI system (Annex III categories, eight families covering critical infrastructure, education, employment, essential public and private services, law enforcement, migration, administration of justice, and democratic processes), the obligations include a quality management system, technical documentation maintained throughout the lifecycle, automated event logging, human oversight provisions, accuracy and cybersecurity testing, and conformity assessment before placement on the market.
The eval-specific obligations are accuracy testing (Art. 15), risk management (Art. 9), and post-market monitoring (Art. 72). Accuracy and robustness testing must be documented and proportionate to intended use. The accepted approach is to point at a recognized framework (the NIST AI RMF maps cleanly, see the next chapter) and demonstrate that your eval activities cover the required dimensions 2.
What to document, regardless of tier
The minimum evidence pack for an Act audit conversation:
| Document | What it contains |
|---|
| Eval methodology summary | What benchmarks, what cadence, what release gates |
| Risk register (see AI risk register) | Identified risks, severity, mitigations, residual risk |
| Model and system cards (see Customer trust artifacts) | Capabilities, limitations, eval results with dates |
| Post-market monitoring report | Production incidents, response, model updates triggered |
| Human oversight description | Which humans, when, with what authority to override |
Microsoft and other major vendors publish reference implementations of this evidence pack 3. The shape is becoming standardized; you do not need to invent it from scratch.
A common misreading, named
The Act applies to systems "placed on the market or put into service" in the EU. Some teams read this as "we are a US company, this does not apply." It does. The Act has explicit extraterritorial scope when (a) the output of the system is used in the EU, (b) the system's deployer is established in the EU, or (c) the system is made available on the EU market. For most AI SaaS products, at least one of these is true.
The second common misreading is treating compliance as a one-time audit. The Act's post-market monitoring and risk-management provisions are continuous; the eval program is what feeds those obligations. A spike in production failure rate that you do not investigate and respond to is a separate compliance issue from the rate itself.
What to do this quarter
- Identify whether your system is in scope (most are; check Annex III against your product surfaces).
- Map your existing eval activities to the four eval-adjacent obligations: accuracy/robustness, risk management, post-market monitoring, and adversarial testing for GPAI systemic-risk.
- Stand up the evidence pack above. The risk register, model card, and post-market monitoring report are the three artifacts that take real work; the rest are summaries of work you are already doing.
The next chapter, NIST AI RMF mapped, gives you the cross-walk that makes the eval-to-obligation mapping defensible in writing.