AI Evals
Runnable worked examples for evals you can ship this week.
The most useful, opinionated reference for AI evals.
AI Evals is a curated technical reference for evaluating large language model systems in production. The site covers error analysis, LLM-as-judge calibration, RAG evaluation, agentic eval, statistical rigor, and twenty task-specific eval playbooks. Every claim is cited.