AI Evals

AI Evals https://www.aievals.co Recent updates from aievals.co en-us Fri, 29 May 2026 16:19:28 GMT Why evals matter https://www.aievals.co/learn/foundations/why-evals-matter https://www.aievals.co/learn/foundations/why-evals-matter Fri, 29 May 2026 00:00:00 GMT Foundations If you ship AI, evals are the moat. The case in one page. The eval maturity model https://www.aievals.co/learn/foundations/eval-maturity-model https://www.aievals.co/learn/foundations/eval-maturity-model Fri, 29 May 2026 00:00:00 GMT Foundations Five stages from vibe-checking to a continuous quality program. Where are you? AI development as scientific method https://www.aievals.co/learn/foundations/scientific-method-for-ai https://www.aievals.co/learn/foundations/scientific-method-for-ai Fri, 29 May 2026 00:00:00 GMT Foundations The observe, hypothesize, experiment, measure, iterate loop mapped onto an AI product team. Why LLM evals are hard https://www.aievals.co/learn/foundations/non-determinism-and-subjectivity https://www.aievals.co/learn/foundations/non-determinism-and-subjectivity Fri, 29 May 2026 00:00:00 GMT Foundations Non-determinism, subjectivity, prompt sensitivity, and the reasons classical ML eval frameworks do not transfer. Eval vs monitoring vs guardrails https://www.aievals.co/learn/foundations/eval-vs-monitoring-vs-guardrails https://www.aievals.co/learn/foundations/eval-vs-monitoring-vs-guardrails Fri, 29 May 2026 00:00:00 GMT Foundations Three concepts that share traces but do different jobs, with the architectural placement of each in a request lifecycle. The 60-80% rule https://www.aievals.co/learn/error-analysis/the-60-80-rule https://www.aievals.co/learn/error-analysis/the-60-80-rule Fri, 29 May 2026 00:00:00 GMT Error Analysis Three failure modes usually account for most of your bugs. Find them, fix them, repeat. Open coding for AI traces https://www.aievals.co/learn/error-analysis/open-coding-workflow https://www.aievals.co/learn/error-analysis/open-coding-workflow Fri, 29 May 2026 00:00:00 GMT Error Analysis Read traces, write notes, cluster notes into failure modes, prioritize fixes. The three-stage process and the LLM-assisted parts. Dimensional sampling https://www.aievals.co/learn/error-analysis/dimensional-sampling https://www.aievals.co/learn/error-analysis/dimensional-sampling Fri, 29 May 2026 00:00:00 GMT Error Analysis Synthetic input construction across features, scenarios, and personas to cover the failure surface, not the demo path. Failure-mode taxonomies https://www.aievals.co/learn/error-analysis/failure-mode-taxonomies https://www.aievals.co/learn/error-analysis/failure-mode-taxonomies Fri, 29 May 2026 00:00:00 GMT Error Analysis Bottom-up coding beats top-down templates, and why generic labels like hallucination mislead. Build your trace viewer in an afternoon https://www.aievals.co/learn/error-analysis/the-data-viewer https://www.aievals.co/learn/error-analysis/the-data-viewer Fri, 29 May 2026 00:00:00 GMT Error Analysis The single highest-return tool for an eval program. Streamlit and Next.js variants, with the minimum feature set. Case study: NurtureBoss https://www.aievals.co/learn/error-analysis/case-study-nurtureboss https://www.aievals.co/learn/error-analysis/case-study-nurtureboss Fri, 29 May 2026 00:00:00 GMT Error Analysis One team's path from 66 percent date errors to 5 percent in a single error-analysis cycle. LLM-as-Judge: the practical introduction https://www.aievals.co/learn/llm-as-judge/introduction https://www.aievals.co/learn/llm-as-judge/introduction Fri, 29 May 2026 00:00:00 GMT LLM-as-Judge What an LLM judge is, when to use one, when not to, and the three pitfalls that show up on every team's first attempt. The four-part judge prompt https://www.aievals.co/learn/llm-as-judge/judge-prompt-structure https://www.aievals.co/learn/llm-as-judge/judge-prompt-structure Fri, 29 May 2026 00:00:00 GMT LLM-as-Judge A copyable template: role and definitions, criteria, few-shot from your train set, output format. Plus the runnable invocation. Calibrating your judge against humans https://www.aievals.co/learn/llm-as-judge/calibration-to-humans https://www.aievals.co/learn/llm-as-judge/calibration-to-humans Fri, 29 May 2026 00:00:00 GMT LLM-as-Judge TPR and TNR over agreement percent. Train, dev, test splits. The 100-example minimum. Iterate the prompt until both rates clear 0.90. Judge biases you must defuse https://www.aievals.co/learn/llm-as-judge/biases-and-mitigations https://www.aievals.co/learn/llm-as-judge/biases-and-mitigations Fri, 29 May 2026 00:00:00 GMT LLM-as-Judge Position, verbosity, self-preference, format, sycophancy. Each with the empirical evidence and the mitigation you bake into the harness. Where LLM judges fail https://www.aievals.co/learn/llm-as-judge/judgebench-honest-take https://www.aievals.co/learn/llm-as-judge/judgebench-honest-take Fri, 29 May 2026 00:00:00 GMT LLM-as-Judge JudgeBench: GPT-4o is roughly random on hard correctness judgments. The implications for your pipeline are concrete, not theoretical. Pairwise vs direct scoring https://www.aievals.co/learn/llm-as-judge/pairwise-vs-direct https://www.aievals.co/learn/llm-as-judge/pairwise-vs-direct Fri, 29 May 2026 00:00:00 GMT LLM-as-Judge When to ask the judge for a side-by-side ranking and when to ask for a single binary verdict. Distilled judges https://www.aievals.co/learn/llm-as-judge/distilled-judges https://www.aievals.co/learn/llm-as-judge/distilled-judges Fri, 29 May 2026 00:00:00 GMT LLM-as-Judge Why distillation matters in production, which shipped judges to know about, and the inline plus calibration architecture that holds up. Constructing your eval dataset https://www.aievals.co/learn/datasets/construction https://www.aievals.co/learn/datasets/construction Fri, 29 May 2026 00:00:00 GMT Datasets & Golden Sets How to build the first hundred labeled examples, then grow them past five hundred without losing signal. Synthetic vs human-authored examples https://www.aievals.co/learn/datasets/synthetic-vs-human https://www.aievals.co/learn/datasets/synthetic-vs-human Fri, 29 May 2026 00:00:00 GMT Datasets & Golden Sets Where each shape of data buys signal, where each one leaks, and how to mix them without poisoning the set. Versioning your eval set https://www.aievals.co/learn/datasets/versioning-lineage https://www.aievals.co/learn/datasets/versioning-lineage Fri, 29 May 2026 00:00:00 GMT Datasets & Golden Sets A content-hashed eval set pinned to every result is a contract with your past self. Building regression sets https://www.aievals.co/learn/datasets/regression-sets https://www.aievals.co/learn/datasets/regression-sets Fri, 29 May 2026 00:00:00 GMT Datasets & Golden Sets The set whose job is to fail loudly the next time you reintroduce a bug you already fixed. Dimensional coverage: features, scenarios, personas https://www.aievals.co/learn/datasets/dimensional-coverage https://www.aievals.co/learn/datasets/dimensional-coverage Fri, 29 May 2026 00:00:00 GMT Datasets & Golden Sets A three-axis grid that tells you which cells of the failure surface your set actually probes. Where to keep your eval datasets https://www.aievals.co/learn/datasets/dataset-platforms https://www.aievals.co/learn/datasets/dataset-platforms Fri, 29 May 2026 00:00:00 GMT Datasets & Golden Sets Langfuse, Phoenix, Braintrust, LangSmith, or a JSONL in git. The choice that actually matters. Eval RAG in two layers https://www.aievals.co/learn/rag-evals/retrieval-vs-generation https://www.aievals.co/learn/rag-evals/retrieval-vs-generation Fri, 29 May 2026 00:00:00 GMT RAG Evals Separate retrieval and generation or you cannot fix what breaks. Retrieval metrics: Recall@K, MRR, NDCG, hit-rate https://www.aievals.co/learn/rag-evals/retrieval-metrics https://www.aievals.co/learn/rag-evals/retrieval-metrics Fri, 29 May 2026 00:00:00 GMT RAG Evals The IR metrics that distinguish a retriever that found the right chunk from one that did not. Faithfulness and atomic facts https://www.aievals.co/learn/rag-evals/generation-faithfulness https://www.aievals.co/learn/rag-evals/generation-faithfulness Fri, 29 May 2026 00:00:00 GMT RAG Evals Decompose the answer into atomic claims, judge each claim binary, and stop measuring hallucination as a single yes/no. Context precision and recall https://www.aievals.co/learn/rag-evals/context-precision-recall https://www.aievals.co/learn/rag-evals/context-precision-recall Fri, 29 May 2026 00:00:00 GMT RAG Evals The Ragas pair that tells you whether the retriever found the right chunks and whether they were the only chunks worth showing the model. Synthetic query generation https://www.aievals.co/learn/rag-evals/synthetic-query-generation https://www.aievals.co/learn/rag-evals/synthetic-query-generation Fri, 29 May 2026 00:00:00 GMT RAG Evals Bootstrap a labeled RAG eval set from your own documents, without inventing ground truth. Long-context RAG and RULER https://www.aievals.co/learn/rag-evals/long-context-rag https://www.aievals.co/learn/rag-evals/long-context-rag Fri, 29 May 2026 00:00:00 GMT RAG Evals Why advertised context lengths exceed effective context lengths, and the benchmarks that expose the gap. Non-English RAG eval https://www.aievals.co/learn/rag-evals/non-english-rag https://www.aievals.co/learn/rag-evals/non-english-rag Fri, 29 May 2026 00:00:00 GMT RAG Evals What to measure when your corpus and your users do not share a language with most of the literature. Trajectory-level evaluation https://www.aievals.co/learn/agentic-evals/trajectory-vs-outcome https://www.aievals.co/learn/agentic-evals/trajectory-vs-outcome Fri, 29 May 2026 00:00:00 GMT Agentic Evals Outcome-only scoring ranks agents in the wrong order. Score the trajectory and the result. Tool-call evaluation: AST, schema, execution https://www.aievals.co/learn/agentic-evals/tool-call-evaluation https://www.aievals.co/learn/agentic-evals/tool-call-evaluation Fri, 29 May 2026 00:00:00 GMT Agentic Evals Three levels of tool-call evaluation: schema conformance, semantic AST equivalence, and execution-grounded outcomes. Use all three or you measure the wrong thing. Sub-goal decomposition: plan graphs as eval objects https://www.aievals.co/learn/agentic-evals/sub-goal-decomposition https://www.aievals.co/learn/agentic-evals/sub-goal-decomposition Fri, 29 May 2026 00:00:00 GMT Agentic Evals Real agent tasks are graphs. Evaluating only the leaf hides 4-of-5 sub-goal progress. Sub-goal completion is the planning metric that surfaces it. Pass^k: the metric that catches inconsistent agents https://www.aievals.co/learn/agentic-evals/pass-k-and-consistency https://www.aievals.co/learn/agentic-evals/pass-k-and-consistency Fri, 29 May 2026 00:00:00 GMT Agentic Evals Pass@1 hides catastrophic inconsistency. Report pass^k or you do not know how often the agent will repeatedly succeed. Agent-as-a-judge: the trajectory-scoring pattern https://www.aievals.co/learn/agentic-evals/agent-as-judge https://www.aievals.co/learn/agentic-evals/agent-as-judge Fri, 29 May 2026 00:00:00 GMT Agentic Evals Single-LLM judges lose the thread on long trajectories. An agent judge with its own tools can walk the steps and grade them independently. Reward hacking detection https://www.aievals.co/learn/agentic-evals/reward-hacking https://www.aievals.co/learn/agentic-evals/reward-hacking Fri, 29 May 2026 00:00:00 GMT Agentic Evals Once an objective is optimized, it gets gamed. Reward hacking is empirically common; eval for it explicitly. Environmental drift: robustness to tool and API change https://www.aievals.co/learn/agentic-evals/environmental-drift https://www.aievals.co/learn/agentic-evals/environmental-drift Fri, 29 May 2026 00:00:00 GMT Agentic Evals Most agent benchmarks freeze the environment. Production does not. Inject latency, schema shifts, and transient errors. Score the recovery, not the happy path. Regression suites for agent loops https://www.aievals.co/learn/agentic-evals/regression-suites https://www.aievals.co/learn/agentic-evals/regression-suites Fri, 29 May 2026 00:00:00 GMT Agentic Evals Trajectory regression is harder than answer regression. Replay 50 prod trajectories on every change and diff outcome, step count, and tool-call sequence. Online vs offline: where each fires https://www.aievals.co/learn/production/online-vs-offline https://www.aievals.co/learn/production/online-vs-offline Fri, 29 May 2026 00:00:00 GMT Production & Observability Offline evals run on a frozen dataset for every change. Online evals sample real traffic continuously. They answer different questions and both belong in production. Trace schema: what to log https://www.aievals.co/learn/production/trace-schema https://www.aievals.co/learn/production/trace-schema Fri, 29 May 2026 00:00:00 GMT Production & Observability The trace is the unit of analysis for every eval program. This is the schema, the fields that matter, and the storage shape that makes downstream queries cheap. Sampling strategies for production traces https://www.aievals.co/learn/production/sampling-strategies https://www.aievals.co/learn/production/sampling-strategies Fri, 29 May 2026 00:00:00 GMT Production & Observability Uniform sampling is the default and the wrong default for tail slices. Stratify on the dimensions you care about, oversample failures, and reweight. Drift detection for production AI https://www.aievals.co/learn/production/drift-detection https://www.aievals.co/learn/production/drift-detection Fri, 29 May 2026 00:00:00 GMT Production & Observability Track input and output distributions over time. Page on changes that matter; ignore the ones that do not. The hardest part is deciding which is which. Cost and latency belong in the scorecard https://www.aievals.co/learn/production/cost-latency-eval https://www.aievals.co/learn/production/cost-latency-eval Fri, 29 May 2026 00:00:00 GMT Production & Observability A model that wins quality and triples p95 latency is not a win. Put quality, cost, and latency on one scorecard and pick on the Pareto frontier. Wiring evals into CI https://www.aievals.co/learn/production/ci-integration https://www.aievals.co/learn/production/ci-integration Fri, 29 May 2026 00:00:00 GMT Production & Observability Tier the suite into smoke, full, and nightly. Cap cost. Shard. Publish to the PR. Make the merge gate boring and the merge decision data-driven. Release gates from eval results https://www.aievals.co/learn/production/release-gates https://www.aievals.co/learn/production/release-gates Fri, 29 May 2026 00:00:00 GMT Production & Observability A release gate is a contract between the team and the user. Define the bands, automate the comparison, and hold the line when a regression candidate wants out. Incident response for AI products https://www.aievals.co/learn/production/incident-response https://www.aievals.co/learn/production/incident-response Fri, 29 May 2026 00:00:00 GMT Production & Observability An AI postmortem looks different from a service postmortem. The root cause is rarely a line of code. The actions are usually eval-set changes. Confidence intervals: Wilson and Clopper-Pearson https://www.aievals.co/learn/statistics/confidence-intervals https://www.aievals.co/learn/statistics/confidence-intervals Fri, 29 May 2026 00:00:00 GMT Statistics & Rigor Wald is wrong. Wilson is the practical default. Clopper-Pearson is the conservative exact. With code and the case for why N=10 evals tell you almost nothing. Bootstrap and paired tests for noisy eval metrics https://www.aievals.co/learn/statistics/bootstrap-and-paired-tests https://www.aievals.co/learn/statistics/bootstrap-and-paired-tests Fri, 29 May 2026 00:00:00 GMT Statistics & Rigor When the metric is not a clean Bernoulli, bootstrap. When the systems share inputs, pair. Both close the gap between point estimates and actionable decisions. Statistical power: how big does the eval set need to be https://www.aievals.co/learn/statistics/statistical-power https://www.aievals.co/learn/statistics/statistical-power Fri, 29 May 2026 00:00:00 GMT Statistics & Rigor Power is the probability of detecting the delta you care about. Pick the delta first, derive the N, then commit. Reversing the order produces evals nobody can act on. Clustered standard errors: when CIs lie https://www.aievals.co/learn/statistics/clustered-standard-errors https://www.aievals.co/learn/statistics/clustered-standard-errors Fri, 29 May 2026 00:00:00 GMT Statistics & Rigor When eval questions cluster (multiple questions per topic, multiple turns per session), naive standard errors are three times too small. The cluster-aware variance formula and code, with the... Inter-rater agreement: Cohen's kappa and Krippendorff's alpha https://www.aievals.co/learn/statistics/inter-rater-agreement https://www.aievals.co/learn/statistics/inter-rater-agreement Fri, 29 May 2026 00:00:00 GMT Statistics & Rigor Two qualified humans disagree on your rubric. That is information about the rubric, not about the raters. Measure agreement before you trust labels. Multiple comparisons: Bonferroni vs BH-FDR https://www.aievals.co/learn/statistics/multiple-comparisons https://www.aievals.co/learn/statistics/multiple-comparisons Fri, 29 May 2026 00:00:00 GMT Statistics & Rigor Run twenty tests at alpha 0.05 and you expect one false positive. Correct for it. Bonferroni for strict family-wise control; Benjamini-Hochberg for false-discovery control. Effect size: Cohen's d and absolute deltas https://www.aievals.co/learn/statistics/effect-size https://www.aievals.co/learn/statistics/effect-size Fri, 29 May 2026 00:00:00 GMT Statistics & Rigor A statistically significant 0.3-point win on a 0-100 rubric is not worth shipping. Report the magnitude, not just the p-value. Walking the OWASP Top 10 for LLM apps https://www.aievals.co/learn/safety-and-red-team/owasp-llm-top-10 https://www.aievals.co/learn/safety-and-red-team/owasp-llm-top-10 Fri, 29 May 2026 00:00:00 GMT Safety & Red-Team The ten failure categories every LLM application owner should be probing, with the smallest useful test for each. Designing a red-team program https://www.aievals.co/learn/safety-and-red-team/red-team-program-design https://www.aievals.co/learn/safety-and-red-team/red-team-program-design Fri, 29 May 2026 00:00:00 GMT Safety & Red-Team The operating model for an LLM red team: frequency, staffing, intake, scoring, and where it sits relative to automated regression. HarmBench in practice https://www.aievals.co/learn/safety-and-red-team/harmbench-walkthrough https://www.aievals.co/learn/safety-and-red-team/harmbench-walkthrough Fri, 29 May 2026 00:00:00 GMT Safety & Red-Team A code walk through the HarmBench corpus, the StrongREJECT scoring classifier, and how to read the leaderboard without overfitting to it. AILuminate in practice https://www.aievals.co/learn/safety-and-red-team/ailuminate-walkthrough https://www.aievals.co/learn/safety-and-red-team/ailuminate-walkthrough Fri, 29 May 2026 00:00:00 GMT Safety & Red-Team MLCommons' standardized safety benchmark, the twelve hazard categories, and how to read its grade scale alongside HarmBench. Over-refusal and the balance question https://www.aievals.co/learn/safety-and-red-team/over-refusal-and-balance https://www.aievals.co/learn/safety-and-red-team/over-refusal-and-balance Fri, 29 May 2026 00:00:00 GMT Safety & Red-Team Why refusal rate alone is a misleading metric, the XSTest pattern for paired benign cases, and how to grade refusal against necessity. Jailbreaks and defenses https://www.aievals.co/learn/safety-and-red-team/jailbreaks-and-defenses https://www.aievals.co/learn/safety-and-red-team/jailbreaks-and-defenses Fri, 29 May 2026 00:00:00 GMT Safety & Red-Team GCG-style adversarial suffixes, the empirical floor on defense success, and what your guardrail stack should actually do.