AI Evals

https://www.aievals.co AI Evals Recent updates from aievals.co 2026-05-29T00:00:00.000Z https://www.aievals.co/learn/foundations/why-evals-matter Why evals matter 2026-05-29T00:00:00.000Z

If you ship AI, evals are the moat. The case in one page.

https://www.aievals.co/learn/foundations/eval-maturity-model The eval maturity model 2026-05-29T00:00:00.000Z

Five stages from vibe-checking to a continuous quality program. Where are you?

https://www.aievals.co/learn/foundations/scientific-method-for-ai AI development as scientific method 2026-05-29T00:00:00.000Z

The observe, hypothesize, experiment, measure, iterate loop mapped onto an AI product team.

https://www.aievals.co/learn/foundations/non-determinism-and-subjectivity Why LLM evals are hard 2026-05-29T00:00:00.000Z

Non-determinism, subjectivity, prompt sensitivity, and the reasons classical ML eval frameworks do not transfer.

https://www.aievals.co/learn/foundations/eval-vs-monitoring-vs-guardrails Eval vs monitoring vs guardrails 2026-05-29T00:00:00.000Z

Three concepts that share traces but do different jobs, with the architectural placement of each in a request lifecycle.

https://www.aievals.co/learn/error-analysis/the-60-80-rule The 60-80% rule 2026-05-29T00:00:00.000Z

Three failure modes usually account for most of your bugs. Find them, fix them, repeat.

https://www.aievals.co/learn/error-analysis/open-coding-workflow Open coding for AI traces 2026-05-29T00:00:00.000Z

Read traces, write notes, cluster notes into failure modes, prioritize fixes. The three-stage process and the LLM-assisted parts.

https://www.aievals.co/learn/error-analysis/dimensional-sampling Dimensional sampling 2026-05-29T00:00:00.000Z

Synthetic input construction across features, scenarios, and personas to cover the failure surface, not the demo path.

https://www.aievals.co/learn/error-analysis/failure-mode-taxonomies Failure-mode taxonomies 2026-05-29T00:00:00.000Z

Bottom-up coding beats top-down templates, and why generic labels like hallucination mislead.

https://www.aievals.co/learn/error-analysis/the-data-viewer Build your trace viewer in an afternoon 2026-05-29T00:00:00.000Z

The single highest-return tool for an eval program. Streamlit and Next.js variants, with the minimum feature set.

https://www.aievals.co/learn/error-analysis/case-study-nurtureboss Case study: NurtureBoss 2026-05-29T00:00:00.000Z

One team's path from 66 percent date errors to 5 percent in a single error-analysis cycle.

https://www.aievals.co/learn/llm-as-judge/introduction LLM-as-Judge: the practical introduction 2026-05-29T00:00:00.000Z

What an LLM judge is, when to use one, when not to, and the three pitfalls that show up on every team's first attempt.

https://www.aievals.co/learn/llm-as-judge/judge-prompt-structure The four-part judge prompt 2026-05-29T00:00:00.000Z

A copyable template: role and definitions, criteria, few-shot from your train set, output format. Plus the runnable invocation.

https://www.aievals.co/learn/llm-as-judge/calibration-to-humans Calibrating your judge against humans 2026-05-29T00:00:00.000Z

TPR and TNR over agreement percent. Train, dev, test splits. The 100-example minimum. Iterate the prompt until both rates clear 0.90.

https://www.aievals.co/learn/llm-as-judge/biases-and-mitigations Judge biases you must defuse 2026-05-29T00:00:00.000Z

Position, verbosity, self-preference, format, sycophancy. Each with the empirical evidence and the mitigation you bake into the harness.

https://www.aievals.co/learn/llm-as-judge/judgebench-honest-take Where LLM judges fail 2026-05-29T00:00:00.000Z

JudgeBench: GPT-4o is roughly random on hard correctness judgments. The implications for your pipeline are concrete, not theoretical.

https://www.aievals.co/learn/llm-as-judge/pairwise-vs-direct Pairwise vs direct scoring 2026-05-29T00:00:00.000Z

When to ask the judge for a side-by-side ranking and when to ask for a single binary verdict.

https://www.aievals.co/learn/llm-as-judge/distilled-judges Distilled judges 2026-05-29T00:00:00.000Z

Why distillation matters in production, which shipped judges to know about, and the inline plus calibration architecture that holds up.

https://www.aievals.co/learn/datasets/construction Constructing your eval dataset 2026-05-29T00:00:00.000Z

How to build the first hundred labeled examples, then grow them past five hundred without losing signal.

https://www.aievals.co/learn/datasets/synthetic-vs-human Synthetic vs human-authored examples 2026-05-29T00:00:00.000Z

Where each shape of data buys signal, where each one leaks, and how to mix them without poisoning the set.

https://www.aievals.co/learn/datasets/versioning-lineage Versioning your eval set 2026-05-29T00:00:00.000Z

A content-hashed eval set pinned to every result is a contract with your past self.

https://www.aievals.co/learn/datasets/regression-sets Building regression sets 2026-05-29T00:00:00.000Z

The set whose job is to fail loudly the next time you reintroduce a bug you already fixed.

https://www.aievals.co/learn/datasets/dimensional-coverage Dimensional coverage: features, scenarios, personas 2026-05-29T00:00:00.000Z

A three-axis grid that tells you which cells of the failure surface your set actually probes.

https://www.aievals.co/learn/datasets/dataset-platforms Where to keep your eval datasets 2026-05-29T00:00:00.000Z

Langfuse, Phoenix, Braintrust, LangSmith, or a JSONL in git. The choice that actually matters.

https://www.aievals.co/learn/rag-evals/retrieval-vs-generation Eval RAG in two layers 2026-05-29T00:00:00.000Z

Separate retrieval and generation or you cannot fix what breaks.

https://www.aievals.co/learn/rag-evals/retrieval-metrics Retrieval metrics: Recall@K, MRR, NDCG, hit-rate 2026-05-29T00:00:00.000Z

The IR metrics that distinguish a retriever that found the right chunk from one that did not.

https://www.aievals.co/learn/rag-evals/generation-faithfulness Faithfulness and atomic facts 2026-05-29T00:00:00.000Z

Decompose the answer into atomic claims, judge each claim binary, and stop measuring hallucination as a single yes/no.

https://www.aievals.co/learn/rag-evals/context-precision-recall Context precision and recall 2026-05-29T00:00:00.000Z

The Ragas pair that tells you whether the retriever found the right chunks and whether they were the only chunks worth showing the model.

https://www.aievals.co/learn/rag-evals/synthetic-query-generation Synthetic query generation 2026-05-29T00:00:00.000Z

Bootstrap a labeled RAG eval set from your own documents, without inventing ground truth.

https://www.aievals.co/learn/rag-evals/long-context-rag Long-context RAG and RULER 2026-05-29T00:00:00.000Z

Why advertised context lengths exceed effective context lengths, and the benchmarks that expose the gap.

https://www.aievals.co/learn/rag-evals/non-english-rag Non-English RAG eval 2026-05-29T00:00:00.000Z

What to measure when your corpus and your users do not share a language with most of the literature.

https://www.aievals.co/learn/agentic-evals/trajectory-vs-outcome Trajectory-level evaluation 2026-05-29T00:00:00.000Z

Outcome-only scoring ranks agents in the wrong order. Score the trajectory and the result.

https://www.aievals.co/learn/agentic-evals/tool-call-evaluation Tool-call evaluation: AST, schema, execution 2026-05-29T00:00:00.000Z

Three levels of tool-call evaluation: schema conformance, semantic AST equivalence, and execution-grounded outcomes. Use all three or you measure the wrong thing.

https://www.aievals.co/learn/agentic-evals/sub-goal-decomposition Sub-goal decomposition: plan graphs as eval objects 2026-05-29T00:00:00.000Z

Real agent tasks are graphs. Evaluating only the leaf hides 4-of-5 sub-goal progress. Sub-goal completion is the planning metric that surfaces it.

https://www.aievals.co/learn/agentic-evals/pass-k-and-consistency Pass^k: the metric that catches inconsistent agents 2026-05-29T00:00:00.000Z

Pass@1 hides catastrophic inconsistency. Report pass^k or you do not know how often the agent will repeatedly succeed.

https://www.aievals.co/learn/agentic-evals/agent-as-judge Agent-as-a-judge: the trajectory-scoring pattern 2026-05-29T00:00:00.000Z

Single-LLM judges lose the thread on long trajectories. An agent judge with its own tools can walk the steps and grade them independently.

https://www.aievals.co/learn/agentic-evals/reward-hacking Reward hacking detection 2026-05-29T00:00:00.000Z

Once an objective is optimized, it gets gamed. Reward hacking is empirically common; eval for it explicitly.

https://www.aievals.co/learn/agentic-evals/environmental-drift Environmental drift: robustness to tool and API change 2026-05-29T00:00:00.000Z

Most agent benchmarks freeze the environment. Production does not. Inject latency, schema shifts, and transient errors. Score the recovery, not the happy path.

https://www.aievals.co/learn/agentic-evals/regression-suites Regression suites for agent loops 2026-05-29T00:00:00.000Z

Trajectory regression is harder than answer regression. Replay 50 prod trajectories on every change and diff outcome, step count, and tool-call sequence.

https://www.aievals.co/learn/production/online-vs-offline Online vs offline: where each fires 2026-05-29T00:00:00.000Z

Offline evals run on a frozen dataset for every change. Online evals sample real traffic continuously. They answer different questions and both belong in production.

https://www.aievals.co/learn/production/trace-schema Trace schema: what to log 2026-05-29T00:00:00.000Z

The trace is the unit of analysis for every eval program. This is the schema, the fields that matter, and the storage shape that makes downstream queries cheap.

https://www.aievals.co/learn/production/sampling-strategies Sampling strategies for production traces 2026-05-29T00:00:00.000Z

Uniform sampling is the default and the wrong default for tail slices. Stratify on the dimensions you care about, oversample failures, and reweight.

https://www.aievals.co/learn/production/drift-detection Drift detection for production AI 2026-05-29T00:00:00.000Z

Track input and output distributions over time. Page on changes that matter; ignore the ones that do not. The hardest part is deciding which is which.

https://www.aievals.co/learn/production/cost-latency-eval Cost and latency belong in the scorecard 2026-05-29T00:00:00.000Z

A model that wins quality and triples p95 latency is not a win. Put quality, cost, and latency on one scorecard and pick on the Pareto frontier.

https://www.aievals.co/learn/production/ci-integration Wiring evals into CI 2026-05-29T00:00:00.000Z

Tier the suite into smoke, full, and nightly. Cap cost. Shard. Publish to the PR. Make the merge gate boring and the merge decision data-driven.

https://www.aievals.co/learn/production/release-gates Release gates from eval results 2026-05-29T00:00:00.000Z

A release gate is a contract between the team and the user. Define the bands, automate the comparison, and hold the line when a regression candidate wants out.

https://www.aievals.co/learn/production/incident-response Incident response for AI products 2026-05-29T00:00:00.000Z

An AI postmortem looks different from a service postmortem. The root cause is rarely a line of code. The actions are usually eval-set changes.

https://www.aievals.co/learn/statistics/confidence-intervals Confidence intervals: Wilson and Clopper-Pearson 2026-05-29T00:00:00.000Z

Wald is wrong. Wilson is the practical default. Clopper-Pearson is the conservative exact. With code and the case for why N=10 evals tell you almost nothing.

https://www.aievals.co/learn/statistics/bootstrap-and-paired-tests Bootstrap and paired tests for noisy eval metrics 2026-05-29T00:00:00.000Z

When the metric is not a clean Bernoulli, bootstrap. When the systems share inputs, pair. Both close the gap between point estimates and actionable decisions.

https://www.aievals.co/learn/statistics/statistical-power Statistical power: how big does the eval set need to be 2026-05-29T00:00:00.000Z

Power is the probability of detecting the delta you care about. Pick the delta first, derive the N, then commit. Reversing the order produces evals nobody can act on.

https://www.aievals.co/learn/statistics/clustered-standard-errors Clustered standard errors: when CIs lie 2026-05-29T00:00:00.000Z

When eval questions cluster (multiple questions per topic, multiple turns per session), naive standard errors are three times too small. The cluster-aware variance formula and code, with the...

https://www.aievals.co/learn/statistics/inter-rater-agreement Inter-rater agreement: Cohen's kappa and Krippendorff's alpha 2026-05-29T00:00:00.000Z

Two qualified humans disagree on your rubric. That is information about the rubric, not about the raters. Measure agreement before you trust labels.

https://www.aievals.co/learn/statistics/multiple-comparisons Multiple comparisons: Bonferroni vs BH-FDR 2026-05-29T00:00:00.000Z

Run twenty tests at alpha 0.05 and you expect one false positive. Correct for it. Bonferroni for strict family-wise control; Benjamini-Hochberg for false-discovery control.

https://www.aievals.co/learn/statistics/effect-size Effect size: Cohen's d and absolute deltas 2026-05-29T00:00:00.000Z

A statistically significant 0.3-point win on a 0-100 rubric is not worth shipping. Report the magnitude, not just the p-value.

https://www.aievals.co/learn/safety-and-red-team/owasp-llm-top-10 Walking the OWASP Top 10 for LLM apps 2026-05-29T00:00:00.000Z

The ten failure categories every LLM application owner should be probing, with the smallest useful test for each.

https://www.aievals.co/learn/safety-and-red-team/red-team-program-design Designing a red-team program 2026-05-29T00:00:00.000Z

The operating model for an LLM red team: frequency, staffing, intake, scoring, and where it sits relative to automated regression.

https://www.aievals.co/learn/safety-and-red-team/harmbench-walkthrough HarmBench in practice 2026-05-29T00:00:00.000Z

A code walk through the HarmBench corpus, the StrongREJECT scoring classifier, and how to read the leaderboard without overfitting to it.

https://www.aievals.co/learn/safety-and-red-team/ailuminate-walkthrough AILuminate in practice 2026-05-29T00:00:00.000Z

MLCommons' standardized safety benchmark, the twelve hazard categories, and how to read its grade scale alongside HarmBench.

https://www.aievals.co/learn/safety-and-red-team/over-refusal-and-balance Over-refusal and the balance question 2026-05-29T00:00:00.000Z

Why refusal rate alone is a misleading metric, the XSTest pattern for paired benign cases, and how to grade refusal against necessity.

https://www.aievals.co/learn/safety-and-red-team/jailbreaks-and-defenses Jailbreaks and defenses 2026-05-29T00:00:00.000Z

GCG-style adversarial suffixes, the empirical floor on defense success, and what your guardrail stack should actually do.