<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <id>https://www.aievals.co</id>
  <title>AI Evals</title>
  <subtitle>Recent updates from aievals.co</subtitle>
  <link href="https://www.aievals.co" />
  <link rel="self" href="https://www.aievals.co/atom.xml" />
  <updated>2026-05-29T00:00:00.000Z</updated>
  <entry>
    <id>https://www.aievals.co/learn/foundations/why-evals-matter</id>
    <title>Why evals matter</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/foundations/why-evals-matter" />
    <category term="Foundations" />
    <summary>If you ship AI, evals are the moat. The case in one page.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/foundations/eval-maturity-model</id>
    <title>The eval maturity model</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/foundations/eval-maturity-model" />
    <category term="Foundations" />
    <summary>Five stages from vibe-checking to a continuous quality program. Where are you?</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/foundations/scientific-method-for-ai</id>
    <title>AI development as scientific method</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/foundations/scientific-method-for-ai" />
    <category term="Foundations" />
    <summary>The observe, hypothesize, experiment, measure, iterate loop mapped onto an AI product team.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/foundations/non-determinism-and-subjectivity</id>
    <title>Why LLM evals are hard</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/foundations/non-determinism-and-subjectivity" />
    <category term="Foundations" />
    <summary>Non-determinism, subjectivity, prompt sensitivity, and the reasons classical ML eval frameworks do not transfer.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/foundations/eval-vs-monitoring-vs-guardrails</id>
    <title>Eval vs monitoring vs guardrails</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/foundations/eval-vs-monitoring-vs-guardrails" />
    <category term="Foundations" />
    <summary>Three concepts that share traces but do different jobs, with the architectural placement of each in a request lifecycle.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/error-analysis/the-60-80-rule</id>
    <title>The 60-80% rule</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/error-analysis/the-60-80-rule" />
    <category term="Error Analysis" />
    <summary>Three failure modes usually account for most of your bugs. Find them, fix them, repeat.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/error-analysis/open-coding-workflow</id>
    <title>Open coding for AI traces</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/error-analysis/open-coding-workflow" />
    <category term="Error Analysis" />
    <summary>Read traces, write notes, cluster notes into failure modes, prioritize fixes. The three-stage process and the LLM-assisted parts.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/error-analysis/dimensional-sampling</id>
    <title>Dimensional sampling</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/error-analysis/dimensional-sampling" />
    <category term="Error Analysis" />
    <summary>Synthetic input construction across features, scenarios, and personas to cover the failure surface, not the demo path.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/error-analysis/failure-mode-taxonomies</id>
    <title>Failure-mode taxonomies</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/error-analysis/failure-mode-taxonomies" />
    <category term="Error Analysis" />
    <summary>Bottom-up coding beats top-down templates, and why generic labels like hallucination mislead.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/error-analysis/the-data-viewer</id>
    <title>Build your trace viewer in an afternoon</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/error-analysis/the-data-viewer" />
    <category term="Error Analysis" />
    <summary>The single highest-return tool for an eval program. Streamlit and Next.js variants, with the minimum feature set.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/error-analysis/case-study-nurtureboss</id>
    <title>Case study: NurtureBoss</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/error-analysis/case-study-nurtureboss" />
    <category term="Error Analysis" />
    <summary>One team&apos;s path from 66 percent date errors to 5 percent in a single error-analysis cycle.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/llm-as-judge/introduction</id>
    <title>LLM-as-Judge: the practical introduction</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/llm-as-judge/introduction" />
    <category term="LLM-as-Judge" />
    <summary>What an LLM judge is, when to use one, when not to, and the three pitfalls that show up on every team&apos;s first attempt.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/llm-as-judge/judge-prompt-structure</id>
    <title>The four-part judge prompt</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/llm-as-judge/judge-prompt-structure" />
    <category term="LLM-as-Judge" />
    <summary>A copyable template: role and definitions, criteria, few-shot from your train set, output format. Plus the runnable invocation.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/llm-as-judge/calibration-to-humans</id>
    <title>Calibrating your judge against humans</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/llm-as-judge/calibration-to-humans" />
    <category term="LLM-as-Judge" />
    <summary>TPR and TNR over agreement percent. Train, dev, test splits. The 100-example minimum. Iterate the prompt until both rates clear 0.90.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/llm-as-judge/biases-and-mitigations</id>
    <title>Judge biases you must defuse</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/llm-as-judge/biases-and-mitigations" />
    <category term="LLM-as-Judge" />
    <summary>Position, verbosity, self-preference, format, sycophancy. Each with the empirical evidence and the mitigation you bake into the harness.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/llm-as-judge/judgebench-honest-take</id>
    <title>Where LLM judges fail</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/llm-as-judge/judgebench-honest-take" />
    <category term="LLM-as-Judge" />
    <summary>JudgeBench: GPT-4o is roughly random on hard correctness judgments. The implications for your pipeline are concrete, not theoretical.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/llm-as-judge/pairwise-vs-direct</id>
    <title>Pairwise vs direct scoring</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/llm-as-judge/pairwise-vs-direct" />
    <category term="LLM-as-Judge" />
    <summary>When to ask the judge for a side-by-side ranking and when to ask for a single binary verdict.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/llm-as-judge/distilled-judges</id>
    <title>Distilled judges</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/llm-as-judge/distilled-judges" />
    <category term="LLM-as-Judge" />
    <summary>Why distillation matters in production, which shipped judges to know about, and the inline plus calibration architecture that holds up.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/datasets/construction</id>
    <title>Constructing your eval dataset</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/datasets/construction" />
    <category term="Datasets &amp; Golden Sets" />
    <summary>How to build the first hundred labeled examples, then grow them past five hundred without losing signal.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/datasets/synthetic-vs-human</id>
    <title>Synthetic vs human-authored examples</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/datasets/synthetic-vs-human" />
    <category term="Datasets &amp; Golden Sets" />
    <summary>Where each shape of data buys signal, where each one leaks, and how to mix them without poisoning the set.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/datasets/versioning-lineage</id>
    <title>Versioning your eval set</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/datasets/versioning-lineage" />
    <category term="Datasets &amp; Golden Sets" />
    <summary>A content-hashed eval set pinned to every result is a contract with your past self.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/datasets/regression-sets</id>
    <title>Building regression sets</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/datasets/regression-sets" />
    <category term="Datasets &amp; Golden Sets" />
    <summary>The set whose job is to fail loudly the next time you reintroduce a bug you already fixed.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/datasets/dimensional-coverage</id>
    <title>Dimensional coverage: features, scenarios, personas</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/datasets/dimensional-coverage" />
    <category term="Datasets &amp; Golden Sets" />
    <summary>A three-axis grid that tells you which cells of the failure surface your set actually probes.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/datasets/dataset-platforms</id>
    <title>Where to keep your eval datasets</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/datasets/dataset-platforms" />
    <category term="Datasets &amp; Golden Sets" />
    <summary>Langfuse, Phoenix, Braintrust, LangSmith, or a JSONL in git. The choice that actually matters.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/rag-evals/retrieval-vs-generation</id>
    <title>Eval RAG in two layers</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/rag-evals/retrieval-vs-generation" />
    <category term="RAG Evals" />
    <summary>Separate retrieval and generation or you cannot fix what breaks.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/rag-evals/retrieval-metrics</id>
    <title>Retrieval metrics: Recall@K, MRR, NDCG, hit-rate</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/rag-evals/retrieval-metrics" />
    <category term="RAG Evals" />
    <summary>The IR metrics that distinguish a retriever that found the right chunk from one that did not.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/rag-evals/generation-faithfulness</id>
    <title>Faithfulness and atomic facts</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/rag-evals/generation-faithfulness" />
    <category term="RAG Evals" />
    <summary>Decompose the answer into atomic claims, judge each claim binary, and stop measuring hallucination as a single yes/no.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/rag-evals/context-precision-recall</id>
    <title>Context precision and recall</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/rag-evals/context-precision-recall" />
    <category term="RAG Evals" />
    <summary>The Ragas pair that tells you whether the retriever found the right chunks and whether they were the only chunks worth showing the model.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/rag-evals/synthetic-query-generation</id>
    <title>Synthetic query generation</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/rag-evals/synthetic-query-generation" />
    <category term="RAG Evals" />
    <summary>Bootstrap a labeled RAG eval set from your own documents, without inventing ground truth.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/rag-evals/long-context-rag</id>
    <title>Long-context RAG and RULER</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/rag-evals/long-context-rag" />
    <category term="RAG Evals" />
    <summary>Why advertised context lengths exceed effective context lengths, and the benchmarks that expose the gap.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/rag-evals/non-english-rag</id>
    <title>Non-English RAG eval</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/rag-evals/non-english-rag" />
    <category term="RAG Evals" />
    <summary>What to measure when your corpus and your users do not share a language with most of the literature.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/agentic-evals/trajectory-vs-outcome</id>
    <title>Trajectory-level evaluation</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/agentic-evals/trajectory-vs-outcome" />
    <category term="Agentic Evals" />
    <summary>Outcome-only scoring ranks agents in the wrong order. Score the trajectory and the result.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/agentic-evals/tool-call-evaluation</id>
    <title>Tool-call evaluation: AST, schema, execution</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/agentic-evals/tool-call-evaluation" />
    <category term="Agentic Evals" />
    <summary>Three levels of tool-call evaluation: schema conformance, semantic AST equivalence, and execution-grounded outcomes. Use all three or you measure the wrong thing.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/agentic-evals/sub-goal-decomposition</id>
    <title>Sub-goal decomposition: plan graphs as eval objects</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/agentic-evals/sub-goal-decomposition" />
    <category term="Agentic Evals" />
    <summary>Real agent tasks are graphs. Evaluating only the leaf hides 4-of-5 sub-goal progress. Sub-goal completion is the planning metric that surfaces it.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/agentic-evals/pass-k-and-consistency</id>
    <title>Pass^k: the metric that catches inconsistent agents</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/agentic-evals/pass-k-and-consistency" />
    <category term="Agentic Evals" />
    <summary>Pass@1 hides catastrophic inconsistency. Report pass^k or you do not know how often the agent will repeatedly succeed.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/agentic-evals/agent-as-judge</id>
    <title>Agent-as-a-judge: the trajectory-scoring pattern</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/agentic-evals/agent-as-judge" />
    <category term="Agentic Evals" />
    <summary>Single-LLM judges lose the thread on long trajectories. An agent judge with its own tools can walk the steps and grade them independently.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/agentic-evals/reward-hacking</id>
    <title>Reward hacking detection</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/agentic-evals/reward-hacking" />
    <category term="Agentic Evals" />
    <summary>Once an objective is optimized, it gets gamed. Reward hacking is empirically common; eval for it explicitly.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/agentic-evals/environmental-drift</id>
    <title>Environmental drift: robustness to tool and API change</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/agentic-evals/environmental-drift" />
    <category term="Agentic Evals" />
    <summary>Most agent benchmarks freeze the environment. Production does not. Inject latency, schema shifts, and transient errors. Score the recovery, not the happy path.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/agentic-evals/regression-suites</id>
    <title>Regression suites for agent loops</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/agentic-evals/regression-suites" />
    <category term="Agentic Evals" />
    <summary>Trajectory regression is harder than answer regression. Replay 50 prod trajectories on every change and diff outcome, step count, and tool-call sequence.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/production/online-vs-offline</id>
    <title>Online vs offline: where each fires</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/production/online-vs-offline" />
    <category term="Production &amp; Observability" />
    <summary>Offline evals run on a frozen dataset for every change. Online evals sample real traffic continuously. They answer different questions and both belong in production.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/production/trace-schema</id>
    <title>Trace schema: what to log</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/production/trace-schema" />
    <category term="Production &amp; Observability" />
    <summary>The trace is the unit of analysis for every eval program. This is the schema, the fields that matter, and the storage shape that makes downstream queries cheap.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/production/sampling-strategies</id>
    <title>Sampling strategies for production traces</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/production/sampling-strategies" />
    <category term="Production &amp; Observability" />
    <summary>Uniform sampling is the default and the wrong default for tail slices. Stratify on the dimensions you care about, oversample failures, and reweight.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/production/drift-detection</id>
    <title>Drift detection for production AI</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/production/drift-detection" />
    <category term="Production &amp; Observability" />
    <summary>Track input and output distributions over time. Page on changes that matter; ignore the ones that do not. The hardest part is deciding which is which.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/production/cost-latency-eval</id>
    <title>Cost and latency belong in the scorecard</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/production/cost-latency-eval" />
    <category term="Production &amp; Observability" />
    <summary>A model that wins quality and triples p95 latency is not a win. Put quality, cost, and latency on one scorecard and pick on the Pareto frontier.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/production/ci-integration</id>
    <title>Wiring evals into CI</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/production/ci-integration" />
    <category term="Production &amp; Observability" />
    <summary>Tier the suite into smoke, full, and nightly. Cap cost. Shard. Publish to the PR. Make the merge gate boring and the merge decision data-driven.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/production/release-gates</id>
    <title>Release gates from eval results</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/production/release-gates" />
    <category term="Production &amp; Observability" />
    <summary>A release gate is a contract between the team and the user. Define the bands, automate the comparison, and hold the line when a regression candidate wants out.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/production/incident-response</id>
    <title>Incident response for AI products</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/production/incident-response" />
    <category term="Production &amp; Observability" />
    <summary>An AI postmortem looks different from a service postmortem. The root cause is rarely a line of code. The actions are usually eval-set changes.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/statistics/confidence-intervals</id>
    <title>Confidence intervals: Wilson and Clopper-Pearson</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/statistics/confidence-intervals" />
    <category term="Statistics &amp; Rigor" />
    <summary>Wald is wrong. Wilson is the practical default. Clopper-Pearson is the conservative exact. With code and the case for why N=10 evals tell you almost nothing.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/statistics/bootstrap-and-paired-tests</id>
    <title>Bootstrap and paired tests for noisy eval metrics</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/statistics/bootstrap-and-paired-tests" />
    <category term="Statistics &amp; Rigor" />
    <summary>When the metric is not a clean Bernoulli, bootstrap. When the systems share inputs, pair. Both close the gap between point estimates and actionable decisions.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/statistics/statistical-power</id>
    <title>Statistical power: how big does the eval set need to be</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/statistics/statistical-power" />
    <category term="Statistics &amp; Rigor" />
    <summary>Power is the probability of detecting the delta you care about. Pick the delta first, derive the N, then commit. Reversing the order produces evals nobody can act on.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/statistics/clustered-standard-errors</id>
    <title>Clustered standard errors: when CIs lie</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/statistics/clustered-standard-errors" />
    <category term="Statistics &amp; Rigor" />
    <summary>When eval questions cluster (multiple questions per topic, multiple turns per session), naive standard errors are three times too small. The cluster-aware variance formula and code, with the...</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/statistics/inter-rater-agreement</id>
    <title>Inter-rater agreement: Cohen&apos;s kappa and Krippendorff&apos;s alpha</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/statistics/inter-rater-agreement" />
    <category term="Statistics &amp; Rigor" />
    <summary>Two qualified humans disagree on your rubric. That is information about the rubric, not about the raters. Measure agreement before you trust labels.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/statistics/multiple-comparisons</id>
    <title>Multiple comparisons: Bonferroni vs BH-FDR</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/statistics/multiple-comparisons" />
    <category term="Statistics &amp; Rigor" />
    <summary>Run twenty tests at alpha 0.05 and you expect one false positive. Correct for it. Bonferroni for strict family-wise control; Benjamini-Hochberg for false-discovery control.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/statistics/effect-size</id>
    <title>Effect size: Cohen&apos;s d and absolute deltas</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/statistics/effect-size" />
    <category term="Statistics &amp; Rigor" />
    <summary>A statistically significant 0.3-point win on a 0-100 rubric is not worth shipping. Report the magnitude, not just the p-value.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/safety-and-red-team/owasp-llm-top-10</id>
    <title>Walking the OWASP Top 10 for LLM apps</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/safety-and-red-team/owasp-llm-top-10" />
    <category term="Safety &amp; Red-Team" />
    <summary>The ten failure categories every LLM application owner should be probing, with the smallest useful test for each.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/safety-and-red-team/red-team-program-design</id>
    <title>Designing a red-team program</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/safety-and-red-team/red-team-program-design" />
    <category term="Safety &amp; Red-Team" />
    <summary>The operating model for an LLM red team: frequency, staffing, intake, scoring, and where it sits relative to automated regression.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/safety-and-red-team/harmbench-walkthrough</id>
    <title>HarmBench in practice</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/safety-and-red-team/harmbench-walkthrough" />
    <category term="Safety &amp; Red-Team" />
    <summary>A code walk through the HarmBench corpus, the StrongREJECT scoring classifier, and how to read the leaderboard without overfitting to it.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/safety-and-red-team/ailuminate-walkthrough</id>
    <title>AILuminate in practice</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/safety-and-red-team/ailuminate-walkthrough" />
    <category term="Safety &amp; Red-Team" />
    <summary>MLCommons&apos; standardized safety benchmark, the twelve hazard categories, and how to read its grade scale alongside HarmBench.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/safety-and-red-team/over-refusal-and-balance</id>
    <title>Over-refusal and the balance question</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/safety-and-red-team/over-refusal-and-balance" />
    <category term="Safety &amp; Red-Team" />
    <summary>Why refusal rate alone is a misleading metric, the XSTest pattern for paired benign cases, and how to grade refusal against necessity.</summary>
  </entry>
  <entry>
    <id>https://www.aievals.co/learn/safety-and-red-team/jailbreaks-and-defenses</id>
    <title>Jailbreaks and defenses</title>
    <updated>2026-05-29T00:00:00.000Z</updated>
    <link href="https://www.aievals.co/learn/safety-and-red-team/jailbreaks-and-defenses" />
    <category term="Safety &amp; Red-Team" />
    <summary>GCG-style adversarial suffixes, the empirical floor on defense success, and what your guardrail stack should actually do.</summary>
  </entry>
</feed>
