Cookbook

Runnable worked examples for evals you can ship this week.

Runnable worked examples for evals you can ship this week.

Your first eval in 30 minutes
LLM-judge calibration loop
RAG eval with Ragas
Agent eval with τ-bench
Multi-turn conversation eval
SWE-bench Verified walkthrough
Long-context eval with RULER
Text-to-SQL eval with execution
Online eval with Langfuse
Online eval with Phoenix
TypeScript eval with Vercel AI SDK
Adding error bars to your evals
Contamination detection for code-gen and reasoning evals
HarmBench red-team run
Synthetic query generation for golden sets

By Om Bharatiya. View on aievals.co.