Start Here for Frontend Engineers

Nine steps from why evals matter through online monitoring. TypeScript-first tooling, judge basics, a trace viewer you build, multi-turn and voice quality, client-side instrumentation. Also available as a free PDF download.

Frontend engineers own the surface where model quality is actually felt: the streaming transcript, the retry button, the thumbs-down that nobody clicks. This track is for staff-plus frontend and full-stack engineers who ship conversational and AI-assisted product surfaces in TypeScript. The premise: perceived quality is measurable, the tooling can live in your existing Vitest setup, and the client is an instrument, not just a renderer. Two steps are hands-on recipes that take about an hour each.

The order is deliberate. Steps one and two build the mental model and the first artifact, in Python because that recipe is the shortest path to understanding the loop. Step three makes the loop native to your stack. Steps four and five are the quality-measurement craft: rubric-driven judges and the review UI you are better positioned to build than anyone else on the team. Steps six through eight cover the surfaces themselves: chat, multi-turn state, voice. Step nine moves the whole thing online.

The track

Why evals are the moat (12 min). The three reasons LLM features rot (non-determinism, subjectivity, drift) and why ship-and-pray fails for AI. Hamel's argument is the canonical one ¹.
Your first eval in 30 minutes (30 min). Fifty labeled traces, a binary judge, a confusion matrix. The concepts transfer to TypeScript unchanged; do it once here first.
TypeScript eval with Vercel AI SDK (60 min). Vitest as the eval runner, AI SDK calls under test, a zod-typed binary judge, and cost accounting that prints with the test summary ² ³.
LLM-as-Judge: the practical introduction (10 min). Perceived quality needs a rubric, and a judge is how the rubric runs at scale. Binary plus critique beats 1-to-5 scales ⁴.
Build your trace viewer in an afternoon (12 min). The single highest-return tool in an eval program is a review UI, and that is frontend work. Streamlit and Next.js variants with the minimum feature set ⁵.
Conversational quality (12 min). The task-type playbook for chat surfaces: persona consistency, context retention, recovery from misunderstanding, scored per turn ⁶.
Multi-turn conversation eval (75 min). The hands-on version: per-turn binary rubrics plus a per-session state check, logged to Langfuse ⁷.
Voice: STT, TTS, conversational (12 min). WER for transcription, MOS panels for synthesis, time-to-first-byte and barge-in for full duplex. P.808 is the listening-test standard ⁸.
Online vs offline: where each fires (14 min). Offline evals on frozen datasets, online evals on sampled traffic, and the feedback events your client has to emit for either to mean anything ⁹.

What comes after the track

When your assistant starts calling tools, the multi-turn work extends into agentic evals: trajectory vs outcome scoring is the next concept up. On the judge side, biases and mitigations is the page to read before you trust a judge score in a launch review.

TIP

Step 5 pays for the whole track. A team with a fast trace viewer labels ten times more data than a team squinting at JSON logs, and labeled data is the input to every other step here.

Hamel Husain, "Your AI Product Needs Evals." ↩
Vercel AI SDK documentation. ↩
Vitest documentation. ↩
Zheng et al., MT-Bench. NeurIPS 2023 D&B. ↩
Hamel Husain, "A Field Guide to Rapidly Improving AI Products." ↩
Hamel Husain, "Creating an LLM-as-a-Judge That Drives Business Results." ↩
Langfuse cookbook, multi-turn conversation evaluation. ↩
ITU-T Recommendation P.808, subjective evaluation of speech quality. ↩
Chip Huyen, "Building a Generative AI Platform." ↩

AI Evals

The most useful, opinionated reference for AI evals.

AI Evals is a curated technical reference for evaluating large language model systems in production. The site covers error analysis, LLM-as-judge calibration, RAG evaluation, agentic eval, statistical rigor, and twenty task-specific eval playbooks. Every claim is cited.

Why evals matter
Your first eval in 30 minutes
LLM-as-Judge introduction
RAG eval in two layers
Trajectory eval for agents
Glossary
Papers