Frontend engineers own the surface where model quality is actually felt: the streaming transcript, the retry button, the thumbs-down that nobody clicks. This track is for staff-plus frontend and full-stack engineers who ship conversational and AI-assisted product surfaces in TypeScript. The premise: perceived quality is measurable, the tooling can live in your existing Vitest setup, and the client is an instrument, not just a renderer. Two steps are hands-on recipes that take about an hour each.
The order is deliberate. Steps one and two build the mental model and the first artifact, in Python because that recipe is the shortest path to understanding the loop. Step three makes the loop native to your stack. Steps four and five are the quality-measurement craft: rubric-driven judges and the review UI you are better positioned to build than anyone else on the team. Steps six through eight cover the surfaces themselves: chat, multi-turn state, voice. Step nine moves the whole thing online.
The track
- Why evals are the moat (12 min). The three reasons LLM features rot (non-determinism, subjectivity, drift) and why ship-and-pray fails for AI. Hamel's argument is the canonical one 1.
- Your first eval in 30 minutes (30 min). Fifty labeled traces, a binary judge, a confusion matrix. The concepts transfer to TypeScript unchanged; do it once here first.
- TypeScript eval with Vercel AI SDK (60 min). Vitest as the eval runner, AI SDK calls under test, a zod-typed binary judge, and cost accounting that prints with the test summary 2 3.
- LLM-as-Judge: the practical introduction (10 min). Perceived quality needs a rubric, and a judge is how the rubric runs at scale. Binary plus critique beats 1-to-5 scales 4.
- Build your trace viewer in an afternoon (12 min). The single highest-return tool in an eval program is a review UI, and that is frontend work. Streamlit and Next.js variants with the minimum feature set 5.
- Conversational quality (12 min). The task-type playbook for chat surfaces: persona consistency, context retention, recovery from misunderstanding, scored per turn 6.
- Multi-turn conversation eval (75 min). The hands-on version: per-turn binary rubrics plus a per-session state check, logged to Langfuse 7.
- Voice: STT, TTS, conversational (12 min). WER for transcription, MOS panels for synthesis, time-to-first-byte and barge-in for full duplex. P.808 is the listening-test standard 8.
- Online vs offline: where each fires (14 min). Offline evals on frozen datasets, online evals on sampled traffic, and the feedback events your client has to emit for either to mean anything 9.
What comes after the track
When your assistant starts calling tools, the multi-turn work extends into agentic evals: trajectory vs outcome scoring is the next concept up. On the judge side, biases and mitigations is the page to read before you trust a judge score in a launch review.
TIP
Step 5 pays for the whole track. A team with a fast trace viewer labels ten times more data than a team squinting at JSON logs, and labeled data is the input to every other step here.