Hamel's blunt take is that "the single biggest thing teams can do to speed up their eval program is build a custom data viewer" [1]. The teams that do this iterate ten times faster than the teams that rely on vendor UIs or, worse, raw JSON. The reason is friction. Every second of friction in reading a trace compounds across the hundreds of traces a working program reads each week.
Vendor viewers solve a different problem from yours. They are built to be generic across products. Your viewer is built to be specific to your product, your trace schema, and your annotators. Three custom features can save you twenty minutes a session.
Why a custom viewer beats a vendor UI
The vendor viewers (Langfuse, Phoenix, Braintrust, LangSmith) are good general-purpose tools and the right place to log traces. They lose to a custom viewer for reading traces in your product for three reasons.
First, the schema is generic. The vendor knows about spans and inputs and outputs; it does not know that the second tool call in a recipe-bot trace is the dietary-restriction check, or that you want that highlighted in red when the response includes an ingredient on the user's no-go list. Your viewer can [2].
Second, the layout is not optimized for your read pattern. The vendor view defaults to a single trace at a time, with nested JSON for tool calls. You read traces twenty per minute and need side-by-side, expanded tool I/O, and one-keystroke pass/fail. Building that in the vendor UI is either impossible or an enterprise feature.
Third, annotation lives in the vendor. If you want the labels to flow back into a CSV or a notebook for analysis, every vendor has a different export shape and a different rate limit. A custom viewer that writes to a single Parquet file ends the export problem.
The custom viewer is not a replacement for the vendor's observability stack. It is the read layer on top of it. The vendor stores; the custom viewer reads. The two together are how a working eval program ships.
NOTE
If a vendor tool gives you the read layer you want out of the box, use it. Phoenix's annotation queues and Braintrust's trace-to-eval-case workflow are real assets in the right team [3] [4]. The custom viewer is the path of least resistance, not the only path.
The minimum feature set
A useful trace viewer has four properties. Build the four; do not build a fifth until you have used the first four for a month.
flowchart LR
Store["Langfuse / Phoenix / S3 export"] --> Loader["Loader\nparquet/jsonl"]
Loader --> View["Viewer\nside-by-side, copy, filter"]
View --> Note["Annotation\nbinary + critique"]
Note --> Sink["Sink\nlabels.parquet"]
Sink --> Eval["Eval pipeline / judge"]
Side-by-side rendering. Two traces visible at once, scroll-linked or independent. Side-by-side beats single-trace because comparison is what catches subtle failures (the model behaves correctly on the easier of the two related queries and wrongly on the harder one, and the contrast is what makes it visible) [5].
Copy of the entire trace as one block. A button that copies the full trace (system prompt collapsed, tool calls expanded, response) to the clipboard. The annotators paste it into Slack to discuss with engineering. Without the copy button they screenshot, which makes the conversation lossy.
Filter by anything. Date range, persona, tool used, score from the last eval run, whether the annotator labeled it before, a free-text search against the response. The killer filter is "score=fail AND label=null" because that is the queue: traces the judge flagged that nobody has audited.
Annotate inline. A pass/fail toggle, a single critique text box, and one keyboard shortcut per category. The annotator does not switch tabs. The label is committed to disk on every keystroke; lost annotations are worse than no annotations because they erode trust [6].
The non-features are also important. No charts. No aggregations. No "AI-suggested critique." The viewer is for reading. Aggregation happens in the eval pipeline, downstream.
The Streamlit variant
The fastest path to a working viewer is Streamlit. Eighty lines, one afternoon, deployable on a laptop or a private subdomain.
"""Trace viewer for error analysis. Side-by-side, copy, filter, annotate.
Loads traces from a Langfuse export. Persists labels to a Parquet file."""
import streamlit as st
import pandas as pd
from pathlib import Path
from datetime import datetime
TRACES_PATH = Path("traces.parquet")
LABELS_PATH = Path("labels.parquet")
CATEGORIES = [
"dietary_violated",
"format_error",
"missed_handoff",
"promise_drop",
"tool_arg_typo",
"none",
]
@st.cache_data
def load_traces() -> pd.DataFrame:
return pd.read_parquet(TRACES_PATH)
def load_labels() -> pd.DataFrame:
if LABELS_PATH.exists():
return pd.read_parquet(LABELS_PATH)
return pd.DataFrame(columns=[
"trace_id", "annotator", "label", "critique", "pass_fail", "ts",
])
def save_label(trace_id: str, label: str, critique: str, pass_fail: str):
df = load_labels()
row = {
"trace_id": trace_id,
"annotator": st.session_state.get("annotator", "anon"),
"label": label,
"critique": critique,
"pass_fail": pass_fail,
"ts": datetime.utcnow().isoformat(),
}
df = pd.concat([df, pd.DataFrame([row])], ignore_index=True)
df.to_parquet(LABELS_PATH, index=False)
def render_trace(trace: pd.Series):
st.markdown(f"**Trace** `{trace.trace_id}`")
st.text_area("user", trace.user_input, height=80, key=f"u-{trace.trace_id}")
with st.expander("tool calls", expanded=True):
for call in trace.tool_calls:
st.code(f"{call['name']}({call['args']}) -> {call['result']}")
st.text_area("assistant", trace.response, height=160, key=f"a-{trace.trace_id}")
if st.button("Copy as Markdown", key=f"copy-{trace.trace_id}"):
st.code(format_for_clipboard(trace))
def format_for_clipboard(trace: pd.Series) -> str:
return (
f"USER: {trace.user_input}\n\n"
+ "TOOLS:\n"
+ "\n".join(f"- {c['name']}({c['args']}) -> {c['result']}"
for c in trace.tool_calls)
+ f"\n\nASSISTANT: {trace.response}"
)
# Sidebar: filters and annotator name.
st.sidebar.text_input("annotator name", key="annotator")
df = load_traces()
labels = load_labels()
labeled_ids = set(labels.trace_id) if len(labels) > 0 else set()
st.sidebar.markdown("---")
score_filter = st.sidebar.selectbox(
"judge score", ["any", "pass", "fail"], index=0,
)
only_unlabeled = st.sidebar.checkbox("only unlabeled", value=True)
search = st.sidebar.text_input("text search", "")
view = df.copy()
if score_filter != "any":
view = view[view.judge_score == score_filter]
if only_unlabeled:
view = view[~view.trace_id.isin(labeled_ids)]
if search:
view = view[view.response.str.contains(search, case=False, na=False)
| view.user_input.str.contains(search, case=False, na=False)]
st.sidebar.markdown(f"{len(view)} of {len(df)} traces")
# Main: side-by-side render. Two columns, current + next.
view = view.reset_index(drop=True)
idx = st.session_state.get("idx", 0)
idx = st.number_input("position", min_value=0,
max_value=max(len(view) - 1, 0),
value=min(idx, max(len(view) - 1, 0)))
st.session_state["idx"] = idx
left, right = st.columns(2)
with left:
render_trace(view.iloc[idx])
with right:
if idx + 1 < len(view):
render_trace(view.iloc[idx + 1])
# Annotation row for the left-side trace.
st.markdown("### Annotate the left trace")
col1, col2, col3 = st.columns([1, 2, 4])
with col1:
pass_fail = st.radio("verdict", ["pass", "fail"], horizontal=True)
with col2:
label = st.selectbox("category", CATEGORIES)
with col3:
critique = st.text_input("critique (one line)")
if st.button("Save & next", type="primary"):
save_label(view.iloc[idx].trace_id, label, critique, pass_fail)
st.session_state["idx"] = idx + 1
st.rerun()
That is the full viewer. Eighty lines, one Parquet sink, no JavaScript. Drop it behind an SSO proxy and the team can use it the same afternoon. The annotation file is the input to the rest of the eval pipeline; you do not need any other export step.
The features missing on purpose: no auto-suggested labels (would bias the human), no aggregate dashboard (lives downstream), no role-based access (handled by the proxy). Each missing feature is one less thing that can break.
The Next.js variant
A React viewer is the right move when more than two people use the tool concurrently, the trace volume blows past what a Streamlit server can hold in memory, or the team wants annotation history threaded through other product UI. The shape is the same: loader, viewer, annotation, sink.
'use client';
import { useEffect, useState } from 'react';
import type { Trace, Label } from './types';
const CATEGORIES = [
'dietary_violated',
'format_error',
'missed_handoff',
'promise_drop',
'tool_arg_typo',
'none',
] as const;
type Filters = {
score: 'any' | 'pass' | 'fail';
onlyUnlabeled: boolean;
search: string;
};
export default function ViewerPage() {
const [traces, setTraces] = useState<Trace[]>([]);
const [labels, setLabels] = useState<Label[]>([]);
const [filters, setFilters] = useState<Filters>({
score: 'any',
onlyUnlabeled: true,
search: '',
});
const [idx, setIdx] = useState(0);
useEffect(() => {
fetch('/api/traces').then((r) => r.json()).then(setTraces);
fetch('/api/labels').then((r) => r.json()).then(setLabels);
}, []);
const labeledIds = new Set(labels.map((l) => l.traceId));
const view = traces.filter((t) => {
if (filters.score !== 'any' && t.judgeScore !== filters.score) return false;
if (filters.onlyUnlabeled && labeledIds.has(t.id)) return false;
if (filters.search && !`${t.userInput} ${t.response}`.toLowerCase()
.includes(filters.search.toLowerCase())) return false;
return true;
});
async function saveLabel(label: Omit<Label, 'ts'>) {
const next: Label = { ...label, ts: new Date().toISOString() };
await fetch('/api/labels', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(next),
});
setLabels([...labels, next]);
setIdx(idx + 1);
}
if (view.length === 0) return <div>No traces match filters.</div>;
const current = view[idx];
const next = view[idx + 1];
return (
<div className="grid grid-cols-[260px_1fr] gap-4 p-4">
<FilterSidebar
filters={filters}
onChange={setFilters}
viewCount={view.length}
totalCount={traces.length}
/>
<main>
<div className="grid grid-cols-2 gap-4">
<TraceCard trace={current} />
{next ? <TraceCard trace={next} /> : <div />}
</div>
<AnnotationBar
traceId={current.id}
categories={CATEGORIES}
onSave={saveLabel}
/>
</main>
</div>
);
}
The shape is the same as the Streamlit viewer. The differences are operational: the React variant ships a server (/api/traces, /api/labels) and scales to many concurrent annotators. Keep the components small. The viewer is not a product feature; it is the inside-baseball tool for the eval team. Resist the temptation to add filters that no one has asked for.
The annotation sink can be the same Parquet file, written by the API route. Or it can be a Postgres table. Either way, the labels are the source of truth and they belong in a place you can query from a notebook.
Annotation schema
A useful annotation has five fields and not many more.
| field | type | example | why |
|---|
| trace_id | string | t-9f0a2b... | foreign key into trace store |
| annotator | string | om | for inter-rater agreement |
| pass_fail | enum | fail | the primary binary verdict |
| label | enum or null | dietary_violated | failure-mode category (null = none) |
| critique | string (free text) | recommended pancetta to a vegan user | one sentence, why |
Three optional fields show up later. A confidence flag (0 or 1) is useful for the borderline cases the annotator wants to revisit. A "needs SME review" flag is useful when the annotator is not the principal domain expert. A linked-issue field is useful when the trace becomes a GitHub issue.
The schema does not include a Likert score. Binary plus critique is the standard for a reason; binary lowers the sample size you need to detect a real change, raises inter-rater agreement, and forces the annotator to commit [6]. Likert hides disagreement under aggregated means and invites the annotator to park on 3.
TIP
If your team insists on a 1-to-5 scale, write the rubric so the breakpoint between 3 and 4 is the binary pass/fail. Then ignore the 1, 2, 4, 5 distinctions until you have proven you can act on them.
Exporting labels
The viewer is the input to the rest of the program. The labels file goes downstream in two ways.
The first is into the calibration loop. A judge prompt is graded against the human labels: TPR and TNR computed against the binary verdicts, iterated until both clear 0.9. The labels file is the held-out set. See calibration to humans for the discipline.
The second is into the regression set. Traces labeled fail with a critique become frozen examples the system has to pass on every release. The labels file is the source of truth for what "passing" means.
Both downstream uses depend on a stable schema. Add fields with care; remove fields never. The file is a contract with the future versions of your eval program.
What can go wrong
A few patterns kill the viewer.
The first is feature creep. Someone asks for a chart. Then someone asks for an aggregated dashboard. Then someone asks for a workflow engine. The viewer becomes a half-built BI tool that nobody uses to read traces. The fix is to keep the read tool small and put the BI elsewhere.
The second is shared annotator credentials. If everyone logs in as team, you cannot compute inter-rater agreement, and the label drift is invisible. The annotator field has to be per-person from day one.
The third is no backup of the labels file. The Parquet file is the most valuable artifact the team owns. It belongs in version control or in versioned object storage with retention. Losing the labels file is worse than losing a sprint of code; the code can be rewritten and the labels cannot [7].
Build the viewer this week. The savings start the day after. Hamel's framing is empirical: teams with a thoughtfully designed data viewer iterate ten times faster than teams without one [1]. The investment is one afternoon. The payoff is every error-analysis cycle for the life of the product.