The trace is the spine of every eval program. Error analysis reads traces. Online judges score traces. The dataset is a slice of traces. Drift detection compares trace distributions across windows. If the trace schema is wrong, every downstream tool is downstream of the wrong thing.
Most teams start with whatever their LLM SDK logs by default (input string, output string, latency, token counts) and then spend two quarters discovering they cannot answer the questions that matter. They cannot rebuild a session because session IDs were not captured. They cannot compare tool-call accuracy across two prompts because the tool calls were not stored as structured fields. They cannot reproduce a failure because the system prompt was templated at runtime and the rendered version was thrown away. The schema below is the minimum to avoid those quarters.
What a trace has to capture
A trace describes one request the model saw and one response it produced. If the request invoked tools or retrieval, those steps are spans inside the trace. The unit of analysis is the trace, not the model call, because an evaluator cares about whether the system as a whole delivered the right behavior to the user. A single trace can contain ten model calls and the user does not care which one was off 1.
The information falls into five buckets:
- Identity. Trace ID, parent ID (if part of a session), session ID, user ID or anonymous bucket, timestamp.
- Inputs. User message, system prompt as rendered, context windows, retrieved chunks, tool outputs the model saw.
- Configuration. Model name and version, sampling parameters (temperature, top_p, max_tokens), prompt template name and version, feature flags, A/B variant.
- Outputs. Final assistant text, intermediate tool calls and their JSON arguments, tool responses, refusals, stop reason.
- Operational signals. Latency total and per span, time to first token, input and output token counts, cost in cents, error class if any.
A useful test for the schema: can you replay this trace tomorrow against a candidate prompt change and get the same model call. If anything in the inputs or configuration is missing, the answer is no, and the schema needs that field.
The schema, JSON
{
"trace_id": "01HXZ8K9TJ4VFEC2N7Q1Y3R7B5",
"session_id": "sess_4b9c2e",
"parent_trace_id": null,
"user_bucket": "anon_h_3f9a",
"ts": "2026-05-12T14:33:21.482Z",
"app_version": "rag-router@2026.05.04",
"ab_variant": "context_pack_v3",
"feature_flags": ["rerank_v2", "sql_fallback_off"],
"intent": "schedule_tour",
"language": "en-US",
"input": {
"user_message": "I need a one bedroom with the bathroom not connected",
"system_prompt_rendered": "You are a helpful property management assistant. ...",
"system_prompt_template_id": "pm_assistant@v18",
"messages": [
{"role": "user", "content": "I need a one bedroom..."}
],
"tools_offered": ["get_availability", "schedule_tour", "send_floorplan"]
},
"retrieval": [
{
"span_id": "ret_01",
"query": "one bedroom bathroom not connected",
"index": "units_2026q2",
"top_k": 8,
"results": [
{"doc_id": "u_a101", "score": 0.81, "ms": 14},
{"doc_id": "u_b205", "score": 0.77, "ms": 14}
],
"latency_ms": 28
}
],
"tool_calls": [
{
"span_id": "tc_01",
"name": "get_availability",
"arguments": {"bedrooms": 1, "bathroom_connected": null},
"raw_response": "[{\"unit\":\"A101\",...}]",
"ok": true,
"latency_ms": 92
}
],
"model_call": {
"provider": "anthropic",
"model": "claude-3-7-sonnet-20250219",
"params": {"temperature": 0, "max_tokens": 1024, "top_p": 1.0},
"input_tokens": 1342,
"output_tokens": 187,
"ttft_ms": 412,
"latency_ms": 1583,
"cost_cents": 2.71,
"stop_reason": "end_turn"
},
"output": {
"assistant_text": "I found two one-bedroom units: A101 and B205. Both have the bathroom adjacent to the living room rather than connected to the bedroom.",
"refusal": false,
"guardrail_actions": []
},
"error": null,
"user_feedback": {"thumbs": null, "edit_distance": null}
}
The schema is wider than most teams start with, on purpose. Every field on this list shows up in a real query a real team writes within the first six months: "show me all traces where the system prompt template changed," "compute cost per intent," "find the traces where the retrieval returned fewer than three documents," "compute pass^k across sessions."
CITATION
Hamel Husain, "A Field Guide to Rapidly Improving AI Products." The piece argues for removing "ALL friction from looking at data" by capturing the trace as the unit of analysis and investing in a viewer that surfaces it cleanly. The trace schema below is the data side of that argument. hamel.dev/blog/posts/field-guide
The schema, Parquet
For warehousing and analytical queries, the JSON shape above flattens into a small set of Parquet tables. Nested fields stay nested where they are queried together; high-cardinality lists move to side tables. A reasonable starting partition is traces partitioned by date, with tool_calls, retrieval_spans, and evaluations as side tables keyed on trace_id 1 2.
# pyarrow schema for the core traces table.
import pyarrow as pa
trace_schema = pa.schema([
("trace_id", pa.string()),
("session_id", pa.string()),
("parent_trace_id", pa.string()),
("user_bucket", pa.string()),
("ts", pa.timestamp("ms", tz="UTC")),
("app_version", pa.string()),
("ab_variant", pa.string()),
("feature_flags", pa.list_(pa.string())),
("intent", pa.string()),
("language", pa.string()),
("input", pa.struct([
("user_message", pa.string()),
("system_prompt_rendered", pa.string()),
("system_prompt_template_id", pa.string()),
("tools_offered", pa.list_(pa.string())),
])),
("model_call", pa.struct([
("provider", pa.string()),
("model", pa.string()),
("temperature", pa.float32()),
("max_tokens", pa.int32()),
("input_tokens", pa.int32()),
("output_tokens", pa.int32()),
("ttft_ms", pa.int32()),
("latency_ms", pa.int32()),
("cost_cents", pa.float32()),
("stop_reason", pa.string()),
])),
("output", pa.struct([
("assistant_text", pa.string()),
("refusal", pa.bool_()),
("guardrail_actions", pa.list_(pa.string())),
])),
("error_class", pa.string()),
("date", pa.date32()), # partition key
])
tool_call_schema = pa.schema([
("trace_id", pa.string()),
("span_id", pa.string()),
("name", pa.string()),
("arguments_json", pa.string()), # stored as JSON string for schema-flexible tools
("ok", pa.bool_()),
("latency_ms", pa.int32()),
("date", pa.date32()),
])
retrieval_span_schema = pa.schema([
("trace_id", pa.string()),
("span_id", pa.string()),
("query", pa.string()),
("index", pa.string()),
("top_k", pa.int32()),
("result_count", pa.int32()),
("top_score", pa.float32()),
("doc_ids", pa.list_(pa.string())),
("latency_ms", pa.int32()),
("date", pa.date32()),
])
evaluation_schema = pa.schema([
("trace_id", pa.string()),
("eval_name", pa.string()),
("score", pa.string()), # PASS/FAIL or numeric encoded
("score_numeric", pa.float32()),
("judge_version", pa.string()),
("dataset_or_window", pa.string()),
("rater", pa.string()), # llm-judge-vN, human-anita, etc.
("ts", pa.timestamp("ms", tz="UTC")),
("date", pa.date32()),
])
The split into four tables is deliberate. Tool calls and retrieval spans are sparse: most traces have zero or one of each, some traces have many. Keeping them in side tables means the trace row stays a fixed width that a viewer can render quickly, and the side tables support count(*) group by name queries that are awkward against deeply nested JSON. The evaluations table is the metric ledger from the online-vs-offline page; both regimes write into it 3.
Field-by-field rationale
The choices below trip teams up enough that they are worth making explicit.
trace_id as ULID, not UUIDv4. ULIDs sort lexicographically by creation time, which makes range queries cheap and reduces the number of indices you need. UUIDv4s scatter randomly across the keyspace. The difference matters at scale.
user_bucket not raw user ID. PII discipline. Hash the user ID with a stable salt so analytical queries can still aggregate "this user always sees the long response" without storing identifiers. Privacy review goes faster, and you can drop the salt on deletion request.
system_prompt_rendered and system_prompt_template_id. Both, always. The rendered string is what the model actually saw and is what you need to reproduce. The template ID is what you join on when asking "did v18 of the prompt regress vs v17." Storing only one of the two means either you cannot reproduce or you cannot aggregate.
ab_variant and feature_flags. Necessary to attribute. A 5% drop in faithfulness that correlates with rerank_v2 is a finding; the same drop attributed to "Friday" is not.
intent and language. Two of the most useful stratification axes. Both are best computed at trace time from a small classifier and stored, rather than recomputed at query time. The classifier can be wrong; just version it (intent_classifier@2026-04-01) and add the version to the schema if you change it.
tool_calls.arguments_json as a string, not a struct. Tool argument schemas vary across tools and across versions of the same tool. Storing them as a struct forces a brittle union type. JSON strings let the warehouse parse with json_extract at query time and let you change tool schemas without a backfill.
evaluations.rater. Distinguish llm-judge-faithfulness@v3, human-anita, and human-consensus-3rater cleanly. You will want to filter and join on this constantly: "show me LLM-judge scores where human-consensus disagrees," "compute kappa between Anita and Bob on the calibration slice."
What not to log
Not every byte the system saw belongs in the trace. The four things to leave out, or to route to a separate store with stricter access controls:
- Raw PII in user messages. Use redaction at trace-write time and store both the redacted text and a hash of the original. A separate, tightly-scoped store can hold the unredacted text if your privacy posture allows.
- Vector embeddings. They are large and rarely useful for the eval workflow. Store the document IDs and recompute embeddings on demand if you need them.
- The full text of every retrieved chunk in the trace row. Store document IDs and content versions in the trace, and join to a
documents table at query time. Trace rows that average 100KB make every viewer slow.
- Continuous user keystrokes or session telemetry. Out of scope for the eval program; if your product needs this, it goes in a different pipeline.
Reproducing a trace
If the schema is right, this function works:
def replay_trace_with_candidate(trace: dict, candidate_template_id: str) -> dict:
"""Rerun the same model call with a candidate system prompt template.
Holds every other dimension fixed so the delta is attributable.
"""
candidate_prompt = render_template(
template_id=candidate_template_id,
variables=trace["input"].get("template_variables", {}),
)
response = call_model(
provider=trace["model_call"]["provider"],
model=trace["model_call"]["model"],
system=candidate_prompt,
messages=trace["input"]["messages"],
params=trace["model_call"]["params"],
)
return {
"original_trace_id": trace["trace_id"],
"candidate_template_id": candidate_template_id,
"candidate_output": response["text"],
"candidate_cost_cents": response["cost_cents"],
}
If template_variables are missing, you cannot replay. If params are missing, you cannot replay. If messages are concatenated into a single string at log time, you cannot replay. Every "we cannot reproduce that bug" postmortem in eval-land traces back to a missing field in this list 4.
Sampling and retention
Two policies sit on top of the schema. Sampling decides which traces persist; retention decides how long they stick around.
The default is to log every trace. Storage is cheap relative to model cost, and the next eval question is rarely answerable from a sample. If volume forces sampling, sample on the way out (write only a fraction) rather than on the way in (drop traces before the model runs), and stratify by intent so rare intents persist at higher rates 5.
Retention has three tiers worth distinguishing. Hot (30 days, full fidelity) supports the daily error-analysis loop. Warm (180 days, full fidelity) supports drift comparisons and quarterly reviews. Cold (12+ months, summarized) supports year-over-year retrospectives. PII regulations (GDPR, CCPA, sector rules) constrain the floors; bias toward shorter hot retention with fast access rather than longer hot retention with slow access.
What a working trace store unlocks
With the schema above, every routine question becomes a one-paragraph SQL query. "What did we ship last Friday that bumped p95 latency on French traffic?" Filter by date, language, and ab_variant, plot p95. "Where does the system prompt template pm_assistant@v18 underperform v17 on intent schedule_tour?" Join evaluations to traces, group by template ID and intent. "Which tool has the highest failure rate?" Group tool_calls by name where ok = false. None of these require any more instrumentation; they require the fields to already be there 6.
This page sits at the top of the production section because every other page assumes these fields exist. The sampling page assumes intent is a column. The drift page assumes app_version and ab_variant are columns. The cost and latency page assumes cost_cents and ttft_ms are columns. Get the schema in place first, then build everything else against it.