Voice. STT, TTS, conversational

STT (audio to text), TTS (text to audio), and full-duplex conversational voice. WER, MOS, TTFB; not WER on clean audio alone.

What "good" looks like

Voice is three tasks, each scored differently: speech-to-text (STT) on error rate, text-to-speech (TTS) on listener panels, and conversational voice on turn-taking latency. Match the metric to the sub-task first.

The ship targets:

STT: low single-digit WER closed-domain; low double-digit WER on open-domain accents and noise.
TTS: MOS above 4.0 on a P.808 panel; below 3.5 is usually production-blocking for consumer products.
Conversational: TTFB P95 under 300ms; above 300ms reads as slow, above 600ms as broken.

STT. The primary is word error rate (WER) or character error rate (CER), stratified by language, accent, and noise condition. The Whisper paper's per-dataset WER tables are the methodology of record^[1].

Stratification is the paper's central finding, not pedantry. Whisper, trained on 680,000 hours of weakly supervised audio, and a supervised LibriSpeech model score within 0.1% of each other on LibriSpeech test-clean. Yet zero-shot Whisper makes 55.2% fewer errors on average across the other benchmark datasets^[2]. Two systems that match on the clean benchmark are not the same product.

TTS. The primary is mean opinion score (MOS) via the ITU-T P.808 crowdsourcing protocol^[3]. MUSHRA panels are the higher-rigor alternative.

Conversational voice. The metrics are time-to-first-byte (TTFB) latency, barge-in handling (does the model stop speaking when the user starts?), and end-of-turn detection accuracy.

flowchart TD
    IN["Voice task"] --> Q{"Which sub-task?"}
    Q -->|STT| STT["WER/CER, stratified by accent and noise"]
    Q -->|TTS| TTS["MOS via P.808 panel"]
    Q -->|Conversational| CV["TTFB, barge-in, end-of-turn"]
    STT --> GS{"WER within target per stratum?"}
    TTS --> GT{"MOS above 4.0?"}
    CV --> GC{"TTFB P95 under 300ms?"}
    GS -->|no| FIX["Fix the weakest stratum, not aggregate"]
    GT -->|no| FIX
    GC -->|no| FIX
    GS -->|yes| SHIP["Ship"]
    GT -->|yes| SHIP
    GC -->|yes| SHIP

Secondary metrics

Metric	What it catches	How to score
Accent stratification	Systems that pass on-distribution and fail on accents	Per-accent WER on accented English alone
Noise robustness	Degradation as conditions get worse	WER curve over signal-to-noise-ratio
Code-switching	Failure on mixed-language utterances	WER on utterances that mix two languages
Named-entity WER	Domain failures, which concentrate on proper nouns	WER scored on proper nouns specifically
TTS naturalness and emotional appropriateness	Robotic or tonally wrong speech	Rubric panels

Offline eval design

Start on the published breadth sets. LibriSpeech, Common Voice 15, and Fleurs cover the major STT benchmarks. The Whisper repo publishes per-language WER and CER breakdowns on exactly these sets, the reference format for reporting^[1].

Then build a stratified domain set. Build 500 utterances stratified by accent and noise level. The stratification is what makes the eval predict production: the clean-benchmark gap between systems can be a fraction of a point while the off-distribution gap is a factor of two^[2].

Normalize before scoring. WER is brutally sensitive to normalization choices (casing, punctuation, number formats). The Whisper paper ships its normalizer precisely because unnormalized comparisons across systems are meaningless^[2].

For TTS, run the panel. Use P.808 panels with 30 listeners per release. It is expensive but it is the truth in this category^[3]. Use MUSHRA when you need higher rigor for a paper or a vendor comparison.

For conversational voice, instrument the full duplex. Log three things:

End-of-turn accuracy: timing of model audio onset relative to user speech end.
Barge-in: the response to user interruption.
TTFB distribution per turn.

Online eval design

Sample and re-score weekly. Send audio plus transcript plus TTS output to a labeling queue. Calculate WER on the human-revised labels weekly.

Alert on the metrics that move. Watch TTFB P95 (sub-300ms is the usual product target for natural conversation), barge-in failures (the model talks over the user), and per-accent WER drift. The Whisper paper's stratified-WER methodology generalizes; never report aggregate WER without per-segment breakdowns.

Top benchmarks

LibriSpeech, the standard clean and other splits; saturated at the top, still useful for regression.
Common Voice 15 and Fleurs. Common Voice is multilingual and accented; Fleurs covers 102 languages with broad accent coverage.
P.808 panels, https://www.itu.int/rec/T-REC-P.808. The TTS standard for subjective evaluation.

Tools especially suited

jiwer: WER scoring in Python with normalization options.
ESPnet: full STT and TTS pipeline experiments.
P.808 protocol kits: crowdsourced TTS evaluation.
Whisper benchmark suite: including its text normalizer, for replicating the per-dataset WER methodology^[1].

Common pitfalls

Pitfall	Why it bites	Fix
WER on clean audio only	Clean WER is saturated for leading models and conceals the robustness gap; the 0.1%-on-clean versus 55.2%-fewer-errors-elsewhere result is the canonical demonstration	Stratify WER by accent and noise^[2]
MOS without sample-size statistics	A single P.808 panel of 10 listeners is high variance	Report confidence intervals or use the protocol's recommended sample sizes
Ignoring sub-300ms TTFB for conversational voice	Above 300ms the user perceives the system as slow; above 600ms as broken	Track and alert on TTFB P95 against the 300ms target
Treating WER as the only STT metric	WER on named entities is often 3 to 10 times higher than aggregate WER; the named-entity number predicts customer escalation better	Score named-entity WER separately^[4]

Canonical reading

The Whisper paper is the canonical reading: it establishes per-dataset stratified WER as the right methodology and shows that in-distribution scores systematically overstate real-world robustness. Its human-comparison framing (zero-shot models "approach their accuracy and robustness") is the honest way to position any STT system^[2]. The repo carries the living WER tables on Common Voice 15 and Fleurs^[1]. The ITU-T P.808 recommendation is the canonical reference for TTS subjective evaluation via crowdsourcing; the protocol makes MOS comparable across panels^[3].

Day-1 starter

500 utterances stratified by accent and noise. WER per stratum, not aggregate, scored after a fixed normalizer. Aggregate clean-audio WER hides the failure modes that customers report.
A 30-listener P.808 panel for every TTS release. The panel is the truth; automated naturalness scorers are useful for regression but not for decisions.
Latency P50 and P95 dashboard for any conversational product. TTFB above 600ms is product-broken; track and alert.

Cross-cutting links

Cost and latency belong in the scorecard: TTFB percentiles as first-class eval metrics.
Confidence intervals: Wilson and Clopper-Pearson: the sample-size discipline MOS panels need.
Dimensional sampling: stratifying by accent and noise instead of averaging over them.
Statistical methods: the technique reference behind panel statistics.
Langfuse: the labeling queue for sampled audio, transcripts, and TTS output.
Adding error bars to your evals: error bars on WER and MOS deltas.