What "good" looks like
Voice splits into three sub-tasks with different metrics. For speech-to-text (STT), the primary is word error rate (WER) or character error rate (CER), stratified by language, accent, and noise condition; the Whisper paper's per-dataset WER tables are the methodology of record[1]. For text-to-speech (TTS), the primary is mean opinion score (MOS) collected via the ITU-T P.808 crowdsourcing protocol[2]; MUSHRA panels are the alternative for higher-rigor comparisons. For conversational voice, the metrics are time-to-first-byte (TTFB) latency, barge-in handling (does the model stop speaking when the user starts?), and end-of-turn detection accuracy.
The "good" target on STT depends on the use case. Closed-domain transcription should be in the low single-digit WER; open-domain accents and noise should be in the low double digits. For TTS, target MOS above 4.0 on a P.808 panel; below 3.5 is usually production-blocking for consumer products.
Secondary metrics
Accent stratification. Per-accent WER on accented English alone. Noise robustness, WER curve over signal-to-noise-ratio. Code-switching, WER on utterances that mix two languages. Named-entity WER, scored on proper nouns specifically; named entities are where domain failures concentrate. TTS naturalness and emotional appropriateness, scored on rubric panels.
Offline eval design
LibriSpeech, Common Voice 15, and Fleurs for breadth across the major STT benchmarks. On top of those, build a domain set of 500 utterances stratified by accent and noise level; the stratification is what makes the eval predict production. For TTS, P.808 panels with 30 listeners per release; the panel is expensive but it is the truth in this category[2]. MUSHRA when you need higher rigor for a paper or a vendor comparison.
For conversational voice, instrument the full duplex: log the timing of model audio onset relative to user speech end (end-of-turn accuracy), the response to user interruption (barge-in), and the TTFB distribution per turn.
Online eval design
Sample audio plus transcript plus TTS output to a labeling queue. Calculate WER on the human-revised labels weekly. Alert on TTFB P95 (sub-300ms is the usual product target for natural conversation), barge-in failures (the model talks over the user), and per-accent WER drift. The Whisper paper's stratified-WER methodology generalizes; do not report aggregate WER without per-segment breakdowns.
Top benchmarks
- LibriSpeech, the standard clean and other splits; saturated at the top, still useful for regression.
- Common Voice 15 and Fleurs. Common Voice is multilingual and accented; Fleurs covers 102 languages with broad accent coverage.
- P.808 panels, https://www.itu.int/rec/T-REC-P.808. The TTS standard for subjective evaluation.
jiwer for WER scoring in Python with normalization options. ESPnet for full STT and TTS pipeline experiments. P.808 protocol kits for crowdsourced TTS evaluation. The Whisper benchmark suite for replicating the per-dataset WER methodology[1].
Common pitfalls
WER on clean audio only. Clean WER is saturated for leading models and tells you almost nothing about real-world performance; stratify by noise and accent. MOS without sample-size statistics. A single P.808 panel of 10 listeners is high variance; report confidence intervals or use the protocol's recommended sample sizes. Ignoring sub-300ms TTFB for conversational voice. Above 300ms, the user perceives the system as slow; above 600ms, the user perceives it as broken. Treating WER as the only STT metric. WER on named entities is often 3 to 10 times higher than aggregate WER; the named-entity number predicts customer escalation better[3].
Canonical reading
The Whisper paper is the canonical reading because it both establishes per-dataset stratified WER as the right methodology and demonstrates the cross-dataset generalization that defines current STT[1]. The ITU-T P.808 recommendation is the canonical reference for TTS subjective evaluation via crowdsourcing; the protocol is what makes MOS comparable across panels[2].
Day-1 starter
- 500 utterances stratified by accent and noise. WER per stratum, not aggregate. Aggregate WER hides the failure modes that customers report.
- A 30-listener P.808 panel for every TTS release. The panel is the truth; automated naturalness scorers are useful for regression but not for decisions.
- Latency P50 and P95 dashboard for any conversational product. TTFB above 600ms is product-broken; track and alert.