Benchmarking Speech-to-Text Models for Long-Form Medical Dialogue

Updated 8 April 2026 · Published 29 July 2025 · 42 models · Evaluation code on GitHub

Physicians lose countless hours each week to manual note-taking. At Omi, we're building an on-device AI-Scribe that transcribes and summarises entire consultations without sending patient data to the cloud. Choosing the right speech-to-text engine is the foundation of that product — so we built the most comprehensive medical STT benchmark we could find, and we keep it current.

What's new in v4

We now rank models by Medical WER (M-WER) — a metric that only counts errors on clinically relevant words (drugs, conditions, symptoms, anatomy, clinical procedures). Standard WER treats "yeah" and "amoxicillin" as equally important. M-WER doesn't — and the ranking looks very different as a result. We also report Drug M-WER separately, because misspelling a drug name is a patient safety issue, and drug names turn out to be 2–5x harder than other medical terms for almost every model.

42 models benchmarked across cloud APIs (Google, OpenAI, ElevenLabs, Soniox, AssemblyAI, Deepgram, Mistral, Microsoft, Cohere, Groq) and local/open-source (VibeVoice, Qwen3-ASR, Parakeet, Whisper variants, NVIDIA Canary, Gemma 4, and more).
Google Gemini 3 Pro leads the leaderboard at 2.65% M-WER.
Microsoft VibeVoice-ASR 9B is the first open-source model to genuinely compete with Gemini-tier cloud APIs (#3, 3.16% M-WER).
Qwen3-ASR 1.7B is the best small open-source model — 4.40% M-WER at ~7s per file on an A10 GPU.
Custom text normalizer replacing Whisper's default, after we discovered two bugs that were quietly inflating WER by 2–3% across every model in the industry. Open-source and drop-in.

Leaderboard

Ranked by Medical WER (M-WER). Lower is better. Dataset: PriMock57 — 55 simulated GP consultations, ~80,500 words of British English medical dialogue.

#	Model	WER	M-WER	Drug M-WER	Avg Speed	Type
1	Google Gemini 3 Pro Preview	8.35%	2.65%	3.1%	64.5s	API
2	Google Gemini 2.5 Pro	8.15%	2.97%	4.1%	56.4s	API
3	VibeVoice-ASR 9B	8.34%	3.16%	5.6%	96.7s	H100
4	Soniox stt-async-v4	9.18%	3.32%	7.1%	46.2s	API
5	Google Gemini 3 Flash Preview	11.33%	3.64%	5.2%	51.5s	API
6	ElevenLabs Scribe v2	9.72%	3.86%	4.3%	43.5s	API
7	AssemblyAI Universal-3 Pro (medical-v1)	9.55%	4.02%	6.5%	37.3s	API
8	Qwen3 ASR 1.7B	9.00%	4.40%	8.6%	6.8s	A10
9	Deepgram Nova-3 Medical	9.05%	4.53%	9.7%	12.9s	API
10	OpenAI GPT-4o Mini (Dec 2025)	11.18%	4.85%	10.6%	40.4s	API
11	Microsoft MAI-Transcribe-1	11.52%	4.85%	11.2%	21.8s	API
12	ElevenLabs Scribe v1	10.87%	4.88%	7.5%	36.3s	API
13	Google Gemini 2.5 Flash	9.45%	5.01%	10.3%	20.2s	API
14	Voxtral Mini Transcribe V1	11.85%	5.17%	11.0%	22.4s	API
15	Parakeet TDT 1.1B	9.03%	5.20%	15.5%	12.3s	T4
16	Voxtral Mini Transcribe V2	11.64%	5.36%	12.1%	18.4s	API
17	Voxtral Mini 4B Realtime	11.89%	5.39%	11.8%	270.9s	A10
18	Cohere Transcribe (Mar 2026)	11.81%	5.59%	16.6%	3.9s	A10
19	OpenAI Whisper-1	13.20%	5.62%	10.3%	104.3s	API
20	Groq Whisper Large v3 Turbo	12.14%	5.75%	14.4%	8.0s	API
21	NVIDIA Canary 1B Flash	12.03%	5.97%	15.7%	23.4s	T4
22	Groq Whisper Large v3	11.93%	5.97%	13.6%	8.6s	API
23	OpenAI GPT-4o Mini Transcribe	13.60%	6.03%	11.4%	23.2s	API
24	MLX Whisper Large v3 Turbo	11.65%	6.16%	14.0%	12.9s	Apple Silicon
25	Parakeet TDT 0.6B v2	10.75%	6.19%	17.2%	5.4s	Apple Silicon
26	WhisperKit Large v3 Turbo	12.28%	6.35%	14.4%	21.4s	Apple Silicon
27	Qwen3 ASR 0.6B	9.83%	6.48%	15.1%	5.1s	A10
28	Kyutai STT 2.6B	11.20%	6.51%	15.7%	148.4s	T4
29	GLM-ASR-Nano-2512	10.84%	7.05%	17.5%	87.7s	T4
30	Parakeet TDT 0.6B v3	9.35%	7.25%	22.0%	6.3s	Apple Silicon
31	Nemotron Speech Streaming 0.6B	11.06%	8.97%	22.6%	11.7s	T4
32	OpenAI GPT-4o Transcribe	14.84%	9.03%	14.9%	27.9s	API
33	NVIDIA Canary-Qwen 2.5B	12.94%	9.80%	22.8%	105.4s	T4
34	Gemma 4 E4B-it	15.69%	9.99%	15.5%	185.4s	T4
35	NVIDIA Canary 1B v2	14.32%	11.24%	20.5%	9.2s	T4
36	IBM Granite Speech 3.3-2B	16.55%	12.80%	23.1%	109.7s	T4
37	Apple SpeechAnalyzer	12.36%	13.02%	27.4%	6.0s	Apple Silicon
38	Gemma 4 E2B-it	18.90%	13.92%	19.8%	134.6s	T4
39	Azure Foundry Phi-4	31.13%	15.38%	18.1%	212.8s	API
40	Kyutai STT 1B (Multilingual)	27.28%	21.23%	28.9%	79.5s	T4
41	Google MedASR	64.38%	49.66%	58.0%	1.2s	Apple Silicon
42	Facebook MMS-1B-all	38.70%	54.01%	72.0%	28.6s	T4

Dataset — PriMock57

57 simulated GP consultations (5–10 minutes each), recorded by seven Babylon Health doctors with role-play patients.
55 files used after removing two that triggered catastrophic hallucinations on multiple models.
~80,500 words of British English medical dialogue.
Reference transcripts cleaned to plain text for fair WER calculation.
Public dataset: babylonhealth/primock57

Evaluation framework

Every model is run through the same pipeline: transcribe the audio, log processing time, normalise the output with our custom text normalizer, and compute WER, M-WER, Drug M-WER, and per-file statistics. Models that crash on long audio get chunking with overlap (typically 30s chunks with 10s overlap and LCS merging). We run on a mix of Apple Silicon (M4 Max), NVIDIA GPUs (T4, L4, A10, H100), and cloud APIs directly. No quantisation on local models unless explicitly noted — we benchmark full precision for fairness.

Full evaluation code: github.com/Omi-Health/medical-STT-eval

Choosing a medical STT engine

Best accuracy, cloud API

Google Gemini 3 Pro or 2.5 Pro. Both sit below 3% M-WER and handle long-form conversation natively.

Strong accuracy without Google

Soniox, AssemblyAI Universal-3 Pro (with medical-v1 domain), or Deepgram Nova-3 Medical. All three are genuinely competitive with Gemini-tier on M-WER, and Deepgram is the fastest cloud API on the board at 13s/file.

Best open-source, accuracy-first

Microsoft VibeVoice-ASR 9B. First open-source model to compete with Gemini on medical audio. Requires ~18GB VRAM (L4/A10 is enough; won't fit on T4) and is slow — 97s/file even on H100 — because it's a 9B LLM processing audio as tokens rather than a purpose-built speech model.

Best open-source, cost/speed-balanced

Qwen3-ASR 1.7B. ~14x faster than VibeVoice for only ~1.3 points more M-WER. Requires compute capability ≥ 8.0 (A10 or better; not T4).

On-device, Apple Silicon

Parakeet TDT models are the fastest option, but be aware their Drug M-WER is weak — 22% for the 0.6B v3 — which may not be acceptable in a clinical context.

What to avoid for medical transcription

Google MedASR (despite being medical-specific, it's built for single-speaker dictation, not conversations — 49.66% M-WER), Facebook MMS-1B-all (multilingual phonetic vocab, 54% M-WER), and current quantised versions of VibeVoice (aggressive quantisation kills drug-name accuracy).

Limitations and what's next

UK English only — the PriMock57 dataset is British English. We'll extend to multi-language datasets in future updates.
WER ≠ clinical usefulness — M-WER is a meaningful step toward clinical relevance, but human-graded clinical correctness is the next frontier.
API cost unmeasured — a future update will include $/hour and latency SLO metrics for cloud APIs.
Medical vocabulary is extensible — our 179-term list is a starting point; contributions welcome via the repo.