Evaluations
Benchmarking Speech-to-Text Models for Long-Form Medical Dialogue
Updated Apr 8, 2026
Last updated: 8 April 2026 · 42 models tested · Evaluation code open-source on GitHub
Physicians lose countless hours each week to manual note-taking. At Omi, we’re building an on-device AI-Scribe that transcribes and summarises entire consultations without sending patient data to the cloud. Choosing the right speech-to-text engine is the foundation of that product — so we built the most comprehensive medical STT benchmark we could find, and we keep it current.
What’s new
We now rank models by Medical WER (M-WER) — a metric that only counts errors on clinically relevant words (drugs, conditions, symptoms, anatomy, clinical procedures). Standard WER treats “yeah” and “amoxicillin” as equally important. M-WER doesn’t — and the ranking looks very different as a result. We also report Drug M-WER separately, because misspelling a drug name is a patient safety issue, and drug names turn out to be 2–5× harder than other medical terms for almost every model.
Other highlights from recent updates:
• 42 models benchmarked, across cloud APIs (Google, OpenAI, ElevenLabs, Soniox, AssemblyAI, Deepgram, Mistral, Microsoft Azure Speech, Cohere, Groq) and local/open-source (VibeVoice, Qwen3-ASR, Parakeet, Whisper variants, NVIDIA Canary, Gemma 4, and more).
• Google Gemini 3 Pro leads the leaderboard at 2.65% M-WER.
• Microsoft VibeVoice-ASR 9B is the first open-source model to genuinely compete with Gemini-tier cloud APIs (#3, 3.16% M-WER) — and notably outperforms Microsoft’s own new closed MAI-Transcribe-1 in Azure Speech by a wide margin on drug-name accuracy.
• Qwen3-ASR 1.7B is the best small open-source model — 4.40% M-WER at ~7 seconds per file on an A10 GPU. The strongest accuracy-to-cost ratio for a local deployment.
• Custom text normalizer replacing Whisper’s default, after we discovered two bugs in it that were quietly inflating WER by 2–3% across every model in the industry. Open-source and drop-in.
• Full evaluation code is open-source on GitHub. Reproduce our results, run your own models, or contribute improvements.
Full 42-model leaderboard below, with per-category breakdowns and per-file metrics in the GitHub repo.
Dataset — PriMock57
• 57 simulated GP consultations (5–10 minutes each), recorded by seven Babylon Health doctors with role-play patients. 55 files used after removing two that triggered catastrophic hallucinations on multiple models.
• ~80,500 words of British English medical dialogue.
• Reference transcripts cleaned to plain text for fair WER calculation.
• Public repo: babylonhealth/primock57
Evaluation framework
Every model is run through the same pipeline: transcribe the audio, log processing time, normalise the output with our custom text normalizer, and compute WER, M-WER, Drug M-WER, and per-file statistics. Models that crash on long audio get chunking with overlap (typically 30s chunks with 10s overlap and LCS merging). We run on a mix of Apple Silicon (M4 Max), NVIDIA GPUs (T4, L4, A10, H100), and cloud APIs directly. No quantisation on local models unless explicitly noted — we benchmark full precision for fairness.
Full leaderboard — ranked by Medical WER
# | Model | WER | M-WER | Drug M-WER | Avg Speed | Type |
|---|---|---|---|---|---|---|
1 | Google Gemini 3 Pro Preview | 8.35% | 2.65% | 3.1% | 64.5s | API |
2 | Google Gemini 2.5 Pro | 8.15% | 2.97% | 4.1% | 56.4s | API |
3 | VibeVoice-ASR 9B | 8.34% | 3.16% | 5.6% | 96.7s | H100 |
4 | Soniox stt-async-v4 | 9.18% | 3.32% | 7.1% | 46.2s | API |
5 | Google Gemini 3 Flash Preview | 11.33% | 3.64% | 5.2% | 51.5s | API |
6 | ElevenLabs Scribe v2 | 9.72% | 3.86% | 4.3% | 43.5s | API |
7 | AssemblyAI Universal-3 Pro (medical-v1) | 9.55% | 4.02% | 6.5% | 37.3s | API |
8 | Qwen3 ASR 1.7B | 9.00% | 4.40% | 8.6% | 6.8s | A10 |
9 | Deepgram Nova-3 Medical | 9.05% | 4.53% | 9.7% | 12.9s | API |
10 | OpenAI GPT-4o Mini (Dec 2025) | 11.18% | 4.85% | 10.6% | 40.4s | API |
11 | Microsoft MAI-Transcribe-1 | 11.52% | 4.85% | 11.2% | 21.8s | API |
12 | ElevenLabs Scribe v1 | 10.87% | 4.88% | 7.5% | 36.3s | API |
13 | Google Gemini 2.5 Flash | 9.45% | 5.01% | 10.3% | 20.2s | API |
14 | Voxtral Mini Transcribe V1 | 11.85% | 5.17% | 11.0% | 22.4s | API |
15 | Parakeet TDT 1.1B | 9.03% | 5.20% | 15.5% | 12.3s | T4 |
16 | Voxtral Mini Transcribe V2 | 11.64% | 5.36% | 12.1% | 18.4s | API |
17 | Voxtral Mini 4B Realtime | 11.89% | 5.39% | 11.8% | 270.9s | A10 |
18 | Cohere Transcribe (Mar 2026) | 11.81% | 5.59% | 16.6% | 3.9s | A10 |
19 | OpenAI Whisper-1 | 13.20% | 5.62% | 10.3% | 104.3s | API |
20 | Groq Whisper Large v3 Turbo | 12.14% | 5.75% | 14.4% | 8.0s | API |
21 | NVIDIA Canary 1B Flash | 12.03% | 5.97% | 15.7% | 23.4s | T4 |
22 | Groq Whisper Large v3 | 11.93% | 5.97% | 13.6% | 8.6s | API |
23 | OpenAI GPT-4o Mini Transcribe | 13.60% | 6.03% | 11.4% | 23.2s | API |
24 | MLX Whisper Large v3 Turbo | 11.65% | 6.16% | 14.0% | 12.9s | Apple Silicon |
25 | Parakeet TDT 0.6B v2 | 10.75% | 6.19% | 17.2% | 5.4s | Apple Silicon |
26 | WhisperKit Large v3 Turbo | 12.28% | 6.35% | 14.4% | 21.4s | Apple Silicon |
27 | Qwen3 ASR 0.6B | 9.83% | 6.48% | 15.1% | 5.1s | A10 |
28 | Kyutai STT 2.6B | 11.20% | 6.51% | 15.7% | 148.4s | T4 |
29 | GLM-ASR-Nano-2512 | 10.84% | 7.05% | 17.5% | 87.7s | T4 |
30 | Parakeet TDT 0.6B v3 | 9.35% | 7.25% | 22.0% | 6.3s | Apple Silicon |
31 | Nemotron Speech Streaming 0.6B | 11.06% | 8.97% | 22.6% | 11.7s | T4 |
32 | OpenAI GPT-4o Transcribe | 14.84% | 9.03% | 14.9% | 27.9s | API |
33 | NVIDIA Canary-Qwen 2.5B | 12.94% | 9.80% | 22.8% | 105.4s | T4 |
34 | Gemma 4 E4B-it^ | 15.69% | 9.99% | 15.5% | 185.4s | T4 |
35 | NVIDIA Canary 1B v2 | 14.32% | 11.24% | 20.5% | 9.2s | T4 |
36 | IBM Granite Speech 3.3-2B | 16.55% | 12.80% | 23.1% | 109.7s | T4 |
37 | Apple SpeechAnalyzer | 12.36% | 13.02% | 27.4% | 6.0s | Apple Silicon |
38 | Gemma 4 E2B-it^ | 18.90% | 13.92% | 19.8% | 134.6s | T4 |
39 | Azure Foundry Phi-4 | 31.13% | 15.38% | 18.1% | 212.8s | API |
40 | Kyutai STT 1B (Multilingual) | 27.28% | 21.23% | 28.9% | 79.5s | T4 |
41 | Google MedASR | 64.38% | 49.66% | 58.0% | 1.2s | Apple Silicon |
42 | Facebook MMS-1B-all | 38.70% | 54.01% | 72.0% | 28.6s | T4 |
Per-category breakdowns and per-file metrics are on GitHub.
Key takeaways for choosing a medical STT engine
If you want the best accuracy and can use a cloud API: Google Gemini 3 Pro or 2.5 Pro. Both sit below 3% M-WER and handle long-form conversation natively.
If you want strong accuracy but don’t want to rely on Google: Soniox, AssemblyAI Universal-3 Pro (with medical-v1 domain), or Deepgram Nova-3 Medical. All three are genuinely competitive with Gemini-tier on M-WER, and Deepgram is the fastest cloud API on the board at 13s/file.
If you want the best open-source model, accuracy-first: Microsoft VibeVoice-ASR 9B. First open-source model to compete with Gemini on medical audio. Requires ~18GB VRAM (L4/A10 is enough; won’t fit on T4) and is slow — 97s/file even on H100 — because it’s a 9B LLM processing audio as tokens rather than a purpose-built speech model.
If you want the best open-source model, cost/speed-balanced: Qwen3-ASR 1.7B. ~14× faster than VibeVoice for only ~1.3 points more M-WER. Requires compute capability ≥ 8.0 (A10 or better; not T4).
If you’re deploying on-device for Apple Silicon: Parakeet TDT models are the fastest option, but be aware their Drug M-WER is weak — 22% for the 0.6B v3 — which may not be acceptable in a clinical context.
What to avoid for medical transcription: Google MedASR (despite being medical-specific, it’s built for single-speaker dictation, not conversations — 49.66% M-WER), Facebook MMS-1B-all (multilingual phonetic vocab, 54% M-WER), and current quantised versions of VibeVoice (aggressive quantisation kills drug-name accuracy).
Limitations & what’s next
• UK English only — the PriMock57 dataset is British English. We’ll extend to multi-language datasets in future updates.
• WER ≠ clinical usefulness — M-WER is a meaningful step toward clinical relevance, but human-graded clinical correctness is the next frontier for this benchmark.
• API cost unmeasured — a future update will include $/hour and latency SLO metrics for the cloud APIs.
• Medical vocabulary is extensible — our 179-term list is a starting point; contributions welcome via the repo.
The full evaluation code, every model’s transcripts, and all per-file metrics are open-source on GitHub — reproduce our results, benchmark your own model, or contribute improvements.


