Evaluations
Benchmarking Speech-to-Text Models for Long-Form Medical Dialogue
Updated Apr 8, 2026
Last updated: 8 April 2026 · 43 models tested · Evaluation code open-source on GitHub
Physicians lose countless hours each week to manual note-taking. At Omi, we’re building an on-device AI-Scribe that transcribes and summarises entire consultations without sending patient data to the cloud. Choosing the right speech-to-text engine is the foundation of that product — so we built the most comprehensive medical STT benchmark we could find, and we keep it current.
What’s new
We now rank models by Medical WER (M-WER) — a metric that only counts errors on clinically relevant words (drugs, conditions, symptoms, anatomy, clinical procedures). Standard WER treats “yeah” and “amoxicillin” as equally important. M-WER doesn’t — and the ranking looks very different as a result. We also report Drug M-WER separately, because misspelling a drug name is a patient safety issue, and drug names turn out to be 2–5× harder than other medical terms for almost every model.
Other highlights from recent updates:
• 43 models benchmarked, across cloud APIs (Google, OpenAI, ElevenLabs, Soniox, AssemblyAI, Deepgram, Mistral, Microsoft Azure Speech, Cohere, Groq) and local/open-source (VibeVoice, Qwen3-ASR, Parakeet, Whisper variants, NVIDIA Canary, Gemma 4, and more).
• Google Gemini 3 Pro leads the leaderboard at 2.65% M-WER.
• Microsoft VibeVoice-ASR 9B is the first open-source model to genuinely compete with Gemini-tier cloud APIs (#3, 3.16% M-WER) — and notably outperforms Microsoft’s own new closed MAI-Transcribe-1 in Azure Speech by a wide margin on drug-name accuracy.
• Qwen3-ASR 1.7B is the best small open-source model — 4.40% M-WER at ~7 seconds per file on an A10 GPU. The strongest accuracy-to-cost ratio for a local deployment.
• Custom text normalizer replacing Whisper’s default, after we discovered two bugs in it that were quietly inflating WER by 2–3% across every model in the industry. Open-source and drop-in.
• Full evaluation code is open-source on GitHub. Reproduce our results, run your own models, or contribute improvements.
Full Top 15 below, and the complete 43-model leaderboard (with per-category breakdowns and per-file metrics) is in the GitHub repo.
Dataset — PriMock57
• 57 simulated GP consultations (5–10 minutes each), recorded by seven Babylon Health doctors with role-play patients. 55 files used after removing two that triggered catastrophic hallucinations on multiple models.
• ~80,500 words of British English medical dialogue.
• Reference transcripts cleaned to plain text for fair WER calculation.
• Public repo: babylonhealth/primock57
Evaluation framework
Every model is run through the same pipeline: transcribe the audio, log processing time, normalise the output with our custom text normalizer, and compute WER, M-WER, Drug M-WER, and per-file statistics. Models that crash on long audio get chunking with overlap (typically 30s chunks with 10s overlap and LCS merging). We run on a mix of Apple Silicon (M4 Max), NVIDIA GPUs (T4, L4, A10, H100), and cloud APIs directly. No quantisation on local models unless explicitly noted — we benchmark full precision for fairness.
Top 15 — ranked by Medical WER
# | Model | WER | M-WER | Drug M-WER | Speed | Type |
|---|---|---|---|---|---|---|
1 | Google Gemini 3 Pro Preview | 8.35% | 2.65% | 3.1% | 65s | API |
2 | Google Gemini 2.5 Pro | 8.15% | 2.97% | 4.1% | 56s | API |
3 | VibeVoice-ASR 9B (Microsoft, open-source) | 8.34% | 3.16% | 5.6% | 97s | Local (H100) |
4 | Soniox stt-async-v4 | 9.18% | 3.32% | 7.1% | 46s | API |
5 | Google Gemini 3 Flash Preview | 11.33% | 3.64% | 5.2% | 52s | API |
6 | ElevenLabs Scribe v2 | 9.72% | 3.86% | 4.3% | 44s | API |
7 | AssemblyAI Universal-3 Pro (medical-v1) | 9.55% | 4.02% | 6.5% | 37s | API |
8 | Qwen3 ASR 1.7B (open-source) | 9.00% | 4.40% | 8.6% | 7s | Local (A10) |
9 | Deepgram Nova-3 Medical | 9.05% | 4.53% | 9.7% | 13s | API |
10 | OpenAI GPT-4o Mini (Dec 2025) | 11.18% | 4.85% | 10.6% | 40s | API |
11 | Microsoft MAI-Transcribe-1 | 11.52% | 4.85% | 11.2% | 22s | API |
12 | ElevenLabs Scribe v1 | 10.87% | 4.88% | 7.5% | 36s | API |
13 | Google Gemini 2.5 Flash | 9.45% | 5.01% | 10.3% | 20s | API |
14 | Voxtral Mini Transcribe V1 | 11.85% | 5.17% | 11.0% | 22s | API |
15 | Parakeet TDT 1.1B | 9.03% | 5.20% | 15.5% | 12s | Local (T4) |
Full 43-model leaderboard, per-category breakdowns, and per-file metrics: GitHub.
Key takeaways for choosing a medical STT engine
If you want the best accuracy and can use a cloud API: Google Gemini 3 Pro or 2.5 Pro. Both sit below 3% M-WER and handle long-form conversation natively.
If you want strong accuracy but don’t want to rely on Google: Soniox, AssemblyAI Universal-3 Pro (with medical-v1 domain), or Deepgram Nova-3 Medical. All three are genuinely competitive with Gemini-tier on M-WER, and Deepgram is the fastest cloud API on the board at 13s/file.
If you want the best open-source model, accuracy-first: Microsoft VibeVoice-ASR 9B. First open-source model to compete with Gemini on medical audio. Requires ~18GB VRAM (L4/A10 is enough; won’t fit on T4) and is slow — 97s/file even on H100 — because it’s a 9B LLM processing audio as tokens rather than a purpose-built speech model.
If you want the best open-source model, cost/speed-balanced: Qwen3-ASR 1.7B. ~14× faster than VibeVoice for only ~1.3 points more M-WER. Requires compute capability ≥ 8.0 (A10 or better; not T4).
If you’re deploying on-device for Apple Silicon: Parakeet TDT models are the fastest option, but be aware their Drug M-WER is weak — 22% for the 0.6B v3 — which may not be acceptable in a clinical context.
What to avoid for medical transcription: Google MedASR (despite being medical-specific, it’s built for single-speaker dictation, not conversations — 49.66% M-WER), Facebook MMS-1B-all (multilingual phonetic vocab, 54% M-WER), and current quantised versions of VibeVoice (aggressive quantisation kills drug-name accuracy).
Limitations & what’s next
• UK English only — the PriMock57 dataset is British English. We’ll extend to multi-language datasets in future updates.
• WER ≠ clinical usefulness — M-WER is a meaningful step toward clinical relevance, but human-graded clinical correctness is the next frontier for this benchmark.
• API cost unmeasured — a future update will include $/hour and latency SLO metrics for the cloud APIs.
• Medical vocabulary is extensible — our 179-term list is a starting point; contributions welcome via the repo.
The full evaluation code, every model’s transcripts, and all per-file metrics are open-source on GitHub — reproduce our results, benchmark your own model, or contribute improvements.


