Evaluations

Benchmarking Speech-to-Text Models for Long-Form Medical Dialogue

Updated Apr 8, 2026

Last updated: 8 April 2026 · 43 models tested · Evaluation code open-source on GitHub

Physicians lose countless hours each week to manual note-taking. At Omi, we’re building an on-device AI-Scribe that transcribes and summarises entire consultations without sending patient data to the cloud. Choosing the right speech-to-text engine is the foundation of that product — so we built the most comprehensive medical STT benchmark we could find, and we keep it current.

What’s new

We now rank models by Medical WER (M-WER) — a metric that only counts errors on clinically relevant words (drugs, conditions, symptoms, anatomy, clinical procedures). Standard WER treats “yeah” and “amoxicillin” as equally important. M-WER doesn’t — and the ranking looks very different as a result. We also report Drug M-WER separately, because misspelling a drug name is a patient safety issue, and drug names turn out to be 2–5× harder than other medical terms for almost every model.

Other highlights from recent updates:

43 models benchmarked, across cloud APIs (Google, OpenAI, ElevenLabs, Soniox, AssemblyAI, Deepgram, Mistral, Microsoft Azure Speech, Cohere, Groq) and local/open-source (VibeVoice, Qwen3-ASR, Parakeet, Whisper variants, NVIDIA Canary, Gemma 4, and more).

Google Gemini 3 Pro leads the leaderboard at 2.65% M-WER.

Microsoft VibeVoice-ASR 9B is the first open-source model to genuinely compete with Gemini-tier cloud APIs (#3, 3.16% M-WER) — and notably outperforms Microsoft’s own new closed MAI-Transcribe-1 in Azure Speech by a wide margin on drug-name accuracy.

Qwen3-ASR 1.7B is the best small open-source model — 4.40% M-WER at ~7 seconds per file on an A10 GPU. The strongest accuracy-to-cost ratio for a local deployment.

Custom text normalizer replacing Whisper’s default, after we discovered two bugs in it that were quietly inflating WER by 2–3% across every model in the industry. Open-source and drop-in.

Full evaluation code is open-source on GitHub. Reproduce our results, run your own models, or contribute improvements.

Full Top 15 below, and the complete 43-model leaderboard (with per-category breakdowns and per-file metrics) is in the GitHub repo.

Dataset — PriMock57

• 57 simulated GP consultations (5–10 minutes each), recorded by seven Babylon Health doctors with role-play patients. 55 files used after removing two that triggered catastrophic hallucinations on multiple models.

• ~80,500 words of British English medical dialogue.

• Reference transcripts cleaned to plain text for fair WER calculation.

• Public repo: babylonhealth/primock57

Evaluation framework

Every model is run through the same pipeline: transcribe the audio, log processing time, normalise the output with our custom text normalizer, and compute WER, M-WER, Drug M-WER, and per-file statistics. Models that crash on long audio get chunking with overlap (typically 30s chunks with 10s overlap and LCS merging). We run on a mix of Apple Silicon (M4 Max), NVIDIA GPUs (T4, L4, A10, H100), and cloud APIs directly. No quantisation on local models unless explicitly noted — we benchmark full precision for fairness.

Top 15 — ranked by Medical WER

#

Model

WER

M-WER

Drug M-WER

Speed

Type

1

Google Gemini 3 Pro Preview

8.35%

2.65%

3.1%

65s

API

2

Google Gemini 2.5 Pro

8.15%

2.97%

4.1%

56s

API

3

VibeVoice-ASR 9B (Microsoft, open-source)

8.34%

3.16%

5.6%

97s

Local (H100)

4

Soniox stt-async-v4

9.18%

3.32%

7.1%

46s

API

5

Google Gemini 3 Flash Preview

11.33%

3.64%

5.2%

52s

API

6

ElevenLabs Scribe v2

9.72%

3.86%

4.3%

44s

API

7

AssemblyAI Universal-3 Pro (medical-v1)

9.55%

4.02%

6.5%

37s

API

8

Qwen3 ASR 1.7B (open-source)

9.00%

4.40%

8.6%

7s

Local (A10)

9

Deepgram Nova-3 Medical

9.05%

4.53%

9.7%

13s

API

10

OpenAI GPT-4o Mini (Dec 2025)

11.18%

4.85%

10.6%

40s

API

11

Microsoft MAI-Transcribe-1

11.52%

4.85%

11.2%

22s

API

12

ElevenLabs Scribe v1

10.87%

4.88%

7.5%

36s

API

13

Google Gemini 2.5 Flash

9.45%

5.01%

10.3%

20s

API

14

Voxtral Mini Transcribe V1

11.85%

5.17%

11.0%

22s

API

15

Parakeet TDT 1.1B

9.03%

5.20%

15.5%

12s

Local (T4)

Full 43-model leaderboard, per-category breakdowns, and per-file metrics: GitHub.

Key takeaways for choosing a medical STT engine

If you want the best accuracy and can use a cloud API: Google Gemini 3 Pro or 2.5 Pro. Both sit below 3% M-WER and handle long-form conversation natively.

If you want strong accuracy but don’t want to rely on Google: Soniox, AssemblyAI Universal-3 Pro (with medical-v1 domain), or Deepgram Nova-3 Medical. All three are genuinely competitive with Gemini-tier on M-WER, and Deepgram is the fastest cloud API on the board at 13s/file.

If you want the best open-source model, accuracy-first: Microsoft VibeVoice-ASR 9B. First open-source model to compete with Gemini on medical audio. Requires ~18GB VRAM (L4/A10 is enough; won’t fit on T4) and is slow — 97s/file even on H100 — because it’s a 9B LLM processing audio as tokens rather than a purpose-built speech model.

If you want the best open-source model, cost/speed-balanced: Qwen3-ASR 1.7B. ~14× faster than VibeVoice for only ~1.3 points more M-WER. Requires compute capability ≥ 8.0 (A10 or better; not T4).

If you’re deploying on-device for Apple Silicon: Parakeet TDT models are the fastest option, but be aware their Drug M-WER is weak — 22% for the 0.6B v3 — which may not be acceptable in a clinical context.

What to avoid for medical transcription: Google MedASR (despite being medical-specific, it’s built for single-speaker dictation, not conversations — 49.66% M-WER), Facebook MMS-1B-all (multilingual phonetic vocab, 54% M-WER), and current quantised versions of VibeVoice (aggressive quantisation kills drug-name accuracy).

Limitations & what’s next

UK English only — the PriMock57 dataset is British English. We’ll extend to multi-language datasets in future updates.

WER ≠ clinical usefulness — M-WER is a meaningful step toward clinical relevance, but human-graded clinical correctness is the next frontier for this benchmark.

API cost unmeasured — a future update will include $/hour and latency SLO metrics for the cloud APIs.

Medical vocabulary is extensible — our 179-term list is a starting point; contributions welcome via the repo.

The full evaluation code, every model’s transcripts, and all per-file metrics are open-source on GitHub — reproduce our results, benchmark your own model, or contribute improvements.

Logo

© 2025 - Omi Health B.V.