Evaluations

Benchmarking Speech-to-Text Models for Long-Form Medical Dialogue

Updated Apr 8, 2026

Last updated: 8 April 2026 · 42 models tested · Evaluation code open-source on GitHub

Physicians lose countless hours each week to manual note-taking. At Omi, we’re building an on-device AI-Scribe that transcribes and summarises entire consultations without sending patient data to the cloud. Choosing the right speech-to-text engine is the foundation of that product — so we built the most comprehensive medical STT benchmark we could find, and we keep it current.

What’s new

We now rank models by Medical WER (M-WER) — a metric that only counts errors on clinically relevant words (drugs, conditions, symptoms, anatomy, clinical procedures). Standard WER treats “yeah” and “amoxicillin” as equally important. M-WER doesn’t — and the ranking looks very different as a result. We also report Drug M-WER separately, because misspelling a drug name is a patient safety issue, and drug names turn out to be 2–5× harder than other medical terms for almost every model.

Other highlights from recent updates:

42 models benchmarked, across cloud APIs (Google, OpenAI, ElevenLabs, Soniox, AssemblyAI, Deepgram, Mistral, Microsoft Azure Speech, Cohere, Groq) and local/open-source (VibeVoice, Qwen3-ASR, Parakeet, Whisper variants, NVIDIA Canary, Gemma 4, and more).

Google Gemini 3 Pro leads the leaderboard at 2.65% M-WER.

Microsoft VibeVoice-ASR 9B is the first open-source model to genuinely compete with Gemini-tier cloud APIs (#3, 3.16% M-WER) — and notably outperforms Microsoft’s own new closed MAI-Transcribe-1 in Azure Speech by a wide margin on drug-name accuracy.

Qwen3-ASR 1.7B is the best small open-source model — 4.40% M-WER at ~7 seconds per file on an A10 GPU. The strongest accuracy-to-cost ratio for a local deployment.

Custom text normalizer replacing Whisper’s default, after we discovered two bugs in it that were quietly inflating WER by 2–3% across every model in the industry. Open-source and drop-in.

Full evaluation code is open-source on GitHub. Reproduce our results, run your own models, or contribute improvements.

Full 42-model leaderboard below, with per-category breakdowns and per-file metrics in the GitHub repo.

Dataset — PriMock57

• 57 simulated GP consultations (5–10 minutes each), recorded by seven Babylon Health doctors with role-play patients. 55 files used after removing two that triggered catastrophic hallucinations on multiple models.

• ~80,500 words of British English medical dialogue.

• Reference transcripts cleaned to plain text for fair WER calculation.

• Public repo: babylonhealth/primock57

Evaluation framework

Every model is run through the same pipeline: transcribe the audio, log processing time, normalise the output with our custom text normalizer, and compute WER, M-WER, Drug M-WER, and per-file statistics. Models that crash on long audio get chunking with overlap (typically 30s chunks with 10s overlap and LCS merging). We run on a mix of Apple Silicon (M4 Max), NVIDIA GPUs (T4, L4, A10, H100), and cloud APIs directly. No quantisation on local models unless explicitly noted — we benchmark full precision for fairness.

Full leaderboard — ranked by Medical WER

#

Model

WER

M-WER

Drug M-WER

Avg Speed

Type

1

Google Gemini 3 Pro Preview

8.35%

2.65%

3.1%

64.5s

API

2

Google Gemini 2.5 Pro

8.15%

2.97%

4.1%

56.4s

API

3

VibeVoice-ASR 9B

8.34%

3.16%

5.6%

96.7s

H100

4

Soniox stt-async-v4

9.18%

3.32%

7.1%

46.2s

API

5

Google Gemini 3 Flash Preview

11.33%

3.64%

5.2%

51.5s

API

6

ElevenLabs Scribe v2

9.72%

3.86%

4.3%

43.5s

API

7

AssemblyAI Universal-3 Pro (medical-v1)

9.55%

4.02%

6.5%

37.3s

API

8

Qwen3 ASR 1.7B

9.00%

4.40%

8.6%

6.8s

A10

9

Deepgram Nova-3 Medical

9.05%

4.53%

9.7%

12.9s

API

10

OpenAI GPT-4o Mini (Dec 2025)

11.18%

4.85%

10.6%

40.4s

API

11

Microsoft MAI-Transcribe-1

11.52%

4.85%

11.2%

21.8s

API

12

ElevenLabs Scribe v1

10.87%

4.88%

7.5%

36.3s

API

13

Google Gemini 2.5 Flash

9.45%

5.01%

10.3%

20.2s

API

14

Voxtral Mini Transcribe V1

11.85%

5.17%

11.0%

22.4s

API

15

Parakeet TDT 1.1B

9.03%

5.20%

15.5%

12.3s

T4

16

Voxtral Mini Transcribe V2

11.64%

5.36%

12.1%

18.4s

API

17

Voxtral Mini 4B Realtime

11.89%

5.39%

11.8%

270.9s

A10

18

Cohere Transcribe (Mar 2026)

11.81%

5.59%

16.6%

3.9s

A10

19

OpenAI Whisper-1

13.20%

5.62%

10.3%

104.3s

API

20

Groq Whisper Large v3 Turbo

12.14%

5.75%

14.4%

8.0s

API

21

NVIDIA Canary 1B Flash

12.03%

5.97%

15.7%

23.4s

T4

22

Groq Whisper Large v3

11.93%

5.97%

13.6%

8.6s

API

23

OpenAI GPT-4o Mini Transcribe

13.60%

6.03%

11.4%

23.2s

API

24

MLX Whisper Large v3 Turbo

11.65%

6.16%

14.0%

12.9s

Apple Silicon

25

Parakeet TDT 0.6B v2

10.75%

6.19%

17.2%

5.4s

Apple Silicon

26

WhisperKit Large v3 Turbo

12.28%

6.35%

14.4%

21.4s

Apple Silicon

27

Qwen3 ASR 0.6B

9.83%

6.48%

15.1%

5.1s

A10

28

Kyutai STT 2.6B

11.20%

6.51%

15.7%

148.4s

T4

29

GLM-ASR-Nano-2512

10.84%

7.05%

17.5%

87.7s

T4

30

Parakeet TDT 0.6B v3

9.35%

7.25%

22.0%

6.3s

Apple Silicon

31

Nemotron Speech Streaming 0.6B

11.06%

8.97%

22.6%

11.7s

T4

32

OpenAI GPT-4o Transcribe

14.84%

9.03%

14.9%

27.9s

API

33

NVIDIA Canary-Qwen 2.5B

12.94%

9.80%

22.8%

105.4s

T4

34

Gemma 4 E4B-it^

15.69%

9.99%

15.5%

185.4s

T4

35

NVIDIA Canary 1B v2

14.32%

11.24%

20.5%

9.2s

T4

36

IBM Granite Speech 3.3-2B

16.55%

12.80%

23.1%

109.7s

T4

37

Apple SpeechAnalyzer

12.36%

13.02%

27.4%

6.0s

Apple Silicon

38

Gemma 4 E2B-it^

18.90%

13.92%

19.8%

134.6s

T4

39

Azure Foundry Phi-4

31.13%

15.38%

18.1%

212.8s

API

40

Kyutai STT 1B (Multilingual)

27.28%

21.23%

28.9%

79.5s

T4

41

Google MedASR

64.38%

49.66%

58.0%

1.2s

Apple Silicon

42

Facebook MMS-1B-all

38.70%

54.01%

72.0%

28.6s

T4

Per-category breakdowns and per-file metrics are on GitHub.

Key takeaways for choosing a medical STT engine

If you want the best accuracy and can use a cloud API: Google Gemini 3 Pro or 2.5 Pro. Both sit below 3% M-WER and handle long-form conversation natively.

If you want strong accuracy but don’t want to rely on Google: Soniox, AssemblyAI Universal-3 Pro (with medical-v1 domain), or Deepgram Nova-3 Medical. All three are genuinely competitive with Gemini-tier on M-WER, and Deepgram is the fastest cloud API on the board at 13s/file.

If you want the best open-source model, accuracy-first: Microsoft VibeVoice-ASR 9B. First open-source model to compete with Gemini on medical audio. Requires ~18GB VRAM (L4/A10 is enough; won’t fit on T4) and is slow — 97s/file even on H100 — because it’s a 9B LLM processing audio as tokens rather than a purpose-built speech model.

If you want the best open-source model, cost/speed-balanced: Qwen3-ASR 1.7B. ~14× faster than VibeVoice for only ~1.3 points more M-WER. Requires compute capability ≥ 8.0 (A10 or better; not T4).

If you’re deploying on-device for Apple Silicon: Parakeet TDT models are the fastest option, but be aware their Drug M-WER is weak — 22% for the 0.6B v3 — which may not be acceptable in a clinical context.

What to avoid for medical transcription: Google MedASR (despite being medical-specific, it’s built for single-speaker dictation, not conversations — 49.66% M-WER), Facebook MMS-1B-all (multilingual phonetic vocab, 54% M-WER), and current quantised versions of VibeVoice (aggressive quantisation kills drug-name accuracy).

Limitations & what’s next

UK English only — the PriMock57 dataset is British English. We’ll extend to multi-language datasets in future updates.

WER ≠ clinical usefulness — M-WER is a meaningful step toward clinical relevance, but human-graded clinical correctness is the next frontier for this benchmark.

API cost unmeasured — a future update will include $/hour and latency SLO metrics for the cloud APIs.

Medical vocabulary is extensible — our 179-term list is a starting point; contributions welcome via the repo.

The full evaluation code, every model’s transcripts, and all per-file metrics are open-source on GitHub — reproduce our results, benchmark your own model, or contribute improvements.

Logo

© 2025 - Omi Health B.V.