Evaluations

Benchmarking Speech-to-Text Models for Long-Form Medical Dialogue

Updated Mar 27, 2026

(update — 27 Mar 2026)

What changed in the 27 Mar 2026 refresh

5 new models tested (26 → 31): Microsoft VibeVoice-ASR 9B (new open-source leader at 8.34% WER, but needs ~18GB VRAM and is slow even on H100), ElevenLabs Scribe v2 (9.72% vs 10.87% for v1), NVIDIA Nemotron Speech Streaming 0.6B (11.06% on T4), Voxtral Mini 2602 via Transcription API (11.64%), and Voxtral Mini 4B via vLLM realtime (11.89% on H100). Also evaluated LiquidAI LFM2.5-Audio-1.5B and Meta SeamlessM4T v2 Large—neither suited for long-form transcription.

Replaced Whisper’s text normalizer with a custom one: Found two bugs in Whisper’s EnglishTextNormalizer that inflated WER by ~2–3% across all models: (1) “oh” treated as zero—in medical conversations it’s always an interjection, not a digit; (2) missing word equivalences (ok/okay/k, yeah/yep/yes, mum/mom, alright/all right, kinda/kind of). All v3 scores are recalculated with the custom normalizer. Code in evaluate/text_normalizer.py—drop-in replacement, no whisper dependency.

VibeVoice-ASR 9B is the new open-source leader: First open-source model to genuinely compete with Gemini-tier cloud APIs on medical audio (8.34% vs 8.15% for Gemini 2.5 Pro). Needs ~18GB VRAM (L4/A10 sufficient, won’t fit on T4). Even on H100, 97s/file vs 6s for Parakeet.

Parakeet TDT 0.6B v3 remains the edge story: 9.35% WER at 6 seconds per file on Apple Silicon. A 0.6B model within 1% of a 9B model.

ElevenLabs Scribe v2 is a meaningful upgrade: 9.72% vs 10.87% for v1. Best cloud API if you don’t want Google.

LFM Audio and SeamlessM4T didn’t make the cut: LFM2.5-Audio-1.5B isn’t a dedicated ASR model—transcription via prompting yielded sparse keyword extractions (~74 words from 1400-word conversations with 2s chunks) or hallucination loops with longer chunks. SeamlessM4T is a translation model that summarized instead of transcribing (~677 words from ~1400).

———————————————————

(update — 24 Dec 2025)

What changed in the 24 Dec 2025 refresh

Open-sourced the benchmark: Full evaluation code now available on GitHub—run your own models, reproduce our results, or contribute improvements.

github.com/Omi-Health/medical-STT-eval

7 new models tested: Added Gemini 3 Pro/Flash Preview, Parakeet v3, updated GPT-4o Mini, NVIDIA Canary 1B v2, IBM Granite Speech, and Google MedASR.

Parakeet v3 jumps to #3: NVIDIA's latest Parakeet release now beats Gemini 2.5 Flash with 11.9% WER at just 6 seconds per file—best local model for on-device use.

Google MedASR tested: Despite being Google's medical-specific model, it scored worst at 64.9% WER. Key insight: MedASR is optimized for single-speaker dictation, not doctor-patient conversations.

Hallucination patterns documented: We identified repetition loops in autoregressive models (Canary 1B v2, Granite Speech, Kyutai) and developed chunking strategies to mitigate them.

—————————————

(update — 03 Aug 2025)

What changed in the 03 Aug 2025 refresh

Extended Whisper-style normalisation: We now strip fillers ("um"), expand contractions ("you're → you are"), and standardise punctuation/numbers before calculating WER. Most models gained 0.5-1 pp accuracy.

Extra models & fresh weights: Added Gemini 2.5 Pro / Flash, Voxtral Small, Kyutai

—————————————

Physicians lose countless hours each week to manual note-taking, so we're developing an on-device AI-Scribe that can transcribe and summarise entire consultations without sending any data to the cloud. To choose the right speech-to-text engine, we benchmarked 31 open- and closed-source models on PriMock57—a set of 57 simulated GP consultations (5–10 minutes each) recorded by Babylon Health clinicians. We ran the audio files through each model, logged word-error rate (WER), speed, and consistency, and applied chunking only where models crashed on audio longer than 40 seconds. The results surprised us—and they'll help anyone looking to bring reliable, privacy-first transcription into clinical workflows.

Dataset — PriMock57

• 57 simulated GP consultations recorded by seven Babylon Health doctors with role-play patients.

• Reference transcripts were cleaned to plain text for fair Word-Error-Rate (WER) calculation.

• Two recordings (day1_consultation07, day3_consultation03) triggered catastrophic hallucinations on multiple models, so 55 files remain.

• Public repo: https://github.com/babylonhealth/primock57

Evaluation framework

Transcription — a per-model runner saves a .txt and logs processing time.

Normalisation — extended Whisper-style preprocessing strips fillers ("um"), expands contractions ("you're → you are"), and standardises punctuation/numbers before WER calculation.

Metrics — scripts compute WER, best/worst file and standard deviation.

Comparison — results merge into CSV and rankings for plotting.

Chunking — only applied to models that break on audio longer than 40 s (30 s chunks with 10 s overlap).

Hardware & run types

Local Mac – Apple M4 Max 64 GB using MLX & WhisperKit

Local GPU – AWS g5.2xl (NVIDIA L4 24 GB) running vLLM and NVIDIA NeMo

Cloud APIs – Groq, ElevenLabs, OpenAI, Mistral, Gemini

Azure – Foundry Phi-4 multimodal endpoint

Results (55 files, 31 models)

#

Model

Avg WER

Best–Worst

Avg sec/file

Host

Info

1

Google Gemini 2.5 Pro

8.15%

4.3–14.9

56s

API (Google)

Long

2

Microsoft VibeVoice-ASR 9B

8.34%

4.6–14.2

97s

Local (H100)

Long

3

Google Gemini 3 Pro Preview

8.35%

4.2–16.7

65s

API (Google)

Long

4

Parakeet TDT 0.6B v3

9.35%

5.1–15.8

6s

Local (M4)

Long

5

Google Gemini 2.5 Flash

9.45%

5.0–34.5

20s

API (Google)

Long

6

ElevenLabs Scribe v2

9.72%

5.2–64.1

44s

API (ElevenLabs)

Long

7

Parakeet TDT 0.6B v2

10.75%

6.8–17.5

5s

Local (M4)

Long

8

ElevenLabs Scribe v1

10.87%

5.5–64.9

36s

API (ElevenLabs)

Long

9

NVIDIA Nemotron Speech Streaming 0.6B

11.06%

6.3–17.8

12s

Local (T4)

Long

10

OpenAI GPT-4o Mini (2025-12-15)

11.18%

5.9–22.1

40s

API (OpenAI)

Long

11

Kyutai STT 2.6B (en)

11.20%

6.2–18.3

148s

Local (L4 GPU)

Long

12

Google Gemini 3 Flash Preview

11.33%

5.9–22.7

52s

API (Google)

Long

13

Voxtral Mini 2602 (Transcription API)

11.64%

6.1–44.5

18s

API (Mistral)

Long

14

MLX Whisper-L v3-turbo

11.65%

6.0–29.8

13s

Local (M4)

Long

15

Voxtral Mini (API)

11.85%

6.3–44.9

22s

API (Mistral)

Long

16

Voxtral Mini 4B (vLLM realtime)

11.89%

6.5–45.2

693s*

Local (T4/H100)

Long

17

Groq Whisper-L v3

12.05%

7.1–18.9

9s

API (Groq)

Long

18

Groq Whisper-L v3-turbo

12.15%

6.8–19.4

8s

API (Groq)

Long

19

NVIDIA Canary 1B Flash

12.20%

6.9–19.6

23s

Local (L4 GPU)

Chunk

20

WhisperKit-L v3-turbo

12.25%

6.1–19.8

21s

Local (macOS)

Long

21

Apple SpeechAnalyzer

12.50%

7.0–19.1

6s

Local (macOS)

Long

22

NVIDIA Canary-Qwen 2.5B

13.05%

6.7–61.8

105s

Local (L4 GPU)

Chunk

23

OpenAI Whisper-1

13.15%

5.8–101.5

104s

API (OpenAI)

Long

24

OpenAI GPT-4o Mini Transcribe

13.65%

6.5–40.4

API (OpenAI)

Long

25

NVIDIA Canary 1B v2**

14.45%

7.9–42.8

9s

Local (L4 GPU)

Long

26

OpenAI GPT-4o Transcribe

14.75%

6.6–63.8

28s

API (OpenAI)

Long

27

IBM Granite Speech 3.3-2B***

16.55%

6.1–32.7

110s

Local (L4 GPU)

Chunk

28

Kyutai STT 1B (en/fr)

27.10%

7.8–220.5

80s

Local (L4 GPU)

Long

29

Azure Foundry Phi-4

30.75%

7.3–104.5

213s

API (Azure)

Chunk

30

Google MedASR

62.50%

29.5–96.2

1s

Local (M4)

Long

31

LiquidAI LFM2.5-Audio-1.5B†

Local (GPU)

N/A

*54/55 files (1 blocked by safety filter) **3 files with hallucination loops ***Requires chunking to avoid repetition loops

Key findings

Google Gemini 2.5 Pro leads accuracy at 10.8% WER, with the new Gemini 3 Pro Preview close behind at 11.0%.

Parakeet v3 is the new local champion—11.9% WER at 6 seconds per file makes it ideal for on-device medical scribes.

OpenAI GPT-4o Mini (Dec 2025) jumped from 15.9% to 12.8% WER, now ranking #5 overall.

Google MedASR scored worst (64.9% WER) despite being medical-specific—it's designed for dictation, not conversations.

Autoregressive models hallucinate: Canary 1B v2, Granite Speech, and Kyutai all exhibited repetition loops on certain files. Chunking with overlap mitigates this.

Groq Whisper-v3 (turbo) offers the best price/latency balance in the cloud.

Apple SpeechAnalyzer remains a solid choice for Swift apps at 14.8% WER.

Limitations & next steps

• UK-English only → we'll use multi-language datasets in the future.

• WER ≠ clinical usefulness → needs more thorough evaluation on medical correctness.

• API cost unmeasured → v2 will include $/hour and CO₂ metrics.

• Evaluation code now open-source → github.com/Omi-Health/medical-STT-eval

Get in touch

Want to try the on-device AI-Scribe or plug in your own model?

Email [email protected]

Logo

© 2025 - Omi Health B.V.