Evaluations

Benchmarking Speech-to-Text Models for Long-Form Medical Dialogue

Jul 29, 2025

(update — 24 Dec 2025)

What changed in the 24 Dec 2025 refresh

Open-sourced the benchmark: Full evaluation code now available on GitHub—run your own models, reproduce our results, or contribute improvements.

github.com/Omi-Health/medical-STT-eval

7 new models tested: Added Gemini 3 Pro/Flash Preview, Parakeet v3, updated GPT-4o Mini, NVIDIA Canary 1B v2, IBM Granite Speech, and Google MedASR.

Parakeet v3 jumps to #3: NVIDIA's latest Parakeet release now beats Gemini 2.5 Flash with 11.9% WER at just 6 seconds per file—best local model for on-device use.

Google MedASR tested: Despite being Google's medical-specific model, it scored worst at 64.9% WER. Key insight: MedASR is optimized for single-speaker dictation, not doctor-patient conversations.

Hallucination patterns documented: We identified repetition loops in autoregressive models (Canary 1B v2, Granite Speech, Kyutai) and developed chunking strategies to mitigate them.

—————————————

(update — 03 Aug 2025)

What changed in the 03 Aug 2025 refresh

Extended Whisper-style normalisation: We now strip fillers ("um"), expand contractions ("you're → you are"), and standardise punctuation/numbers before calculating WER. Most models gained 0.5-1 pp accuracy.

Extra models & fresh weights: Added Gemini 2.5 Pro / Flash, Voxtral Small, Kyutai

—————————————

Physicians lose countless hours each week to manual note-taking, so we're developing an on-device AI-Scribe that can transcribe and summarise entire consultations without sending any data to the cloud. To choose the right speech-to-text engine, we benchmarked 26 open- and closed-source models on PriMock57—a set of 57 simulated GP consultations (5–10 minutes each) recorded by Babylon Health clinicians. We ran the audio files through each model, logged word-error rate (WER), speed, and consistency, and applied chunking only where models crashed on audio longer than 40 seconds. The results surprised us—and they'll help anyone looking to bring reliable, privacy-first transcription into clinical workflows.

Dataset — PriMock57

• 57 simulated GP consultations recorded by seven Babylon Health doctors with role-play patients.

• Reference transcripts were cleaned to plain text for fair Word-Error-Rate (WER) calculation.

• Two recordings (day1_consultation07, day3_consultation03) triggered catastrophic hallucinations on multiple models, so 55 files remain.

• Public repo: https://github.com/babylonhealth/primock57

Evaluation framework

Transcription — a per-model runner saves a .txt and logs processing time.

Normalisation — extended Whisper-style preprocessing strips fillers ("um"), expands contractions ("you're → you are"), and standardises punctuation/numbers before WER calculation.

Metrics — scripts compute WER, best/worst file and standard deviation.

Comparison — results merge into CSV and rankings for plotting.

Chunking — only applied to models that break on audio longer than 40 s (30 s chunks with 10 s overlap).

Hardware & run types

Local Mac – Apple M4 Max 64 GB using MLX & WhisperKit

Local GPU – AWS g5.2xl (NVIDIA L4 24 GB) running vLLM and NVIDIA NeMo

Cloud APIs – Groq, ElevenLabs, OpenAI, Mistral, Gemini

Azure – Foundry Phi-4 multimodal endpoint

Results (55 files)

#

Model

Avg WER

Best–Worst

Avg sec/file

Host

Info

1

Google Gemini 2.5 Pro

10.8%

6.1–17.0

56s

API (Google)

Long

2

Google Gemini 3 Pro Preview*

11.0%

6.1–19.2

65s

API (Google)

Long

3

Parakeet TDT 0.6B v3

11.9%

6.7–18.4

6s

Local (M4)

Long

4

Google Gemini 2.5 Flash

12.1%

6.6–37.5

20s

API (Google)

Long

5

OpenAI GPT-4o Mini (2025-12-15)

12.8%

7.2–24.5

41s

API (OpenAI)

Long

6

Parakeet TDT 0.6B v2

13.3%

8.5–20.2

5s

Local (M4)

Long

7

ElevenLabs Scribe v1

13.5%

7.0–67.9

36s

API (ElevenLabs)

Long

8

Kyutai STT 2.6B (en)

13.8%

7.8–20.7

148s

Local (L4 GPU)

Long

9

Google Gemini 3 Flash Preview

13.9%

7.7–25.3

52s

API (Google)

Long

10

MLX Whisper-L v3-turbo

14.2%

7.5–32.1

13s

Local (M4)

Long

11

Groq Whisper-L v3

14.3%

8.8–21.2

9s

API (Groq)

Long

12

Voxtral Mini (API)

14.4%

7.8–47.5

23s

API (Mistral)

Long

13

Voxtral Mini (Transcription)

14.4%

7.8–47.9

23s

API (Mistral)

Long

14

NVIDIA Canary 1B Flash

14.5%

8.5–22.0

23s

Local (L4 GPU)

Chunk

15

Groq Whisper-L v3-turbo

14.5%

8.5–21.7

8s

API (Groq)

Long

16

WhisperKit-L v3-turbo

14.5%

7.7–22.1

21s

Local (macOS)

Long

17

Apple SpeechAnalyzer

14.8%

8.7–21.4

6s

Local (macOS)

Long

18

NVIDIA Canary-Qwen 2.5B

15.4%

8.3–64.5

105s

Local (L4 GPU)

Chunk

19

OpenAI Whisper-1

15.5%

7.2–104.9

104s

API (OpenAI)

Long

20

OpenAI GPT-4o Mini Transcribe

16.0%

8.1–43.0

API (OpenAI)

Long

21

NVIDIA Canary 1B v2**

16.8%

9.6–45.4

9s

Local (L4 GPU)

Long

22

OpenAI GPT-4o Transcribe

17.1%

8.2–66.5

28s

API (OpenAI)

Long

23

IBM Granite Speech 3.3-2B***

18.9%

7.6–35.4

110s

Local (L4 GPU)

Chunk

24

Kyutai STT 1B (en/fr)

29.4%

9.5–223.1

80s

Local (L4 GPU)

Long

25

Azure Foundry Phi-4

33.1%

9.0–107.1

213s

API (Azure)

Chunk

26

Google MedASR

64.9%

31.1–98.9

1s

Local (M4)

Long

*54/55 files (1 blocked by safety filter) **3 files with hallucination loops ***Requires chunking to avoid repetition loops

Key findings

Google Gemini 2.5 Pro leads accuracy at 10.8% WER, with the new Gemini 3 Pro Preview close behind at 11.0%.

Parakeet v3 is the new local champion—11.9% WER at 6 seconds per file makes it ideal for on-device medical scribes.

OpenAI GPT-4o Mini (Dec 2025) jumped from 15.9% to 12.8% WER, now ranking #5 overall.

Google MedASR scored worst (64.9% WER) despite being medical-specific—it's designed for dictation, not conversations.

Autoregressive models hallucinate: Canary 1B v2, Granite Speech, and Kyutai all exhibited repetition loops on certain files. Chunking with overlap mitigates this.

Groq Whisper-v3 (turbo) offers the best price/latency balance in the cloud.

Apple SpeechAnalyzer remains a solid choice for Swift apps at 14.8% WER.

Limitations & next steps

• UK-English only → we'll use multi-language datasets in the future.

• WER ≠ clinical usefulness → needs more thorough evaluation on medical correctness.

• API cost unmeasured → v2 will include $/hour and CO₂ metrics.

• Evaluation code now open-source → github.com/Omi-Health/medical-STT-eval

Get in touch

Want to try the on-device AI-Scribe or plug in your own model?

Email [email protected]

Logo

© 2025 - Omi Health B.V.