Evaluations

Benchmarking Speech-to-Text Models for Long-Form Medical Dialogue

Jul 29, 2025

(update — 03 Aug 2025)

What changed in the 03 Aug 2025 refresh

  • Extended Whisper-style normalisation: We now strip fillers (“um”), expand contractions (“you’re → you are”), and standardise punctuation/numbers before calculating WER. Most models gained 0.5-1 pp accuracy.

  • Extra models & fresh weights: Added Gemini 2.5 Pro / Flash, Voxtral Small, Kyutai

—————————————-

Physicians lose countless hours each week to manual note-taking, so we’re developing an on-device AI-Scribe that can transcribe and summarise entire consultations without sending any data to the cloud. To choose the right speech-to-text engine, we benchmarked 15 open- and closed-source models on PriMock57—a set of 57 simulated GP consultations (5–10 minutes each) recorded by Babylon Health clinicians. We ran the audio files through each model, logged word-error rate (WER), speed, and consistency, and applied chunking only where models crashed on audio longer than 40 seconds. The results surprised us—and they’ll help anyone looking to bring reliable, privacy-first transcription into clinical workflows.

  1. Dataset — PriMock57

• 57 simulated GP consultations recorded by seven Babylon Health doctors with role-play patients.

• Reference transcripts were cleaned to plain text for fair Word-Error-Rate (WER) calculation.

• Two recordings (day1_consultation07, day3_consultation03) triggered catastrophic hallucinations on multiple models, so 55 files remain.

• Public repo: https://github.com/babylonhealth/primock57

  1. Evaluation framework

  • Transcription — a per-model runner saves a .txt and logs processing time.

  • Normalisation — extended Whisper-style preprocessing strips fillers ("um"), expands contractions ("you're → you are"), and standardises punctuation/numbers before WER calculation.

  • Metrics — scripts compute WER, best/worst file and standard deviation.

  • Comparison — results merge into CSV and rankings for plotting.

  • Chunking — only applied to Canary-Qwen 2.5 B, Canary-1B-Flash and Phi-4, which break on audio longer than 40 s (30 s chunks with 10 s overlap).

  1. Hardware & run types

Local Mac  – Apple M4 Max 64 GB using MLX & WhisperKit

Local GPU – AWS g5.2xl (NVIDIA L4 24 GB) running vLLM and NVIDIA NeMo

Cloud APIs – Groq, ElevenLabs, OpenAI, Mistral, Gemini

Azure     – Foundry Phi-4 multimodal endpoint

  1. Results (55 files)

#

Model

Avg WER

Best–Worst

Avg sec/file

Host

Info

1

google-gemini-2.5-pro

10.8 %

6.1–17.0

56 s

API (Google)

Long

2

google-gemini-2.5-flash

12.1 %

6.6–37.5

20 s

API (Google)

Long

3

parakeet-0.6 B v2

13.3 %

8.5–20.2

5 s

Local (M4)

Long

4

elevenlabs-scribe v1

13.5 %

7.0–67.9

36 s

API (ElevenLabs)

Long

5

kyutai STT 2.6 B (en)

13.8 %

7.8–20.7

148 s

Local (L4 GPU)

Long

6

mlx Whisper-L v3-turbo

14.2 %

7.5–32.1

13 s

Local (M4)

Long

7

groq Whisper-L v3

14.3 %

8.8–21.2

9 s

API (Groq)

Long

8

Voxtral-mini 3 B

14.3 %

7.8–47.9

74 s

Local (L4 GPU)

Long

9

Voxtral-mini (API)

14.4 %

7.8–47.5

23 s

API (Mistral)

Long

10

Canary-1B Flash

14.5 %

8.5–22.0

23 s

Local (L4 GPU)

Chunk

11

groq Whisper-L v3-turbo

14.5 %

8.5–21.7

8 s

API (Groq)

Long

12

whisperkit-L v3-turbo

14.5 %

7.7–22.1

21 s

Local (macOS)

Long

13

Apple SpeechAnalyzer

14.8 %

8.7–21.4

6 s

Local (macOS)

Long

14

Voxtral-small (chat)

15.4 %

5.9–97.4

32 s

API (Mistral)

Long

15

NVIDIA Canary-Qwen 2.5 B

15.4 %

8.3–64.5

105 s

Local (L4 GPU)

Chunk

16

OpenAI Whisper-1

15.5 %

7.2–104.9

104 s

API (OpenAI)

Long

17

OpenAI GPT-4o-mini (transcribe)

15.9 %

8.1–43.0

API (OpenAI)

Long

18

OpenAI GPT-4o (transcribe)

17.1 %

8.2–66.5

28 s

API (OpenAI)

Long

19

Kyutai STT 1 B (en/fr)

29.4 %

9.5–223.1

80 s

Local (L4 GPU)

Long

20

Azure Foundry Phi-4

33.1 %

9.0–107.1

213 s

API (Azure)

Chunk

  1. Key findings

  • Google Gemini 2.5 Pro leads accuracy at 10.8% WER, with Gemini Flash close behind at 12.1%.

  • Parakeet-0.6 B on an M4 runs about 5× real-time—great for local use.

  • Groq Whisper-v3 (turbo) offers the best price/latency balance in the cloud.

  • Chunking rescues Canary, Canary-Qwen & Phi-4 but doubles runtime.

  • Apple SpeechAnalyzer is a great hit for Swift apps.

  • ElevenLabs Scribe ranks 4th in accuracy but shows high variability (7.0–67.9% range).

  • Voxtral-mini beats Canary-Qwen 2.5 B for multilingual long-form.

  1. Limitations & next steps

• UK-English only → we’ll use multi-language datasets in the future.

• WER ≠ clinical usefulness → needs more thorough evaluation on medical correctness.

• API cost unmeasured → v2 will include $/hour and CO₂ metrics.


Get in touch

Want to try the on-device AI-Scribe or plug in your own model?

Email [email protected]


Logo

© 2025 - Omi Health B.V.