Evaluations

Benchmarking Speech-to-Text Models for Long-Form Medical Dialogue

Jul 29, 2025

(update — 03 Aug 2025)

What changed in the 03 Aug 2025 refresh

Extended Whisper-style normalisation: We now strip fillers (“um”), expand contractions (“you’re → you are”), and standardise punctuation/numbers before calculating WER. Most models gained 0.5-1 pp accuracy.
Extra models & fresh weights: Added Gemini 2.5 Pro / Flash, Voxtral Small, Kyutai

—————————————-

Physicians lose countless hours each week to manual note-taking, so we’re developing an on-device AI-Scribe that can transcribe and summarise entire consultations without sending any data to the cloud. To choose the right speech-to-text engine, we benchmarked 15 open- and closed-source models on PriMock57—a set of 57 simulated GP consultations (5–10 minutes each) recorded by Babylon Health clinicians. We ran the audio files through each model, logged word-error rate (WER), speed, and consistency, and applied chunking only where models crashed on audio longer than 40 seconds. The results surprised us—and they’ll help anyone looking to bring reliable, privacy-first transcription into clinical workflows.

Dataset — PriMock57

• 57 simulated GP consultations recorded by seven Babylon Health doctors with role-play patients.

• Reference transcripts were cleaned to plain text for fair Word-Error-Rate (WER) calculation.

• Two recordings (day1_consultation07, day3_consultation03) triggered catastrophic hallucinations on multiple models, so 55 files remain.

• Public repo: https://github.com/babylonhealth/primock57

Evaluation framework

Transcription — a per-model runner saves a .txt and logs processing time.
Normalisation — extended Whisper-style preprocessing strips fillers ("um"), expands contractions ("you're → you are"), and standardises punctuation/numbers before WER calculation.
Metrics — scripts compute WER, best/worst file and standard deviation.
Comparison — results merge into CSV and rankings for plotting.
Chunking — only applied to Canary-Qwen 2.5 B, Canary-1B-Flash and Phi-4, which break on audio longer than 40 s (30 s chunks with 10 s overlap).

Hardware & run types

Local Mac – Apple M4 Max 64 GB using MLX & WhisperKit

Local GPU – AWS g5.2xl (NVIDIA L4 24 GB) running vLLM and NVIDIA NeMo

Cloud APIs – Groq, ElevenLabs, OpenAI, Mistral, Gemini

Azure – Foundry Phi-4 multimodal endpoint

Results (55 files)

#	Model	Avg WER	Best–Worst	Avg sec/file	Host	Info
1	google-gemini-2.5-pro	10.8 %	6.1–17.0	56 s	API (Google)	Long
2	google-gemini-2.5-flash	12.1 %	6.6–37.5	20 s	API (Google)	Long
3	parakeet-0.6 B v2	13.3 %	8.5–20.2	5 s	Local (M4)	Long
4	elevenlabs-scribe v1	13.5 %	7.0–67.9	36 s	API (ElevenLabs)	Long
5	kyutai STT 2.6 B (en)	13.8 %	7.8–20.7	148 s	Local (L4 GPU)	Long
6	mlx Whisper-L v3-turbo	14.2 %	7.5–32.1	13 s	Local (M4)	Long
7	groq Whisper-L v3	14.3 %	8.8–21.2	9 s	API (Groq)	Long
8	Voxtral-mini 3 B	14.3 %	7.8–47.9	74 s	Local (L4 GPU)	Long
9	Voxtral-mini (API)	14.4 %	7.8–47.5	23 s	API (Mistral)	Long
10	Canary-1B Flash	14.5 %	8.5–22.0	23 s	Local (L4 GPU)	Chunk
11	groq Whisper-L v3-turbo	14.5 %	8.5–21.7	8 s	API (Groq)	Long
12	whisperkit-L v3-turbo	14.5 %	7.7–22.1	21 s	Local (macOS)	Long
13	Apple SpeechAnalyzer	14.8 %	8.7–21.4	6 s	Local (macOS)	Long
14	Voxtral-small (chat)	15.4 %	5.9–97.4	32 s	API (Mistral)	Long
15	NVIDIA Canary-Qwen 2.5 B	15.4 %	8.3–64.5	105 s	Local (L4 GPU)	Chunk
16	OpenAI Whisper-1	15.5 %	7.2–104.9	104 s	API (OpenAI)	Long
17	OpenAI GPT-4o-mini (transcribe)	15.9 %	8.1–43.0	—	API (OpenAI)	Long
18	OpenAI GPT-4o (transcribe)	17.1 %	8.2–66.5	28 s	API (OpenAI)	Long
19	Kyutai STT 1 B (en/fr)	29.4 %	9.5–223.1	80 s	Local (L4 GPU)	Long
20	Azure Foundry Phi-4	33.1 %	9.0–107.1	213 s	API (Azure)	Chunk

Key findings

Google Gemini 2.5 Pro leads accuracy at 10.8% WER, with Gemini Flash close behind at 12.1%.
Parakeet-0.6 B on an M4 runs about 5× real-time—great for local use.
Groq Whisper-v3 (turbo) offers the best price/latency balance in the cloud.
Chunking rescues Canary, Canary-Qwen & Phi-4 but doubles runtime.
Apple SpeechAnalyzer is a great hit for Swift apps.
ElevenLabs Scribe ranks 4th in accuracy but shows high variability (7.0–67.9% range).
Voxtral-mini beats Canary-Qwen 2.5 B for multilingual long-form.

Limitations & next steps

• UK-English only → we’ll use multi-language datasets in the future.

• WER ≠ clinical usefulness → needs more thorough evaluation on medical correctness.

• API cost unmeasured → v2 will include $/hour and CO₂ metrics.

Get in touch

Want to try the on-device AI-Scribe or plug in your own model?

Email [email protected]

Omi-Sum 3B: Open-Source Model for Medical Summaries ›

Benchmarking Speech-to-Text Models for Long-Form Medical Dialogue

Dataset — PriMock57

Evaluation framework

Hardware & run types

Results (55 files)

Key findings

Limitations & next steps

Get in touch