Evaluations

Benchmarking Speech-to-Text Models for Long-Form Medical Dialogue

Jul 29, 2025

(update — 24 Dec 2025)

What changed in the 24 Dec 2025 refresh

Open-sourced the benchmark: Full evaluation code now available on GitHub—run your own models, reproduce our results, or contribute improvements.

→ github.com/Omi-Health/medical-STT-eval

7 new models tested: Added Gemini 3 Pro/Flash Preview, Parakeet v3, updated GPT-4o Mini, NVIDIA Canary 1B v2, IBM Granite Speech, and Google MedASR.

Parakeet v3 jumps to #3: NVIDIA's latest Parakeet release now beats Gemini 2.5 Flash with 11.9% WER at just 6 seconds per file—best local model for on-device use.

Google MedASR tested: Despite being Google's medical-specific model, it scored worst at 64.9% WER. Key insight: MedASR is optimized for single-speaker dictation, not doctor-patient conversations.

Hallucination patterns documented: We identified repetition loops in autoregressive models (Canary 1B v2, Granite Speech, Kyutai) and developed chunking strategies to mitigate them.

—————————————

(update — 03 Aug 2025)

What changed in the 03 Aug 2025 refresh

Extended Whisper-style normalisation: We now strip fillers ("um"), expand contractions ("you're → you are"), and standardise punctuation/numbers before calculating WER. Most models gained 0.5-1 pp accuracy.

Extra models & fresh weights: Added Gemini 2.5 Pro / Flash, Voxtral Small, Kyutai

—————————————

Physicians lose countless hours each week to manual note-taking, so we're developing an on-device AI-Scribe that can transcribe and summarise entire consultations without sending any data to the cloud. To choose the right speech-to-text engine, we benchmarked 26 open- and closed-source models on PriMock57—a set of 57 simulated GP consultations (5–10 minutes each) recorded by Babylon Health clinicians. We ran the audio files through each model, logged word-error rate (WER), speed, and consistency, and applied chunking only where models crashed on audio longer than 40 seconds. The results surprised us—and they'll help anyone looking to bring reliable, privacy-first transcription into clinical workflows.

Dataset — PriMock57

• 57 simulated GP consultations recorded by seven Babylon Health doctors with role-play patients.

• Reference transcripts were cleaned to plain text for fair Word-Error-Rate (WER) calculation.

• Two recordings (day1_consultation07, day3_consultation03) triggered catastrophic hallucinations on multiple models, so 55 files remain.

• Public repo: https://github.com/babylonhealth/primock57

Evaluation framework

Transcription — a per-model runner saves a .txt and logs processing time.

Normalisation — extended Whisper-style preprocessing strips fillers ("um"), expands contractions ("you're → you are"), and standardises punctuation/numbers before WER calculation.

Metrics — scripts compute WER, best/worst file and standard deviation.

Comparison — results merge into CSV and rankings for plotting.

Chunking — only applied to models that break on audio longer than 40 s (30 s chunks with 10 s overlap).

Hardware & run types

Local Mac – Apple M4 Max 64 GB using MLX & WhisperKit

Local GPU – AWS g5.2xl (NVIDIA L4 24 GB) running vLLM and NVIDIA NeMo

Cloud APIs – Groq, ElevenLabs, OpenAI, Mistral, Gemini

Azure – Foundry Phi-4 multimodal endpoint

Results (55 files)

#	Model	Avg WER	Best–Worst	Avg sec/file	Host	Info
1	Google Gemini 2.5 Pro	10.8%	6.1–17.0	56s	API (Google)	Long
2	Google Gemini 3 Pro Preview*	11.0%	6.1–19.2	65s	API (Google)	Long
3	Parakeet TDT 0.6B v3	11.9%	6.7–18.4	6s	Local (M4)	Long
4	Google Gemini 2.5 Flash	12.1%	6.6–37.5	20s	API (Google)	Long
5	OpenAI GPT-4o Mini (2025-12-15)	12.8%	7.2–24.5	41s	API (OpenAI)	Long
6	Parakeet TDT 0.6B v2	13.3%	8.5–20.2	5s	Local (M4)	Long
7	ElevenLabs Scribe v1	13.5%	7.0–67.9	36s	API (ElevenLabs)	Long
8	Kyutai STT 2.6B (en)	13.8%	7.8–20.7	148s	Local (L4 GPU)	Long
9	Google Gemini 3 Flash Preview	13.9%	7.7–25.3	52s	API (Google)	Long
10	MLX Whisper-L v3-turbo	14.2%	7.5–32.1	13s	Local (M4)	Long
11	Groq Whisper-L v3	14.3%	8.8–21.2	9s	API (Groq)	Long
12	Voxtral Mini (API)	14.4%	7.8–47.5	23s	API (Mistral)	Long
13	Voxtral Mini (Transcription)	14.4%	7.8–47.9	23s	API (Mistral)	Long
14	NVIDIA Canary 1B Flash	14.5%	8.5–22.0	23s	Local (L4 GPU)	Chunk
15	Groq Whisper-L v3-turbo	14.5%	8.5–21.7	8s	API (Groq)	Long
16	WhisperKit-L v3-turbo	14.5%	7.7–22.1	21s	Local (macOS)	Long
17	Apple SpeechAnalyzer	14.8%	8.7–21.4	6s	Local (macOS)	Long
18	NVIDIA Canary-Qwen 2.5B	15.4%	8.3–64.5	105s	Local (L4 GPU)	Chunk
19	OpenAI Whisper-1	15.5%	7.2–104.9	104s	API (OpenAI)	Long
20	OpenAI GPT-4o Mini Transcribe	16.0%	8.1–43.0	—	API (OpenAI)	Long
21	NVIDIA Canary 1B v2**	16.8%	9.6–45.4	9s	Local (L4 GPU)	Long
22	OpenAI GPT-4o Transcribe	17.1%	8.2–66.5	28s	API (OpenAI)	Long
23	IBM Granite Speech 3.3-2B***	18.9%	7.6–35.4	110s	Local (L4 GPU)	Chunk
24	Kyutai STT 1B (en/fr)	29.4%	9.5–223.1	80s	Local (L4 GPU)	Long
25	Azure Foundry Phi-4	33.1%	9.0–107.1	213s	API (Azure)	Chunk
26	Google MedASR	64.9%	31.1–98.9	1s	Local (M4)	Long

*54/55 files (1 blocked by safety filter) **3 files with hallucination loops ***Requires chunking to avoid repetition loops

Key findings

Google Gemini 2.5 Pro leads accuracy at 10.8% WER, with the new Gemini 3 Pro Preview close behind at 11.0%.

Parakeet v3 is the new local champion—11.9% WER at 6 seconds per file makes it ideal for on-device medical scribes.

OpenAI GPT-4o Mini (Dec 2025) jumped from 15.9% to 12.8% WER, now ranking #5 overall.

Google MedASR scored worst (64.9% WER) despite being medical-specific—it's designed for dictation, not conversations.

Autoregressive models hallucinate: Canary 1B v2, Granite Speech, and Kyutai all exhibited repetition loops on certain files. Chunking with overlap mitigates this.

Groq Whisper-v3 (turbo) offers the best price/latency balance in the cloud.

Apple SpeechAnalyzer remains a solid choice for Swift apps at 14.8% WER.

Limitations & next steps

• UK-English only → we'll use multi-language datasets in the future.

• WER ≠ clinical usefulness → needs more thorough evaluation on medical correctness.

• API cost unmeasured → v2 will include $/hour and CO₂ metrics.

• Evaluation code now open-source → github.com/Omi-Health/medical-STT-eval

Get in touch

Want to try the on-device AI-Scribe or plug in your own model?

Email [email protected]

Omi-Sum 3B: Open-Source Model for Medical Summaries ›