Evaluations
Benchmarking Speech-to-Text Models for Long-Form Medical Dialogue
Jul 29, 2025
(update — 03 Aug 2025)
What changed in the 03 Aug 2025 refresh
Extended Whisper-style normalisation: We now strip fillers (“um”), expand contractions (“you’re → you are”), and standardise punctuation/numbers before calculating WER. Most models gained 0.5-1 pp accuracy.
Extra models & fresh weights: Added Gemini 2.5 Pro / Flash, Voxtral Small, Kyutai
—————————————-
Physicians lose countless hours each week to manual note-taking, so we’re developing an on-device AI-Scribe that can transcribe and summarise entire consultations without sending any data to the cloud. To choose the right speech-to-text engine, we benchmarked 15 open- and closed-source models on PriMock57—a set of 57 simulated GP consultations (5–10 minutes each) recorded by Babylon Health clinicians. We ran the audio files through each model, logged word-error rate (WER), speed, and consistency, and applied chunking only where models crashed on audio longer than 40 seconds. The results surprised us—and they’ll help anyone looking to bring reliable, privacy-first transcription into clinical workflows.
Dataset — PriMock57
• 57 simulated GP consultations recorded by seven Babylon Health doctors with role-play patients.
• Reference transcripts were cleaned to plain text for fair Word-Error-Rate (WER) calculation.
• Two recordings (day1_consultation07, day3_consultation03) triggered catastrophic hallucinations on multiple models, so 55 files remain.
• Public repo: https://github.com/babylonhealth/primock57
Evaluation framework
Transcription — a per-model runner saves a .txt and logs processing time.
Normalisation — extended Whisper-style preprocessing strips fillers ("um"), expands contractions ("you're → you are"), and standardises punctuation/numbers before WER calculation.
Metrics — scripts compute WER, best/worst file and standard deviation.
Comparison — results merge into CSV and rankings for plotting.
Chunking — only applied to Canary-Qwen 2.5 B, Canary-1B-Flash and Phi-4, which break on audio longer than 40 s (30 s chunks with 10 s overlap).
Hardware & run types
Local Mac – Apple M4 Max 64 GB using MLX & WhisperKit
Local GPU – AWS g5.2xl (NVIDIA L4 24 GB) running vLLM and NVIDIA NeMo
Cloud APIs – Groq, ElevenLabs, OpenAI, Mistral, Gemini
Azure – Foundry Phi-4 multimodal endpoint
Results (55 files)
# | Model | Avg WER | Best–Worst | Avg sec/file | Host | Info |
---|---|---|---|---|---|---|
1 | google-gemini-2.5-pro | 10.8 % | 6.1–17.0 | 56 s | API (Google) | Long |
2 | google-gemini-2.5-flash | 12.1 % | 6.6–37.5 | 20 s | API (Google) | Long |
3 | parakeet-0.6 B v2 | 13.3 % | 8.5–20.2 | 5 s | Local (M4) | Long |
4 | elevenlabs-scribe v1 | 13.5 % | 7.0–67.9 | 36 s | API (ElevenLabs) | Long |
5 | kyutai STT 2.6 B (en) | 13.8 % | 7.8–20.7 | 148 s | Local (L4 GPU) | Long |
6 | mlx Whisper-L v3-turbo | 14.2 % | 7.5–32.1 | 13 s | Local (M4) | Long |
7 | groq Whisper-L v3 | 14.3 % | 8.8–21.2 | 9 s | API (Groq) | Long |
8 | Voxtral-mini 3 B | 14.3 % | 7.8–47.9 | 74 s | Local (L4 GPU) | Long |
9 | Voxtral-mini (API) | 14.4 % | 7.8–47.5 | 23 s | API (Mistral) | Long |
10 | Canary-1B Flash | 14.5 % | 8.5–22.0 | 23 s | Local (L4 GPU) | Chunk |
11 | groq Whisper-L v3-turbo | 14.5 % | 8.5–21.7 | 8 s | API (Groq) | Long |
12 | whisperkit-L v3-turbo | 14.5 % | 7.7–22.1 | 21 s | Local (macOS) | Long |
13 | Apple SpeechAnalyzer | 14.8 % | 8.7–21.4 | 6 s | Local (macOS) | Long |
14 | Voxtral-small (chat) | 15.4 % | 5.9–97.4 | 32 s | API (Mistral) | Long |
15 | NVIDIA Canary-Qwen 2.5 B | 15.4 % | 8.3–64.5 | 105 s | Local (L4 GPU) | Chunk |
16 | OpenAI Whisper-1 | 15.5 % | 7.2–104.9 | 104 s | API (OpenAI) | Long |
17 | OpenAI GPT-4o-mini (transcribe) | 15.9 % | 8.1–43.0 | — | API (OpenAI) | Long |
18 | OpenAI GPT-4o (transcribe) | 17.1 % | 8.2–66.5 | 28 s | API (OpenAI) | Long |
19 | Kyutai STT 1 B (en/fr) | 29.4 % | 9.5–223.1 | 80 s | Local (L4 GPU) | Long |
20 | Azure Foundry Phi-4 | 33.1 % | 9.0–107.1 | 213 s | API (Azure) | Chunk |
Key findings
Google Gemini 2.5 Pro leads accuracy at 10.8% WER, with Gemini Flash close behind at 12.1%.
Parakeet-0.6 B on an M4 runs about 5× real-time—great for local use.
Groq Whisper-v3 (turbo) offers the best price/latency balance in the cloud.
Chunking rescues Canary, Canary-Qwen & Phi-4 but doubles runtime.
Apple SpeechAnalyzer is a great hit for Swift apps.
ElevenLabs Scribe ranks 4th in accuracy but shows high variability (7.0–67.9% range).
Voxtral-mini beats Canary-Qwen 2.5 B for multilingual long-form.
Limitations & next steps
• UK-English only → we’ll use multi-language datasets in the future.
• WER ≠ clinical usefulness → needs more thorough evaluation on medical correctness.
• API cost unmeasured → v2 will include $/hour and CO₂ metrics.
Get in touch
Want to try the on-device AI-Scribe or plug in your own model?
Email [email protected]