Evaluations
Benchmarking Speech-to-Text Models for Long-Form Medical Dialogue
Jul 29, 2025
(update — 24 Dec 2025)
What changed in the 24 Dec 2025 refresh
Open-sourced the benchmark: Full evaluation code now available on GitHub—run your own models, reproduce our results, or contribute improvements.
→ github.com/Omi-Health/medical-STT-eval
7 new models tested: Added Gemini 3 Pro/Flash Preview, Parakeet v3, updated GPT-4o Mini, NVIDIA Canary 1B v2, IBM Granite Speech, and Google MedASR.
Parakeet v3 jumps to #3: NVIDIA's latest Parakeet release now beats Gemini 2.5 Flash with 11.9% WER at just 6 seconds per file—best local model for on-device use.
Google MedASR tested: Despite being Google's medical-specific model, it scored worst at 64.9% WER. Key insight: MedASR is optimized for single-speaker dictation, not doctor-patient conversations.
Hallucination patterns documented: We identified repetition loops in autoregressive models (Canary 1B v2, Granite Speech, Kyutai) and developed chunking strategies to mitigate them.
—————————————
(update — 03 Aug 2025)
What changed in the 03 Aug 2025 refresh
Extended Whisper-style normalisation: We now strip fillers ("um"), expand contractions ("you're → you are"), and standardise punctuation/numbers before calculating WER. Most models gained 0.5-1 pp accuracy.
Extra models & fresh weights: Added Gemini 2.5 Pro / Flash, Voxtral Small, Kyutai
—————————————
Physicians lose countless hours each week to manual note-taking, so we're developing an on-device AI-Scribe that can transcribe and summarise entire consultations without sending any data to the cloud. To choose the right speech-to-text engine, we benchmarked 26 open- and closed-source models on PriMock57—a set of 57 simulated GP consultations (5–10 minutes each) recorded by Babylon Health clinicians. We ran the audio files through each model, logged word-error rate (WER), speed, and consistency, and applied chunking only where models crashed on audio longer than 40 seconds. The results surprised us—and they'll help anyone looking to bring reliable, privacy-first transcription into clinical workflows.
Dataset — PriMock57
• 57 simulated GP consultations recorded by seven Babylon Health doctors with role-play patients.
• Reference transcripts were cleaned to plain text for fair Word-Error-Rate (WER) calculation.
• Two recordings (day1_consultation07, day3_consultation03) triggered catastrophic hallucinations on multiple models, so 55 files remain.
• Public repo: https://github.com/babylonhealth/primock57
Evaluation framework
Transcription — a per-model runner saves a .txt and logs processing time.
Normalisation — extended Whisper-style preprocessing strips fillers ("um"), expands contractions ("you're → you are"), and standardises punctuation/numbers before WER calculation.
Metrics — scripts compute WER, best/worst file and standard deviation.
Comparison — results merge into CSV and rankings for plotting.
Chunking — only applied to models that break on audio longer than 40 s (30 s chunks with 10 s overlap).
Hardware & run types
Local Mac – Apple M4 Max 64 GB using MLX & WhisperKit
Local GPU – AWS g5.2xl (NVIDIA L4 24 GB) running vLLM and NVIDIA NeMo
Cloud APIs – Groq, ElevenLabs, OpenAI, Mistral, Gemini
Azure – Foundry Phi-4 multimodal endpoint
Results (55 files)
# | Model | Avg WER | Best–Worst | Avg sec/file | Host | Info |
|---|---|---|---|---|---|---|
1 | Google Gemini 2.5 Pro | 10.8% | 6.1–17.0 | 56s | API (Google) | Long |
2 | Google Gemini 3 Pro Preview* | 11.0% | 6.1–19.2 | 65s | API (Google) | Long |
3 | Parakeet TDT 0.6B v3 | 11.9% | 6.7–18.4 | 6s | Local (M4) | Long |
4 | Google Gemini 2.5 Flash | 12.1% | 6.6–37.5 | 20s | API (Google) | Long |
5 | OpenAI GPT-4o Mini (2025-12-15) | 12.8% | 7.2–24.5 | 41s | API (OpenAI) | Long |
6 | Parakeet TDT 0.6B v2 | 13.3% | 8.5–20.2 | 5s | Local (M4) | Long |
7 | ElevenLabs Scribe v1 | 13.5% | 7.0–67.9 | 36s | API (ElevenLabs) | Long |
8 | Kyutai STT 2.6B (en) | 13.8% | 7.8–20.7 | 148s | Local (L4 GPU) | Long |
9 | Google Gemini 3 Flash Preview | 13.9% | 7.7–25.3 | 52s | API (Google) | Long |
10 | MLX Whisper-L v3-turbo | 14.2% | 7.5–32.1 | 13s | Local (M4) | Long |
11 | Groq Whisper-L v3 | 14.3% | 8.8–21.2 | 9s | API (Groq) | Long |
12 | Voxtral Mini (API) | 14.4% | 7.8–47.5 | 23s | API (Mistral) | Long |
13 | Voxtral Mini (Transcription) | 14.4% | 7.8–47.9 | 23s | API (Mistral) | Long |
14 | NVIDIA Canary 1B Flash | 14.5% | 8.5–22.0 | 23s | Local (L4 GPU) | Chunk |
15 | Groq Whisper-L v3-turbo | 14.5% | 8.5–21.7 | 8s | API (Groq) | Long |
16 | WhisperKit-L v3-turbo | 14.5% | 7.7–22.1 | 21s | Local (macOS) | Long |
17 | Apple SpeechAnalyzer | 14.8% | 8.7–21.4 | 6s | Local (macOS) | Long |
18 | NVIDIA Canary-Qwen 2.5B | 15.4% | 8.3–64.5 | 105s | Local (L4 GPU) | Chunk |
19 | OpenAI Whisper-1 | 15.5% | 7.2–104.9 | 104s | API (OpenAI) | Long |
20 | OpenAI GPT-4o Mini Transcribe | 16.0% | 8.1–43.0 | — | API (OpenAI) | Long |
21 | NVIDIA Canary 1B v2** | 16.8% | 9.6–45.4 | 9s | Local (L4 GPU) | Long |
22 | OpenAI GPT-4o Transcribe | 17.1% | 8.2–66.5 | 28s | API (OpenAI) | Long |
23 | IBM Granite Speech 3.3-2B*** | 18.9% | 7.6–35.4 | 110s | Local (L4 GPU) | Chunk |
24 | Kyutai STT 1B (en/fr) | 29.4% | 9.5–223.1 | 80s | Local (L4 GPU) | Long |
25 | Azure Foundry Phi-4 | 33.1% | 9.0–107.1 | 213s | API (Azure) | Chunk |
26 | Google MedASR | 64.9% | 31.1–98.9 | 1s | Local (M4) | Long |
*54/55 files (1 blocked by safety filter) **3 files with hallucination loops ***Requires chunking to avoid repetition loops
Key findings
Google Gemini 2.5 Pro leads accuracy at 10.8% WER, with the new Gemini 3 Pro Preview close behind at 11.0%.
Parakeet v3 is the new local champion—11.9% WER at 6 seconds per file makes it ideal for on-device medical scribes.
OpenAI GPT-4o Mini (Dec 2025) jumped from 15.9% to 12.8% WER, now ranking #5 overall.
Google MedASR scored worst (64.9% WER) despite being medical-specific—it's designed for dictation, not conversations.
Autoregressive models hallucinate: Canary 1B v2, Granite Speech, and Kyutai all exhibited repetition loops on certain files. Chunking with overlap mitigates this.
Groq Whisper-v3 (turbo) offers the best price/latency balance in the cloud.
Apple SpeechAnalyzer remains a solid choice for Swift apps at 14.8% WER.
Limitations & next steps
• UK-English only → we'll use multi-language datasets in the future.
• WER ≠ clinical usefulness → needs more thorough evaluation on medical correctness.
• API cost unmeasured → v2 will include $/hour and CO₂ metrics.
• Evaluation code now open-source → github.com/Omi-Health/medical-STT-eval
Get in touch
Want to try the on-device AI-Scribe or plug in your own model?
Email [email protected]


