Evaluations
Benchmarking Speech-to-Text Models for Long-Form Medical Dialogue
Updated Mar 27, 2026
(update — 27 Mar 2026)
What changed in the 27 Mar 2026 refresh
5 new models tested (26 → 31): Microsoft VibeVoice-ASR 9B (new open-source leader at 8.34% WER, but needs ~18GB VRAM and is slow even on H100), ElevenLabs Scribe v2 (9.72% vs 10.87% for v1), NVIDIA Nemotron Speech Streaming 0.6B (11.06% on T4), Voxtral Mini 2602 via Transcription API (11.64%), and Voxtral Mini 4B via vLLM realtime (11.89% on H100). Also evaluated LiquidAI LFM2.5-Audio-1.5B and Meta SeamlessM4T v2 Large—neither suited for long-form transcription.
Replaced Whisper’s text normalizer with a custom one: Found two bugs in Whisper’s EnglishTextNormalizer that inflated WER by ~2–3% across all models: (1) “oh” treated as zero—in medical conversations it’s always an interjection, not a digit; (2) missing word equivalences (ok/okay/k, yeah/yep/yes, mum/mom, alright/all right, kinda/kind of). All v3 scores are recalculated with the custom normalizer. Code in evaluate/text_normalizer.py—drop-in replacement, no whisper dependency.
VibeVoice-ASR 9B is the new open-source leader: First open-source model to genuinely compete with Gemini-tier cloud APIs on medical audio (8.34% vs 8.15% for Gemini 2.5 Pro). Needs ~18GB VRAM (L4/A10 sufficient, won’t fit on T4). Even on H100, 97s/file vs 6s for Parakeet.
Parakeet TDT 0.6B v3 remains the edge story: 9.35% WER at 6 seconds per file on Apple Silicon. A 0.6B model within 1% of a 9B model.
ElevenLabs Scribe v2 is a meaningful upgrade: 9.72% vs 10.87% for v1. Best cloud API if you don’t want Google.
LFM Audio and SeamlessM4T didn’t make the cut: LFM2.5-Audio-1.5B isn’t a dedicated ASR model—transcription via prompting yielded sparse keyword extractions (~74 words from 1400-word conversations with 2s chunks) or hallucination loops with longer chunks. SeamlessM4T is a translation model that summarized instead of transcribing (~677 words from ~1400).
———————————————————
(update — 24 Dec 2025)
What changed in the 24 Dec 2025 refresh
Open-sourced the benchmark: Full evaluation code now available on GitHub—run your own models, reproduce our results, or contribute improvements.
→ github.com/Omi-Health/medical-STT-eval
7 new models tested: Added Gemini 3 Pro/Flash Preview, Parakeet v3, updated GPT-4o Mini, NVIDIA Canary 1B v2, IBM Granite Speech, and Google MedASR.
Parakeet v3 jumps to #3: NVIDIA's latest Parakeet release now beats Gemini 2.5 Flash with 11.9% WER at just 6 seconds per file—best local model for on-device use.
Google MedASR tested: Despite being Google's medical-specific model, it scored worst at 64.9% WER. Key insight: MedASR is optimized for single-speaker dictation, not doctor-patient conversations.
Hallucination patterns documented: We identified repetition loops in autoregressive models (Canary 1B v2, Granite Speech, Kyutai) and developed chunking strategies to mitigate them.
—————————————
(update — 03 Aug 2025)
What changed in the 03 Aug 2025 refresh
Extended Whisper-style normalisation: We now strip fillers ("um"), expand contractions ("you're → you are"), and standardise punctuation/numbers before calculating WER. Most models gained 0.5-1 pp accuracy.
Extra models & fresh weights: Added Gemini 2.5 Pro / Flash, Voxtral Small, Kyutai
—————————————
Physicians lose countless hours each week to manual note-taking, so we're developing an on-device AI-Scribe that can transcribe and summarise entire consultations without sending any data to the cloud. To choose the right speech-to-text engine, we benchmarked 31 open- and closed-source models on PriMock57—a set of 57 simulated GP consultations (5–10 minutes each) recorded by Babylon Health clinicians. We ran the audio files through each model, logged word-error rate (WER), speed, and consistency, and applied chunking only where models crashed on audio longer than 40 seconds. The results surprised us—and they'll help anyone looking to bring reliable, privacy-first transcription into clinical workflows.
Dataset — PriMock57
• 57 simulated GP consultations recorded by seven Babylon Health doctors with role-play patients.
• Reference transcripts were cleaned to plain text for fair Word-Error-Rate (WER) calculation.
• Two recordings (day1_consultation07, day3_consultation03) triggered catastrophic hallucinations on multiple models, so 55 files remain.
• Public repo: https://github.com/babylonhealth/primock57
Evaluation framework
Transcription — a per-model runner saves a .txt and logs processing time.
Normalisation — extended Whisper-style preprocessing strips fillers ("um"), expands contractions ("you're → you are"), and standardises punctuation/numbers before WER calculation.
Metrics — scripts compute WER, best/worst file and standard deviation.
Comparison — results merge into CSV and rankings for plotting.
Chunking — only applied to models that break on audio longer than 40 s (30 s chunks with 10 s overlap).
Hardware & run types
Local Mac – Apple M4 Max 64 GB using MLX & WhisperKit
Local GPU – AWS g5.2xl (NVIDIA L4 24 GB) running vLLM and NVIDIA NeMo
Cloud APIs – Groq, ElevenLabs, OpenAI, Mistral, Gemini
Azure – Foundry Phi-4 multimodal endpoint
Results (55 files, 31 models)
# | Model | Avg WER | Best–Worst | Avg sec/file | Host | Info |
|---|---|---|---|---|---|---|
1 | Google Gemini 2.5 Pro | 8.15% | 4.3–14.9 | 56s | API (Google) | Long |
2 | Microsoft VibeVoice-ASR 9B | 8.34% | 4.6–14.2 | 97s | Local (H100) | Long |
3 | Google Gemini 3 Pro Preview | 8.35% | 4.2–16.7 | 65s | API (Google) | Long |
4 | Parakeet TDT 0.6B v3 | 9.35% | 5.1–15.8 | 6s | Local (M4) | Long |
5 | Google Gemini 2.5 Flash | 9.45% | 5.0–34.5 | 20s | API (Google) | Long |
6 | ElevenLabs Scribe v2 | 9.72% | 5.2–64.1 | 44s | API (ElevenLabs) | Long |
7 | Parakeet TDT 0.6B v2 | 10.75% | 6.8–17.5 | 5s | Local (M4) | Long |
8 | ElevenLabs Scribe v1 | 10.87% | 5.5–64.9 | 36s | API (ElevenLabs) | Long |
9 | NVIDIA Nemotron Speech Streaming 0.6B | 11.06% | 6.3–17.8 | 12s | Local (T4) | Long |
10 | OpenAI GPT-4o Mini (2025-12-15) | 11.18% | 5.9–22.1 | 40s | API (OpenAI) | Long |
11 | Kyutai STT 2.6B (en) | 11.20% | 6.2–18.3 | 148s | Local (L4 GPU) | Long |
12 | Google Gemini 3 Flash Preview | 11.33% | 5.9–22.7 | 52s | API (Google) | Long |
13 | Voxtral Mini 2602 (Transcription API) | 11.64% | 6.1–44.5 | 18s | API (Mistral) | Long |
14 | MLX Whisper-L v3-turbo | 11.65% | 6.0–29.8 | 13s | Local (M4) | Long |
15 | Voxtral Mini (API) | 11.85% | 6.3–44.9 | 22s | API (Mistral) | Long |
16 | Voxtral Mini 4B (vLLM realtime) | 11.89% | 6.5–45.2 | 693s* | Local (T4/H100) | Long |
17 | Groq Whisper-L v3 | 12.05% | 7.1–18.9 | 9s | API (Groq) | Long |
18 | Groq Whisper-L v3-turbo | 12.15% | 6.8–19.4 | 8s | API (Groq) | Long |
19 | NVIDIA Canary 1B Flash | 12.20% | 6.9–19.6 | 23s | Local (L4 GPU) | Chunk |
20 | WhisperKit-L v3-turbo | 12.25% | 6.1–19.8 | 21s | Local (macOS) | Long |
21 | Apple SpeechAnalyzer | 12.50% | 7.0–19.1 | 6s | Local (macOS) | Long |
22 | NVIDIA Canary-Qwen 2.5B | 13.05% | 6.7–61.8 | 105s | Local (L4 GPU) | Chunk |
23 | OpenAI Whisper-1 | 13.15% | 5.8–101.5 | 104s | API (OpenAI) | Long |
24 | OpenAI GPT-4o Mini Transcribe | 13.65% | 6.5–40.4 | — | API (OpenAI) | Long |
25 | NVIDIA Canary 1B v2** | 14.45% | 7.9–42.8 | 9s | Local (L4 GPU) | Long |
26 | OpenAI GPT-4o Transcribe | 14.75% | 6.6–63.8 | 28s | API (OpenAI) | Long |
27 | IBM Granite Speech 3.3-2B*** | 16.55% | 6.1–32.7 | 110s | Local (L4 GPU) | Chunk |
28 | Kyutai STT 1B (en/fr) | 27.10% | 7.8–220.5 | 80s | Local (L4 GPU) | Long |
29 | Azure Foundry Phi-4 | 30.75% | 7.3–104.5 | 213s | API (Azure) | Chunk |
30 | Google MedASR | 62.50% | 29.5–96.2 | 1s | Local (M4) | Long |
31 | LiquidAI LFM2.5-Audio-1.5B† | — | — | — | Local (GPU) | N/A |
*54/55 files (1 blocked by safety filter) **3 files with hallucination loops ***Requires chunking to avoid repetition loops
Key findings
Google Gemini 2.5 Pro leads accuracy at 10.8% WER, with the new Gemini 3 Pro Preview close behind at 11.0%.
Parakeet v3 is the new local champion—11.9% WER at 6 seconds per file makes it ideal for on-device medical scribes.
OpenAI GPT-4o Mini (Dec 2025) jumped from 15.9% to 12.8% WER, now ranking #5 overall.
Google MedASR scored worst (64.9% WER) despite being medical-specific—it's designed for dictation, not conversations.
Autoregressive models hallucinate: Canary 1B v2, Granite Speech, and Kyutai all exhibited repetition loops on certain files. Chunking with overlap mitigates this.
Groq Whisper-v3 (turbo) offers the best price/latency balance in the cloud.
Apple SpeechAnalyzer remains a solid choice for Swift apps at 14.8% WER.
Limitations & next steps
• UK-English only → we'll use multi-language datasets in the future.
• WER ≠ clinical usefulness → needs more thorough evaluation on medical correctness.
• API cost unmeasured → v2 will include $/hour and CO₂ metrics.
• Evaluation code now open-source → github.com/Omi-Health/medical-STT-eval
Get in touch
Want to try the on-device AI-Scribe or plug in your own model?
Email [email protected]


