Evaluations
Benchmarking Speech-to-Text Models for Long-Form Medical Dialogue
Jul 29, 2025

Physicians lose countless hours each week to manual note-taking, so we’re developing an on-device AI-Scribe that can transcribe and summarise entire consultations without sending any data to the cloud. To choose the right speech-to-text engine, we benchmarked 15 open- and closed-source models on PriMock57—a set of 57 simulated GP consultations (5–10 minutes each) recorded by Babylon Health clinicians. We ran the audio files through each model, logged word-error rate (WER), speed, and consistency, and applied chunking only where models crashed on audio longer than 40 seconds. The results surprised us—and they’ll help anyone looking to bring reliable, privacy-first transcription into clinical workflows.
Dataset — PriMock57
• 57 simulated GP consultations recorded by seven Babylon Health doctors with role-play patients.
• Reference transcripts were cleaned to plain text for fair Word-Error-Rate (WER) calculation.
• Two recordings (day1_consultation07, day3_consultation03) triggered catastrophic hallucinations on multiple models, so 55 files remain.
• Public repo: https://github.com/babylonhealth/primock57
Evaluation framework
• Transcription — a per-model runner saves a .txt and logs processing time.
• Metrics — scripts compute WER, best/worst file and standard deviation.
• Comparison — results merge into CSV and rankings for plotting.
• Chunking — only applied to Canary-Qwen 2.5 B, Canary-1B-Flash and Phi-4, which break on audio longer than 40 s (30 s chunks with 10 s overlap).
Hardware & run types
Local Mac – Apple M4 Max 64 GB using MLX & WhisperKit
Local GPU – AWS g5.2xl (NVIDIA L4 24 GB) running vLLM and NVIDIA NeMo
Cloud APIs – Groq, ElevenLabs, OpenAI, Mistral
Azure – Foundry Phi-4 multimodal endpoint
Results (55 files)
Rank | Model | Avg WER | Best–Worst | Avg sec/file | Host / Run-type | Info |
---|---|---|---|---|---|---|
1 | ElevenLabs Scribe v1 | 15.0 % | 8.7–68.5 | 36.3 s | API (ElevenLabs) | Long |
2 | MLX Whisper-L v3-turbo | 17.6 % | 9.4–33.4 | 12.9 s | Local (M4) | Long |
3 | Parakeet-TDT-0.6 B v2 | 17.9 % | 12.4–25.8 | 5.4 s | Local (M4) | Long |
4 | NVIDIA Canary-Qwen 2.5 B | 18.2 % | 10.5–65.5 | 105.4 s | Local (L4 + NVIDIA NeMo) | Chunk |
5 | Apple SpeechAnalyzer | 18.2 % | 11.9–25.8 | 6.0 s | Local (macOS) | Long |
6 | Groq Whisper-L v3 | 18.4 % | 11.8–26.5 | 8.6 s | API (Groq) | Long |
7 | Voxtral-mini 3 B | 18.5 % | 12.0–25.8 | 74.4 s | Local (L4 + vLLM) | Long |
8 | Groq Whisper-L v3-turbo | 18.7 % | 11.8–26.9 | 8.0 s | API (Groq) | Long |
9 | Canary-1B-Flash | 18.8 % | 11.6–25.7 | 23.4 s | Local (L4 + NVIDIA NeMo) | Chunk |
10 | Voxtral-mini (API) | 19.0 % | 11.6–50.3 | 22.9 s | API (Mistral) | Long |
11 | WhisperKit-L v3-turbo | 19.1 % | 11.8–27.5 | 21.4 s | Local (macOS) | Long |
12 | OpenAI Whisper-1 | 19.6 % | 10.5–98.6 | 104.3 s | API (OpenAI) | Long |
13 | OpenAI GPT-4o-mini | 20.6 % | 11.6–45.7 | — | API (OpenAI) | Long |
14 | OpenAI GPT-4o | 21.7 % | 12.0–67.2 | 27.9 s | API (OpenAI) | Long |
15 | Azure Foundry Phi-4 | 36.6 % | 13.0–110.0 | 212.8 s | API (Azure Foundry) | Chunk |
Key findings
• ElevenLabs Scribe leads accuracy but can hallucinate on edge cases.
• Parakeet-0.6 B on an M4 runs about 5× real-time—great for local use.
• Groq Whisper-v3 (turbo) offers the best price/latency balance in the cloud.
• Chunking rescues Canary, Canary-Qwen & Phi-4 but doubles runtime.
• Apple SpeechAnalyzer is a great hit for Swift apps.
• Voxtral-mini beats Canary-Qwen 2.5 B for multilingual long-form.
Limitations & next steps
• UK-English only → we’ll use multi-language datasets in the future.
• WER ≠ clinical usefulness → needs more thorough evaluation on medical correctness.
• API cost unmeasured → v2 will include $/hour and CO₂ metrics.
Get in touch
Want to try the on-device AI-Scribe or plug in your own model?
Email [email protected]