← Research

Benchmarking Speech-to-Text Models for Long-Form Medical Dialogue

Physicians lose countless hours each week to manual note-taking. At Omi, we're building an on-device AI-Scribe that transcribes and summarises entire consultations without sending patient data to the cloud. Choosing the right speech-to-text engine is the foundation of that product — so we built the most comprehensive medical STT benchmark we could find, and we keep it current.

What's new in v4

We now rank models by Medical WER (M-WER) — a metric that only counts errors on clinically relevant words (drugs, conditions, symptoms, anatomy, clinical procedures). Standard WER treats "yeah" and "amoxicillin" as equally important. M-WER doesn't — and the ranking looks very different as a result. We also report Drug M-WER separately, because misspelling a drug name is a patient safety issue, and drug names turn out to be 2–5x harder than other medical terms for almost every model.

Leaderboard

Ranked by Medical WER (M-WER). Lower is better. Dataset: PriMock57 — 55 simulated GP consultations, ~80,500 words of British English medical dialogue.

# Model WER M-WER Drug M-WER Avg Speed Type
1Google Gemini 3 Pro Preview8.35%2.65%3.1%64.5sAPI
2Google Gemini 2.5 Pro8.15%2.97%4.1%56.4sAPI
3VibeVoice-ASR 9B8.34%3.16%5.6%96.7sH100
4Soniox stt-async-v49.18%3.32%7.1%46.2sAPI
5Google Gemini 3 Flash Preview11.33%3.64%5.2%51.5sAPI
6ElevenLabs Scribe v29.72%3.86%4.3%43.5sAPI
7AssemblyAI Universal-3 Pro (medical-v1)9.55%4.02%6.5%37.3sAPI
8Qwen3 ASR 1.7B9.00%4.40%8.6%6.8sA10
9Deepgram Nova-3 Medical9.05%4.53%9.7%12.9sAPI
10OpenAI GPT-4o Mini (Dec 2025)11.18%4.85%10.6%40.4sAPI
11Microsoft MAI-Transcribe-111.52%4.85%11.2%21.8sAPI
12ElevenLabs Scribe v110.87%4.88%7.5%36.3sAPI
13Google Gemini 2.5 Flash9.45%5.01%10.3%20.2sAPI
14Voxtral Mini Transcribe V111.85%5.17%11.0%22.4sAPI
15Parakeet TDT 1.1B9.03%5.20%15.5%12.3sT4
16Voxtral Mini Transcribe V211.64%5.36%12.1%18.4sAPI
17Voxtral Mini 4B Realtime11.89%5.39%11.8%270.9sA10
18Cohere Transcribe (Mar 2026)11.81%5.59%16.6%3.9sA10
19OpenAI Whisper-113.20%5.62%10.3%104.3sAPI
20Groq Whisper Large v3 Turbo12.14%5.75%14.4%8.0sAPI
21NVIDIA Canary 1B Flash12.03%5.97%15.7%23.4sT4
22Groq Whisper Large v311.93%5.97%13.6%8.6sAPI
23OpenAI GPT-4o Mini Transcribe13.60%6.03%11.4%23.2sAPI
24MLX Whisper Large v3 Turbo11.65%6.16%14.0%12.9sApple Silicon
25Parakeet TDT 0.6B v210.75%6.19%17.2%5.4sApple Silicon
26WhisperKit Large v3 Turbo12.28%6.35%14.4%21.4sApple Silicon
27Qwen3 ASR 0.6B9.83%6.48%15.1%5.1sA10
28Kyutai STT 2.6B11.20%6.51%15.7%148.4sT4
29GLM-ASR-Nano-251210.84%7.05%17.5%87.7sT4
30Parakeet TDT 0.6B v39.35%7.25%22.0%6.3sApple Silicon
31Nemotron Speech Streaming 0.6B11.06%8.97%22.6%11.7sT4
32OpenAI GPT-4o Transcribe14.84%9.03%14.9%27.9sAPI
33NVIDIA Canary-Qwen 2.5B12.94%9.80%22.8%105.4sT4
34Gemma 4 E4B-it15.69%9.99%15.5%185.4sT4
35NVIDIA Canary 1B v214.32%11.24%20.5%9.2sT4
36IBM Granite Speech 3.3-2B16.55%12.80%23.1%109.7sT4
37Apple SpeechAnalyzer12.36%13.02%27.4%6.0sApple Silicon
38Gemma 4 E2B-it18.90%13.92%19.8%134.6sT4
39Azure Foundry Phi-431.13%15.38%18.1%212.8sAPI
40Kyutai STT 1B (Multilingual)27.28%21.23%28.9%79.5sT4
41Google MedASR64.38%49.66%58.0%1.2sApple Silicon
42Facebook MMS-1B-all38.70%54.01%72.0%28.6sT4

Dataset — PriMock57

Evaluation framework

Every model is run through the same pipeline: transcribe the audio, log processing time, normalise the output with our custom text normalizer, and compute WER, M-WER, Drug M-WER, and per-file statistics. Models that crash on long audio get chunking with overlap (typically 30s chunks with 10s overlap and LCS merging). We run on a mix of Apple Silicon (M4 Max), NVIDIA GPUs (T4, L4, A10, H100), and cloud APIs directly. No quantisation on local models unless explicitly noted — we benchmark full precision for fairness.

Full evaluation code: github.com/Omi-Health/medical-STT-eval

Choosing a medical STT engine

Best accuracy, cloud API

Google Gemini 3 Pro or 2.5 Pro. Both sit below 3% M-WER and handle long-form conversation natively.

Strong accuracy without Google

Soniox, AssemblyAI Universal-3 Pro (with medical-v1 domain), or Deepgram Nova-3 Medical. All three are genuinely competitive with Gemini-tier on M-WER, and Deepgram is the fastest cloud API on the board at 13s/file.

Best open-source, accuracy-first

Microsoft VibeVoice-ASR 9B. First open-source model to compete with Gemini on medical audio. Requires ~18GB VRAM (L4/A10 is enough; won't fit on T4) and is slow — 97s/file even on H100 — because it's a 9B LLM processing audio as tokens rather than a purpose-built speech model.

Best open-source, cost/speed-balanced

Qwen3-ASR 1.7B. ~14x faster than VibeVoice for only ~1.3 points more M-WER. Requires compute capability ≥ 8.0 (A10 or better; not T4).

On-device, Apple Silicon

Parakeet TDT models are the fastest option, but be aware their Drug M-WER is weak — 22% for the 0.6B v3 — which may not be acceptable in a clinical context.

What to avoid for medical transcription

Google MedASR (despite being medical-specific, it's built for single-speaker dictation, not conversations — 49.66% M-WER), Facebook MMS-1B-all (multilingual phonetic vocab, 54% M-WER), and current quantised versions of VibeVoice (aggressive quantisation kills drug-name accuracy).

Limitations and what's next