← Research

Omi Med STT v1: On-Device Medical Speech-to-Text

Parakeet TDT 0.6B v2 · CC-BY-4.0
Medical-term error rate (M-WER) 3.5× fewer errors
Parakeet TDT 0.6B v2 (base) 8.36%
Omi Med STT v1 2.37%
8.30%Word error rate
2.37%Medical-term WER
Mac · CUDA · CPURuns locally — no cloud
Omi Med STT v1 adapts Parakeet TDT 0.6B v2 for local clinical transcription, reducing medical-term errors while shipping Mac, CUDA, and CPU runtimes.

Medical transcription has forced a trade-off: the accurate options are closed cloud APIs, which means patient audio leaves the building; the private, on-device options haven't been accurate enough for clinical dialogue. Omi Med STT v1 is our attempt to close that gap — a 0.6-billion-parameter adaptation of NVIDIA's open Parakeet TDT 0.6B v2, evaluated against 21 external systems, open and closed, on a locked, held-out clinical benchmark (7.18 hours of audio, identical input and scoring for every system).

The short version: it is the most accurate open or local model we measured by overall word error rate, and it sits inside the band of the specialized medical cloud APIs. It runs entirely on your own hardware, and every shipped artifact (Mac, CUDA, CPU) carries its own published benchmark numbers. Weights are CC-BY-4.0; the runtime is MIT.

This page is the complete evaluation write-up. We publish it the same way we publish our public 42-model STT benchmark: numbers first, including where we lose.

Parameters
0.6 billion
Base model
NVIDIA Parakeet TDT 0.6B v2
Training data
Curated real + synthetic clinical audio
Licence
CC-BY-4.0 weights · MIT runtime

How it was measured

Every system transcribed the same locked, held-out clinical set — 7.18 hours of English medical audio spanning real GP consultations, long-form clinical dialogue, medication reviews, radiology dictation, procedure/device/test vocabulary, and an everyday-speech control — and every transcript was scored by the same code. The set was locked before final model selection, with zero overlap between our training audio and the test audio.

Results — versus open-source models

Ranked on medical-term accuracy (M-WER), the metric that matters for clinical use. Only one open model scores better — at 15× the size and ~7× the compute. Omi also has the best overall WER (8.30%) of any open or local model we measured.

Omi vs open-source models — locked clinical benchmark, lower M-WER is better
ModelM-WERWERDrugSpeed*time / 1 hour audio (× realtime)
VibeVoice-ASR 9B1.78%11.10%1.36%5m 20s (11.2×)
Omi Med STT v1 (0.6B)2.37%8.30%4.75%25s (146.3×)
Qwen3 ASR 1.7B3.13%10.72%6.11%44s (81.1×)
Qwen3 ASR 0.6B3.38%11.11%7.92%32s (111.1×)
Whisper Large v3 Turbo3.93%11.98%5.88%1m 19s (45.8×)
Voxtral Mini Transcribe V14.53%13.53%6.33%46s (77.9×)
Cohere Transcribe 03-20265.05%14.88%11.09%25s (142.9×)
Parakeet TDT 0.6B v38.01%15.26%9.50%23s (157.9×)
NVIDIA Canary 1B Flash8.04%17.26%13.12%59s (60.6×)
Parakeet TDT 0.6B v2 — our base8.36%16.45%8.60%23s (153.8×)
Google MedASR13.86%35.94%14.48%42s (85.7×)

* Speed = time to transcribe 1 hour of audio (× realtime in brackets). All local models measured on the same NVIDIA A10 GPU class (VibeVoice-ASR 9B on an H100 due to its size; Cohere as a locally hosted endpoint). Omi figure is the canonical checkpoint; the installed CLI with long-audio chunking measures 37s (96.8×), and an Apple-Silicon Mac 53s (67.4×).

Two honest notes. VibeVoice-ASR 9B earns its first place on medical terms (1.78% vs 2.37%) — it is also ~15× larger, ~13× slower, and worse on overall WER. And the comparison against our own base model, NVIDIA Parakeet TDT 0.6B v2, is the clearest proof the adaptation works: same architecture, same size, but medical-term error cut ~3.5× (8.36% → 2.37%), WER halved, and spurious drug mentions reduced from 131 to 9.

Results — versus general-purpose cloud APIs

The strongest general transcription services — ElevenLabs, Soniox, the Gemini and OpenAI families. The leaders out-transcribe Omi on medical terms; Omi sits in the band while being the only on-device, open-weights entry, the fastest, and the only one where patient audio never leaves the machine.

Omi vs general-purpose cloud APIs — same benchmark, lower M-WER is better
ModelM-WERWERDrugSpeed*time / 1 hour audio (× realtime)
ElevenLabs Scribe v21.39%6.53%0.23%7m 42s (7.8×)
Gemini 3.1 Pro Preview †1.65%7.13%0.23%41m 54s (1.4×)
Soniox STT Async v41.95%6.99%3.39%33m 18s (1.8×)
Omi Med STT v1 (on-device)2.37%8.30%4.75%25s (146.3×)
Gemini 3.5 Flash †2.39%7.99%0.45%19m 18s (3.1×)
Reson8 Prerecorded2.58%6.69%6.56%8m 06s (7.4×)
Voxtral Mini Transcribe v22.79%8.12%5.66%3m 54s (15.4×)
OpenAI GPT-4o Mini Transcribe3.55%10.26%3.39%4m 54s (12.2×)

* Speed = time to transcribe 1 hour of audio (× realtime in brackets). Cloud APIs are timed as full per-request round-trip (upload, queue, network and processing), called one request at a time. Omi's figure is local on-device compute on an NVIDIA A10 and excludes any network round-trip — so on latency it has a structural head start; concurrent API calls would raise the cloud figures. † Cleaned Gemini figures — see remark.

† A remark on the Gemini results. Both Gemini models showed a failure mode no other system had: on a stress slice of 420 benign, non-diagnostic clips, they repeatedly ignored the audio and fabricated entire clinical consultations — invented symptoms, histories and management plans (Gemini 3.1 Pro on 33 clips, Gemini 3.5 Flash on 87; every other system, including Omi: zero). The chart shows their cleaned numbers, with the fabrication clips excluded, so they're judged on transcription quality like everyone else; on the full set their headline WERs were 13.63% and 23.97%. For clinical use, we'd argue a transcriber that occasionally invents findings is the bigger risk than one that misses a word — but read the numbers either way.

Results — versus specialized medical APIs

The medical-specialty transcription services. AssemblyAI's medical model leads on medical terms; Omi is next — ahead of Deepgram Nova-3 Medical and Corti — as the only entry with open weights running on your own hardware.

Omi vs specialized medical STT APIs — same benchmark, lower M-WER is better
ModelM-WERWERDrugSpeed*time / 1 hour audio (× realtime)
AssemblyAI Universal-3 Pro Medical1.81%6.94%1.36%28m 06s (2.1×)
Omi Med STT v1 (on-device)2.37%8.30%4.75%25s (146.3×)
Deepgram Nova-3 Medical2.44%7.33%2.26%7m 48s (7.7×)
Corti Transcripts5.12%9.60%11.31%1h 5m (0.9×)

* Speed = time to transcribe 1 hour of audio (× realtime in brackets). Cloud APIs are timed as full per-request round-trip; Omi's figure is local on-device compute (NVIDIA A10; 53s / 67.4× on an Apple-Silicon Mac) and excludes the network round-trip the cloud services incur — a structural latency advantage of running on-device, not a like-for-like compute race.

The structural difference matters as much as the scores: every other row in this chart requires sending patient audio to a third-party cloud. Omi's transcription happens on the machine in the consultation room, with no audio retention anywhere else — and with no network round-trip in the loop.

Performance by clinical setting

Averages hide where a model actually struggles, so here is Omi Med STT v1 broken down by the kind of audio a clinic produces, grouped so every cell rests on a meaningful sample of scored medical terms. "Consultations & clinical dialogue" is full conversation audio — recorded multi-speaker GP visits with crosstalk and informal drug mentions, plus extended clinical dialogue. "Dictation & medication review" covers radiology-style dictation and spoken medication lists. "Procedures, tests & general speech" is short-phrase audio: intervention and diagnostic vocabulary — think colonoscopy, pacemaker lead, HbA1c, CT angiogram — mixed with ordinary non-medical speech that verifies the medical fine-tune didn't degrade everyday conversation.

Omi Med STT v1 by clinical setting and recording length. Lower is better.
M-WER WER Drug M-WER
By clinical setting
Consultations & clinical dialogue2.71%10.26%7.10%
Procedures, tests & general speech2.77%7.43%3.77%
Dictation & medication review3.51%4.06%4.00%
By recording length
Short clips, under 30 s1.99%5.98%3.09%
Full-length recordings2.70%10.25%7.10%

The reading for a clinician: medical-term accuracy is consistent everywhere (2.7–3.5% M-WER in every group). Overall word accuracy splits by audio type — dictation sits around 4% WER, short-phrase audio around 7%, and full conversations around 10%, driven by crosstalk and informal phrasing rather than medical terms. Within the consultation group, medication names are the dominant medical error (7.10% drug M-WER) — exactly where the next version is aimed. Across term categories, recall stays high everywhere: symptoms 98.6%, anatomy 98.8%, conditions 97.9%, clinical terms 97.9%, drugs 95.3%.

One install, three runtimes

The model ships with an open runtime, omi-med-stt, that auto-selects the right engine and artifact for the machine it's on:

Terminal
$ pip install omi-med-stt
$ omi-med-stt consultation.wav

# Apple Silicon → MLX q8 (default)
$ pip install "omi-med-stt[mlx]"

# NVIDIA CUDA → NeMo full checkpoint
$ pip install "omi-med-stt[nemo]"

# Linux / Windows CPU → GGUF q8_0 via parakeet.cpp
$ omi-med-stt install-cpp

Most releases quantize silently. We benchmarked every artifact we ship on the full locked test set, with the same scorer:

Runtime artifacts on the full locked benchmark (7.18 h). Lower WER is better.
Artifact Platform Size Speed: time / 1 hour audio (× realtime) M-WER WER Drug M-WER
Canonical NeMo .nemoNVIDIA CUDA (reference)2.5 GB25s (146.3×)*2.37%8.30%4.75%
MLX q8 — Mac defaultApple Silicon0.94 GB53s (67.4×)2.75%8.61%5.20%
MLX full precisionApple Silicon2.5 GB56s (64.5×)2.65%8.59%5.20%
GGUF q8_0 — CPU defaultLinux / Windows CPU0.93 GB2m 53s (20.8×)†3.20%9.12%6.33%

Speed = time to transcribe 1 hour of audio (× realtime in brackets). * Direct full-audio inference on an NVIDIA A10; the installed CLI, which auto-chunks long recordings, measures 37s (96.8×). † 32-thread Linux CPU. The CPU path is a portability fallback, not the quality path — prefer CUDA or Apple Silicon when available. The CLI auto-chunks audio above 240 seconds on the NeMo path at a measured cost of roughly one WER point on those files; below 240 seconds its output is identical to direct NeMo inference.

The Mac default is q8 because the evidence supports it: 2.7× smaller and slightly faster than full precision, with drug-name accuracy identical to the full-precision export (5.20%) and overall WER within 0.02 points. We also built a 4-bit artifact — its raw WER looked fine, but drug-name accuracy regressed (5.88% vs 5.20%), so we didn't ship it. For a medical model, safety metrics outrank file size.

The runtime matrix is verified on real machines — clean Ubuntu 22.04, a pristine Windows Server 2022 with no developer tooling, Azure A10 CUDA, and macOS — plus a four-job CI smoke suite on every release.

Training data and methodology

Omi Med STT v1 is fine-tuned from NVIDIA Parakeet TDT 0.6B v2 on a curated corpus of English clinical audio. It combines openly-licensed datasets, access-controlled clinical sources, and Omi's own proprietary synthetic medical-speech dataset. Most of the audio is real recordings, with targeted synthetic speech added where real clinical data is scarce.

Training corpus — ~127 hours of distinct audio across ~38,700 clips.
Audio Hours Share
Real recordings~91 h71%
Targeted synthetic speech~36 h29%

The medical material spans the full range of clinical speech — consultation dialogue, dictation, medication review, and procedure, device and test vocabulary — with general speech also represented so the model stays accurate on ordinary conversation.

The held-out benchmark was locked before final model selection, with zero overlap with training data. Scorer definitions, competitor versions and decode settings are published alongside the runtime. One honest caveat about our public 42-model benchmark: Omi Med STT trained on a held-out split of that set's consultation data, so this model can only be scored fairly on the held-out files — not the full public set. That run is a planned follow-up.

Limitations

Omi Med STT v1 is speech-to-text only. It is not a diagnostic, triage, prescribing, or clinical decision model, and it is not clinically validated. Transcripts must be reviewed before any clinical use.

What's next

Two things are in active development: a streaming model for live, in-consultation transcription rather than file-based processing, and multilingual support — Dutch, German, French and Spanish first, matching the markets Omi Scribe serves. Drug-name accuracy is the headline quality goal for v2.

Where to find it

Omi Med STT v1 is a derivative of nvidia/parakeet-tdt-0.6b-v2. It is not an NVIDIA model. Built with NVIDIA NeMo; the runtime interoperates with parakeet-mlx and parakeet.cpp.

Cite this model

APA — Omi Health. (2026). Omi Med STT v1: On-Device Medical Speech-to-Text. https://omi.health/research/omi-med-stt

@misc{omi_med_stt_v1_2026,
  title   = {Omi Med STT v1: On-Device Medical Speech-to-Text},
  author  = {{Omi Health}},
  year    = {2026},
  url     = {https://omi.health/research/omi-med-stt},
  note    = {0.6B medical ASR, fine-tuned from NVIDIA Parakeet TDT 0.6B v2, CC-BY-4.0}
}

Related research