← Research

omi-medical-edge-1: Open On-Device Medical Speech-to-Text

Published 8 June 2026 as “omi-medical-edge-1”, renamed July 2026 · CC-BY-4.0 weights, MIT runtime · Model on Hugging Face · Runtime on GitHub

Dr: and how long have you had the cough? Pt: about two weeks now, worse at night. Plan: start amoxicillin 500 mg three times daily for seven days. Imaging: chest X-ray shows no focal consolidation. Patient reports intermittent chest tightness on exertion. Review metformin dose; repeat HbA1c at the next visit. Procedure: flexible sigmoidoscopy booked for next month. Auscultation: bibasal crepitations, no wheeze noted.

Parakeet TDT 0.6B v2 · CC-BY-4.0

Clinical audio

omi-medical-edge-1 on-device

Mac · CUDA · CPU

Transcript

Started on amoxicillin 500 mg for the chest infection, with a colonoscopy booked and HbA1c repeated at review.

Medical-term error rate (M-WER) 1.8× fewer errors

Parakeet TDT 0.6B v2 3.80%

omi-medical-edge-1 2.16%

6.64%Word error rate

2.16%Medical-term WER

Mac · CUDA · CPURuns locally — no cloud

omi-medical-edge-1 adapts Parakeet TDT 0.6B v2 for local clinical transcription, reducing medical-term errors while shipping Mac, CUDA, and CPU runtimes.

Medical transcription has forced a trade-off: the accurate options are closed cloud APIs, which means patient audio leaves the building; the private, on-device options haven't been accurate enough for clinical dialogue. omi-medical-edge-1 is our attempt to close that gap — a 0.6-billion-parameter adaptation of NVIDIA's open Parakeet TDT 0.6B v2, evaluated on our sealed clinical benchmark — the same board that now ranks 28 systems, open and closed, on identical audio with one open scorer.

The short version: it is the most accurate open or local model we measured by overall word error rate, and it sits inside the band of the specialized medical cloud APIs. It runs entirely on your own hardware, and every shipped artifact (Mac, CUDA, CPU) carries its own published benchmark numbers. Weights are CC-BY-4.0; the runtime is MIT.

This page is the model write-up: what it is, how it was trained, and how to run it. The complete, current evaluation — leaderboard, dataset card, methodology — lives on our standing medical STT benchmark.

Parameters

0.6 billion

Base model

NVIDIA Parakeet TDT 0.6B v2

Training data

Curated real + synthetic clinical audio

Licence

CC-BY-4.0 weights · MIT runtime

How it was measured

On OmiMedSTT-Bench, our sealed clinical test set: 7.2 hours of English medical audio across five scenario types, scored by the same open pipeline for every system, ranked by Medical WER (M-WER) — the error rate on clinically relevant terms. Zero file-level overlap between our training audio and the test set. Full dataset card, scoring convention and the complete 30-system leaderboard: the benchmark page.

Results — July 2026 board

Current numbers on the standing benchmark (they have improved since the June release write-up as the scorer and board matured):

Best edge-deployable open medical model measured — #13 of 28 systems overall, ahead of most general cloud APIs and every other open model that runs on-device.
M-WER 2.16% · WER 6.64% — versus 3.80% / 12.48% for the base model it adapts (NVIDIA Parakeet TDT 0.6B v2): 1.8× fewer medical-term errors, 1.9× fewer word errors.
Zero runaway hallucinations on the full set — no looping or fabricated passages on any file.
The only open models scoring better medically are 9B-class research models that cannot run at edge latency; our flagship omi-medical-1 (#1 on the board) is the API/self-host option.

Full leaderboard and methodology →

One install, three runtimes

The model ships with an open runtime, omi-med-stt, that auto-selects the right engine and artifact for the machine it's on:

Terminal

$ pip install omi-med-stt
$ omi-med-stt consultation.wav

# Apple Silicon → MLX q8 (default)
$ pip install "omi-med-stt[mlx]"

# NVIDIA CUDA → NeMo full checkpoint
$ pip install "omi-med-stt[nemo]"

# Linux / Windows CPU → GGUF q8_0 via parakeet.cpp
$ omi-med-stt install-cpp

Most releases quantize silently. We benchmarked every artifact we ship on the full locked test set, with the same scorer:

Runtime artifacts on the full locked benchmark (7.18 h). Lower WER is better.
Artifact	Platform	Size	Speed: time / 1 hour audio (× realtime)	M-WER	WER	Drug M-WER
Canonical NeMo `.nemo`	NVIDIA CUDA (reference)	2.5 GB	25s (146.3×)*	2.37%	8.30%	4.75%
MLX q8 — Mac default	Apple Silicon	0.94 GB	53s (67.4×)	2.75%	8.61%	5.20%
MLX full precision	Apple Silicon	2.5 GB	56s (64.5×)	2.65%	8.59%	5.20%
GGUF q8_0 — CPU default	Linux / Windows CPU	0.93 GB	2m 53s (20.8×)†	3.20%	9.12%	6.33%

Numbers in this table are the release-time snapshot (June 2026 scorer), kept for artifact-to-artifact comparison; the canonical artifact's current board numbers are M-WER 2.16% / WER 6.64% (benchmark). Speed = time to transcribe 1 hour of audio (× realtime in brackets). * Direct full-audio inference on an NVIDIA A10; the installed CLI, which auto-chunks long recordings, measures 37s (96.8×). † 32-thread Linux CPU. The CPU path is a portability fallback, not the quality path — prefer CUDA or Apple Silicon when available. The CLI auto-chunks audio above 240 seconds on the NeMo path at a measured cost of roughly one WER point on those files; below 240 seconds its output is identical to direct NeMo inference.

The Mac default is q8 because the evidence supports it: 2.7× smaller and slightly faster than full precision, with drug-name accuracy identical to the full-precision export (5.20%) and overall WER within 0.02 points. We also built a 4-bit artifact — its raw WER looked fine, but drug-name accuracy regressed (5.88% vs 5.20%), so we didn't ship it. For a medical model, safety metrics outrank file size.

The runtime matrix is verified on real machines — clean Ubuntu 22.04, a pristine Windows Server 2022 with no developer tooling, Azure A10 CUDA, and macOS — plus a four-job CI smoke suite on every release.

Training data and methodology

omi-medical-edge-1 is fine-tuned from NVIDIA Parakeet TDT 0.6B v2 on a curated corpus of English clinical audio. It combines openly-licensed datasets, access-controlled clinical sources, and Omi's own proprietary synthetic medical-speech dataset. Most of the audio is real recordings, with targeted synthetic speech added where real clinical data is scarce.

Training corpus — ~127 hours of distinct audio across ~38,700 clips.
Audio	Hours	Share
Real recordings	~91 h	71%
Targeted synthetic speech	~36 h	29%

The medical material spans the full range of clinical speech — consultation dialogue, dictation, medication review, and procedure, device and test vocabulary — with general speech also represented so the model stays accurate on ordinary conversation.

The held-out benchmark was locked before final model selection, with zero overlap with training data. Scorer definitions, competitor versions and decode settings are published alongside the runtime. One honest caveat about the original public PriMock57 benchmark (v1–v4, archived): our models trained on part of that public dataset, so they cannot be scored fairly on it — which is exactly why the standing benchmark moved to a sealed test set.

Limitations

Drug names are the hardest category (4.75% Drug M-WER, 95.3% recall) and the explicit focus of the next version — the top specialized medical APIs are stronger here today.
Full conversation audio is the hardest setting (10.3% WER on the consultations group, 7.1% drug M-WER); dictation-style audio is considerably stronger (~4% WER).
English only, for now (see below). Audio is converted to 16 kHz mono automatically by the runtime.
Quantized artifacts trade a little accuracy for size — each artifact's own numbers are published in the table above.

omi-medical-edge-1 is speech-to-text only. It is not a diagnostic, triage, prescribing, or clinical decision model, and it is not clinically validated. Transcripts must be reviewed before any clinical use.

What's next

Two things are in active development: a streaming model for live, in-consultation transcription rather than file-based processing, and multilingual support — Dutch, German, French and Spanish first, matching the markets Speech-to-Text models serves. Drug-name accuracy is the headline quality goal for v2.

Where to find it

Canonical model (NeMo): huggingface.co/omi-health/omi-med-stt-v1
Apple Silicon MLX q8 (Mac default): omi-med-stt-v1-mlx-q8 · MLX full: omi-med-stt-v1-mlx
CPU GGUF (parakeet.cpp): omi-med-stt-v1-gguf
Runtime CLI (MIT): github.com/Omi-Health/omi-med-stt-runtime · PyPI
Pairs with: Omi-Sum 3B — transcript → structured SOAP note, for a fully local audio-to-note pipeline.
Contact: [email protected]

omi-medical-edge-1 is a derivative of nvidia/parakeet-tdt-0.6b-v2. It is not an NVIDIA model. Built with NVIDIA NeMo; the runtime interoperates with parakeet-mlx and parakeet.cpp.

Cite this model

APA — Omi Health. (2026). omi-medical-edge-1: On-Device Medical Speech-to-Text. https://omi.health/research/omi-med-stt

@misc{omi_med_stt_v1_2026,
  title   = {omi-medical-edge-1: On-Device Medical Speech-to-Text},
  author  = {{Omi Health}},
  year    = {2026},
  url     = {https://omi.health/research/omi-med-stt},
  note    = {0.6B medical ASR, fine-tuned from NVIDIA Parakeet TDT 0.6B v2, CC-BY-4.0}
}

Related research

Medical Speech-to-Text Benchmark — 42 models ranked by Medical WER on public data
Clinical SOAP Note Safety Evaluation — 6 models, 300 dialogues, safety-first scoring
Omi-Sum 3B — open-source clinical model for SOAP note summarization