AI scribes don't just hallucinate. They forget.
We tested 8 frontier AI models on 300 doctor–patient conversations. The biggest safety problem was not hallucination. It was omission: the models missed 520 safety-critical facts, 43× more often than they invented one.
- We tested 8 frontier AI models on 300 doctor–patient conversations, graded by a 4-model cross-family judging panel on prose, safety, and cost.
- Hallucinations were rare; omissions were 43× more common — and a clinician can't review what isn't on the page.
- Omi Guard recovered the dropped facts (520 → 0) and flagged all 12 hallucinations confirmed by the judging panel — adding zero new ones, at ~$0.0007/note.
- The benchmark is synthetic and reproducible; clinical-workflow validation is next.
We're Omi Health, and two principles drive our research: safety and openness. So we published every transcript, note, price, and scoring script behind this benchmark. If you're choosing an AI scribe, or building one, this is meant to help you decide. One finding reframed the whole problem for us, so we'll start there.
The failure everyone misses
Everyone worries about AI scribes hallucinating. In our benchmark, that wasn't the main failure mode.
Across 2,400 model-written notes, we found 12 confirmed high-impact hallucinations — and 520 missed safety facts. The quieter failure was 43× more common.
Safety failures across base writer notesMissed safety facts were 43× more common than confirmed hallucinations
Across 2,400 base-writer notes (8 models × 300 dialogues). Forgetting was 43× more common than fabricating.
If you've been worried about AI scribes making things up, you've been watching the rarer failure. The asymmetry matters most at sign-off: hallucinations are visible; omissions are silent. A clinician can delete a bad sentence they can see. They can't correct a medication, allergy, or follow-up plan that never made it onto the page.
So we stopped scoring notes only on how fluent they sound, and built a benchmark to measure what actually matters.
How we ran it
We ran a first version in 2025 (6 models, one safety score). This year we rebuilt it to be harder to game. Recent work — including the June 2026 Nature Medicine comparison of frontier and specialized clinical AI — drew criticism for test leakage, single-vendor judges, and private datasets. We designed v2 around the opposite of each.
The eight writers include the frontier models in that debate, plus the cheaper models a hospital might realistically self-host. The question underneath is simple: do you need purpose-built clinical AI, or does a general model with the right guardrails do the job?
8 frontier AI models
GPT-5.5 · GPT-5.4-mini · DeepSeek-V4-Pro · Kimi K2.6 · Sonnet 4.6 · Opus 4.8 · Gemini 3.1 Pro · Gemini 3.5 Flash
300 synthetic dialogues
Each writer drafts a SOAP note for every conversation — 2,400 notes.
4 LLM judges
Cross-family panel: Anthropic · DeepSeek · OpenAI · Google. A result only counts on a majority.
Prose
How well does it write a note, compared to GPT-5.5?
Safety
Does it invent facts — or drop the ones that matter?
Cost & speed
What does each note cost, and how fast is it?
Concretely: an open-ended note-writing task, a cross-family judging panel where a result only counts on a majority, and public list prices for cost and speed. We read each note three ways — prose, safety, and cost & speed. Full method and every artifact are in Reproduce it.
One caveat before the results: Omi built Guard, and Omi ran this evaluation. We disclose that up front. The reason to trust or challenge the numbers is not our word for it — it is the published corpus and scripts.
First, how the notes read.
Prose — how the note reads
Takeaway: only Sonnet and DeepSeek beat GPT-5.5 on prose.
For each dialogue, judges compared GPT-5.5's note with a challenger's note, blind to identity and with order counterbalanced. That gives every writer a win / tie / loss record against the same anchor. Click any model for its exact record.
Prose quality: each writer vs GPT-5.5Green wins, grey ties, red losses across 300 paired note comparisons
What we found. Sonnet and DeepSeek clearly beat the GPT-5.5 anchor on prose. GPT-5.4-mini and both Geminis tie with it. Opus loses despite being the largest Claude model; its verbose style works against the compactness a clinical note needs. Bigger isn't better here.
Safety — what it invents, and what it forgets
Takeaway: forgetting varies far more than hallucination.
Not every miss is equal. We separate severity in both directions:
- Hallucinations. A high-impact hallucination is a clinically serious unsupported claim — an invented medication, or a symptom the patient actually denied. A low-impact one is a wording or formatting slip with no clinical consequence.
- Omissions. A high-impact omission is a dropped fact in a safety-critical field — a medication, allergy, assessment, or plan item. A low-impact omission is narrative or context detail a clinician might want but that isn't safety-critical.
Every figure below is high-impact: hallucinations confirmed by panel majority, and omissions checked against a transcript-grounded evidence ledger. Per writer, the two risks pull apart — click a model for exact counts; safest is top-right:
Safety map: hallucinations vs missed factsRight means fewer hallucinations; up means fewer missed safety facts
What we found. Fabrication is rare: the Claudes and Kimi have zero confirmed hallucinations; the rest have one to five. Forgetting is where models separate. Opus omits the fewest facts (35), DeepSeek the most (161). No base writer is clean on both.
Cost & speed — can you run it at scale
Takeaway: the cheap models are fast enough for production.
We priced every note at public list rates and measured typical response time. No internal discounts, no partner credits. Click a model — cheap and fast is the top-right:
Cost and speed per noteRight means cheaper; up means faster response time
What we found. Cost and speed vary enormously. The priciest model is about 10× the cheapest; the slowest takes 10× longer than the fastest. GPT-5.4-mini and Gemini 3.5 Flash own the cheap-and-fast corner. DeepSeek is cheapest but slower. Kimi is expensive and slow.
Before any safety layer: no model wins on everything
Takeaway: every writer is strong on some axes and weak on others.
Pulling the three readings together, no writer wins on every axis. The cheapest models forget more. The most fact-complete models are pricey or slow.
| Writer | High-impact hallucinations | Missed safety facts | $/note | Speed |
|---|---|---|---|---|
| Claude Opus 4.8 | 0 | 35 | $0.0305 | 9.8s |
| GPT-5.5 | 3 | 46 | $0.0287 | 8.3s |
| GPT-5.4-mini | 2 | 47 | $0.0042 | 4.8s |
| Gemini 3.5 Flash | 1 | 54 | $0.0089 | 4.6s |
| Claude Sonnet 4.6 | 0 | 58 | $0.0180 | 15.5s |
| Kimi K2.6 | 0 | 58 | $0.0330 | 46.4s |
| Gemini 3.1 Pro | 1 | 61 | $0.0316 | 20.7s |
| DeepSeek-V4-Pro | 5 | 161 | $0.0038 | 17.3s |
So we tested a different architecture: keep the writer, but wrap it in a layer that checks every line against the conversation, restores missing facts, and flags claims it cannot verify. That's Omi Guard.
The Omi Guard move: don't ask the model to be perfect — wrap it
Omi Guard sits between any writer and the final note. It works in both directions: recovering facts the writer dropped and flagging claims the conversation does not support. Each action is tied to transcript evidence:
Recover a dropped fact
Restore an omitted medication, allergy, or plan item that the conversation supports.
Catch a hallucination
Surface a suspected hallucination — or any claim it can't verify — for the clinician, with the contradicting snippet. Never silently changed.
Remove uncited filler
Strip boilerplate the conversation never supported. Claims backed by evidence are kept.
Guard is conservative by design: when it can't verify a claim, it abstains rather than guess. Its internals are proprietary; the scored outputs are published and reproducible.
What changed, before and after Guard
Takeaway: Guard recovers the misses and flags the confirmed writer hallucinations.
Running the same 2,400 notes through Guard closed both measured failure modes — and Guard introduced no new confirmed hallucinations:
Before Guard, those 520 facts were absent from the notes and the 12 confirmed hallucinations sat unmarked. After Guard, dropped facts are restored from the transcript and unsupported claims carry a review flag with evidence.
And the ranking? The cheap models lead
Takeaway: once omissions are recovered, small cheap writers become deployable.
The Safe Note Efficiency Index weights safety most heavily: 70% safety, 20% cost, 10% speed. Prose is excluded because a fluent unsafe note is still unsafe.
The chart maps every writer with cheaper to the right, safer toward the top. Toggle Guard on — the movement lines show missed facts being recovered, lifting each writer into the safe top half:
Safe Note Efficiency: before and after Omi GuardRight means cheaper; up means safer after missed facts are recovered
With Guard, the combined index is led by cheaper writers because the safety gap has been closed in the measured notes:
| # | Stack | Index | $/note | speed | Notes |
|---|---|---|---|---|---|
| 1 | GPT-5.4-mini + Guard | 98.70 | $0.0055 | 9.5s | best overall |
| 2 | DeepSeek-V4-Pro + Guard | 96.33 | $0.0050 | 23.2s | cheapest writer |
| 3 | Gemini 3.5 Flash + Guard | 92.59 | $0.0096 | 11.9s | cheap + fast value pick |
| 4 | Claude Sonnet 4.6 + Guard | 82.61 | $0.0193 | 22.5s | strongest prose |
| 5 | GPT-5.5 + Guard | 80.23 | $0.0305 | 13.4s | anchor |
| 6 | Claude Opus 4.8 + Guard | 79.03 | $0.0316 | 16.0s | verbose |
| 7 | Gemini 3.1 Pro + Guard | 76.01 | $0.0323 | 28.1s | cheap but slow |
| 8 | Kimi K2.6 + Guard | 69.92 | $0.0343 | 51.7s | reasoning bloat |
Cost is writer token usage at public list prices plus Guard's marginal ~$0.0007/note; speed is the typical (median) time per note end to end. Gemini "thinks" heavily by default, which inflates its cost and speed without changing the note; we run it at its minimum thinking budget for a like-for-like comparison. Full methodology, token counts, and prices are in the repo.
Why it matters for hospitals
This is the architecture Omi Scribe is built around. It is self-hosted: hospitals run it inside their own private cloud and bring their own model from Azure AI Foundry, Google Vertex AI, or Amazon Bedrock. Omi Guard is the safety layer around it.
Bring your own writer
Foundry, Vertex AI, or Bedrock — running in your own tenancy. Swap models as the frontier moves.
Deterministic safety layer
Recovers facts the writer dropped, flags the writer's hallucinations for review, and abstains when unsure.
A note with lineage
Every line traceable to a moment in the conversation, with a full history of every change.
Runs entirely within the hospital's private cloud — no patient data leaves the tenancy.
Because every Guard action is logged, the note arrives with lineage. Each recovered fact, removed filler line, and review flag carries the transcript line that justified it. That's the difference between the AI rewrote my note and the AI showed me its work.
Added to Medications: Lisinopril 20 mg once daily — from the patient mentioning their morning dose (lines 12–13).
In Plan: "No chest discomfort" — but the conversation says "actually, yes — sometimes after walking up stairs." Please review before signing.
Why flag, and not delete?
Guard surfaces suspected hallucinations; it never silently removes the writer's text. A wrong deletion is invisible. A flag keeps the clinician in control and preserves the signing workflow. Once real-world use proves which flags are reliably correct, the highest-confidence ones can be auto-applied — but flag-first earns that trust the right way round.
What this benchmark does not prove
Stated plainly:
- Clinical outcomes. This is a transcript-grounded benchmark, not a clinical trial.
- Flag precision. We measure recall against the 12 confirmed writer hallucinations. Whether every flag is clinically useful is a clinician question we answer next, with partners.
- Generalization. All dialogues are synthetic, English, ambulatory primary-care.
- Independence. Omi built Guard and ran the evaluation. That is why the corpus and scripts are public.
We're publishing this because it's reproducible, not because it's the final word. Clinical-workflow validation with partners is the next step.
Reproduce it
The corpus, all 4,800 before/after notes, token and price data, and scoring scripts are open-source. Guard already ran; these scripts score its published output, so you can verify every number without Guard internals:
- GitHub: github.com/Omi-Health/medical-note-eval · MIT licence
- Re-run the cross-family verifier panel:
python scripts/score_panel.py --writer mini --arm guarded - Check Guard-attributable unsupported additions:
python scripts/symmetric_reconcile.py --writer mini - Check the recovered-omissions count:
python scripts/recall_check.py --writer mini - Reproduce every per-note cost:
python scripts/cost_report.py
Disagree with a number? Open an issue. We'd rather be corrected than wrong in public.
We're enrolling founding partners for Omi Guard
We're enrolling founding partners to test Omi Guard on real clinical conversations, inside their own cloud and with their own models. The goal is simple: measure whether recovered facts, review flags, and transcript lineage improve trust at sign-off.
Talk to us [email protected]Cite this benchmark
APA — Omi Health. (2026). Trustworthy AI Notes: Evaluating 8 Frontier Writers With and Without the Omi Guard Safety Layer. https://omi.health/research/note-eval-v2
@misc{omi_note_eval_v2_2026,
title = {Trustworthy AI Notes: 8 Frontier Writers + Omi Guard},
author = {{Omi Health}},
year = {2026},
url = {https://omi.health/research/note-eval-v2},
note = {8 writers, 300 dialogues, before/after a deterministic safety layer}
}
Related research
- SOAP Note Safety Benchmark v1 — the original 6-model evaluation this builds on
- Medical Speech-to-Text Benchmark — 42 models ranked by Medical WER
- Omi Med STT v1 — our on-device 0.6B medical speech-to-text model
- Omi-Sum 3B — open-source clinical model for SOAP note summarization