Research · Omi Guard

AI scribes don't just hallucinate. They forget.

We tested 8 frontier AI models on 300 doctor–patient conversations. The biggest safety problem was not hallucination. It was omission: the models missed 520 safety-critical facts, 43× more often than they invented one.

Published 17 June 2026 · 8 writers · 300 dialogues · 4,800 notes · Code & data on GitHub

8note-writing models

300clinical dialogues

2,400base notes scored

43×more omissions than hallucinations

#1GPT-5.4-mini + Guard on efficiency

TL;DR

We tested 8 frontier AI models on 300 doctor–patient conversations, graded by a 4-model cross-family judging panel on prose, safety, and cost.
Hallucinations were rare; omissions were 43× more common — and a clinician can't review what isn't on the page.
Omi Guard recovered the dropped facts (520 → 0) and flagged all 12 hallucinations confirmed by the judging panel — adding zero new ones, at ~$0.0007/note.
The benchmark is synthetic and reproducible; clinical-workflow validation is next.

We're Omi Health, and two principles drive our research: safety and openness. So we published every transcript, note, price, and scoring script behind this benchmark. If you're choosing an AI scribe, or building one, this is meant to help you decide. One finding reframed the whole problem for us, so we'll start there.

The failure everyone misses

Everyone worries about AI scribes hallucinating. In our benchmark, that wasn't the main failure mode.

Across 2,400 model-written notes, we found 12 confirmed high-impact hallucinations — and 520 missed safety facts. The quieter failure was 43× more common.

Safety failures across base writer notesMissed safety facts were 43× more common than confirmed hallucinations

Confirmed high-impact hallucinations — what it invents0

Missed safety facts — meds, allergies, assessment, plan0

Across 2,400 base-writer notes (8 models × 300 dialogues). Forgetting was 43× more common than fabricating.

If you've been worried about AI scribes making things up, you've been watching the rarer failure. The asymmetry matters most at sign-off: hallucinations are visible; omissions are silent. A clinician can delete a bad sentence they can see. They can't correct a medication, allergy, or follow-up plan that never made it onto the page.

So we stopped scoring notes only on how fluent they sound, and built a benchmark to measure what actually matters.

How we ran it

We ran a first version in 2025 (6 models, one safety score). This year we rebuilt it to be harder to game. Recent work — including the June 2026 Nature Medicine comparison of frontier and specialized clinical AI — drew criticism for test leakage, single-vendor judges, and private datasets. We designed v2 around the opposite of each.

The eight writers include the frontier models in that debate, plus the cheaper models a hospital might realistically self-host. The question underneath is simple: do you need purpose-built clinical AI, or does a general model with the right guardrails do the job?

The setup

8 frontier AI models

GPT-5.5 · GPT-5.4-mini · DeepSeek-V4-Pro · Kimi K2.6 · Sonnet 4.6 · Opus 4.8 · Gemini 3.1 Pro · Gemini 3.5 Flash

300 synthetic dialogues

Each writer drafts a SOAP note for every conversation — 2,400 notes.

4 LLM judges

Cross-family panel: Anthropic · DeepSeek · OpenAI · Google. A result only counts on a majority.

Prose

How well does it write a note, compared to GPT-5.5?

Safety

Does it invent facts — or drop the ones that matter?

Cost & speed

What does each note cost, and how fast is it?

We then re-ran every writer with Omi Guard on top and scored those too — 4,800 notes in all, base and guarded — so every number below has a with-Guard counterpart.

Concretely: an open-ended note-writing task, a cross-family judging panel where a result only counts on a majority, and public list prices for cost and speed. We read each note three ways — prose, safety, and cost & speed. Full method and every artifact are in Reproduce it.

One caveat before the results: Omi built Guard, and Omi ran this evaluation. We disclose that up front. The reason to trust or challenge the numbers is not our word for it — it is the published corpus and scripts.

First, how the notes read.

Prose — how the note reads

Takeaway: only Sonnet and DeepSeek beat GPT-5.5 on prose.

For each dialogue, judges compared GPT-5.5's note with a challenger's note, blind to identity and with order counterbalanced. That gives every writer a win / tie / loss record against the same anchor. Click any model for its exact record.

Prose quality: each writer vs GPT-5.5Green wins, grey ties, red losses across 300 paired note comparisons

Green = wins vs GPT-5.5 · grey = ties · red = losses, out of 300 head-to-head dialogues. GPT-5.5 is the anchor and isn't compared to itself.

What we found. Sonnet and DeepSeek clearly beat the GPT-5.5 anchor on prose. GPT-5.4-mini and both Geminis tie with it. Opus loses despite being the largest Claude model; its verbose style works against the compactness a clinical note needs. Bigger isn't better here.

Safety — what it invents, and what it forgets

Takeaway: forgetting varies far more than hallucination.

Not every miss is equal. We separate severity in both directions:

Hallucinations. A high-impact hallucination is a clinically serious unsupported claim — an invented medication, or a symptom the patient actually denied. A low-impact one is a wording or formatting slip with no clinical consequence.
Omissions. A high-impact omission is a dropped fact in a safety-critical field — a medication, allergy, assessment, or plan item. A low-impact omission is narrative or context detail a clinician might want but that isn't safety-critical.

Every figure below is high-impact: hallucinations confirmed by panel majority, and omissions checked against a transcript-grounded evidence ledger. Per writer, the two risks pull apart — click a model for exact counts; safest is top-right:

Safety map: hallucinations vs missed factsRight means fewer hallucinations; up means fewer missed safety facts

Each writer's two safety failure modes, per 300 notes. Toward the right: fewer hallucinations. Toward the top: fewer missed facts. The top-right (shaded) is safest.

What we found. Fabrication is rare: the Claudes and Kimi have zero confirmed hallucinations; the rest have one to five. Forgetting is where models separate. Opus omits the fewest facts (35), DeepSeek the most (161). No base writer is clean on both.

Cost & speed — can you run it at scale

Takeaway: the cheap models are fast enough for production.

We priced every note at public list rates and measured typical response time. No internal discounts, no partner credits. Click a model — cheap and fast is the top-right:

Cost and speed per noteRight means cheaper; up means faster response time

$/note at public list price (cheaper toward the right) vs response speed (faster toward the top). The top-right (shaded) is best: cheap and fast. Gemini shown at its minimum thinking budget; see note below.

What we found. Cost and speed vary enormously. The priciest model is about 10× the cheapest; the slowest takes 10× longer than the fastest. GPT-5.4-mini and Gemini 3.5 Flash own the cheap-and-fast corner. DeepSeek is cheapest but slower. Kimi is expensive and slow.

Before any safety layer: no model wins on everything

Takeaway: every writer is strong on some axes and weak on others.

Pulling the three readings together, no writer wins on every axis. The cheapest models forget more. The most fact-complete models are pricey or slow.

Writers on their own, per 300 notes — sorted by missed facts (fewer is safer). This is the "before" picture the safety layer has to improve on.
Writer	High-impact hallucinations	Missed safety facts	$/note	Speed
Claude Opus 4.8	0	35	$0.0305	9.8s
GPT-5.5	3	46	$0.0287	8.3s
GPT-5.4-mini	2	47	$0.0042	4.8s
Gemini 3.5 Flash	1	54	$0.0089	4.6s
Claude Sonnet 4.6	0	58	$0.0180	15.5s
Kimi K2.6	0	58	$0.0330	46.4s
Gemini 3.1 Pro	1	61	$0.0316	20.7s
DeepSeek-V4-Pro	5	161	$0.0038	17.3s

So we tested a different architecture: keep the writer, but wrap it in a layer that checks every line against the conversation, restores missing facts, and flags claims it cannot verify. That's Omi Guard.

The Omi Guard move: don't ask the model to be perfect — wrap it

Omi Guard sits between any writer and the final note. It works in both directions: recovering facts the writer dropped and flagging claims the conversation does not support. Each action is tied to transcript evidence:

Green · insert

Recover a dropped fact

Restore an omitted medication, allergy, or plan item that the conversation supports.

Amber · flag

Catch a hallucination

Surface a suspected hallucination — or any claim it can't verify — for the clinician, with the contradicting snippet. Never silently changed.

Red · drop

Remove uncited filler

Strip boilerplate the conversation never supported. Claims backed by evidence are kept.

Guard is conservative by design: when it can't verify a claim, it abstains rather than guess. Its internals are proprietary; the scored outputs are published and reproducible.

What changed, before and after Guard

Takeaway: Guard recovers the misses and flags the confirmed writer hallucinations.

Running the same 2,400 notes through Guard closed both measured failure modes — and Guard introduced no new confirmed hallucinations:

Missed safety facts recovered (meds, allergies, assessment, plan)

520→0

Panel-confirmed writer hallucinations flagged for review

0→12 of 12

New hallucinations Guard introduced

Cost Guard adds

~$0.0007 / note

Before Guard, those 520 facts were absent from the notes and the 12 confirmed hallucinations sat unmarked. After Guard, dropped facts are restored from the transcript and unsupported claims carry a review flag with evidence.

And the ranking? The cheap models lead

Takeaway: once omissions are recovered, small cheap writers become deployable.

The Safe Note Efficiency Index weights safety most heavily: 70% safety, 20% cost, 10% speed. Prose is excluded because a fluent unsafe note is still unsafe.

The chart maps every writer with cheaper to the right, safer toward the top. Toggle Guard on — the movement lines show missed facts being recovered, lifting each writer into the safe top half:

Safe Note Efficiency: before and after Omi GuardRight means cheaper; up means safer after missed facts are recovered

writers before Guard after Guard

With Guard, the combined index is led by cheaper writers because the safety gap has been closed in the measured notes:

Safe Note Efficiency Index — every writer with Omi Guard. Higher is better.
#	Stack	Index	$/note	speed	Notes
1	GPT-5.4-mini + Guard	98.70	$0.0055	9.5s	best overall
2	DeepSeek-V4-Pro + Guard	96.33	$0.0050	23.2s	cheapest writer
3	Gemini 3.5 Flash + Guard	92.59	$0.0096	11.9s	cheap + fast value pick
4	Claude Sonnet 4.6 + Guard	82.61	$0.0193	22.5s	strongest prose
5	GPT-5.5 + Guard	80.23	$0.0305	13.4s	anchor
6	Claude Opus 4.8 + Guard	79.03	$0.0316	16.0s	verbose
7	Gemini 3.1 Pro + Guard	76.01	$0.0323	28.1s	cheap but slow
8	Kimi K2.6 + Guard	69.92	$0.0343	51.7s	reasoning bloat

GPT-5.4-mini + Guard ranked #1 on Safe Note Efficiency — at ~19% of GPT-5.5's cost. The takeaway is not that the smallest model is always best. It is that a cheap writer becomes viable when a separate layer recovers omissions and flags unsupported claims.

Cost is writer token usage at public list prices plus Guard's marginal ~$0.0007/note; speed is the typical (median) time per note end to end. Gemini "thinks" heavily by default, which inflates its cost and speed without changing the note; we run it at its minimum thinking budget for a like-for-like comparison. Full methodology, token counts, and prices are in the repo.

Why it matters for hospitals

This is the architecture Omi Scribe is built around. It is self-hosted: hospitals run it inside their own private cloud and bring their own model from Azure AI Foundry, Google Vertex AI, or Amazon Bedrock. Omi Guard is the safety layer around it.

Your model

Bring your own writer

Foundry, Vertex AI, or Bedrock — running in your own tenancy. Swap models as the frontier moves.

→

Omi Guard

Deterministic safety layer

Recovers facts the writer dropped, flags the writer's hallucinations for review, and abstains when unsure.

→

Output

A note with lineage

Every line traceable to a moment in the conversation, with a full history of every change.

Runs entirely within the hospital's private cloud — no patient data leaves the tenancy.

Because every Guard action is logged, the note arrives with lineage. Each recovered fact, removed filler line, and review flag carries the transcript line that justified it. That's the difference between the AI rewrote my note and the AI showed me its work.

Guard reviewed this note · 1 fix applied · 1 item flagged for review

Auto-applied

Added to Medications: Lisinopril 20 mg once daily — from the patient mentioning their morning dose (lines 12–13).

Needs your review

In Plan: "No chest discomfort" — but the conversation says "actually, yes — sometimes after walking up stairs." Please review before signing.

Why flag, and not delete?

Guard surfaces suspected hallucinations; it never silently removes the writer's text. A wrong deletion is invisible. A flag keeps the clinician in control and preserves the signing workflow. Once real-world use proves which flags are reliably correct, the highest-confidence ones can be auto-applied — but flag-first earns that trust the right way round.

What this benchmark does not prove

Stated plainly:

Clinical outcomes. This is a transcript-grounded benchmark, not a clinical trial.
Flag precision. We measure recall against the 12 confirmed writer hallucinations. Whether every flag is clinically useful is a clinician question we answer next, with partners.
Generalization. All dialogues are synthetic, English, ambulatory primary-care.
Independence. Omi built Guard and ran the evaluation. That is why the corpus and scripts are public.

We're publishing this because it's reproducible, not because it's the final word. Clinical-workflow validation with partners is the next step.

Reproduce it

The corpus, all 4,800 before/after notes, token and price data, and scoring scripts are open-source. Guard already ran; these scripts score its published output, so you can verify every number without Guard internals:

GitHub: github.com/Omi-Health/medical-note-eval · MIT licence
Re-run the cross-family verifier panel: python scripts/score_panel.py --writer mini --arm guarded
Check Guard-attributable unsupported additions: python scripts/symmetric_reconcile.py --writer mini
Check the recovered-omissions count: python scripts/recall_check.py --writer mini
Reproduce every per-note cost: python scripts/cost_report.py

Disagree with a number? Open an issue. We'd rather be corrected than wrong in public.

We're enrolling founding partners for Omi Guard

We're enrolling founding partners to test Omi Guard on real clinical conversations, inside their own cloud and with their own models. The goal is simple: measure whether recovered facts, review flags, and transcript lineage improve trust at sign-off.

Talk to us [email protected]

Cite this benchmark

APA — Omi Health. (2026). Trustworthy AI Notes: Evaluating 8 Frontier Writers With and Without the Omi Guard Safety Layer. https://omi.health/research/note-eval-v2

@misc{omi_note_eval_v2_2026,
  title   = {Trustworthy AI Notes: 8 Frontier Writers + Omi Guard},
  author  = {{Omi Health}},
  year    = {2026},
  url     = {https://omi.health/research/note-eval-v2},
  note    = {8 writers, 300 dialogues, before/after a deterministic safety layer}
}

Related research

SOAP Note Safety Benchmark v1 — the original 6-model evaluation this builds on
Medical Speech-to-Text Benchmark — 42 models ranked by Medical WER
Omi Med STT v1 — our on-device 0.6B medical speech-to-text model
Omi-Sum 3B — open-source clinical model for SOAP note summarization