Clinical SOAP Note Evaluation: Safety-First Benchmarking

Published 17 April 2026 · Benchmark data updated 14 December 2025 · 6 models · 300 dialogues · Code and data on GitHub

Newer work: we've since published the AI Scribe Safety Benchmark v2 — 8 frontier models on 300 conversations, scored before and after a deterministic safety layer, with interactive charts and fully open data.

Claim-grounding audit

Each SOAP note is split into claims, matched to transcript evidence, and penalized hardest when clinical facts are unsupported.

300 dialogues

01 Transcript evidence

S12

S18

S24

S31

S42

chest pain amoxicillin follow-up

02 Generated SOAP note

Subjective

supported

Assessment

inferred

Plan

unsupported

Figure: the evaluation checks whether every generated clinical claim is grounded in the transcript. The result tables below show the model-by-model outcomes.

Many public summarization benchmarks reward notes that are complete, fluent, and similar to a reference note. In clinical documentation, that can miss the most important failure mode: unsupported clinical claims. A note that invents a medication the patient never mentioned is more dangerous than a note that's missing a section.

We built this benchmark to measure what matters: does the model make things up?

Our evaluation framework is inspired by Abridge's confabulation framework, which classifies each claim by its support level (supported, inferred, questionable, unmentioned, contradiction) and severity (minimal, moderate, major). We extend this with explicit hallucination and omission counting, coverage scoring, and multi-judge aggregation to reduce bias.

Core principle: Better to omit than to fabricate. Our Composite score weights Safety at 50%, Coverage at 30%, and Generalist quality at 20%. A model that writes beautiful notes but invents clinical claims will rank below one that writes plainer notes grounded in the transcript.

Leaderboard

6 models evaluated on 300 synthetic doctor-patient dialogues. Multi-judge evaluation (3 judges per comparison, cross-family to avoid bias). All scores 0–5 scale. Higher is better.

Overall rankings by Composite score (0–5 scale, higher is better)
#	Model	Composite	Safety	Evidence	Coverage	Generalist
1	GPT-5.2	4.723	4.543	4.358	4.954	4.824
2	Gemini 3 Pro Preview	4.699	4.541	4.357	4.864	4.848
3	Omi-SOAP-edge-v1 *	4.654	4.547	4.421	4.862	4.595
4	Kimi K2 Thinking	4.546	4.217	3.890	4.906	4.828
5	Claude Opus 4.5	4.543	4.202	3.870	4.947	4.793
6	GPT-5	4.285	3.805	3.316	4.843	4.646

* Omi scores averaged across 5 evaluations with different judge panels.

Omi-SOAP-edge-v1 ranks #3 on Composite — behind GPT-5.2 and Gemini 3 Pro — but has the highest Safety score (4.547) and highest Evidence score (4.421) of any model tested. The models that beat it on Composite do so by scoring higher on Coverage and Generalist quality (how complete and readable the note is), not by being safer.

Hallucination risk

This is where the benchmark gets interesting. We report hallucination rates as first-class metrics, not hidden inside a composite score. Using Omi as the 1.0x baseline:

Unsupported clinical claims per note, with Omi as 1.0x baseline
Model	Major unsupported claims/note	Risk vs Omi	Minor unsupported claims/note	Majority major rate
GPT-5.2	0.114	0.89x	0.358	4.0%
Gemini 3 Pro Preview	0.127	0.99x	0.280	8.0%
Omi-SOAP-edge-v1	0.128	1.00x	0.193	6.7%
Kimi K2 Thinking	0.351	2.74x	0.382	19.3%
Claude Opus 4.5	0.397	3.10x	0.191	25.3%
GPT-5	0.553	4.32x	0.382	36.7%

Major unsupported claims are clinically meaningful fabrications: invented diagnoses, medications the patient never mentioned, vitals that weren't taken, procedures that didn't happen. These are patient safety risks.

Minor unsupported claims are low-impact wording or citation issues — still wrong, but unlikely to cause harm.

Majority major rate is the percentage of dialogues where 2 or more judges (out of 3) independently flagged at least one major unsupported claim. This filters out single-judge disagreements.

What this means

GPT-5.2, Gemini 3 Pro, and Omi form a safety tier — all averaging roughly one major unsupported claim every 8–9 generated notes. Majority-confirmed major rates are lower: 4.0% for GPT-5.2, 6.7% for Omi, and 8.0% for Gemini 3 Pro Preview.
Kimi K2 Thinking produces 2.7x more major unsupported claims than Omi. Nearly 1 in 5 notes has a majority-confirmed major issue.
Claude Opus 4.5 produces 3.1x more major unsupported claims than Omi. 1 in 4 notes has a majority-confirmed major issue — despite having one of the highest Coverage scores (4.947), second only to GPT-5.2. It writes thorough notes, but also produces substantially more unsupported clinical claims.
GPT-5 (not 5.2) produces 4.3x more major unsupported claims. Over 1 in 3 notes has a majority-confirmed major error.

Why does Omi lead Evidence while GPT-5.2 has slightly fewer major unsupported claims? The Evidence score penalizes both major and minor unsupported claims. GPT-5.2 has the lowest major rate (0.114), but Omi has the lowest minor rate (0.193 vs GPT-5.2's 0.358). Under the formula E = 5 - 1×minor - 3×major, Omi's lower minor count gives it the highest overall Evidence score.

The coverage-safety trade-off: Some frontier models score highly on coverage and readability while producing substantially more major unsupported claims. This is the failure mode the benchmark is designed to surface: higher coverage and polish can coincide with more unsupported clinical claims.

Head-to-head results

All comparisons are against Omi-SOAP-edge-v1. Tie threshold: |composite difference| < 0.25.

Pairwise comparison results vs Omi-SOAP-edge-v1
Opponent	Omi wins	Opponent wins	Ties	Winner
GPT-5.2	53 (17.7%)	96 (32.0%)	151 (50.3%)	GPT-5.2
Gemini 3 Pro Preview	56 (18.7%)	90 (30.0%)	154 (51.3%)	Gemini 3 Pro
Kimi K2 Thinking	100 (33.3%)	79 (26.3%)	121 (40.3%)	Omi
Claude Opus 4.5	105 (35.0%)	59 (19.7%)	136 (45.3%)	Omi
GPT-5	148 (49.3%)	43 (14.3%)	109 (36.3%)	Omi

Omi beats Kimi K2, Claude Opus 4.5, and GPT-5 head-to-head. It loses to GPT-5.2 and Gemini 3 Pro — but even in those losses, the majority of dialogues are ties (50–51%), and Omi still leads on Safety and Evidence scores.

How it works

Dataset

300 synthetic doctor-patient dialogues with sentence-level IDs (SIDs) for evidence tracking. Each model generates a SOAP note from the same transcript. No real patient data.

Judges

3 LLM judges per comparison, chosen from different model families to prevent same-family bias. Each judge evaluates both notes independently and returns structured counts (unsupported claims, numeric errors, coverage flags) plus quality subscores.

Scoring dimensions

Composite score breakdown
Dimension	Weight	What it measures
Safety	50%	Evidence accuracy (70%) + numeric fidelity (30%). Did the note fabricate claims or get numbers wrong?
Coverage	30%	Did the note capture key SOAP elements? (vitals, meds, assessment, safety, follow-up — when present in transcript)
Generalist	20%	How good does the note read? Mean of factual, completeness, and readability subscores from judges.

Evidence score: E = max(0, 5 - 1 × minor - 3 × major)
Major unsupported claims cost 3x as much as minor ones — because fabricating a medication is not the same as a citation formatting issue.

Why this matters

A doctor reviewing an AI-generated note needs to trust that every claim actually came from the conversation. Omissions matter, which is why Coverage is part of the Composite score. But plausible fabrications can be especially hard to catch because they look like normal clinical documentation.

Our benchmark asks: is every claim in this note grounded in the transcript? And if not, how dangerous is the unsupported claim?

We weight Safety at 50% of the Composite score because we believe this reflects clinical reality. A model that invents a plausible-sounding medication list will score higher on traditional metrics than one that correctly leaves the section empty — but our benchmark penalizes the fabrication.

Limitations

This benchmark is intentionally lightweight. All dialogues are synthetic, and no real patient data is included. The results measure transcript-grounded SOAP note generation, not clinical outcomes, diagnostic correctness, or EHR-level documentation quality.

The evaluation uses LLM judges rather than human clinician adjudication. We reduce judge bias with cross-family judge panels and A/B randomization, but the results should still be read as an open benchmark signal, not as clinical validation.

Omi Health maintains the benchmark and includes its own model. To make that auditable, we publish the transcripts, model outputs, judge outputs, scoring code, and leaderboard.

Reproduce it

The full evaluation framework, all 300 dialogues, all model outputs, and all judge results are open-source:

GitHub: github.com/Omi-Health/medical-note-eval
Licence: MIT
Generate SOAP notes:
python scripts/generate_soap.py --model gemini_pro --output_dir data/outputs/gemini_pro
Run evaluation:
python scripts/run_evaluation.py --a_root data/outputs/omi_soap_edge_v1 --b_root data/outputs/gemini_pro --a_label "Omi-SOAP-edge-v1" --b_label "Gemini-3-Pro" --judges gpt5_mini,claude_haiku,kimi_instruct

We welcome contributions — additional models, alternative judge configurations, or improvements to the scoring methodology.

Cite this benchmark

APA — Omi Health. (2026). Clinical SOAP Note Evaluation: Safety-First Benchmarking. https://omi.health/research/note-eval

@misc{omi_note_eval_2026,
  title   = {Clinical SOAP Note Evaluation: Safety-First Benchmarking},
  author  = {{Omi Health}},
  year    = {2026},
  url     = {https://omi.health/research/note-eval},
  note    = {6 models, 300 dialogues, safety-weighted LLM-as-judge scoring}
}

Related research

Medical Speech-to-Text Benchmark — 42 models ranked by Medical WER
Omi Med STT v1 — our on-device 0.6B medical speech-to-text model
Omi-Sum 3B — open-source clinical model for SOAP note summarization