← Research

Clinical SOAP Note Evaluation: Safety-First Benchmarking

Many public summarization benchmarks reward notes that are complete, fluent, and similar to a reference note. In clinical documentation, that can miss the most important failure mode: unsupported clinical claims. A note that invents a medication the patient never mentioned is more dangerous than a note that's missing a section.

We built this benchmark to measure what matters: does the model make things up?

Our evaluation framework is inspired by Abridge's confabulation framework, which classifies each claim by its support level (supported, inferred, questionable, unmentioned, contradiction) and severity (minimal, moderate, major). We extend this with explicit hallucination and omission counting, coverage scoring, and multi-judge aggregation to reduce bias.

Core principle: Better to omit than to fabricate. Our Composite score weights Safety at 50%, Coverage at 30%, and Generalist quality at 20%. A model that writes beautiful notes but invents clinical claims will rank below one that writes plainer notes grounded in the transcript.

Leaderboard

6 models evaluated on 300 synthetic doctor-patient dialogues. Multi-judge evaluation (3 judges per comparison, cross-family to avoid bias). All scores 0–5 scale. Higher is better.

Overall rankings by Composite score (0–5 scale, higher is better)
# Model Composite Safety Evidence Coverage Generalist
1GPT-5.24.7234.5434.3584.9544.824
2Gemini 3 Pro Preview4.6994.5414.3574.8644.848
3Omi-SOAP-edge-v1 *4.6544.5474.4214.8624.595
4Kimi K2 Thinking4.5464.2173.8904.9064.828
5Claude Opus 4.54.5434.2023.8704.9474.793
6GPT-54.2853.8053.3164.8434.646

* Omi scores averaged across 5 evaluations with different judge panels.

Omi-SOAP-edge-v1 ranks #3 on Composite — behind GPT-5.2 and Gemini 3 Pro — but has the highest Safety score (4.547) and highest Evidence score (4.421) of any model tested. The models that beat it on Composite do so by scoring higher on Coverage and Generalist quality (how complete and readable the note is), not by being safer.

Hallucination risk

This is where the benchmark gets interesting. We report hallucination rates as first-class metrics, not hidden inside a composite score. Using Omi as the 1.0x baseline:

Unsupported clinical claims per note, with Omi as 1.0x baseline
Model Major unsupported claims/note Risk vs Omi Minor unsupported claims/note Majority major rate
GPT-5.20.1140.89x0.3584.0%
Gemini 3 Pro Preview0.1270.99x0.2808.0%
Omi-SOAP-edge-v10.1281.00x0.1936.7%
Kimi K2 Thinking0.3512.74x0.38219.3%
Claude Opus 4.50.3973.10x0.19125.3%
GPT-50.5534.32x0.38236.7%

Major unsupported claims are clinically meaningful fabrications: invented diagnoses, medications the patient never mentioned, vitals that weren't taken, procedures that didn't happen. These are patient safety risks.

Minor unsupported claims are low-impact wording or citation issues — still wrong, but unlikely to cause harm.

Majority major rate is the percentage of dialogues where 2 or more judges (out of 3) independently flagged at least one major unsupported claim. This filters out single-judge disagreements.

What this means

Why does Omi lead Evidence while GPT-5.2 has slightly fewer major unsupported claims? The Evidence score penalizes both major and minor unsupported claims. GPT-5.2 has the lowest major rate (0.114), but Omi has the lowest minor rate (0.193 vs GPT-5.2's 0.358). Under the formula E = 5 - 1×minor - 3×major, Omi's lower minor count gives it the highest overall Evidence score.

The coverage-safety trade-off: Some frontier models score highly on coverage and readability while producing substantially more major unsupported claims. This is the failure mode the benchmark is designed to surface: higher coverage and polish can coincide with more unsupported clinical claims.

Head-to-head results

All comparisons are against Omi-SOAP-edge-v1. Tie threshold: |composite difference| < 0.25.

Pairwise comparison results vs Omi-SOAP-edge-v1
Opponent Omi wins Opponent wins Ties Winner
GPT-5.253 (17.7%)96 (32.0%)151 (50.3%)GPT-5.2
Gemini 3 Pro Preview56 (18.7%)90 (30.0%)154 (51.3%)Gemini 3 Pro
Kimi K2 Thinking100 (33.3%)79 (26.3%)121 (40.3%)Omi
Claude Opus 4.5105 (35.0%)59 (19.7%)136 (45.3%)Omi
GPT-5148 (49.3%)43 (14.3%)109 (36.3%)Omi

Omi beats Kimi K2, Claude Opus 4.5, and GPT-5 head-to-head. It loses to GPT-5.2 and Gemini 3 Pro — but even in those losses, the majority of dialogues are ties (50–51%), and Omi still leads on Safety and Evidence scores.

How it works

Dataset

300 synthetic doctor-patient dialogues with sentence-level IDs (SIDs) for evidence tracking. Each model generates a SOAP note from the same transcript. No real patient data.

Judges

3 LLM judges per comparison, chosen from different model families to prevent same-family bias. Each judge evaluates both notes independently and returns structured counts (unsupported claims, numeric errors, coverage flags) plus quality subscores.

Scoring dimensions

Composite score breakdown
Dimension Weight What it measures
Safety50%Evidence accuracy (70%) + numeric fidelity (30%). Did the note fabricate claims or get numbers wrong?
Coverage30%Did the note capture key SOAP elements? (vitals, meds, assessment, safety, follow-up — when present in transcript)
Generalist20%How good does the note read? Mean of factual, completeness, and readability subscores from judges.

Evidence score: E = max(0, 5 - 1 × minor - 3 × major)
Major unsupported claims cost 3x as much as minor ones — because fabricating a medication is not the same as a citation formatting issue.

Why this matters

A doctor reviewing an AI-generated note needs to trust that every claim actually came from the conversation. Omissions matter, which is why Coverage is part of the Composite score. But plausible fabrications can be especially hard to catch because they look like normal clinical documentation.

Our benchmark asks: is every claim in this note grounded in the transcript? And if not, how dangerous is the unsupported claim?

We weight Safety at 50% of the Composite score because we believe this reflects clinical reality. A model that invents a plausible-sounding medication list will score higher on traditional metrics than one that correctly leaves the section empty — but our benchmark penalizes the fabrication.

Limitations

This benchmark is intentionally lightweight. All dialogues are synthetic, and no real patient data is included. The results measure transcript-grounded SOAP note generation, not clinical outcomes, diagnostic correctness, or EHR-level documentation quality.

The evaluation uses LLM judges rather than human clinician adjudication. We reduce judge bias with cross-family judge panels and A/B randomization, but the results should still be read as an open benchmark signal, not as clinical validation.

Omi Health maintains the benchmark and includes its own model. To make that auditable, we publish the transcripts, model outputs, judge outputs, scoring code, and leaderboard.

Reproduce it

The full evaluation framework, all 300 dialogues, all model outputs, and all judge results are open-source:

We welcome contributions — additional models, alternative judge configurations, or improvements to the scoring methodology.