Clinical SOAP Note Evaluation: Safety-First Benchmarking
Many public summarization benchmarks reward notes that are complete, fluent, and similar to a reference note. In clinical documentation, that can miss the most important failure mode: unsupported clinical claims. A note that invents a medication the patient never mentioned is more dangerous than a note that's missing a section.
We built this benchmark to measure what matters: does the model make things up?
Our evaluation framework is inspired by Abridge's confabulation framework, which classifies each claim by its support level (supported, inferred, questionable, unmentioned, contradiction) and severity (minimal, moderate, major). We extend this with explicit hallucination and omission counting, coverage scoring, and multi-judge aggregation to reduce bias.
Leaderboard
6 models evaluated on 300 synthetic doctor-patient dialogues. Multi-judge evaluation (3 judges per comparison, cross-family to avoid bias). All scores 0–5 scale. Higher is better.
| # | Model | Composite | Safety | Evidence | Coverage | Generalist |
|---|---|---|---|---|---|---|
| 1 | GPT-5.2 | 4.723 | 4.543 | 4.358 | 4.954 | 4.824 |
| 2 | Gemini 3 Pro Preview | 4.699 | 4.541 | 4.357 | 4.864 | 4.848 |
| 3 | Omi-SOAP-edge-v1 * | 4.654 | 4.547 | 4.421 | 4.862 | 4.595 |
| 4 | Kimi K2 Thinking | 4.546 | 4.217 | 3.890 | 4.906 | 4.828 |
| 5 | Claude Opus 4.5 | 4.543 | 4.202 | 3.870 | 4.947 | 4.793 |
| 6 | GPT-5 | 4.285 | 3.805 | 3.316 | 4.843 | 4.646 |
* Omi scores averaged across 5 evaluations with different judge panels.
Omi-SOAP-edge-v1 ranks #3 on Composite — behind GPT-5.2 and Gemini 3 Pro — but has the highest Safety score (4.547) and highest Evidence score (4.421) of any model tested. The models that beat it on Composite do so by scoring higher on Coverage and Generalist quality (how complete and readable the note is), not by being safer.
Hallucination risk
This is where the benchmark gets interesting. We report hallucination rates as first-class metrics, not hidden inside a composite score. Using Omi as the 1.0x baseline:
| Model | Major unsupported claims/note | Risk vs Omi | Minor unsupported claims/note | Majority major rate |
|---|---|---|---|---|
| GPT-5.2 | 0.114 | 0.89x | 0.358 | 4.0% |
| Gemini 3 Pro Preview | 0.127 | 0.99x | 0.280 | 8.0% |
| Omi-SOAP-edge-v1 | 0.128 | 1.00x | 0.193 | 6.7% |
| Kimi K2 Thinking | 0.351 | 2.74x | 0.382 | 19.3% |
| Claude Opus 4.5 | 0.397 | 3.10x | 0.191 | 25.3% |
| GPT-5 | 0.553 | 4.32x | 0.382 | 36.7% |
Major unsupported claims are clinically meaningful fabrications: invented diagnoses, medications the patient never mentioned, vitals that weren't taken, procedures that didn't happen. These are patient safety risks.
Minor unsupported claims are low-impact wording or citation issues — still wrong, but unlikely to cause harm.
Majority major rate is the percentage of dialogues where 2 or more judges (out of 3) independently flagged at least one major unsupported claim. This filters out single-judge disagreements.
What this means
- GPT-5.2, Gemini 3 Pro, and Omi form a safety tier — all averaging roughly one major unsupported claim every 8–9 generated notes. Majority-confirmed major rates are lower: 4.0% for GPT-5.2, 6.7% for Omi, and 8.0% for Gemini 3 Pro Preview.
- Kimi K2 Thinking produces 2.7x more major unsupported claims than Omi. Nearly 1 in 5 notes has a majority-confirmed major issue.
- Claude Opus 4.5 produces 3.1x more major unsupported claims than Omi. 1 in 4 notes has a majority-confirmed major issue — despite having one of the highest Coverage scores (4.947), second only to GPT-5.2. It writes thorough notes, but also produces substantially more unsupported clinical claims.
- GPT-5 (not 5.2) produces 4.3x more major unsupported claims. Over 1 in 3 notes has a majority-confirmed major error.
Why does Omi lead Evidence while GPT-5.2 has slightly fewer major unsupported claims? The Evidence score penalizes both major and minor unsupported claims. GPT-5.2 has the lowest major rate (0.114), but Omi has the lowest minor rate (0.193 vs GPT-5.2's 0.358). Under the formula E = 5 - 1×minor - 3×major, Omi's lower minor count gives it the highest overall Evidence score.
Head-to-head results
All comparisons are against Omi-SOAP-edge-v1. Tie threshold: |composite difference| < 0.25.
| Opponent | Omi wins | Opponent wins | Ties | Winner |
|---|---|---|---|---|
| GPT-5.2 | 53 (17.7%) | 96 (32.0%) | 151 (50.3%) | GPT-5.2 |
| Gemini 3 Pro Preview | 56 (18.7%) | 90 (30.0%) | 154 (51.3%) | Gemini 3 Pro |
| Kimi K2 Thinking | 100 (33.3%) | 79 (26.3%) | 121 (40.3%) | Omi |
| Claude Opus 4.5 | 105 (35.0%) | 59 (19.7%) | 136 (45.3%) | Omi |
| GPT-5 | 148 (49.3%) | 43 (14.3%) | 109 (36.3%) | Omi |
Omi beats Kimi K2, Claude Opus 4.5, and GPT-5 head-to-head. It loses to GPT-5.2 and Gemini 3 Pro — but even in those losses, the majority of dialogues are ties (50–51%), and Omi still leads on Safety and Evidence scores.
How it works
Dataset
300 synthetic doctor-patient dialogues with sentence-level IDs (SIDs) for evidence tracking. Each model generates a SOAP note from the same transcript. No real patient data.
Judges
3 LLM judges per comparison, chosen from different model families to prevent same-family bias. Each judge evaluates both notes independently and returns structured counts (unsupported claims, numeric errors, coverage flags) plus quality subscores.
Scoring dimensions
| Dimension | Weight | What it measures |
|---|---|---|
| Safety | 50% | Evidence accuracy (70%) + numeric fidelity (30%). Did the note fabricate claims or get numbers wrong? |
| Coverage | 30% | Did the note capture key SOAP elements? (vitals, meds, assessment, safety, follow-up — when present in transcript) |
| Generalist | 20% | How good does the note read? Mean of factual, completeness, and readability subscores from judges. |
Evidence score: E = max(0, 5 - 1 × minor - 3 × major)
Major unsupported claims cost 3x as much as minor ones — because fabricating a medication is not the same as a citation formatting issue.
Why this matters
A doctor reviewing an AI-generated note needs to trust that every claim actually came from the conversation. Omissions matter, which is why Coverage is part of the Composite score. But plausible fabrications can be especially hard to catch because they look like normal clinical documentation.
Our benchmark asks: is every claim in this note grounded in the transcript? And if not, how dangerous is the unsupported claim?
We weight Safety at 50% of the Composite score because we believe this reflects clinical reality. A model that invents a plausible-sounding medication list will score higher on traditional metrics than one that correctly leaves the section empty — but our benchmark penalizes the fabrication.
Limitations
This benchmark is intentionally lightweight. All dialogues are synthetic, and no real patient data is included. The results measure transcript-grounded SOAP note generation, not clinical outcomes, diagnostic correctness, or EHR-level documentation quality.
The evaluation uses LLM judges rather than human clinician adjudication. We reduce judge bias with cross-family judge panels and A/B randomization, but the results should still be read as an open benchmark signal, not as clinical validation.
Omi Health maintains the benchmark and includes its own model. To make that auditable, we publish the transcripts, model outputs, judge outputs, scoring code, and leaderboard.
Reproduce it
The full evaluation framework, all 300 dialogues, all model outputs, and all judge results are open-source:
- GitHub: github.com/Omi-Health/medical-note-eval
- Licence: MIT
- Generate SOAP notes:
python scripts/generate_soap.py --model gemini_pro --output_dir data/outputs/gemini_pro - Run evaluation:
python scripts/run_evaluation.py --a_root data/outputs/omi_soap_edge_v1 --b_root data/outputs/gemini_pro --a_label "Omi-SOAP-edge-v1" --b_label "Gemini-3-Pro" --judges gpt5_mini,claude_haiku,kimi_instruct
We welcome contributions — additional models, alternative judge configurations, or improvements to the scoring methodology.