🔍 Ambient Scribe Note & Coding Audit
Purpose
Audit an ambient-AI-drafted clinician note against an auditable source (structured encounter data, clinician-corrected final note, or — where policy allows — the transcript/audio log) to flag note inflation, phantom diagnoses, over-stated HPI or ROS elements, E/M-level drift, HCC upcoding, and assessment/plan content that was not discussed in the encounter. Output is designed for compliance, coding, and clinical-documentation-integrity (CDI) teams reviewing ambient-scribe output under a 2026 policy environment where both payers and provider organizations are responding to measurable coding-intensity drift from ambient tools.
When to Use
Use this skill whenever a human reviewer needs a structured second-pass check on an ambient-AI-drafted note — the "auto-accept" moment is the single largest source of drift. Common scenarios include:
- Random compliance audits (e.g., 5–10% sample of ambient-authored notes per provider per quarter)
- Pre-bill coding review on high-dollar E/M levels (99214, 99215, 99204, 99205, high-complexity inpatient, or any 2-tier level jump from the provider's baseline)
- CDI review when HCC/risk-adjustment diagnoses appear on the problem list that were not the documented reason for the encounter
- Provider-level pattern review when an ambient tool is new to a practice (30/60/90 day cadence)
- Targeted review after a payer downcoding event or a payer audit request
- Post-implementation monitoring when a health system first deploys an ambient scribe at enterprise scale
- RAC, MAC, OIG, or commercial-payer audit preparation and rebuttal packet assembly
- Residency, fellowship, or teaching-service review where attending attestations may conflict with ambient-drafted assessments
- Internal coding training: surfacing drift patterns for feedback to individual clinicians
- Payer contract renegotiation where coding-intensity shifts need to be explained with evidence
This skill is a reviewer's tool. It does not modify the note, does not re-code the claim, does not create a corrective amendment, and does not replace the clinician's clinical judgment. It surfaces risk-ranked findings; a human makes the final call.
Required Input
Provide, at minimum, inputs 1 and 2. Inputs 3–5 dramatically improve audit quality.
- The ambient-AI-drafted note (or the signed final note if that is what is being audited) — Full text with clear section headers (CC, HPI, ROS, PMH, FH, SH, Exam, Assessment & Plan, Patient Instructions). Indicate whether this is the raw draft, the clinician-edited draft, or the signed note. De-identify before testing with external models.
- Ground-truth encounter signal — At least one of: (a) the provider's pre-visit agenda / chief complaint; (b) the structured visit data (vitals, orders placed, labs resulted in the encounter, meds reconciled, problem-list changes); (c) the transcript or summary log from the ambient tool if your policy permits reviewer access; (d) the patient's own pre-visit intake questionnaire. If none of these are available, the audit output is clearly marked "limited-input audit" and risk scores are reduced accordingly.
- Billing submission — Submitted CPT (including E/M level), ICD-10 codes on the claim, any modifiers, HCC-relevant diagnoses captured, and whether a time-based or MDM-based E/M was billed. If a risk-adjustment model (CMS-HCC V24/V28, HHS-HCC) is in play, note which.
- Clinician baseline and policy — The individual provider's historical E/M distribution (for drift comparison), the specialty, the practice's ambient-scribe policy (auto-accept allowed? required review field? dual-signature?), and any payer-specific guardrails (e.g., Medicare Advantage plan's HCC attestation policy, local MAC LCDs).
- Scope and purpose of the audit — Random sample, targeted high-risk, payer-requested, pre-bill, post-bill, post-denial, or teaching-case. Include whether this is a standalone audit or part of a trend analysis across N notes.
If the ambient note was already hand-edited by the clinician without a diff available, the skill notes that the audit is against the final signed document only and cannot distinguish AI-authored content from clinician edits.
Instructions
Produce a structured audit report in the following six sections. Keep risk ratings calibrated — not every discrepancy is a compliance event. Use the rating rubric below exactly.
Risk rating rubric (use verbatim):
🟢 LOW— Documentation style preference or minor discrepancy; no coding or compliance impact.🟡 MEDIUM— Documentation element not supported by encounter signal; would not change the billed code but should be corrected via amendment or addendum.🟠 HIGH— Documentation element affects an E/M element, a billable diagnosis, or HCC capture; requires clinician review and either correction or explicit attestation before the claim is finalized.🔴 CRITICAL— Fabricated exam, fabricated diagnosis, fabricated procedure, fabricated history element, or documentation of a service not provided. Stop the claim; engage compliance. Also includes any assessment content that contradicts the patient's stated concerns or the structured chart.
1. Audit Header
- Audit date, auditor (or "automated pre-review"), note encounter date, clinician, specialty, encounter type (new/established/inpatient/telehealth), ambient tool name if known, audit scope, and which inputs were provided. Flag
LIMITED-INPUT AUDITif inputs 3–5 are missing.
2. Overall Drift Summary
Two to four sentences. Describe at a high level whether the ambient-drafted note appears aligned with the encounter signal, partially aligned, or materially drifted. End with a single-line verdict:
VERDICT: [ALIGNED / MINOR DRIFT / MATERIAL DRIFT / FABRICATION RISK]
If the verdict is MATERIAL DRIFT or FABRICATION RISK, the next sections focus on the specific findings; if it is ALIGNED, the findings section is still completed but short.
3. Element-by-Element Findings
For each major note element, compare drafted content against ground-truth signal. Use this structure:
[Element] — HPI / ROS / PMH-FH-SH / Exam / A&P / Orders / Patient Instructions
Drafted: [one-line excerpt or paraphrase — never reproduce > 15 words verbatim from the note]
Signal: [what the structured data / transcript / agenda supports]
Finding: [supported / partially supported / unsupported / contradicted]
Risk: [🟢 / 🟡 / 🟠 / 🔴]
Reason: [specific — e.g., "ROS lists 10 systems negative; encounter agenda is focused MSK visit for knee pain; 10-system ROS is one of the elements supporting the billed 99214."]
Suggested action: [clinician amendment / addendum / down-level / remove diagnosis / attestation / compliance escalation]
Run this for every element where there is a non-trivial discrepancy. Do not fill space with green-light lines for every sentence.
Specific patterns to watch for (each maps to a known 2026 ambient-scribe drift risk):
- ROS inflation — 10-system negative ROS appended to a focused encounter. Common ambient behavior; HIGH risk if it supports an E/M level.
- Phantom exam elements — Documentation of an exam component (e.g., fundoscopic, neuro, full skin, genitourinary) that the visit's agenda, duration, and orders make implausible. CRITICAL if it supports a higher E/M.
- Phantom diagnoses on the A&P — Diagnoses listed in the assessment that appear nowhere in the structured problem list, patient's stated concerns, orders, or labs. CRITICAL if billed; HIGH if captured but not billed (still a chart-integrity issue).
- HCC drift — A risk-adjustment-relevant diagnosis (diabetes with complications, CHF, CKD stage 3+, major depression, vascular disease, amputation status) added to the A&P without an MEAT element (Monitored, Evaluated, Assessed, Treated). HIGH risk; stop-the-claim if the HCC changes the risk score and no MEAT is documented.
- E/M level drift — MDM-based level that the encounter complexity does not support, or time-based level where the documented time is implausible given the encounter. HIGH to CRITICAL depending on magnitude.
- Time documentation — A stated time that exceeds plausible encounter duration, or a time that conflicts with adjacent encounters on the same day.
- Procedure documentation — A billable procedure (e.g., joint injection, skin biopsy, ECG interpretation, destruction of lesion) documented without corresponding order, consent, or product-use record. CRITICAL unless an order or supply charge corroborates.
- "Copy-forward" drift from prior notes — HPI or A&P language that matches a prior encounter verbatim and contradicts today's chief complaint.
- Assessment contradicting the patient's concern — The patient's stated problem does not appear in the A&P, or the A&P introduces a problem the patient did not raise.
- Patient education / AVS fabrication — "Discussed ____, patient verbalized understanding" for topics not on the encounter agenda.
- Counseling/Coordination of Care inflation — "50% of visit spent in counseling about ____" language that the encounter signal does not support (historical pre-2021 E/M language that still surfaces in drafts).
4. Coding & Billing Impact
Map the findings to the billed CPT, E/M level, ICD-10/HCC capture, and any modifiers. Produce:
- Billed E/M level and whether it is supported by documentation that is itself supported by ground-truth signal
- Recommended E/M level range if drift is present (lowest defensible, highest defensible), with rationale
- ICD-10 codes billed that are supported vs. unsupported; flag HCC-relevant items with 🟠 or 🔴 as appropriate
- Modifier checks (e.g., modifier 25 where the additional E/M is not substantiated by separate A&P content)
- Time-based vs. MDM-based billing — if mixed signals, recommend which is defensible
- Estimated financial exposure range if the highest-risk findings were subject to a payer takeback, expressed as a per-encounter range, clearly flagged as a compliance estimate, not a legal opinion
5. Recommended Remediation
Produce up to five concrete actions, in priority order:
- Pre-bill correction items (what the clinician should amend or addend before the claim drops)
- Dual-signature or attestation requests (who signs what)
- Whether to hold the claim pending review
- Whether the finding rises to the level of routine-feedback vs. compliance-escalation (e.g., OIG-level) per the organization's policy
- Whether this single note is part of a trend that warrants a broader sample (e.g., "pull next 10 99215s from this provider for pattern check")
End with two to four lines of provider-friendly feedback language the CDI team can paste into a feedback message — written as coaching, not blame. Never write language that impugns the provider's intent; ambient-scribe drift is a tool-and-workflow problem first.
6. Audit Trail & Export
Produce a short audit-trail block the team can paste into the EHR audit log or an external QA tracker:
Note ID: [xxxx] Encounter date: [yyyy-mm-dd] Provider: [NPI or internal ID]
Ambient tool: [name] Input completeness: [full / partial / limited]
Verdict: [ALIGNED / MINOR DRIFT / MATERIAL DRIFT / FABRICATION RISK]
Highest finding risk: [🟢 / 🟡 / 🟠 / 🔴]
Recommended claim action: [release / hold / amend-then-release / escalate to compliance]
Reviewer: [name] Review date: [yyyy-mm-dd]
Follow-up trigger: [none / 30-day recheck / broader sample / compliance review]
Safety, Policy, and Fairness Guardrails
- Never reproduce > 15 words verbatim from the note; paraphrase for excerpts. Full-text quotations belong in a compliance-protected audit tool, not in the report body.
- The skill does not guess a provider's intent. All findings are framed as documentation-vs-signal gaps, not accusations.
- If the ambient transcript was used, the skill never reproduces patient voice verbatim beyond short de-identifiable snippets and honors any local policy on transcript use.
- The skill flags its own limits: when inputs 3–5 are missing, risk ratings are conservative (MEDIUM at most for coding impact) and the verdict ceiling is
MINOR DRIFTunless there is direct evidence of fabrication. - Pediatric, obstetric, behavioral-health, and oncology encounters often have legitimate high-complexity MDM that looks like "drift" to a naive audit. The skill applies specialty-adjusted thresholds and will state the adjustment in the audit header.
- The skill does not judge any individual provider's pattern from one note. Provider-level trend claims require N ≥ 10 notes and are explicitly opted in.
- The financial-exposure estimate is a range, not a liability determination. Legal exposure is out of scope.
- Output is a compliance work-product. It should be routed through the organization's attorney-client-privileged QA channel where applicable.
Example Output (abridged)
=== Ambient Scribe Note Audit ===
Audit date: 2026-04-15 Auditor: CDI – J. Lopez
Encounter date: 2026-04-11 Clinician: Dr. R. Patel, Family Medicine
Ambient tool: [vendor] Note type: Established patient, level 4 (99214)
Input completeness: full (signed note + structured chart + agenda + transcript excerpt)
Audit scope: Pre-bill random sample, 2026-Q2, 10% sample
Overall drift summary:
The ambient-drafted note is partially aligned with the encounter signal. The chief complaint (back pain) and orders (lumbar MRI, cyclobenzaprine) are supported. The ROS and exam content are not. The billed 99214 is defensible at 99213 on MDM, and 99214 only if the ROS and exam inflation are corrected to match the encounter.
VERDICT: MINOR DRIFT
--- Element-by-element findings ---
Element: ROS
Drafted: "Ten-system ROS, all negative except MSK."
Signal: Agenda and transcript focus on back pain x 6 weeks; no other systems discussed.
Finding: Unsupported.
Risk: 🟠 HIGH (contributes to billed E/M level)
Reason: The 10-system ROS is one of the data elements supporting 99214 complexity; the encounter signal does not support a 10-system review.
Suggested action: Amend ROS to pertinent-positive MSK + relevant pertinent negatives only; reassess E/M level accordingly.
Element: Exam
Drafted: "Full neurologic exam including cranial nerves II–XII, sensory, motor, reflexes, gait, Romberg."
Signal: Transcript captures strength and straight-leg raise; no evidence of cranial-nerve or Romberg assessment.
Finding: Partially supported (strength/SLR yes; CN exam no; Romberg no).
Risk: 🟠 HIGH
Reason: Documentation of exam components not performed is a fabrication risk and may affect MDM complexity.
Suggested action: Amend exam to only what was performed: focused MSK + distal neurovascular exam of the lower extremities.
Element: Assessment & Plan
Drafted: "1) Lumbar radiculopathy, likely L5; 2) Obesity, class I; 3) Pre-diabetes."
Signal: Lumbar radiculopathy is supported. Obesity class I is plausible (BMI in chart). Pre-diabetes: HbA1c not drawn today, no counseling documented in transcript beyond a passing reference by the patient.
Finding: Items 1 and 2 supported; item 3 is an HCC-relevant diagnosis added without documented MEAT.
Risk: 🟠 HIGH
Reason: HCC/risk-adjustment capture without MEAT is a known 2026 drift pattern and a payer audit target. If the claim is submitted with this code and no supporting element, the organization carries avoidable exposure.
Suggested action: Either (a) remove pre-diabetes from today's A&P, or (b) amend with a 1-sentence MEAT statement reflecting today's assessment and plan if clinically accurate.
--- Coding & billing impact ---
Billed: 99214 ICD-10: M54.16, E66.01, R73.03 (pre-diabetes)
Documentation supports: 99213 on MDM after ROS/exam corrections; 99214 only if corrected documentation still supports moderate-complexity MDM.
Recommended range: 99213 (lowest defensible) – 99214 (highest defensible after amendment).
HCC exposure: E66.01 is non-HCC; R73.03 is non-HCC. Low risk-adjustment exposure on this note.
Modifier 25: not used. OK.
Estimated financial exposure if uncorrected: ~$30–$45 per encounter between 99213 and 99214 at the practice's 2026 Medicare rates (compliance estimate, not legal opinion).
--- Recommended remediation ---
1. Amend ROS to pertinent-positive/pertinent-negative only; drop the 10-system line.
2. Amend exam to the components actually performed.
3. Either remove R73.03 from the A&P or attach a MEAT statement from the encounter.
4. Re-evaluate whether 99214 remains defensible on MDM after corrections; if not, submit at 99213.
5. Do not release claim until items 1–4 are resolved.
Provider-friendly feedback (coaching):
"Today's back-pain encounter was appropriately coded on complexity, but the ambient-drafted ROS and exam included elements that weren't actually performed. A quick amendment to reflect only what you did will keep the note compliant and protect the E/M level. Also — the ambient tool added pre-diabetes to the A&P; because it's a risk-adjustment–adjacent code, please either remove it or add a one-line MEAT statement if you're addressing it today."
--- Audit trail ---
Note ID: 2026-0411-xxxx Encounter: 2026-04-11 Provider: [NPI]
Ambient tool: [vendor] Input completeness: full
Verdict: MINOR DRIFT Highest finding: 🟠 HIGH
Recommended claim action: hold; amend; release at 99213 or 99214 per post-amendment MDM
Reviewer: J. Lopez, CDI Review date: 2026-04-15
Follow-up trigger: pattern check if 3+ similar ROS/exam findings in next sample
Nursing-Specific Bias & HITL Audit Extension (v1.1)
Added April 24, 2026. Use this extension when the ambient-drafted note under audit is a nursing note (RN, LPN, APRN), a nurse-led handoff, or a nurse-authored patient-education / discharge-education artifact produced with an ambient tool. The extension applies the February 25, 2026 American Academy of Nursing (AAN) AI Position Statement's Human-in-the-Loop (HITL) and bias-evaluation expectations to the audit — nurses signing AI-assisted documentation retain full legal and ethical responsibility for each entry, so the audit surfaces drift patterns that are specifically nursing-side and specifically bias-shaped.
Trigger this extension whenever any of the following is true:
- The author of record is a nurse (RN, LPN, APRN) or a nursing-led team (charge nurse handoff, care-manager documentation, school nurse, home-health visit note)
- The note is the nurse-handoff (I-SBARR) artifact already covered by
nurse-shift-handoff-isbarr.mdand an audit is being run against the ambient output - The encounter involves a patient from a population historically harmed by healthcare-AI performance gaps (Black, Indigenous, Hispanic/Latine, LEP, deaf/hard-of-hearing, older adult, pediatric, disabled)
- The ambient tool has been reported (internally or in published literature) to transcribe, summarize, or prioritize differently by dialect, accent, or phrasing — the "Invisible Scribes" pattern
When this extension is triggered, add Section 7 (Nursing-specific findings) and Section 8 (Bias & HITL attestation) to the standard six-section report.
Section 7: Nursing-specific findings
For each of the following, score independently of the E/M coding focus in Sections 3–4:
- Nursing scope & standardized language — Drafted content uses nursing-specific language (NANDA-I, NIC, NOC, nursing process) where the structured chart expects it; ambient note does not substitute physician-style A&P language for nursing-process documentation. Risk 🟡–🟠 if a scope-of-practice mismatch could affect a licensing board review.
- Handoff integrity (I-SBARR) — If the note is a handoff, the Introduction–Situation–Background–Assessment–Recommendation–Readback elements are each represented and supported by the encounter signal. Readback / confirmation element is captured. Risk 🟠 if a safety-critical element is missing or fabricated.
- Pain-assessment, behavioral, and patient-voice fidelity — The ambient note's paraphrase of patient-reported pain, distress, or symptom language does not systematically under-score or re-frame a patient's own words. Red flag: patient states "worst pain of my life" and ambient draft records "moderate discomfort." Risk 🟠–🔴.
- Skin / wound / fall-risk / pressure-injury documentation — Findings match bedside-assessment signal (photo log, pressure-injury scale, fall-risk tool). Ambient tool did not default to "skin intact" when the bedside assessment captured otherwise. Risk 🟠 if a CMS-reportable hospital-acquired condition could be mis-documented.
- Medication administration & titration — Times, doses, and routes match MAR / smart-pump log; ambient note did not collapse multiple administrations into one line. Risk 🔴 if a controlled-substance or high-alert med event is mis-documented.
- Education & teach-back — "Patient verbalized understanding" statements are only present where the encounter signal actually captured a teach-back turn. Risk 🟠 if fabricated teach-back supports a discharge-readiness decision.
Section 8: Bias & HITL attestation
A short block (6–12 lines) that reports:
- Demographic context — Patient's self-reported race, ethnicity, preferred language, and any accommodation needs, pulled from the structured chart (do not infer from voice or free text).
- Transcript-fidelity signal — Whether the ambient tool's transcript shows signs of the known bias-shaped failure modes ("Invisible Scribes" patterns): dropped turns in non-English or accented English, summarization that omits the patient's own phrasing, prioritization that promotes clinician voice over patient voice. If no transcript is available, state that explicitly — absence of testability is itself a finding.
- HITL attestation — Confirms that the signing nurse reviewed the ambient draft, verified it against bedside assessment, and accepts full professional responsibility per AAN 2026. The audit does not accept "auto-accept" as HITL. If the note was auto-signed without a documented review turn, risk is 🟠 by default.
- Governance trigger — Whether the finding should be added to the deployer's AI Tool Registry performance log and whether a trend escalation to the nursing-led AI oversight committee is warranted (yes if ≥ 2 bias-shaped findings in a rolling 10-note sample).
Output of Sections 7–8 feeds back into the Section 2 verdict: if any nursing-specific finding is 🔴, the overall verdict is FABRICATION RISK regardless of coding impact; if ≥ 2 bias-shaped findings surface in Section 8, the overall verdict is at minimum MATERIAL DRIFT and the audit is flagged for governance review.
Notes on Fit and Limits
- Intended for CDI, coding, compliance, revenue-cycle, and (v1.1) nursing-led AI oversight teams. Not a clinical decision-support tool.
- Operates on documents, not audio. Where transcript access is permitted by policy and vendor agreement, audit quality improves substantially; otherwise the skill operates on structured encounter data plus the signed note only.
- Complements, rather than replaces, the existing Coding Review Assistant skill (which focuses on ICD-10 / HCC / DRG code selection). This skill focuses on the documentation-to-coding integrity path specifically in an ambient-authored note.
- Complements, rather than replaces, payer-specific denial rebuttal work. When a payer downcodes a claim based on ambient-drafted content, use this skill's output as an input to the Denial Appeal Letter Writer skill.
- v1.1 extension complements, rather than replaces, the
nurse-shift-handoff-isbarr.mdskill — handoff drafting is a different workflow than handoff auditing. Use I-SBARR handoff skill to draft; use this skill (Sections 7–8) to audit. - Anti-plagiarism: output is original paraphrasing and structured risk rating. Never reproduces vendor templates, payer policy language, clinical note content, or AAN position-statement text verbatim.
Evidence Base & Tool-Calibration Note (v1.2 addition — May 2026)
RCT evidence on tool-specific documentation impact. A 2025 pragmatic randomized clinical trial (238 outpatient physicians across 14 specialties; three-arm parallel design: Microsoft Dragon Ambient eXperience DAX Copilot vs Nabla vs usual-care control) found meaningful tool-to-tool variation in the primary outcome of time spent writing each note. Nabla users achieved an approximately 9.5% greater reduction in time-in-note compared with the control arm; DAX users showed no statistically significant difference from control on the same measure. Both scribe arms showed improvements in secondary burnout and cognitive-task-load outcomes (Mini-Z scores). The finding that two major market-leading tools produce divergent documentation-time outcomes under controlled conditions has a direct audit implication: baseline productivity assumptions embedded in an organization's ambient-scribe policy — and therefore in the audit sampling rationale — should be calibrated to the specific deployed tool rather than to the ambient-scribe product category as a whole. If an organization switched tools mid-year, auditors should re-establish a provider-specific baseline before interpreting drift as provider-level rather than tool-level.
International governance benchmark — NHS England ambient scribing guidance (March 31, 2026). NHS England published formal guidance covering the deployment of AI-enabled ambient scribing products across health and care settings. Four provisions are directly relevant as an international reference for U.S. deployers: (1) Patient notification — patients must be informed at the start of every session that an ambient scribe is in use; this is an active, encounter-level disclosure requirement, not a one-time consent form. (2) Data minimization — audio recordings and transcripts should ordinarily be deleted once a verified summary has been produced; retention beyond that point requires documented justification. (3) DPIA obligation — organizations must conduct a Data Protection Impact Assessment before deployment and maintain clear documentation of data controller and data processor roles when using a vendor-hosted tool. (4) Clinician accountability — healthcare professionals retain full responsibility for the accuracy of every clinical record produced with an ambient scribe; the guidance explicitly flags that additional human validation is expected in complex scenarios such as cross-language translation. Together, these provisions establish a governance framework that is directionally consistent with the AAN 2026 HITL position statement and with HIPAA's minimum-necessary and Business Associate Agreement requirements for U.S. deployers. Organizations using this skill for pre-deployment governance review should assess their ambient-scribe deployment against all four NHS provisions as an international benchmark, independent of direct jurisdictional obligation.
Procurement-Phase Accuracy Baseline & Vendor Selection Audit (v1.3 addition — May 2026)
Why this section exists. Across 2025–early 2026 the ambient-scribe audit conversation focused almost entirely on the post-deployment drift problem — what to do once a clinician auto-accepts a draft and the note has already been entered into the chart. A May 13, 2026 Ontario Auditor General report on AI use across the provincial public service shifted the audit conversation upstream, into the procurement phase, by publishing the first public-sector evaluation of a controlled multi-vendor accuracy bake-off across 20 government-approved AI scribe vendors. The findings are directly relevant to U.S. CDI, compliance, and informatics teams that participate in ambient-scribe vendor selection because they document the error-mode distribution that should be assumed at the start of a deployment, before any provider-specific drift baseline is established.
Key findings (procurement-phase testing across 20 approved vendors). Every vendor produced at least one accuracy defect in simulated-encounter testing. The error-mode distribution clusters in three patterns that map directly to clinical-risk categories:
- Fabricated treatment recommendations — 9 of 20 systems generated suggestions in the draft note that were not made by the clinician on the simulated recording. Reported examples include unsolicited therapy referrals and unsolicited lab orders. This is the same fabrication-risk class the audit's standard Section 3 already covers under "Phantom diagnoses on the A&P," "Procedure documentation," and "Patient education / AVS fabrication," and it is the highest-severity category at procurement because a fabricated referral or order written into the chart can drive a downstream clinical action without the clinician's intent.
- Medication-name capture errors — 12 of 20 systems captured a different drug than what was prescribed by the clinician on the simulated recording. This is a near-miss adverse-event class with implications across the medication-reconciliation, e-prescribing, controlled-substance, and pharmacy workflows. For audits, drug-name mismatch is a 🔴 CRITICAL finding when the mismatched entry was signed or transmitted to a pharmacy and is at minimum 🟠 HIGH when the mismatch was caught in a clinician edit pass.
- Mental-health-detail omission — 17 of 20 systems failed to capture important mental-health context that was present in the simulated recordings. The omission pattern is distinct from the fabrication pattern; it is a silent failure mode that the audit cannot catch from the note alone, only from a comparison against an auditable source (structured chart entries, the patient's pre-visit screen, the transcript). For nursing-led notes and behavioral-health encounters, omission of mental-health context is a 🟠 HIGH finding by default; if a screening result was captured by structured data but did not surface in the note, treat as 🟠 HIGH regardless of the auto-accept state.
Procurement-weighting concern as audit context. The Ontario report further documented that accuracy of generated medical notes was weighted at only 4% of the vendor selection score in the public-sector procurement, while domestic presence in Ontario was weighted at 30%, data privacy and legal controls at 23%, and system security at 11%. For U.S. health-system audit teams the implication is operational: when vendor selection criteria assign small weights to accuracy at the time of contract award, the downstream audit baseline must assume more vendor-to-vendor variation than vendor marketing materials imply, and the audit's sampling cadence in the first 90 days of any new deployment should be set to surface the dominant error modes early. The audit team should request the organization's own procurement-phase accuracy test results before establishing the post-deployment sampling design; in their absence the audit baseline defaults to the three Ontario-documented error-mode categories above.
Vendor selection audit checklist (use before contract award or at renewal).
- Has the vendor provided a controlled-encounter accuracy bake-off result against an organization-defined test set, not vendor-supplied transcripts? If no, flag as
LIMITED VENDOR-EVIDENCEand require an internal bake-off as a contract precondition. - Has the vendor provided a third-party audit report (SOC 2 Type II, HITRUST, or ISO 27001) and a current threat-risk assessment? In the Ontario evaluation, 11 of 20 approved vendors did not submit any third-party reports or ISO 27001 certification and 5 did not submit threat-risk assessments or privacy-impact assessments yet were still approved. Treat absence of either as 🟠 HIGH at procurement.
- Does the vendor disclose the model family, model version, training-data provenance, and update cadence used in the deployed product? Undisclosed model changes mid-contract are a known governance gap.
- Does the vendor's accuracy methodology include explicit testing for the three Ontario-documented error-mode categories: fabricated treatment recommendations, medication-name capture, and mental-health-detail capture? Vendors that test only against transcription-WER (word error rate) benchmarks do not address clinical-impact error modes and should be flagged at procurement.
- Does the contract include a performance-monitoring SLA with named error-rate thresholds, an evidence-based renegotiation trigger if thresholds are exceeded, and a clean-exit data-portability clause? Without these the post-deployment audit team cannot escalate sustained drift into a contract remedy.
Audit calibration update — new top-3 watch patterns. In addition to the standard Section 3 watch list, audits of newly deployed ambient scribes (first 90 days, or any vendor without an internal accuracy bake-off result) should add the following three patterns to the element-by-element findings table:
- Drug-name mismatch — Sampled against MAR / e-prescribing log / pharmacy submission record. 🔴 CRITICAL if signed-and-transmitted; 🟠 HIGH if caught in edit pass; 🟡 MEDIUM if caught pre-sign with no transmission. Sample at minimum 10% of new-prescription encounters in first 30 days.
- Mental-health context omission — Sampled against PHQ-2/PHQ-9, GAD-7, CSSRS, AUDIT-C, or other structured mental-health screen results in the chart. 🟠 HIGH if a positive screen exists in structured data but no narrative reflection in the note. Sample at minimum 10% of encounters with a positive structured screen.
- Fabricated referral or order — Sampled against orders placed, referrals queued, and patient-instruction printout. 🔴 CRITICAL if a referral or order appears in the note that does not exist in the orders queue or order-history. Sample 100% of first-30-day notes where a referral or non-routine order appears.
Cross-skill linkage. Drug-name capture findings should be routed to the medication-reconciliation-assistant.md workflow for closed-loop verification at the next encounter. Mental-health-omission findings in nursing-authored notes should trigger the v1.1 Nursing-Specific Bias & HITL extension (Sections 7 and 8 of this skill). Fabricated-referral findings during the first 90 days of a deployment should be escalated to the organization's AI governance committee per the standard governance trigger in Section 8.
Long-Horizon Agent Reliability & Patient Consent Calibration (v1.4 addition — May 2026)
Why this section exists. Two distinct mid-May 2026 signals expand the auditor's frame of reference upstream of the note itself. First, the May 2026 release of a long-horizon, policy-rich healthcare workflow benchmark provides the most rigorous external evidence to date that frontier general-purpose AI agents — the same model families that power most ambient scribes' summarization and reasoning layers — degrade sharply on multi-step, multi-role, multi-turn tasks under realistic operational pressure. Second, an April 2026 federal class action against two California health systems over ambient-scribe deployment without explicit patient notification has moved patient consent from a documentation footnote into a deployer-level legal-risk surface. Both signals reshape what auditors must include in scope when they review an ambient scribe deployment, even if neither directly changes how a single drafted note is graded.
Long-horizon agent reliability evidence (CHI-Bench, May 2026). A 20+ institution coalition spanning health systems and university research groups released a benchmark of 75 healthcare workflows across three operational domains — provider prior authorization, payer utilization management, and care management — exercising AI agents inside a high-fidelity simulator of approximately 20 healthcare applications exposed via 87 model-context-protocol tools, with a 1,290+ document managed-care operations handbook supplied as a skill. Across 30 agent-and-model configurations from major frontier vendors, the highest-scoring agent resolved roughly 28% of tasks under a strict-pass criterion, no agent cleared 20% under three-run reproducibility testing, and an endurance configuration that ran 25 cases in a single session collapsed performance to under 4%. By domain, utilization review reached approximately 41%, care management approximately 32%, and prior-authorization paperwork approximately 29%.
The audit implication is not that ambient scribes are also general agents — they are typically narrower, more constrained, and tuned to a specific encounter-summarization task — but that the underlying reasoning and tool-use layer shared across many ambient scribe vendors degrades under exactly the conditions that arise in real practice: long sessions, dense policy context, multi-role handoffs, and repeated draft cycles within a single shift. Auditors should accordingly:
- Increase the audit weight on late-session notes within a high-volume shift. When a deployment supports providers who draft 20+ notes per session, sample disproportionately from the back half of the shift; degradation is most likely there.
- Treat any agentic-orchestration extension to an ambient scribe (auto-order placement, auto-referral routing, auto-coding draft submission) as a distinct higher-risk audit class. The benchmark evidence shows the largest performance gaps appear when agents must navigate multiple tools and multi-turn dialogs, not when they produce a single note. Apply 🔴 CRITICAL by default to any auto-action taken by such an extension without a documented HITL turn.
- Refuse to extrapolate vendor pilot results onto enterprise scale without re-baselining. Pilot deployments typically run shorter sessions, fewer cases per session, and tighter clinician oversight than full enterprise rollout. The benchmark's reproducibility finding (no agent clearing 20% on strict triple-run pass) is the calibration anchor: variance between identical inputs is the rule, not the exception. Audit sampling cadence at enterprise launch should assume materially higher variance than the pilot baseline implied.
- Record the underlying model family and version in the audit header. When the ambient vendor swaps its backing model (a common mid-contract event noted in the v1.3 vendor-selection checklist above), variance characteristics can shift even when the vendor reports no functional change. The audit header should capture the underlying model identity at the time of the audited note so a sustained drift trend can be correlated against a model swap event after the fact.
Patient consent and recording disclosure (Sutter Health / MemorialCare class action, April 2026). A federal class action filed April 8, 2026 in the Northern District of California against two health systems alleges that an ambient-AI scribe was deployed during patient encounters without explicit notification or consent, and seeks to certify a nationwide class. The complaint cites the California Confidentiality of Medical Information Act, the California Invasion of Privacy Act, and the Federal Wiretap Act, and frames the legal harm as occurring at the moment of interception — when live audio is captured and transmitted to a vendor server — rather than at any subsequent storage or use event. A parallel action remains open against a separate California health system using the same product. Neither suit has been certified or decided as of this writing; the operational significance is that patient-consent procedure has now been pleaded as a stand-alone harm independent of any documentation defect.
For audit purposes, this expands the consent-and-disclosure scope of the procurement-phase audit (v1.3, above) and adds a per-encounter check that did not previously exist:
- Procurement-phase addition. Treat the vendor's patient-notification posture as a contract-grade requirement at procurement and renewal. The audit team should obtain, in writing, (1) the vendor's recording-and-transmission boundary diagram including out-of-region transmission, (2) the practice's standard patient-notification script and consent record-keeping pattern, (3) the practice's documented procedure for handling a patient who declines, including a documented non-recording fallback, and (4) the BAA's explicit treatment of audio captured during the encounter as PHI from the moment of capture. Absence of any of these four is 🟠 HIGH at procurement.
- Per-encounter addition (post-deployment audits in California, and as best-practice elsewhere). Add a Section 1 audit-header field that captures whether the encounter had a documented patient-notification turn (verbal at start of session, supported by signed acknowledgement, EHR-flagged opt-out tracked across encounters) and whether any opt-out was honored without record-keeping of the encounter audio. Missing notification documentation when state law (CMIA, CIPA, or the analogous statute in the practice's jurisdiction) requires it is 🟠 HIGH; capture of audio after a documented opt-out is 🔴 CRITICAL and should halt the claim and trigger a privacy-incident review under the BAA's breach-notification clause.
- Cross-skill linkage. The patient-notification artifact required by this addition is the operational complement to the
patient-ai-use-disclosure-notice.mdskill, which produces the underlying disclosure language. Use that skill to draft the notice; use this skill (Section 1 audit header) to verify that the notice was actually delivered at each encounter being audited. Where a state law mismatch surfaces (e.g., a one-party-consent state vendor deploying into a two-party-consent state without per-encounter notification), escalate to thepolicy-and-compliance-qanda.mdskill for a written legal-position memo before the next encounter cycle.
Interaction with prior sections. The CHI-Bench evidence sharpens — but does not contradict — the v1.2 DAX-vs-Nabla RCT finding (tool-to-tool variation) and the v1.3 Ontario AG finding (vendor-to-vendor accuracy variation at procurement). All three point in the same direction: ambient-scribe behavior is not a category property, it is a tool-and-deployment property, and the audit baseline must be set on the specific deployed configuration. The class-action signal adds a legal dimension to the v1.2 NHS England active-notification provision: NHS guidance establishes notification as a clinical-governance norm; the Sutter / MemorialCare complaints assert it as a statutory consent requirement. U.S. deployers should treat the two as aligned and operationalize active per-encounter notification regardless of which jurisdiction frames the obligation first.
Interaction with the RCT and NHS findings (already in v1.2). The Ontario procurement-phase findings are complementary, not contradictory, to the DAX-vs-Nabla RCT and the NHS England governance guidance. The RCT addresses time-savings divergence between two tools under controlled deployment conditions; the NHS guidance addresses deployment-phase governance (notification, data minimization, DPIA, clinician accountability); the Ontario procurement findings address procurement-phase accuracy and the implication that vendor selection weighting drives downstream audit sampling cadence. Together the three references constitute the 2026 evidence baseline for an ambient-scribe audit program covering vendor selection (Ontario), deployment governance (NHS), and post-deployment time-and-quality calibration (RCT).