AI Virtual Scribes: Premarket Performance Evaluation, Risk Management, and Postmarket Performance Monitoring

Background 🔗

Why are virtual scribes all the rage now? Supply and demand. Before ChatGPT and the transformers revolution, speech to text was really hard and there was not much downstream value in the raw transcript other than in niches like Radiology and Surgery.

Recent advancements in AI have made virtual scribes more appealing because speech to text got easier with advancements like Open Whisper and downstream usages exploded with advancements like ChatGPT.

While speech to text (STT) solutions have existed for a long time and have been used in radiology and surgical notes dictation for a while, there was not a good use case to use them in patient facing settings.

Therefore, new usages of these technologies such as virtual scribes exploded not seen before.

With widespread use there is likelihood for increased utility, but there are new failure modes never seen before like hallucinations.

This article explores our thoughts on the subject. FDA has not yet put out formal guidance on the topic so this is our attempt at extrapolating the current techno regulatory landscape into the new landscape.

Terms 🔗

Being a new industry, the vernacular has not had a chance to stabilize. Therefore, I feel it is important to establish some common language:

Generative AI: AI that can generate new content not seen in training or inference data which allows the AI to hallucinate
Hallucination: an output of generative AI that is not true and often times difficult to detect due to its authoritative tone
Foundation model: an AI model that was trained on a massive amount of data—often internet scale. Foundation models are said to develop a general world view that allows them to generalize to unseen problems
Prompt: an input to a foundation model
Zero shot: prompting a foundation model without any examples or fine tuning. This heavily relies on the internal memory of the model.
Few shot: Prompting a foundation model with a couple of examples and / or more context.
Prompt engineering: a newly created discipline of crafting prompts to achieve a desired result. It is a set of design patterns, best practices, and risk mitigations
Ambient scribe: A passive system that listens to a physician and patient interaction transcribes, diarizes, and stores the resulting text in another system. It does not interpret.
Transcription: the act of converting speech to text
Diarization: the act of determine who is speaking (e.g. physician or patient)
LLM: large language model. A type of foundation model that processes language (text)
Multimodal model: A type of foundation model that processes text, image, video, sound, other modalities, or a combination of modalities.

Risk Management 🔗

The first principle of medical ethics is non-malfeasence (do no harm). This cascades to medical device development. So much of premarket approval revolves around risk management.

Just because a new technology pops onto the scene does not mean we need to rewrite all of our exiting frameworks for risk management like ISO 14971. The general principles still hold and I won’t belabor you on the prior art. Rather than present an exhaustive analysis of risk management with respect to AI scribes and foundation models in general. I will present scenarios that fall in the following categories:

Risks that were not present before the foundation models boom
Events that are key discussion points because they are common links in many sequences of events or are a common source of risk
Risk mitigations that are likely to be used—or attempted to be used—to mitigate identified risks and my prediction on risk mitigation effect
Failure modes of foundation models and special considerations

Key Sequence of Events 🔗

Let’s build off of the IMDRF SaMD risk categories.

Speech to Text Errors 🔗

Speech-to-text errors and their downstream consequences warrant special consideration for AI virtual scribes. These errors occur when the speech-to-text engine fails to accurately transcribe audio or incorrectly identifies speakers.

The challenge with such errors lies in their detectability. Clinicians rarely verify the audio source against the transcription, making it unlikely for errors to be caught. Consequently, these mistakes can propagate through the clinical record like a virus—a problem exacerbated by the common practice of copying previous encounter notes without source verification.

It's crucial to note that not all errors carry equal weight, necessitating specialized evaluation methods. For instance, a single-character error in medication dosing (e.g., omitting a "0") could prove fatal.

The downstream consequence of the error influences the risk. A useful framework to divide the types of risk could be the SOAP note. Errors that only influence the “subjective” patient reports probably are less likely to result in severe harm as compared to influencing the assessment and plan.

A useful risk segmentation is the four categories in a clinical note (SOAP) note

Example: DNR / DNI Status Error 🔗

Sequence of events:

During a history and physical examination, ambient scribe software misinterprets the patient's DNR or DNI status.
The physician fails to crosscheck the audio recording with the transcription.
The electronic record is updated with incorrect DNR/DNI status.
The patient arrives in an emergency environment where a decision to resuscitate must be made quickly.
An incorrect decision is made—the patient is either resuscitated against their wishes or not resuscitated contrary to their wishes.
The patient suffers severe harm as a result.

Example: Allergy Status Error 🔗

Sequence of events:

During a history and physical examination, ambient scribe software misinterprets the patient's allergy status.
The physician fails to verify the transcription against the audio recording.
The electronic health record is updated with incorrect allergy information.
The patient is prescribed a medication to which they are allergic.
As a result, the patient experiences severe adverse effects.

Other considerations 🔗

Physicians are extremely unlikely to detect errors in audio transcripts because they are unlikely to go back and listen to a recording even if the transcript references the exact time or position in the original recording

Therefore, this makes protective measures like referencing the original recording unlikely to be effective

Transcription errors in critical areas like medication doses, problem lists, similar sounding medications (Hydralazine vs. Hydroxyzine, Clonidine vs. Klonopin)

Transcription errors arising from a lack of medical terminology training in the underlying foundation model.

LLMs may generate plausible but incorrect or fabricated medical information (hallucinations), which physicians may not easily detect due to the authoritative tone.

Transcription errors can occur with medical abbreviations or jargon, especially when LLMs misinterpret terms with multiple meanings (e.g., "MS" could mean "multiple sclerosis" or "mitral stenosis").

LLMs may misinterpret negations or nuances in patient statements, leading to incorrect clinical assessments (e.g., misunderstanding "no chest pain" as "chest pain").

Transcription errors arising from LLMs mishearing or misprocessing regional accents, dialects, or speech impediments, leading to inaccurate patient records.

Physicians might over-rely on AI outputs without sufficient scrutiny, potentially overlooking critical clinical signs or symptoms not captured by the AI.

Lack of transparency in LLM decision-making processes makes it difficult for clinicians to understand how conclusions were reached, hindering error detection.

Transcription errors in vital information like allergy documentation (e.g., confusing "penicillin" with "penicillamine") can result in serious adverse drug reactions.

LLMs trained on general data may lack specialized medical knowledge, increasing the likelihood of errors in specific clinical contexts.

Transcription errors due to misinterpretation of homonyms in medical terminology (e.g., "ileum" vs. "ilium") can lead to incorrect clinical documentation.

Physicians may not have sufficient training to critically evaluate AI outputs, increasing the chance of accepting erroneous information as accurate.

LLMs may fail to capture the emotional or psychological context of patient interactions, potentially missing important cues for mental health conditions.

Regular monitoring and updating of LLMs are necessary to ensure they remain current with the latest medical guidelines and research findings.

There is a need for standardized testing and certification processes for medical LLMs to verify their safety and efficacy before they are deployed in clinical settings.

LLMs may not recognize culturally specific expressions or idioms used by patients, leading to misunderstandings in patient assessments.

Premarket Performance Evaluation Thoughts 🔗

To evaluate the accuracy of virtual AI scribe transcription in medical encounters, the objective function should be grounded in metrics that reflect the clinical risk of errors. Errors that are higher risk should be weighted accordingly depending on intended use. It is not possible to establish a predefined set of evaluation criteria for all devices, however, here are some considerations:

Prioritize Clinical Risk in Error Weighting (Weighted Word Error Rate)

Error Classification Based on Clinical Impact:
- Critical Errors: Misinterpretations that could lead to patient harm (e.g., incorrect medication names, dosages, allergies, or critical diagnoses).
- Major Errors: Errors that might affect clinical decisions but are less likely to cause immediate harm (e.g., misrecorded symptoms, lab results).
- Minor Errors: Errors unlikely to affect patient care (e.g., grammatical mistakes, filler words, or transcribed pleasantries).
Assign Weights According to Risk Level:
- Critical Errors: Highest weight
- Major Errors: Moderate weight
- Minor Errors: Lowest weight

Customize Metrics Based on Intended Use

Use Case Specificity:
- Diagnostic Support: Transcriptions used for diagnosing require higher accuracy in medical terms and symptoms.
- Billing and Coding: Emphasis on correctly capturing procedural terminology and codes.
- Patient Communication: Focus on clarity and understanding for non-professional readers.
Adjust Weights Accordingly:
- Tailor the objective function to prioritize errors that are most impactful in the specific context.

Leverage Standardized Medical Terminologies

SNOMED CT Integration:
- Concept Recognition: Assess the AI's ability to correctly identify and transcribe medical concepts that map to SNOMED CT codes.
- Semantic Precision: Ensure that transcribed terms accurately reflect the intended medical concepts without ambiguity.
ICD-10, CPT, and LOINC Codes:
- Accurate Coding: Verify that diagnoses, procedures, and lab results are transcribed accurately to correspond with the correct ICD-10, CPT, or LOINC codes.
- Billing and Reporting Compliance: Ensure transcriptions meet the necessary requirements for billing and regulatory reporting.

RxNorm Integration:

Medication Identification:
- Accurate Transcription of Drug Names: Evaluate the AI's capability to correctly transcribe medication names, including brand and generic names, and map them to the appropriate RxNorm codes.
Dosage and Administration Accuracy:
- Precise Dosage Capture: Ensure that dosages, strengths, units, and frequency are accurately transcribed to correspond with the correct RxNorm concepts.
- Route of Administration: Verify that the AI correctly identifies and transcribes the route by which medications are to be administered (e.g., oral, intravenous).

Postmarket Performance Monitoring 🔗

A diagram indicating how stored audio recordings can be used to detect potential errors, or at least significant deviations. This can even be done without the audio recordings leaving the computer on which it originally was recorded.

Postmarket performance monitoring is a critical component in ensuring the ongoing accuracy and reliability of virtual AI scribe transcription systems. One effective strategy involves utilizing stored audio recordings of medical encounters to detect potential errors or significant deviations in transcriptions. By reprocessing these recordings with updated AI models, manufacturers can identify discrepancies between previous and new transcriptions, highlighting areas where the AI's performance can be improved.

This process can be designed to maintain stringent data security protocols. Reprocessing and analysis can occur entirely on the original recording device, ensuring that sensitive audio data does not leave the secure environment. This approach aligns with privacy regulations like HIPAA, safeguarding patient information while still enabling valuable performance assessments.

Despite the benefits, many AI scribe manufacturers opt to delete audio recordings under the guise of enhancing data security. In reality, this practice is often driven by a desire to limit legal liability. Retaining audio recordings could expose manufacturers to litigation if transcription errors are found to have adversely affected patient care. Consequently, companies may avoid keeping these recordings, hampering their ability to conduct comprehensive postmarket monitoring. There may be a case for legal protections that would allow manufacturers to retain and use audio recordings solely for performance improvement without increasing their liability.

Revision History 🔗

Date	Changes
2024-10-31	Initial Version

Software Development

FDA Regulatory Consulting

FDA Cybersecurity

Initial Regulatory Assessment

Regulatory Strategy

Real 510(k) Example

Fast 510(k)

Guided 510(k)

QMS Implementation & Support

AI Virtual Scribes: Premarket Performance Evaluation, Risk Management, and Postmarket Performance Monitoring

Background 🔗

Terms 🔗