Yujan's Comments at FDA Digital Health Advisory Committee on GenAI Enabled Medical Devices

 November 23, 2024
SHARE ON

Key Takeaways 🔗

  1. Transforming Post-Market Surveillance into a Benefit: Proposing methods to make post-market surveillance appealing to manufacturers by turning it from a regulatory burden ("stick") into a beneficial process ("carrot") that can reduce pre-market burden and time to market.
  2. Leveraging Synthetic Data and Frontier Models: Using synthetic data generation and foundation models (termed "frontier models") for site-specific data augmentation and automated validation, which can help detect data drift and model drift in AI medical devices.
  3. Predetermined Change Control Plan (PCCP) as a Tool: Utilizing the FDA's PCCP program to pre-certify test plans, allowing manufacturers to reduce pre-market testing burdens and facilitate faster market entry while ensuring ongoing post-market validation.
  4. Balancing Site-Specific and Collective Intelligence: Addressing the challenges of site-specific validation potentially limiting broader clinical insights, and proposing a combined approach that includes both site-specific validation and broader collective intelligence from multiple sites.
  5. Data Availability Influences AI Development Focus: Highlighting that AI development in medical devices is concentrated in areas like radiology and cardiology due to standardized and readily available data, suggesting that lack of standardized data in other fields limits AI innovation in those areas.
  6. Continuous Validation for Data and Model Drift: Implementing continuous validation mechanisms that operate regularly (e.g., nightly) to detect and adjust for data drift and model drift, ensuring AI models remain accurate over time.
  7. Statistical Considerations for Site-Specific Models: Emphasizing the importance of consulting statisticians to address confounding variables and ensure statistical validity when using site-specific models in clinical trials and validations.
  8. Potential of Fully Automated Validation: Exploring the future potential of fully automated data synthesis and validation to reduce costs and increase efficiency in ongoing device verification.
  9. Equity and Access in AI Medical Devices: Recognizing that overrepresentation of certain specialties in AI medical devices may exacerbate health disparities, and suggesting that expanding AI applications to underrepresented fields could improve equity.
  10. Regulatory Pathways Influence Innovation: Noting that current regulatory processes may not be conducive to AI innovations in certain medical specialties, implying that adjustments to regulatory pathways could encourage broader AI development.

Resources 🔗

Transcript 🔗

Moderator: We'll move on to the next speaker. We'll do questions at the end. Dr. Shrestha from Innolitics, you have five minutes.

Yujan: Hello, good morning. Yes, my name is Yujan Shrestha. It's very nice to be here. So I'm affiliated with my company Innolitics. They have paid for my hotel and flight, and I also owe my wife precisely 15 diaper changes as well, so I want to also disclose that conflict of interest there.

At my company, Innolitics, we are a software development agency, and we also help our clients both build their software medical devices and AI as medical devices and get them cleared through FDA. Today, I would like to discuss a potential example of how I think we can turn post-market surveillance from a stick into a carrot, where a stick is something that manufacturers don't want to do and a carrot is something that they would want to do. This is just one example, but I want to also kind of explore how this can be done more generally as well.

Transforming Post-Market Surveillance 🔗

Yujan: My role as a consultant is to help my clients allocate their resources. They have a certain set of key marketing claims that they want to achieve, and they have a finite budget and timeline. If you know someone that doesn't have a finite budget and timeline, please send them my way; I'll be sure to help them out.

Everyone wants the same thing—patients, FDA, industry—we all want safe and effective medical devices quickly and at a low price point. However, in the real world, trade-offs must be made because of budget and timeline constraints. These trade-offs, unfortunately, are mostly made in the pre-market, with the pre-market clearance as the main focus. That's what most earlier-stage startups are just trying to get to—pre-market clearance—and you can understand that they don't really care most about the post-market at that time.

Unfortunately, in the post-market, this becomes a type of regulatory debt, and if you don't mind me using a software analogy, it's kind of like technical debt where new features always take precedence over fixing old debt. I believe the same thing happens to post-market analysis there.

So I believe that is one of the fundamental reasons why we only have 4 percent uptake of radiology AI, as was discussed yesterday. And also, we have the issues of data drift and lack of data generalization as well because of lack of post-market feedback.

Proposing a Win-Win Solution 🔗

Yujan: Let's change the narrative and turn it into a win-win situation for everybody. Why not turn post-market evaluation into something that could be pitched to decrease pre-market burden, and to do so while also decreasing cost, time to market, reduce regional biases, mitigate data drift, and also mitigate model drift in the multi-layer model application design.

Here I'd like to present a hypothetical device, and I promise I'm not getting free advice from you guys—this is purely hypothetical for now. This is an example of a device that has an open-ended input and a closed-ended output. It uses LLMs of unknown provenance, and what it does is it takes DICOM header files and tags them with certain key Boolean classifiers for downstream things like hanging protocols and other standardization.

Using Synthetic Data and Frontier Models 🔗

Yujan: Here's an example of a synthetic data generation pipeline that uses site-specific data and uses what I'm calling a frontier model. I'm defining a frontier model as a foundation model that is on the cutting edge. That frontier model uses a specific test data generation prompt that then creates synthetic data, and then through some testing mechanisms that we talked about more yesterday by people smarter than me, we can verify this very focused, intended use of this model.

So it's kind of like software of unknown provenance (SOUP) validation, where the SOUP could do lots of things, but you only care about the things that you need that SOUP to do for your device. So it's a similar concept here. I also want to describe the concept of also doing validation for one more intended use, and that is to serve as a proxy for a ground-truther for this very narrow task.

Combining Automated Processes for Validation 🔗

Yujan: If we combine the two together, we have site-specific data, fully automated synthetic data generation, and we can even have a fully automated data ground-truthing, and also a nightly comparison to the device outputs. You can imagine here, this operates nightly. Because it operates nightly, it can detect data drift as it occurs. Also, the foundation models could be swapped out, so that could capture data drift on a more global level as well.

Just to conclude here—sorry about going a little bit over—I just want to say that I believe approaches like this would get the 90/10 trade-off balance just right, where we are giving something to the manufacturers that they can do in the pre-market that would also help them in the post-market as well. Thank you for your time.

Discussion on Site-Specific Validation 🔗

Moderator: My question is actually, I want to invite other presenters like from Deloitte and Dr. Shrestha to the podium, and I also want to make some comments about and, you know, just have a quick discussion and a question for you.

So I think the site validation question—I think we should really think through this very importantly and hear more perspectives, especially from the FDA, because eventually, if you have a site-based validation, we are kind of committing to a post-hoc situation where everything is done post-hoc after the model has been deployed. So that's one thing I want to point out there.

And in that case, we did have a pre-cert program at Digital Health Center of Excellence. Can somebody comment that do we still have the pre-cert program and how relevant it would be for getting sites approved or kind of, you know, in line with what we're looking for? That was my first question. Anybody can answer it, and Troy, you can.

Troy: Yeah. So also, for the record, the "task as 80 percent rule" is not an FDA official position. So I just wanted to get that on the record. We currently do not have a pre-cert program, and pre-cert was not necessarily used officially for post-market. It was mostly around the pre-market side.

Moderator: Okay, so that's great to know. And the thing that I want to kind of also point out, which I was hearing, is that we are talking about the communication around it. We're talking about a collective intelligence problem versus a narrow intelligence problem for validating these algorithms. So a collective intelligence problem is that we have multiple clinicians validating a set of algorithms, and the collective intelligence is going into clinical decision-making and validating the algorithm.

If we switch to a site-specific thing, then the collective intelligence distribution of clinicians becomes narrower and kind of limited to that specific site, right? Because that's a very important thing to point out that eventually we are looking at a situation where very narrow sets of clinical decision-making is going to happen at every site, and we will not know what the broader clinical community would think about it.

So the collective intelligence value of site-specific validation is going to be reduced. So, in that framework, my suggestion would be to do both—have a site-specific program, like pre-cert something, where people can do it, and then have a broader collective intelligence program where multiple sites can also put input into the same algorithm and kind of help the clinical decision-making move forward, so that we don't lose on the collective intelligence of the medical community for the decision-making.

And the third is that clinicians—and I'm also a professor of computer science and electrical engineering—so clinicians often operate in a POMDP format, like partially observable Markov decision-making process, where they really don't know a lot of decisions they're going to make, but they have a notion of what's going to come up in the future. So I think that no AI is going to be able to capture because a lot of that is from intuition and clinical signatures.

And I think that's why we need to have a more broader community if we go with the site-specific validation model to put input into the things. And I really like Dr. Kafka's pragmatic trial idea, and that goes to Dr. Purkayastha's initial point about real-world validation. Thank you.

Like a lot of the platform adaptive trials, you can randomize sites into different arms and have different arms running clinical trials together without running a massive site. Obviously, for that, you have to prevent confounding because you cannot randomize sites to do two different arms if they have different devices, different scanners, different clinicians. But you can still study the impact of this site-specific thing we are talking about, which is very critical.

So if we decide to go with the site-specific model, statistically there are going to be a lot of questions about who goes to which site, how do we randomize patients, do we randomize devices, how do we statistically explain this model, right? I really would encourage to consult a good statistician if you go down the site-specific road to prevent confounding.

Response to Site-Specific Validation Concerns 🔗

Moderator: Thank you so much. As you're answering that, Dr. Shrestha, I see you walking up to the mic. I just want to add a little question to get concrete while we have you, which is, you talked about continuous validation. And if we're taking a single device that is generative AI-enabled and putting it in multiple sites, that is something that could come with the device, if you will, that would help us.

Because we're not going to have the right data scientists, clinical informaticists, like at all of the institutions, the way Dr. Rari was mentioning. So, could you say something a little concrete about how that process could work as you're answering this question of continuous validation at the sites?

Yujan: Sure thing. Thank you. So, I'll try to address these, and please, definitely in advance, sorry if I get anything mixed up.

The first point on the pre-cert program—I believe the current tool that we have at FDA is the excellent PCCP program. The kind of way that I envision that, kind of from a software engineer mindset, it's like test-driven development, but you're getting your tests pre-approved for FDA in your pre-market submission.

And I'm always going back to the pre-market submission because I think that's the chance—that's the moment where we can convince industry to do the things there. Otherwise, the market will find the efficient pathway to market, and usually post-market analysis after that first FDA clearance is likely to fall through the cracks there.

So I think something like a PCCP, where even that can be phrased as a carrot rather than a stick. We say, okay, this is something that you could go to FDA, get a clearance sooner perhaps, and this also ties in with the partial soft deployment mechanism there. Perhaps your current algorithm isn't tested enough for nationwide deployment, but with a PCCP, you can get clearance now, you can start generating some revenue now, and you can get that next round of funding, which everyone cares about.

But you have a pre-certified test-driven development type plan where you can say once we meet these criteria, then we can roll out to the rest of the nation. And this is key again, without having to go back to FDA. So I think that's how we can leverage the PCCP to solve some of these issues that I've been hearing.

As for the narrow site problem, you know, I think that could be something like a stochastic sampling where it doesn't necessarily—the burden of proof for that doesn't necessarily need to be put on the customer, but it could be transferred onto the manufacturer. And again, in a way where, in the pre-market, perhaps this means in the standalone performance test, we need less samples because we have assurance that, hey, we're going to get some validation in the post-market. So that could be a way to pitch that, to get more compliance on that.

So like a stochastic sampling where images are fully de-identified, sent back to the manufacturer, and then in that case we can have a bit more control over what's there. That can even be phrased as like a—it's kind of like a verification test that goes ongoing, which is I think another way in the pre-market where FDA can have some assurance that this verification test is going to be run in the post-market. And that tends to capture, I think, what this is—it is a verification test, it's running the post-market, and we're using the same terminology that everyone's used to already.

As for the fully automated validation, how that could potentially solve the healthcare disparities, I don't think the fully automated data synthesis and validation is quite—the technology is not quite there yet to do that. This is a little bit kind of forward-thinking, but I think eventually it will be. And that's a case where I think you could have that if properly validated for those models for that specific narrow use case, then that could significantly decrease the cost for these types of studies.

I am a terrible salesman, but I can sell free. And if I say, hey, this thing can be done very cheaply and for free, that's something I can sell.

Moderator: Just a quick clarifying comment on that. I think for narrow use cases like radiology reports, which are generated by large language models, it's totally fine because we are underwriting a very low risk as the provider. But if you're taking a clinical decision-making, which recognizes tumors or something more significant, that's going to be a much more—because the radiology example that we've been hearing all morning is a very defined problem for a very defined LLM, generating some reports, and that's fine. Of course, it has nuances and complexities, but there might be other bigger issues if we validate a site-specific algorithm or site-specific paradigm, that people might use the site-specific framework to put things that are more complex than the radiology use case we have used. I just want to point that out. The radiology use case is great, but there might be more complexities if we kind of go down that road.

Yujan: I like that you bring that up because I think we do have to recognize—we also talked yesterday about multimodal inputs. And that's something that's very different than having a select image and a select definition. So I like that not only for your comment about, you know, tumor, etc., but just thinking about multimodal in general.

Addressing Equity in AI Medical Devices 🔗

Moderator: Dr. Shrestha, our committee has one more question for you. Could you clarify the distinction between generating synthetic data and the pragmatic comparative effectiveness approach of synthetic controlled trials?

Yujan: Okay. Great. I'm actually not familiar with the pragmatic synthetic control trials.

Dr. Berry: Just because I think there is often that conflation that happens between the synthetic data that's been created and then the approach where you're synthetically deriving a group of control arm. And so, I was curious whether or not you can differentiate between the proposal that you had of creating that synthetic data to help generate some of the evidence for the accuracy and safety versus the comparative effectiveness method of comparing the outcomes?

Yujan: Sure, sure. Yeah. I think the way that I use synthetic data is more akin to data augmentation. I'm not sure if that kind of helps clarify where my vision is that we can use the foundation models that have been trained on a much kind of a global scale, right? And as those foundation models advance, those presumably would be trained on the most up-to-date data. So data drift can be captured in the weights of that foundation models in a global level.

And then if we use that foundation model to then data augment data that is collected on site in like a nightly basis, weekly, whatever, but some kind of frequent cadence, then we can capture the global data drift along with the local data drift all in one setting, and again in like a fully automated manner, which I think there'll be a lot more post-market compliance to do something like that versus something a bit more burdensome where, you know, you'd have to send the data back and get it manually annotated and stuff that is less likely to be done.

Dr. Berry: So do you imagine there are specific sites that have the unique capability of doing this? Or does that sit with manufacturers themselves?

Yujan: I was thinking that this would be the manufacturer. Yeah, pretty much my whole presentation, I was thinking that that could be something that the manufacturer would do as part of their ongoing verification of the product. And it's really not that much different to, like, surveys that you have on your app—you know, "How are we doing?" and, you know, like, it's common practice for websites to send back telemetry data about performance metrics and other things like that. So, you know, I think it just makes business sense, too.

Dr. Berry: Just a quick clarifying comment on that. I think for narrow use cases like radiology reports, which are generated by large language models, it's totally fine because we are underwriting a very low risk as the provider. But if you're taking a clinical decision-making, which recognizes tumors or something more significant, that's going to be a much more—because the radiology example that we've been hearing all morning is a very defined problem for a very defined LLM, generating some reports, and that's fine. Of course, it has nuances and complexities, but there might be other bigger issues if we validate a site-specific algorithm or site-specific paradigm, that people might use the site-specific framework to put things that are more complex than the radiology use case we have used. I just want to point that out. The radiology use case is great, but there might be more complexities if we kind of go down that road.

Yujan: I like that you bring that up because I think we do have to recognize we also talked yesterday about multimodal inputs, and that's something that's very different than having a select image and a select definition. So I like that not only for your comment about, you know, tumor, etc., but just thinking about multimodal in general.

The Challenge of Expanding AI Applications 🔗

Moderator: Dr. Kupchandani, you had a quick question, then Dr. Radman, Maddox, and then we'll close.

Dr. Kupchandani: Yeah, I just have a broader question about, maybe for the industry representatives. Many of these devices are predominantly in cardiology, radiology, neurology, and we don't see many devices, AI or non-AI, for urology, gynecology in an aging nation. I wonder if there's any focus on prostate health or such AI devices, and why this affinity just for radiology and cardiology? No offense to Dr. Bhat, but...

Yujan: I'll defer to more of the clinician here—not industry-related expertise there, necessarily.

Moderator: Sorry, what was the question? Why are cardiology and also radiology significantly overrepresented in these devices?

Dr. Kupchandani: And with the aging nation, few innovations in urology or gynecology, which is an equity issue as well.

Yujan: Gotcha. I think one thing that really comes to mind is just the data availability. Like with cardiology and radiology, there's a lot of imaging data that is available, which I think historically has been kind of the main focus of AI, of non-generative AI. You could also argue that why is dentistry not in there? I think because the standardization of DICOM and things that haven't happened yet inside of dentistry. So that's another barrier to entry there. Yeah, that's the main thing I can think of off the top of my head.

Moderator: It's a great point. Troy, absolutely.

Troy: Yeah, so we do studies of that, and it's a great question, but it's always around supply-demand issues, right? And to Dr. Shrestha's point, data availability becomes a big driver around where people invest. And so when you think about data in the context of hospital systems, 50 percent of the data is like pathology data, but yet there were no standards like DICOM we've seen in radiology. So a lot of investments didn't go there. Radiology then represents probably somewhere around 25 to 30 percent of—this is the data size, right? And then you had a smattering of other things like omics data, and EHR data is a very tiny little sample of that. So because the data had been standardized through DICOM standards, then it was a little bit more easier to apply AI to it, right, from a training perspective and then of course testing and validating.

Moderator: Thank you so much.

Dr. Elkin: Can I just add a point about why radiology?

Moderator: Yes. And then we will go to Dr. Elkin and Dr. Maddox. Oh, sorry, Dr. Radman first.

Dr. Radman: I just go back to the slide that I used yesterday where about 80 percent are coming in as CAD-T, and I'll just emphasize the fact that that's triage. It's not a really functional AI tool, but it's a low bar to get through FDA that manufacturers have used.

So uptake is just not taking place, but I would say that to Troy's point, you have to have data to be able to train as well as use these algorithms. And medical imaging data is pretty robust and consistent. That said, it doesn't have to work in radiology. It should work in other disciplines like urology, gastroenterology. It's just that the device process of getting approved through the regulatory process today is not amenable to that, and I don't see that changing with Gen AI if we keep the same process.

So I say risk is good. The process that's used to determine risk is good. Controls are good, but the existing ones are not good, and they're not going to create devices that are going to create this equity, resolve equity problems, and give access to other specialties.

Moderator: Thank you. Dr. Radman.

SHARE ON
×

Get To Market Faster

Medtech Insider Insights

Our Medtech tips will help you get safe and effective Medtech software on the market faster. We cover regulatory process, AI/ML, software, cybersecurity, interoperability and more.