Clinical Performance Assessments - Presentation by Ethan Ulrich

Participants 🔗

Ethan Ulrich - Software Engineer, AI/ML
Ilya Spivakov - Software Engineer
David Giese - Partner
Sam’an Herman-Griffiths - Junior Software Engineer
Nicholas Chavez - Software Engineer
Mary Vater - Director of Regulatory Affairs
Pablo Hernandez - Software Engineer, AI/ML
Joshua Tzucker - Senior Software Engineer
Kris Huang - Senior Software Engineer

Key Takeaways 🔗

Clinical Performance Assessments (CPAs): CPAs are critical for demonstrating the effectiveness of medical devices, particularly those involving machine learning or computer-aided detection. These assessments are necessary for regulatory submissions, such as FDA approvals.
Study Design Complexity: Designing CPAs requires meticulous planning, including protocols for data collection, ground truth establishment, and evaluation methods like multiple-reader, multiple-case (MRMC) studies. These designs aim to balance cost and regulatory rigor.
Handling Variability: Managing inter-reader and intra-reader variability is crucial. Techniques like adjudication and using multiple annotators help establish reliable ground truths and ensure study robustness.
Importance of Statistical Significance: Demonstrating a statistically significant improvement in device performance is a primary goal. Techniques like Free Response ROC (FROC) curves evaluate sensitivity and false positives to quantify benefits.
Regulatory Collaboration: Engaging with regulators (e.g., FDA) during pre-submissions helps refine study protocols and addresses specific requirements for device evaluations. This collaboration can mitigate risks of study rejection.

Transcript 🔗

Introduction to Clinical Performance Assessments 🔗

Ethan Ulrich: Well, thanks, everyone for coming to the talk. I wanted to do this talk about clinical performance assessments for a while. It's pretty relevant for a lot of devices that we handle, and especially for machine learning enabled devices that do some sort of detection. That's definitely relevant for those things. So I'm going to do…I did prepare a few slides. Let me see if I can share that easily. Once sec…was not working

J. David Giese: Hey everyone! How's it going everyone?

Ethan Ulrich: That big screen showing up?

Meri Martinez: Yes.

Ethan Ulrich: Okay, good. It's working. Okay. I want to talk about clinical forms assessments for a while. And this is in the context of computer assisted detection devices. So yeah, we'll start with what our clinical performance assessments. And then when are they required by the FDA. And then I want to highlight some of the details of clinical performance assessments.

What Are Clinical Performance Assessments? 🔗

Ethan Ulrich: Mostly just you know get a little bit more information about kind of the nuances of these studies, things that I've seen before. And maybe that'll help future submissions for anyone that's involved in a study like this.

What are clinical performance assessments? This is not like an official definition. But this is kind of a combination of multiple definitions that I found. The clinical performance assessment is an evaluation process designed to demonstrate the effectiveness of a device as it is used by the intended users on its intended patient population. And specifically for CADe devices, this type of assessment should demonstrate that when the readers use the device that they improve in some way. Typically something like their sensitivity at finding something versus when they're not using the device.

When Are Clinical Performance Assessments Needed? 🔗

Ethan Ulrich: That's kind of the main goal of this type of evaluation. Showing at your device has a positive effect, both, on its intended patient population and the users that are using it. When are they needed? So just getting kind of in the weeds on the regulatory stuff. The FDA does say that for 21 CFR 892.2050, sometimes it is required. And then more often for 890.2070 it is required. And they do list out specific product codes where these are CADe devices, that they are required. But there are additional ones like an LLZ potentially, I guess could require, like a performance assessment when your device has the potential to influence the interpretation of an image, that is likely when you will need some sort of clinical performance assessment. It may not be designed exactly like they are for a CADe device, but if you're influencing a reader in some way, they likely will need to be this type of assessment. You'll probably need it if your chosen predicate has this type of assessment, since it would be really hard to get away from that. And you definitely need it when the FDA tells you to. They'll let you know if it's required during a Pre Sub and there may even be other reasons not covered here. But that's kind of the main ones to look out for with a new device looking for clearance, something that might need, this type of evaluation. I'll get into some of the details of these studies.

Designing Clinical Performance Studies 🔗

Ethan Ulrich: I hope everyone did get a chance to go through the kind of the template that I created. That one was based on the previous submission for a client where we designed MRMC study. And a lot of this is kind of based off of that. There are going to be lots of different ways to do these types of evaluations. There's no one size fits all. I hope the template is general enough that, you know, it could be tailored to any future projects. But yeah, I encourage you to comment on those pages so we can maybe come up with something that would be really useful.

One of the steps for this type of evaluation is to go out and collect data that needs to be assessed on data that's included in the intended patient population for your device. Typically, especially with CADe devices, it would contain disease cases and non disease cases. And we'll get into the reason for that. You want to know sensitivity and specificity. So you'll need both types of patients in your data set. And when you're collecting data it's good to early as part of your protocol to establish an exclusion and inclusion criteria to explain what types of patients will be included, certain situations that would exclude them from the study.

And then the big thing for CADe devices is you must separate this data from any of the data that was part of your development or training of the device. Definitely for something like a machine learning based device, it cannot be associated with the training data. It's got to be completely separate. Once you have your data, there's need to establish the ground truth of what's inside that data. And this all depends on what your device does, what types of things it's detecting, what it's been in this case like segmenting, for this example here you may have device one of its function is to segment normal anatomy like an organ.

Typically for normal anatomy and large anatomy that it's a little bit easier of a task. So it's not something that is a maybe a ton of variation associated with it. So this type of ground truth establishment where you have three annotators just kind of independently, doing their segmentations of the liver, and then a computer program that's actually an algorithm staple can be used to combine those three into a single segmentation. And then you can use that as your ground truth, that is one example of using it to establish a ground truth for something like a liver.

For more difficult segmentation tasks like lesions, something that's not supposed to be there in the body, typically you'll need a better way to establish, you know, this is the ground truth, but I hope this shows up okay for everyone. It is also on that template page if you want to like a zoomed in version, but because lesions are, they can have a lot of variability when you do the segmentations, depends on which reader is looking at it, what type of evaluation software they're using, lots of factors can affect what's the final segmentation of a lesion will look like.

It's also the opinion of the reader, for annotator one they see one lesion, but for annotator two they actually see two lesions and then annotator three might see one big lesion. What do you do about that. There's this variation in what they believe is the true segmentation. So there's no surefire way to establish the ground truth. The FDA does accept, I think, multiple methods to do it. This is just one way to do it. Maybe a little bit of on the more difficult side. But what you end up with is a very strong ground truth by doing it this way. The method here is that there would be an initial annotation. This would be annotators that are not, you know, top experts in their field I would say, they could even be annotators from outside the US, would produce the initial segmentation. That segmentation is given to a US radiologist that has a bit more experience, that would approve the segmentation. And if there's any error about the segmentation, they would actually change it. And that's what I'm showing with this path right here, the reviewer changes it to two lesions. And then there would be a consolidation step where you would identify lesions that were annotated and also identify where there might be a discrepancy. That's what the yellow is showing that reviewer one did not find this second lesion so something's going on there. There needs to be an additional step to establish this ground truth. There needs to be a like a tiebreaker. One way to do that would be, you know, majority vote, these two reviewers. So this one didn't. So, you know, maybe one is wrong and then, you know, just accept the reviewer or two and reviewer three.

Handling Discrepancies in Annotations 🔗

Ethan Ulrich: Well, we often, at least what I've often seen is that the FDA likes the use of an adjudicator, which is an additional expert that really just serves as the tiebreaker. They're presented with the lesions that have these discrepancies, and then they make the final decision. So that's what this is showing here. There's this yellow lesion that is kind of the suspect lesion. And the interview dictator makes the final decision, yep, this is a lesion and this is the boundary for it. It's kind of a really tedious part of the study being able to go through all of your data, like in some cases you have hundreds of cases that need to be ground truthed. But in order to give evidence to the FDA, that this is a strong ground truth that isn't suspect or subject to a lot of variability, this might be a method that you'll have to use to to establish that. I know that was a lot, kind of in this path. So I do want to pause if there's any questions. What we're talking about on this slide. I'm open to any questions or if something's not clear.

Sam’an Herman-Griffiths: You said when the adjudicator is given an MRI, well, an image that has a discrepancy. Are they also given the opinions of the previous reviewers to take into account what other others have said or just?

Ethan Ulrich: Yeah, I think actually either way it would work. You just have to be part of your protocol. Like one way would be present them with the opinions of all three and say, which one is correct? Or, give them this kind of combined version where they would say, just edit this annotation so that you believe it's correct.

Incorporating FDA Guidelines in Study Design 🔗

Ethan Ulrich: That would be another way to do it. Again, there's not like a cut and dry way to do these adjudication steps or this entire protocol for ground truthing. This is just one way and it's definitely recommended to propose these protocols and a pre sub with the FDA so that they can look at it and say like, yep, that's a good way to do it. And sometimes there's an easier way to do things. I don't know if the FDA would actually recommend that to you, but it's definitely good to propose your protocol in a pre sub, to have it looked at by the FDA before you go out and do it.

J. David Giese: This isn't really a question, but maybe just a few comments that might be interesting. So there's almost always this tension between the cost of doing this because of course you have these expert radiologists who you're paying a lot of money, especially if you have a lot of cases. So there's the cost on the one hand, and then there's what the FDA will let you get through your submission on the other hand. In a lot of what we're doing is helping companies try to balance those two needs so that they don't go out of business because it costs so much that they also don't go out of business because they can't get past the FDA. And there's so many places in this chart here where things have gone wrong, like we've done annotating with a separate AI model for the first step. And sometimes it's okay, sometimes it's not. But there's also questions around the reviewer's expertise, like how many years of experience do they have? We've had FDA say, hey, these guys aren't experienced enough. We've even had a case recently where they got a non substantial equivalence where in the argument was…even though we agree your readers are all good, they're so variable from each other that we just don't understand how to interpret the data. Which was that wasn't was a really surprising one. And we almost appealed that. So there's a lot of ways we've had questions about how you combine the three together, like the staple method is one way, but there's other ways to do it. Sometimes the adjudicator is actually the third person. So you'll have two people do it and then a third will only look at it. So there's a lot of different ways you can do it and it definitely depends a lot on the specific clinical situation and a lot of factors. But anyways, sorry to jump in, Ethan.

Ethan Ulrich: That's all right. That's a really good point because like, why even have this first step where there's these annotators that are then pass on to another set of three people? One of the reasons is sometimes this first step can take multiple hours. And like I said at the start, these could be radiologists maybe outside the US or from a annotation service, that are maybe not necessarily experts, but they're cheaper then these reviewers that have ten years of experience. So they put in the grunt work of the two hours of annotation for each case at a cheaper cost, and then hand it off to the reviewer, which maybe spends, 15-30 minutes of time to just review the case. And so it's a cost savings to sometimes do it this way. That's another reason for that. Wanting to balance, you know, having a really strong ground truth for it versus not having to pay so much for achieving that ground truth. Sam’an did you have a question?

Sam’an Herman-Griffiths: Yeah I want to ask David how do they determine whether or not like their true variable is. Are you saying like the variable in their opinions on, what the ground truth is or in their like experience or?..

J. David Giese: Two variable about their opinions of what the ground truth is exactly and another factor, it's for nonstandard things that if the radiologists aren't doing it in their day to day practice, sometimes even if they do, it's like where exactly is the boundary of this lesion. Like it's not always clear. And you could do it multiple different ways. And so having a protocol that's really detailed that all the radiologist actually read and actually follow is a big part of this too. And what can happen is you can have variability between the readers just because they define their boundaries a little differently. And we've had studies where that happens and then the results aren't good. And then it's like, well, okay, can we go back and edit it and like there's all these kind of questions that come up. And another really common issue is if the training data was done using a kind of a looser protocol by less skilled people, and then the radiologists have their own opinions about how the annotations are done, then your performance may not be very good because the model is trained to do something a little bit different than what the reviewers do, and that's also problematic.

Ethan Ulrich: Yeah, all good points.

George Hattub: I have a question on the reviewers. What has been your experience with the FDA? Do they all have to be US board certified?

Ethan Ulrich: I mean, I would say yes, that's what they really, really want.

George Hattub: That's what they want.

Ethan Ulrich: I've not been in a situation where they've accepted anyone outside the US for that role, like the step here or and definitely not for the adjudicator.

George Hattub: Okay. Thank you

Ilya Spivakov: So I also have a question. So from the figure it seems like the data from annotator one all goes through viewer one. Would you ever want to mix up some data between the different reviewers? Because if, let's say, the annotator and the reviewer have a similar bias, it might amplify it.

Ethan Ulrich: Yeah, that's actually something I thought about. It definitely complicates the diagram. But essentially annotator one for certain number of cases would be reviewed by a reviewer two, is that what you're saying Ilya.

Ilya Spivakov: Yeah. Basically give, you know, mix and match like, separate into three piles and then give everybody annotations from three annotators to each reviewer. I guess you don't want to duplicate the whole data set point. You want a subset of all the other annotations.

Ethan Ulrich: Well, yeah. In this case there's three instances of an annotation for one patient. But yeah I definitely know what you mean. It would probably be best if there's just this initial step where the three annotators make their annotations, and then the reviewers review a sample from each annotator, so that they're not just reviewing one annotator. Now, I do believe that would be a better method. Maybe not necessary, but yeah, probably a better method to avoid certain biases or, you know, like you said, they might have a similar bias. And that's always been propagated throughout the annotation protocol.

Ilya Spivakov: So would it be too expensive to go back and redo that if you find too much variability or now that you're in the process you can't change it?

Ethan Ulrich: For the situation, it might be time that's the issue. For that scenario where you would be like criss crossing the annotators with the reviewers, you probably need to have them go through the entire data set before you hand it off to the reviewers, because then you have to provide the sample to the reviewers. With this method, as soon as annotator one is done with the case, they hand it off to reviewer one. It's just a little bit more efficient that way.

Ilya Spivakov: Okay. Thank you.

Sam’an Herman-Griffiths: Sorry. One more question. I was wondering because you had said there's really no set way to do this process. Are there any guidelines or is it just kind of that they want experts to review it?

Ethan Ulrich: I mean there is the CADe..this one right here, sorry, it’s not on the screen. Computer assisted detection devices apply to radiology images and radiology device data, that one does have like a little bit of guidance on establishing ground truth. But it's still pretty high level. It all depends on the domain, depends on the target. Well what you're actually trying to detect. For certain types of lesions maybe they're easy to segment and there's not that much variability across readers. So maybe a less strict protocol would be needed. Yeah. It's really variable. It's hard to know for sure. That's why doing a pre sub and talking with the FDA about it before committing to a protocol is probably the best way.

Sam’an Herman-Griffiths: So it is case specific.

Ethan Ulrich: Yeah.

Nicholas Chavez: I have a question. Is there a way to have like a rationale, so if you're trying to generate, I guess, a bunch of reviews done where you have a situation where like the content is very niche or maybe doesn't have that many experts, is there a way, is it like a hard, fast rule that you need X amount of reviewers, or is there a way that you can give a rationale? This is a really niche symptom or something, or disease for that matter, and you just don't have the resources available to actually have that many reviewers. You might have a lot of annotators, but you're like, your hands are tied for the amount of reviewers you can get.

Ethan Ulrich: Yeah, that's definitely not a good place to be in. For that situation, again, a Pre sub with the FDA to explain it to them is definitely needed. Like for the situation with *Censored*, the it's brains of pediatric patients. Not a lot of radiologists are qualified to do segmentations on those brains. You typically want a neuro radiologist, someone that's specific to the brain to do that. So we kind of ran into that. They definitely wanted neuro radiologists to be the reviewers. So yeah, we kind of had to find those and that wasn't that wasn't very easy. But yeah, I I'm not sure it's my answer.

J. David Giese: I can say there's another client that we'd worked on where we were able to make the case that only one reader was needed and not three. So I wouldn't say never, but I would say most of the time you do need three. It kind of depends on the intended use of the device and how important it is to assess the inner reader variability. And even in that case, the way it was done was doing a study to show that the inter reader variability was low because they were very standard and easy things to measure. So you had to justify why was it appropriate or was it necessary to do all three readers in that case?

Ethan Ulrich: Thanks, David. Yeah, let's get into the actual study design. So you went out and collected all your data. You've done the ground truthing part. The next step is to actually evaluate what your device's effect has on the readers that use it. So the way that it is evaluated is a multiple reader, multiple case study where each reader will interpret each case twice under different conditions one using device, one not using the device.

Evaluating Device Impact on Reader Performance 🔗

Ethan Ulrich: And because it's done both ways, you can measure the effect that it has on those readers. For most situations, a washout period and a specific case reading order is used to remove or to reduce the potential for biases. One of those is memorization when you actually go to the next slide. So this is an example of the reading scenarios for memorization. So each reader has to read each case twice. You'll read the same group A read case one in reading scenario one without the assistance of the device. And then there will be a 30 day break where you don't do anything. And then session two, you'll read that same case again, but have the assistance of the CAD device, and all of these will be in a random order. And the reason this is done is there's a potential that if you don't wait 30 days, the reader can, oh, I remember this case. I don't know how often that actually happens, but this 30 day washout period is supposed to, you know, mitigate that effect. And then we've had a lot of feedback from the FDA about handling these reading scenarios. I have done a study where every single reader started without the device and then 30 day washout period, and then they use the device. But the FDA has been pushing back recently to have these types of scenarios where the day that you read the cases, half of them will be with the CAD device and half of them will be without. And doing it that way also mitigates certain effects of having to the order of whether you're looking at it with the device or not. So yeah, that's kind of an interesting thing. And it is, you know, quite a bit of bookkeeping to set up a study like this, making sure that every reader is reading every case and that they're reading different orders, but they're having them read in different reading scenarios. So it is pretty complicated. I hope this diagram, kudos to Meri M for putting together, is helpful. The main idea is that you set up your study design so that readers go through the cases in a different order, under different scenarios, that they will eventually go through every case once, and then there's a break of 30 days, and then they'll go through the cases again, and then every case will be the readings area will be flipped in the on the second day.

Mary Vater: Hey Ethan.

Ethan Ulrich: Go ahead Mary.

Mary Vater: Down the other slide. So I had it come up one time where the client really didn't want to do the first session half with CAD and half without CAD, because they felt that just through use of the device, that the readers would inherently get better without the device too, that the device itself and by using it with the AI that they would figure out how to do it better and that they would become more accurate just through using the tool to where they tried to push back with the FDA and argue that, you know, it would falsely inflate the use without the device in the second session, because they would have effectively been trained using the first session. But the FDA didn't buy that argument. And they still wanted them to separate the session the way that you've drawn it out here. But I thought that was kind of interesting that they were pretty confident that their device was going to make people better, even when they weren't using the tool. So it was interesting.

Ethan Ulrich: And I agree with that for the most part, like there is a psychological effect. And so what Mary's referring to is this first session for group B, where they actually start with the assistance of the device and for something like detecting lesions, the device will help you find those lesions. And then, you know, suddenly you're just you're on the lookout for all these lesions. And then, you know, once you hit the second half of this first day session one, you're kind of on the hunt for these lesions, and you're really sensitive to these lesions. And yeah, that could change. Or you know, affect the estimate of that reader's performance. And it's hard to know I guess there maybe is some like post analysis you could do to show like if group A is different than group B. But yeah, I know it's a study you're talking about. And the FDA did really push back on making sure that this was done. They called it the variability with the Day is so yeah. Session one for reader one. Maybe they didn't have a lot of coffee, but they did when they had session two or something. Yeah. It's supposed to help mitigate the effects of the variability for the specific data the reading was done.

Kris Huang: So this idea of making sure the FDA not caring about these sorts of effects is sort of related to another idea that pops up a lot in other, like, clinical trials. And it's a concept called intention to treat, which is where if you have a treatment that you perform and like the patient, for whatever reason, has to go to the other arm, that or if they didn't have the treatment at first and then they got the treatment later, you actually still have to keep them and analyze them as if they never got the treatment. And so it's sort of like just a way of being very conservative. So that if you do register an improvement, it really is real because you're sort of tying one of your hands behind your back and still showing a positive effect. So it's kind of it kind of brings to me like that.

Ethan Ulrich: I want to make sure I get through some of these. I really only planned like half an hour for my slides, but we'll see if I can make it through them. One method for establishing how a reader detects their lesions is as part of the experiment. So this is past the notations for the ground truth. This is where we're doing the experiments. You give a reader a case and they have to identify where the lesions are. One way to do that is just with a point annotation. You might need a software to facilitate that. So the reader will look at the case, provide a point annotation, and then for every point annotation they associate a confidence, how confident they are that this lesion is a lesion, or that this mark represents a lesion. I know it looks like X to the 25th power, but that's X marks the spot and the confidence is 25. That's what this represents here. So yeah the the experiment will allow the readers to go through their cases and provide these annotations. These confidence scores are really important for the evaluation, we'll go over it a little bit in another slide, but it's more so than just sensitivity, specificity, it's can you detect the lesions, and then how confident are you at detecting these lesions. And then once the experiment is done they've made all their detections. The next step is to associate those detections with the ground truth. That's already been established. I hope that shows up okay. But for the red lesion, that's the ground truth. The reader one found it, and reader n found it, and then reader one and reader two also generated false positives.

Typically some software is written to capture those detections and associate them as true positives or false positives. And those will be captured in something like a spreadsheet that will be input for the analysis software. And that's what this might look like for the situation here. Long story short, just associating what the readers marked with the ground truth and determining if it's a true positive detection or false positive detection and making sure that they have a rating associated with their detection. I do want to talk about FROC curves. So first FROC stands for Free Response ROC. I imagine quite a few of you are familiar with ROC curves, they're similar to RC curves. The top one here is an RC curve. I know it's not square, but the diagonal is not where you want to be. That's the performance of a random guessing. And you want to be kind of up. You want your curve to go up towards this corner, this top left corner. Same thing with a FROC curve. The only difference really is we still have sensitivity as the y axis, but the x axis are different. For ROC it's the specificity. And then for FROC it's actually the false positives per image. And I'll go over that in a little bit. The nice thing about FROC is that it does a better job at accounting for cases that may have multiple lesions, like it captures those situations a little bit better, where ROC kind of squashes the data a little bit in terms of multiple lesions, it's not as equipped to evaluate those.

I don't know if you've seen the illustration on the dartboard. So for the situation on the left, you could throw a lot of darts at a dartboard and you're probably going to hit your bull's eye. You're gonna have a good sensitivity hitting that bull's eye, but you're also going to generate a lot of false positives where this situation on the right is going to be a much better situation, where you're hitting your bull's eye and, you know, maybe there's one false positive that's been generated. That's kind of what we're dealing with here. It's same thing with ROC analysis. With FROC analysis, you want to be able to find your lesions without generating additional false positives. This is an example from one of our MRMC studies. And this is kind of what you're looking for. At the end of your study. You have your line with the device and your line without the device. And there's a statistical analysis that can show that there's a significant difference between these lines. You want this line to be up closer to the top left corner. So that's a very good result. Yeah. I want to go over kind of the situations we went with over this, with one of our clients to describe what constitutes a benefit of the device.

The whole point of this study is to show that the readers are benefiting from using your device. So there's a bunch of different scenarios that your device will have an effect on the reader. So for these next images, I have a bunch of the situations. The image on the left is the reader's choice of when they are not influenced by the device, and then the one on the right is what they decided while using the device. So in this scenario, when they didn't use the device, they missed a lesion. And when they use the device, they found the lesion true positive and overall that's a great thing. You improve the reader sensitivity. So that's kind of like the the standard. This is how you improve a reader is by helping them find lesions that they would have normally missed. Then the opposite, which is would be detrimental to your device performance is, they are able to find a lesion without using the device. And then for whatever reason. Yeah, they use your device and they get worse. They don't find a lesion. That would be a bad thing. So you want to avoid those. But yeah, that's just kind of how the evaluation that's how when we're trying to say when you're evaluating the device this this will be a detriment to your device.

Another situation that's maybe not as obvious is improving the confidence of a true positive detection. So in the first situation when they weren't using the device, they gave it a confidence of 85. However, when they use the device, they boosted their confidence. So they're much more certain that this is a lesion after using the device. And that is also a good thing for the evaluation. It shows that there is a positive effect of your device. So maybe not as obvious, but that's another thing that can happen. It's not just about finding the lesions. It's also about how confident you are at finding them. Another situation false negatives for either scenario, using a device, not using the device. They didn't find the lesion. That's actually kind of a neutral effect of your device. It's not good to miss lesions obviously, but if you're not the reason that you cause them to miss it, that's kind of not a big deal. It's just just what it is. And another way that would be a bad thing that can happen a lot with devices is it generates more false positives or causes a reader to generate more false positives. And the ground truth, there is no lesion. When they didn't use the device, they didn't find anything. But when they use the device, for whatever reason, they thought they saw a lesion. Now that would be a bad thing for your devices performance and you hopefully can avoid those.

But then the opposite is obviously a good thing. They generate a false positive when they weren't using the device. However, when using a device, they didn't put a put a detection there and that would also be a good thing, a positive mark for your device. I think that might be kind of last one, reducing the confidence of false positives. That would also be a good thing. Well, you don't want to be generating false positives if you can. I don't know, cast doubt on those false positives like that could also be considered a good thing as part of the evaluation. So yeah, I think that's all in my head for there. Sometimes it's helpful to kind of go through that because it's not very obvious why we're doing all these detections and giving them scores. That's why I wanted to go through that because it just paints a little bit better picture of what's going on and why we do this evaluation this way. And just keep in mind this is just one reader reading one case. And for the study we have multiple readers and multiple cases. So this is happening you know, many many times over. So there's lots of data there. And then yeah, at the end of the study really, you do your evaluation and hopefully get a nice positive result, statistically significant difference using your device versus not using the device. So yeah, sorry that explanation took a while. I think sometimes it's good to go into the weeds of this type of study, but I'll open up for questions and discussion. Did anyone have burning questions after going through the guidance and the protocol?

Sam’an Herman-Griffiths: I have one I don't know if it's burning, but I was wondering how statistical significance is determined.

Ethan Ulrich: For FROC it's…let me go back to that. It's usually related to this curve. There is a two curves that are generated. And you can calculate the area under those curves. And then there is a statistical method. I'll admit I do not know all the ins and outs of that statistical method, but there is a pretty well defined statistical method that will evaluate your data to determine if this difference is significant. It accounts for the variability caused by the readers. Both inter reader variability and intra reader variability, and also the variation associated with the cases that were selected for the study. Yeah, it counts for multiple sources of variation while determining this significance.

Sam’an Herman-Griffiths: I guess the main part of my question is, is it case by case or like FDA defined or like depending on what your device does, you know, like depending on the area of the body being scanned and, and that.

Ethan Ulrich: I mean it's pretty device specific, I guess. I mean this evaluation can be used for lots of different devices and lots of different studies. I'm not sure if that's what you're asking.

Sam’an Herman-Griffiths: No more so like the the statistical significance, like, say, the brain scans they would expect to more significance scans of like a femur they would expect less. Is that a thing or?

Ethan Ulrich: At least for this term that I'm showing right here, we did have to have a pre sub with the FDA and say, you know, we are expecting an improvement of 0.05. And there was a sample size estimation that helped us determine how many cases we needed to collect and how many readers that we should recruit. But that expected difference of 0.05 was really just, you know, there's not necessarily a reason for choosing that number. It's often chosen. Really what that means is that the difference between the two curves in this situation, the difference was 0.13 so it's much higher than 0.05. And that's great. That's why the p value is so low. But I think that 0.05 is kind of like the standard chosen value for a useful difference I guess. It's hard to know for sure. It is nice when you do your study and it's, you know, it's much higher than that number, then you don't have to worry about it. But yeah, it seems like 0.05 is kind of this standard bump in the curve that you're looking for like more critical things, I guess I can think of a situation that would be more critical, but you may want to have a larger boost in performance. And you have to establish that as part of your, like sample size estimation.

Pablo Hernandez: Ethan, one question. Here you showed like one kind of lesion, right. Or one segmentation of liver of just one segmentation like and what is the size sample like how many samples or how many images do you or FDA recommends for that? The second question will be like if the device, instead of just doing one thing is doing like multiple things. I don't know, like to more class A to a class B to more like class C like is that is the increase. I guess it will be like an increase of the size of the images that you need, but it will be linear or so it will be like. So I mean three types you need three times the number of images or it will be more. You need the bias the and doing doing the work there. And so you know you need whatever like half of that because these are the…

Ethan Ulrich: The first question I will answer. No, the FDA probably will not give you a recommendation for how many you should use. They really rely on the sponsor to justify how many cases, that they're going to use. Yes, for the situation where you have, like, multiple things that you're detecting. But I would definitely reach out to a biostatistician to determine how many cases you might need to do that. It just adds a another level of complexity.

J. David Giese: I've got a few things to add to that. So first, I would say there's definitely a difference between having enough samples to prove statistical significance of some effect size. That's like one question. And usually the number that you need for that is a lot lower than the number you actually need to convince the FDA that your model generalizes. So the question is to whether the model will generalize is gonna depend on the types of variability you might expect for that sort of device. So for example, if you have a device its functionality might depend on body mass, then you'll definitely need to make sure you have enough samples of different types of body mass. Or if there's different types of lesions like different disease states, you might need to cover that or an ethnicities another one. So that's really common in imaging studies like different MRI scanner manufacturers is another. And so there can be kind of a long list and of sources of variability and I would say in a lot of cases it seems like 150-250 cases is a pretty common kind of number you need, but it really can vary a lot. And I think I would say we're hoping to bring someone on the team next year from FDA who will have a lot of experience with this because they've done kind of research from the FDA side, actually. They'll be able to give you probably a very good answer on that.

Key Considerations for Generalizability and Sample Size 🔗

Pablo Hernandez: That’s awesome. Yeah. Thank you.

Ethan Ulrich: And yeah, sorry that a lot of times we do get asked that question, how many cases do we need? How many readers do we need. And you know, it's hard to sort of pull a number from out of your head, just say this is exactly what you need. If you want to be very, very thorough and you have unlimited time and resources, you would do a pilot study that actually uses your device on a small set of cases, maybe a few readers, and then you would estimate your variance based on that small study, and then you can predict how many you would need to do a bigger study. It's very difficult to do something like that. And then yeah, going back to what David said, that may not necessarily cover the generalizability question, like how does your device perform on these certain subgroups of data like the imaging manufacturer, the different races and the patients, all that stuff? It really is just about the finding a statistically significant difference. So yeah, it's hard.

Pablo Hernandez: Yeah. Good good.

J. David Giese: One other thing I'd just add in is definitely there's a difference between computer aided detection and computer aided diagnosis and the numbers you need for diagnosis is much higher. And I think the numbers you need for like a screening application would be even higher. On the flip side, the numbers you need for like a triage application by CADt is lower in general.

Ilya Spivakov: So I noticed that everything all these numbers, they include the device and the person. So the device that we evaluated on its own, let's say in some edge case it produces wildly incorrect results. But the reader knows better and ignores whatever the device tells them and still reads, you know, as good as they did without the device. So that's neutral. But the device might be completely wrong in some cases.

Ethan Ulrich: Yeah, that's a really good question. The the clinical performance assessment is one side of the evaluation. I prefer every submission that I've been a part of. There's also standalone performance assessment which is this performance of your device minus the human element. It's actually easier to perform. You just take the output of your device, compare it to your ground truth like die score for the lesion segmentations and report that. That would detect, you know is my device way off and my readers are just really good. So yeah, it establishes what is this baseline performance of your device. How good is it detecting these lesions. And yeah, it removes that human element. So that is also included in a submission. I did have a small section related to it in that protocol that I created, but it does need a little bit of work to finish that out. Yeah that's a good question. This is just really one side of the coin for the clinical performance assessment. There should also be an evaluation of the device itself.

Pablo Hernandez: And what I’ve had in my mind like instead of having three reviewers three kind of expensive radiologists, then you have like two three case, but instead of having only two radiologists having like a pool of radiologists of, you know, two one radiologists per hour, that kind of, now reduce some kind of bias because you're putting more people there, but you're paying less on radiology work. So these three..is a magic number for FDA or?

Ethan Ulrich: Three is a good minimum usually. I mean, it can be used for tiebreakers or, you know, to establish a majority vote. It does depend on what the task is like. If for something like segmenting the liver, there may not be a lot of variability in aligning the liver. I mean, there's still some. But if you had three people do it versus 30, like you probably would end up with a very similar ground truth. When it comes to lesions, that may not be the case. Maybe three is definitely not enough to account for all that different variability. Maybe you do need 30 to come up with your ground truth. It's hard to know. There are like certain methods I hope that they become accepted in the future where they have shown that a large group of essentially novices are as good as an expert if they're just given a little bit of training in, you know, that you'd have like 20 people doing a segmentation and it's combined somehow then that is as good sometimes as radiologists with ten years of experience. I think there needs to be a little bit more evidence of that, before that's accepted. But that would be perhaps another way that you could do your ground truth thing, have a large pool of novices doing your segmentations and then combining them into, you know, single ground truth.

Joshua Tzucker: I have a question about how much you're allowed to rework things as you go, once you get things going, like, it just seems like there's so much, so many variables. Like, I mean, just even like, oh, was the traffic bad on the way into perform to reads like that could affect people's mindset and stuff. Are you allowed to like, let's say you have a site and, I don't know, there's some bad stuff going on in the world and people are just negative and it affects the read results, like, can you just throw out an entire site? Can you start over? What are you allowed to do once things get moving?

Ethan Ulrich: I'm going to defer to maybe Mary. Are you muted?

Mary Vater: I was muted and I'm still trying to figure things out. But anyway yeah. So I think it kind of depends on the situation of course. And I would find it a little bit hard because in real life people are also going to be encountering those in real use. There's going to be other things impacting their performance as well. I guess if it's specifically relating to their performance when they're using the device, if there's an event you can justify, maybe. But I think in general, the FDA doesn't like you to trim your results based off of performance. So I think you'd have to have a really, really good argument for that. And even then, I don't know if I've ever heard of the FDA allowing you to just cut out a particular site or data set for those reasons, but yeah, I mean, it's an interesting thought because I could see that being a factor, but I'd be nervous if that were the case.

Joshua Tzucker: That makes sense that.

J. David Giese: This isn't exactly related to what you're asking, but it is something I've seen a few times, and I don't think we've ever done this, or at least not that I'm aware of, is there will be a pilot run of a number of practice cases, and you'll actually have the reviewers that will do it, and then you'll have them cross-check each other in kind of debate out like, oh, here's why I did this way, and I think it's better and then you'll update the protocol for the annotations based on that. And then once it's kind of agreed on a small number of cases, then you'll actually run the study in the purpose of that's the kind of get agreement on that protocol. I think I've just read about people doing that a number of times in other 510(k)s and studies, but I don't know if we've ever done that, but I think two I would say is if there's a sort of like site specific variability or situation where you think the results would be unfavorable, you would definitely hope, if you documented that in your protocol and usually in the pre sub, you would kind of explain that like here's the criteria we would use to say this site is inappropriate. I think this is on an IBD project that we weren't running the study or doing this, the strategy in it all. But I do think I recall there being something, something like that where there was a site where they they kind of disqualify them. But I think it was it was like pre-written or something like, I don't remember exactly. I just remember that the client kind of complaining about the site and like, yeah, talking about it. So I might be misremembering because it's been a while, but.

Mary Vater: Yeah, if you can show that there was like a protocol deviation or that there was something about the study that wasn't followed properly, then that's like a legitimate reason why you could eliminate that. But the FDA and I can't remember exactly where it is, but there is some like guidance or rule around withholding information that's pertinent to the review. And I think if you encountered a situation like that, I think it would probably be pertinent to review to discuss that and make sure that it was like an isolated case. But yeah, it definitely would be case by case. And concerning if that were to happen. Yeah.

Ethan Ulrich: Well thanks everybody. We are, one minute over time. But yeah, I hope this was beneficial. And I hope everyone was able to get a little bit better understanding of these types of studies and why they're so important, whether, you know, a big deal to do. Yeah. And I hope you have a good rest.

Mary Vater: Great. Thanks.

Pablo Hernandez: Thank you everyone.

Reece Stevens: Thanks, Ethan. This is awesome. Yeah.

Ethan Ulrich: Bye everyone.

Software Development

FDA Regulatory Consulting

FDA Cybersecurity

Regulatory Strategy

Fast 510(k)

Guided 510(k)

Guaranteed AI/ML 510(k) Submission in 3 Months

Guaranteed QMS in 2 Months

Clinical Performance Assessments - Presentation by Ethan Ulrich

Participants 🔗

Key Takeaways 🔗

Transcript 🔗

Introduction to Clinical Performance Assessments 🔗

What Are Clinical Performance Assessments? 🔗

When Are Clinical Performance Assessments Needed? 🔗

Designing Clinical Performance Studies 🔗

Handling Discrepancies in Annotations 🔗

Incorporating FDA Guidelines in Study Design 🔗

Evaluating Device Impact on Reader Performance 🔗

Key Considerations for Generalizability and Sample Size 🔗

Clinical Performance Assessments - Presentation by Ethan Ulrich

Participants 🔗

Key Takeaways 🔗

Transcript 🔗

Introduction to Clinical Performance Assessments 🔗

What Are Clinical Performance Assessments? 🔗

When Are Clinical Performance Assessments Needed? 🔗

Designing Clinical Performance Studies 🔗

Handling Discrepancies in Annotations 🔗

Incorporating FDA Guidelines in Study Design 🔗

Evaluating Device Impact on Reader Performance 🔗

Key Considerations for Generalizability and Sample Size 🔗

Medtech Insider Insights