by Yujan Shrestha on September 28, 2022
This four-part series is for you if:
In this article series, we outline the process to develop an ML/AI algorithm from scratch and get it FDA cleared. This article covers the first phase in a four phase process:
The typical AI/ML Software as a Medical Device (SaMD) project takes between 6 and 24 months, and costs between $125k to $500k to complete. The main factors affecting the time and cost are the complexity of the algorithm, how much you have already developed, and the quality of your existing documentation. Although the FDA’s target review goal for traditional 510(k)s is 90 days, expect the process to take 180 days when you factor in responding to FDA’s questions.
If you follow the process in this article, you can have an FDA cleared product on the market in as little as nine months. Contact us if you need help with any of the above.
Note that these estimates exclude costs related to data acquisition, data annotation, or any necessary clinical performance study.
If a man knows not which port he sails, no wind is favorable. (Seneca)
A common mistake you could make is to jump immediately to implementation without first developing a regulatory strategy. Taking a step back and reaching out to a regulatory consultant can save many months of money and effort. For example, it may not be obvious, but the easiest way to market could be to submit the initial 510(k) with a version which has a simpler set (lower risk) of features and then submit a ‘Special’ 510(k) which has more complex (higher risk) features. The benefit of using this method is that it is often easier to ‘add on’ advanced features to a cleared device than it is to obtain an initial 510(k) clearance for a device which has advanced features. I often see clients fall into the trap of approaching regulatory strategy as they would approach writing an SBIR grant. Although good SBIR grant applications are innovative, exciting, and plausible, good FDA submissions are conservative, straightforward, and methodical. The two are often in opposition with each other.
I would encourage you to search for similar 510(k) submissions using our tool or the FDA database. Previous 510(k) submissions will give you an idea about what is possible with the current state of the art and reveal key insights on the clinical performance targets and study design. While the information that is publicly available on the 510(k) summary will not go into proprietary details, you can read between the lines to infer why the competitor made the decisions they did. This could save you from making the same design mistakes and give you a hint about the number of samples you should expect to annotate.
The deliverables of this activity should be the following:
An AI/ML needs a significant amount of expensive clinical data but you do not need all of it to test the feasibility of the algorithm. Publicly available datasets such as the cancer imaging archive could be a good place to find unlabeled or preliminary annotated data to augment the dataset. Public datasets are useful to prove the concept but may not be viable to use for the 510(k) submission because the data provenance may not be specified or the licensing scheme may be unfavorable for a commercial product. I would also recommend staying away from competition data because the license usually prohibits commercial usage. Additionally, commercial data brokers exist that can procure the data for a fee.
How much data do you need to start out? There are several considerations. There are some tricks you can do to make sure you are using your training data as efficiently as possible. In our experience, a segmentation problem needs less annotated data than a classification problem to achieve a comparable level of performance. It may be advantageous to convert a classification problem into one that has a segmentation intermediate. Suppose you are training an algorithm to detect contrast in a CT scan. One approach is to train a binary classifier on the entire CT volume. Another is to segment the aorta and kidneys and measure the average HU value. If the average HU value is greater than a threshold, there is likely contrast in the image. I estimate that the former approach will take 20x more images than the latter to achieve the same level of performance. Granted it will take longer to annotate each image with a segmentation—as opposed to just a yes/no checkbox per image—but you will have to annotate fewer images which results in a more transparent algorithm—FDA does not like a black box—so I think the tradeoff is worth it.
In my experience, 200 annotated images is usually enough to get the process going and you start getting diminishing returns at 2000 images.
Don’t get hung up on sample size calculations and estimates just yet! It is more important to get the project off the ground than to get it everything perfect just yet. Perfect is the enemy of the good.
Once you have your preliminary data, now is the time to start developing a proof of concept. This has several advantages:
The output of this step is the preliminary algorithm development plan, which can form the content of the first optional FDA presubmissions meeting.
The meeting must be scheduled, and typically occurs between 60-75 days from the date the request is submitted. The meeting can be in person (rare) or virtual (typical). You may also elect to have FDA provide a detailed written response to your questions in lieu of a meeting.
At this phase you should have have the following:
Once you have procured about 10 times the preliminary training dataset, you must annotate them. There are several ways you can achieve this. One option is to continue to annotate manually. An alternative is to run the preliminary algorithm on the unlabelled dataset and manually QA the results. Then retrain the algorithm with the larger training set. Repeat the cycle a couple of times until all the data is labeled.
Once all the training data has been annotated, it is a good idea to QA the training data to detect annotation errors. I recommend to have a clinician review the annotations to detect errors. Additionally, it does not hurt to run inference on the training dataset and flag any annotations that show significant errors. It may seem poor practice to run inference on training data—and indeed it is if we are using the results to make performance claims. However, we are not making any performance claims using this run, we are just detecting outliers in the training data that could indicate manual annotation errors.
We send out tips about once a month.
Articles about software development, AI, signal and image processing, medical regulations, and other topics of interest to professionals in the medical device software industry.
You may view previous articles here.
The Innolitics team, and experts we collaborate with, write all of our articles.