AI/ML from Idea to FDA

by Yujan Shrestha on September 28, 2022

Is this article for me? 🔗

This article outlines the process to develop an ML/AI algorithm from scratch and get it FDA cleared. It covers the four phases of the process: Explore, Develop, Validate, and Document. It also discusses the costs, time, and data requirements involved in the process. Additionally, it provides advice on regulatory strategy, data annotation, and algorithm prototyping.

This article is for you if:

In this article series, we outline the process to develop an ML/AI algorithm from scratch and get it FDA cleared. This article covers the first phase in a four phase process:

  1. Explore. Figure out what is possible and build a prototype. Meet with the FDA the first time.
  2. Develop. Set performance targets, implement, and iterate. Meet with the FDA again.
  3. Validate. Wrap up the design and test your product in a clinical setting.
  4. Document. Prepare and submit your 510(k).

The typical AI/ML Software as a Medical Device (SaMD) project takes between 6 and 24 months, and costs between $125k to $500k to complete. The main factors affecting the time and cost are the complexity of the algorithm, how much you have already developed, and the quality of your existing documentation. Although the FDA’s target review goal for traditional 510(k)s is 90 days, expect the process to take 180 days when you factor in responding to FDA’s questions.

Note that these estimates exclude costs related to data acquisition, data annotation, or any necessary clinical performance study.

If you follow the process in this article, you can have an FDA cleared product on the market in as little as nine months—even sooner with a little guidance from us.

Overall Process 🔗

A flowchart of the end to end ML development process. Activities (rounded rectangles) generate outputs (documents and datasets) and lead to other activities.

Step 1: Preliminary Regulatory Planning 🔗

If a man knows not which port he sails, no wind is favorable. (Seneca)

A common mistake is to rush into implementation without first developing a regulatory strategy. Consulting a regulatory expert can save time and money. For example, it may not be obvious, but submitting an initial 510(k) with a simpler set of features followed by a 'Special' 510(k) with more complex features can be an easier path to market a device. This approach is often simpler than obtaining an initial 510(k) clearance for a device with advanced features. You may approach regulatory strategy in the same way they would write an SBIR grant. However, good FDA submissions require a more conservative, straightforward, and methodical approach. This approach can be in opposition to writing an innovative, exciting, and plausible SBIR grant.

I would encourage you to search for similar 510(k) submissions using our tool or the FDA database. Previous 510(k) submissions will give you an idea about what is possible with the current state of the art and reveal key insights on the clinical performance targets and study design. While the information that is publicly available on the 510(k) summary will not go into proprietary details, you can read between the lines to infer why the competitor made the decisions they did. This could save you from making the same design mistakes and give you a hint about the number of samples you should expect to annotate.

The deliverables of this activity should be the following:

Step 2: Preliminary Annotating 🔗

An AI/ML needs a significant amount of expensive clinical data but you do not need all of it to test the feasibility of the algorithm. Publicly available datasets such as the cancer imaging archive could be a good place to find unlabeled or preliminary annotated data to augment the dataset. Public datasets are useful to prove the concept but may not be viable to use for the 510(k) submission because the data provenance may not be specified or the licensing scheme may be unfavorable for a commercial product. I would also recommend staying away from competition data because the license usually prohibits commercial usage. Additionally, commercial data brokers exist that can procure the data for a fee.

How much data do you need to start out? There are several considerations. To use training data efficiently, there are some tricks you can do. In my experience, a segmentation problem needs less annotated data than a classification problem to achieve a comparable performance. It may be beneficial to convert a classification problem into one that has a segmentation intermediate. For example, if you are training an algorithm to detect contrast in a CT scan, one approach is to train a binary classifier on the entire CT volume. Alternatively, you can segment the aorta and kidneys and measure the average HU value. If the average HU value is higher than a threshold, there is likely contrast in the image. I estimate that the former approach will take 20x more images than the latter to achieve the same performance. Although it takes longer to annotate each image with a segmentation instead of just a yes/no checkbox per image, you will need to annotate fewer images. This results in a more transparent algorithm, which the FDA prefers. Therefore, I think the tradeoff is worth it.

Don't get too caught up in sample size calculations and estimates just yet! It's more important to get the project started than to get everything perfect right away. Perfection can be the enemy of progress.

Step 3: Algorithm Prototyping 🔗

Once you have your preliminary data, now is the time to start developing a proof of concept. This has several advantages:

The output of this step is the preliminary algorithm development plan, which can form the content of the first optional FDA presubmissions meeting.

Step 4: Data Multiplication and Cleaning 🔗

A process diagram visualizing two methods of increasing the size of the development dataset.

At this phase you should have have the following:

  1. Expect that the final version of the algorithm, with more training data, will produce clinically useful results. If the results are unsatisfactory, consider changing the algorithm or adjusting the intended use. For example, if the Dice scores are poor but the centroid distance metrics are acceptable, switch the intended use to display only the center of the tumor instead of the outline. It is better to show less information than noisy information. Reevaluate whether the chosen predicate device is still viable.
  2. A prototype you can show investors or other stakeholders in case you need additional funding to procure and annotate additional training data.

Once you have procured about 10 times the preliminary training dataset, you must annotate them. There are several ways you can achieve this. One option is to continue to annotate manually. An alternative is to run the preliminary algorithm on the unlabelled dataset and manually QA the results. Then retrain the algorithm with the larger training set. Repeat the cycle a couple of times until all the data is labeled.

Once all the training data has been annotated, it is a good idea to QA the data to detect errors. I recommend having a clinician review the annotations. Additionally, running inference on the training dataset and flagging any annotations that show significant errors is beneficial. It may seem like poor practice to run inference on training data, and it is if we are using the results to make performance claims. However, this run is only to detect outliers in the training data that could indicate manual annotation errors—we are not making any performance claims.

Step 5: Algorithm Development 🔗

At this step you should have the following:

In this step, explore frameworks, neural network architectures, tweak parameters, and get the most performance out of your algorithm that your budget allows. There is no upper limit for the budget, but there is a point of diminishing returns. Keep the end goal in mind: create something clinically useful and marketable. Otherwise, it's easy to continue iterating for academic reasons—a trap I've fallen into. We discuss these potential traps in a prior article. This is the most enjoyable part of the ML development lifecycle, but it also has the most ambiguous endpoint. Here are some pointers:

  1. Stop once you have reached your clinical performance target. Anything beyond this is may not be the best appropriation of resources. This is an appropriate time to define the ‘minimally viable product’ (MVP). Defining the MVP includes drafting the marketing claims which are seem as ‘must haves’ in order to sell the product and compete for market share.
  2. Is there a population of cases that could be problematic, such as very large lung nodules, if removed, and would mean you meet your clinical performance target? If so, would the clinical utility of the device significantly diminish if this problematic population is excluded from the device's intended use? Consider how much it may cost to increase performance for this population to evaluate if the cost-benefit ratio makes sense to continue iterating. Unfortunately, this is a research problem and it is often impossible to accurately estimate the unknown. However, a literature search and deeper predicate device search can give you some clues.
  3. Can your clinical performance targets be adjusted to meet the threshold of clinical utility and marketability? For example, you may be detecting the presence of lung nodules but not getting the outline correct. In this case, your Dice score will be low, but your detection sensitivity may be quite good. This would mean you lose the ability to draw an outline in the UI, but you could draw an arrow instead. This could still be clinically useful enough to take the compromise and move on. It is important for engineers to get a rough idea about clinical utility, so they can consider these tradeoffs without needing to involve a clinical expert.
  4. Rapid iteration with clinical stakeholders is essential. To facilitate this, use tools that both clinicians and engineers can use, such as Excel spreadsheets or Notion databases. Every round of feedback presents an opportunity for knowledge diffusion between engineering and clinical teams, increasing the likelihood of "building the right thing" rather than simply "building it right".

Also note, the FDA will likely get back to you with a presubmission meeting date during this step.

Step 6: FDA Presubmission Meeting I 🔗

FDA sends responses to your questions before the meeting so you already know “what” they think prior to the meeting. It is important to carefully read and understand all of the information before the meeting. In my opinion, the best use of the 1 hour meeting is often to ask FDA ‘why’. For example, if FDA disagrees with your choice of a predicate device, it is extremely helpful to understand why. Perhaps they feel that the technology is too dissimilar that it wouldn’t be possible to effectively compare the devices. It may be possible to justify your choice and gain their agreement or they may provide examples of suitable devices.

In my opinion, the best way to utilize the first FDA meeting is to ask the FDA “why” so you can learn how to think like the FDA when they are not immediately available for inquiry. Make sure your team is taking notes because you are not allowed to record the call and are required to submit meeting minutes to the FDA for approval.

Step 7: Final Regulatory Planning 🔗

At this point of the process, you should be able to:


Step 8: Ground Truth Annotation 🔗

Step 9: Standalone Performance Testing 🔗

Step 10: Presubmission Meeting II 🔗

By now you should have the results of standalone performance testing. You should let the FDA know if you had to change the standalone performance testing targets. The main goal of the second presubmission is to make sure the FDA:

  1. Agrees on the validity of your clinical performance study design. It is helpful to know that FDA has recently released a final guidance related to their expectations for clinical performance assessments for 510(k) submissions.
  2. Agrees the study proves your device is performing to its intended use, meeting the proposed indications for use and supports the key marketing claims.

The clinical performance study can be very expensive. You will likely need to pay many highly trained individuals to sit in front of a computer for hours on end and get them all together in a room to debate tough cases. Pizza helps but it is usually insufficient on its own.

Note that clinical trials are different from clinical performance studies. A clinical performance study has a lower burden of proof than a trial. Whereas a trail needs to prove the benefits of the device outweigh the risks, a performance study just needs to prove the medical device is performing as intended in a real world clinical setting.

FDA may have some concerns such as:

  1. concerns about biases in ground truth data collection. For example, your ground truth data collection strategy may be missing any attempt to include more than one institution, CT machine manufacturer, patient population, or other confounders.
  2. concerns about the performance study protocol. For example, you may have not included a washout period to account for recall bias.

Again, ask “why” so you can think like the FDA.

Step 11: Software Packaging 🔗

The software release package should include a list of all the software components and versions. It should also include a list of all the external libraries and their versions. This is important for maintaining compliance with the FDA's off the shelf software requirements. The deployment package should include detailed instructions for setting up the software and deploying it to production. The validation package should include a list of tests which should be run on the software and their expected results. The documentation package should include documentation for all the components of the software and their interactions.

It is best practice to package the software for release on a continuous basis, ideally nightly. Docker is the most common choice for this. Clinical performance studies should be conducted on a version that is as close to complete as possible. If software changes are made after the study is complete, it is necessary to justify why the changes do not affect the result. We strongly recommend that no software changes be made to the AI/ML components after the study is finished.

Step 12: Clinical Performance Study 🔗

Perform your clinical performance study. Make sure that the study design is in compliance with the FDA's final guidance related to their expectations for clinical performance assessments for 510(k) submissions. The study should prove that the device is performing to its intended use, meeting the proposed indications for use, and supporting the key marketing claims. It is important to consider any potential biases in the ground truth data collection, as well as any risks for recall bias when creating the study protocol. Additionally, software changes should not be made to the AI/ML components after the study is complete, and the software should be packaged for release on a continuous basis, ideally nightly.

Step 13: Documentation 🔗

We left out some key documentation exercises done earlier on in the process to keep the deluge of information at a manageable level. Risk analysis and requirements capture should be done during the development process but it is not uncommon to be done retrospectively—which can be a daunting task if you have not done it before. We can write your FDA documentation for you if you do not have the time to do so yourself.

We will cover the documentation aspects for creating 510(k)s for software devices in a future article “Content of a 510(k) for software devices”. In the meantime, check out FDA’s guidance document on the topic.

Step 14: FDA Submission 🔗

The submission step is the culmination of your development, testing and documentation efforts. We recommend that you have someone help with the creation of the submission as a successful submission is much more than a compilation of design and testing documents. A 510(k) is a ‘story’ which is presented to FDA for review. The story must be consistent throughout and must ‘flow’ in a manner which clearly shows that your device is as safe and effective as the predicate device.

Step 15: QMS Preparation and Marketing 🔗

Step 16: FDA Clearance and Beyond 🔗

Congratulations you made it! Once your device is successfully on the market in the United States, you will need to follow the regulations related to such things as complaint handling, adverse event reporting and tracking where your devices are being used. You will also need to maintain your QMS as long as your device is on the market in the United States. FDA will inspect your business every few years to make sure that all regulations are being followed.

You must assess whether any modifications to the cleared device will require a new 510(k). Generally, you can retrain your AI on new data without needing to resubmit a 510(k). However, you must document the changes as if you were submitting one.


Get Medtech Software Tips

Subscribe using RSS

How frequently are they sent?

We send out tips about once a month.

What will I read?

Articles about software development, AI, signal and image processing, medical regulations, and other topics of interest to professionals in the medical device software industry.

You may view previous articles here.

Who creates the content?

The Innolitics team, and experts we collaborate with, write all of our articles.

Want to know more?

Contact us.