Innolitics introduction 🔗
This FDA guidance applies to AI/ML-enabled radiology software that assists in the detection of disease. Many of the principles and approaches, however, apply beyond this limited scope. Furthermore, we’ve seen FDA reference this guidance for non-CADe devices.
This guidance standardizes some of FDA’s expectations around clinical performance evaluation of radiology CADe. Even so, as of early 2024, we still are seeing significant variability in FDA’s expectations around clinical studies for these devices. For this reason, and the high costs of these studies, we recommend that most of our clients complete a pre-sub to verify their study design with FDA before executing it.
Innolitics has deep experience with AI/ML-enabled radiology devices, including clinical performance assessment of radiology CADe. We’ve assisted on both the engineering and regulatory side of many AI/ML devices. See our AI Metrics case study for an example of an end-to-end AI/ML device we developed, and our portfolio for various other case studies.
If you’re looking for help developing a regulatory strategy, designing a clinical study, or even completing end-to-end development of a radiology AI/ML-enable device, we’d love to talk.
Document issued on September 28, 2022.
Document originally issued on July 3, 2012.
For questions regarding this document, please contact RadHealth@fda.hhs.gov.
Contains non-binding guidance.
I. Introduction 🔗
This guidance document provides recommendations to industry, systems and service providers, consultants, FDA staff, and others regarding clinical performance assessment of computer-assisted detection (CADe1) devices applied to radiology images and radiology device data (often referred to as “radiological data” in this document). CADe devices are computerized systems that incorporate pattern recognition and data analysis capabilities (i.e., combine values, measurements, or features extracted from the patient radiological data) intended to identify, mark, highlight, or in any other manner direct attention to portions of an image, or aspects of radiology device data, that may reveal abnormalities during interpretation of patient radiology images or patient radiology device data by the intended user (i.e., a physician or other health care professional), referred to as the “clinician” in this document. We have considered the recommendations on documentation and performance testing for CADe devices made during the public meetings of the Radiology Devices Panel on March 4-5, 2008,2 and November 17-18, 2009.3 We have also considered the public comments received on the draft guidance announced in the Federal Register on October 21, 2009 (74 FR 54053).4
In addition, FDA has issued a final order reclassifying medical image analyzers applied to mammography breast cancer, ultrasound breast lesions, radiograph lung nodules, and radiograph dental caries into Class II (special controls) and subject to premarket notification (510(k)) requirements.5 This guidance provides recommendations that may be useful for compliance with the special controls codified in 21 CFR 892.2070(b)(1) and noted in italic font for clarity in the guidance.
This guidance also is being revised with minor updates to reflect the issuance of the final rule, “Medical Devices; Medical Device Classification Regulations To Conform to Medical Software Provisions in the 21st Century Cures Act” (86 FR 20278).6
In general, FDA’s guidance documents do not establish legally enforceable responsibilities. Instead, guidances describe the Agency’s current thinking on a topic and should be viewed only as recommendations, unless specific regulatory or statutory requirements are cited. The use of the word should in Agency guidances means that something is suggested or recommended, but not required.
II. Scope 🔗
This document provides guidance regarding clinical performance assessment studies for CADe devices applied to radiology images and radiology device data and CADe devices applied to mammography breast cancer, ultrasound breast lesions, radiograph lung nodules, and radiograph dental caries. Radiological data include, for example, images that are produced during patient examination with ultrasound, radiography, magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and digitized film images. As stated above, CADe devices are computerized systems intended to identify, mark, highlight, or in any other manner direct attention to portions of an image, or aspects of radiology device data, that may reveal abnormalities during interpretation of patient radiology images or patient radiology device data by the clinician. This guidance applies to CADe devices, including when a CADe device is part of a combined system, such as the detection portion of combined computer-aided detection and diagnostic devices.
This guidance document applies to the CADe devices classified under 21 CFR 892.2050 “medical image management and processing system” and to devices classified under 21 CFR 892.2070 “Medical image analyzer” including the following product codes:
- NWE (Colon computed tomography system, computer aided detection);
- OEB (Lung computed tomography system, computer-aided detection);
- OMJ (Chest X-ray computer aided detection); and
- MYN (Medical image analyzer).
The Agency intends to create new product codes to identify and track new types of CADe products as necessary.7
By design, a CADe device can be a unique detection scheme specific to only one type of potential abnormality, or a combination or bundle of multiple parallel detection schemes, each one specifically designed to detect one type of potential abnormality revealed in the patient radiological data. Examples of CADe devices that fall within the scope of this guidance include:
- a CADe algorithm designed to identify and prompt microcalcification clusters and masses on digital mammograms;
- a CADe device designed to identify and prompt colonic polyps on CT colonography studies;
- a CADe designed to identify and prompt filling defects on thoracic CT examination; and
- a CADe designed to identify and prompt brain lesions on head MRI studies.
We may consider a CADe algorithm to be modified, for the purposes of this guidance, if, for example, a change has been made to the image processing components, features, classification algorithms, training methods, training data sets, or algorithm parameters. See FDA’s guidance entitled “Computer-Assisted Detection Devices Applied to Radiology Images and Radiology Device Data - Premarket Notification [510(k)] Submissions”8 for information on CADe 510(k) submissions, including when a clinical performance assessment may be recommended for a CADe 510(k) submission.
This guidance does not apply to (1) clinical performance assessment studies for CADe devices that are intended for use during intra-operative procedures, (2) radiological computer-assisted diagnostic (CADx) devices (classified under 21 CFR 892.2060), (3) computer-assisted detection/diagnosis (CADe/x) devices (classified under 21 CFR 892.2090), (4) computer-assisted acquisition/optimization (CADa/o) devices (classified under 21 CFR 892.2100), or (5) radiological computer-assisted triage and notification (CADt) devices (classified under 21 CFR 892.2080), whether marketed as unique devices or bundled with a CADe device that, by itself, may be subject to this guidance. The guidance also does not apply to the many non-CADe devices that are also classified under 21 CFR 892.2050, including, for example, product codes LLZ (System, Image Processing, Radiological) and NFJ (System, Image Management, Ophthalmic). Therefore, most basic radiological image processing and image acquisition products fall outside the scope of this guidance document.
For any of these types of devices, we recommend that you contact the Agency to seek feedback on questions related to the regulatory pathways, regulatory requirements, and recommendations about nonclinical and clinical data.
III. Rationale 🔗
Devices classified as medical image analyzers under 21 CFR 892.2070, are required to comply with the special controls under 21 CFR 892.2070(b), including the requirement for a detailed description of pre-specified performance testing methods and dataset(s) used to assess whether the device will improve reader performance as intended in their design verification and validation.**9
Medical image analyzers are also required to include results from performance testing that demonstrate that the device improves reader performance in the intended use population when used in accordance with the instructions for use. The performance assessment must be based on appropriate diagnostic accuracy measures. The test dataset must contain a sufficient number of cases from important cohorts (e.g., subsets defined by clinically relevant confounders, effect modifiers, concomitant diseases, and subsets defined by image acquisition characteristics) such that the performance estimates and confidence intervals of the device for these individual subsets can be characterized for the intended use population and imaging equipment.10 The recommendations that follow may be useful in demonstrating compliance with those special controls.
Recommendations: The following sections provide recommendations on the clinical study design, methods, study population, and reporting for evaluating CADe systems. This guidance provides recommendations as to how you should design and conduct your clinical performance assessment studies (i.e., well-controlled clinical investigations) for your CADe device. These studies may be part of your 510(k) submission to FDA. The recommendations in this document are meant to guide you as you develop and test your CADe device; they are not meant to specify the full content or type of premarket submission that may be applicable to your device.11 If you would like the Agency's advice about the classification and the regulatory requirements that may be applicable to your device, you may submit a request under section 513(g) of the Federal Food, Drug, and Cosmetic Act (the FD&C Act).12
We recommend that you request the Agency’s review of your protocols prior to initiating your standalone performance assessment and clinical performance assessment studies for your CADe device. To request the Agency’s review of your protocols, you may submit a Pre-Submission to the Agency.13
IV. Clinical Study Design 🔗
The clinical performance assessment of a CADe device is intended to demonstrate the clinical safety and effectiveness of your device for its intended use, when used by the intended user and in accordance with its proposed labeling and instructions.
As described above in the scope, a CADe device, by design, is intended to identify data that may reveal abnormalities during interpretation of patient radiology images or data by the clinician.
There is a complex relationship between the CADe output and the clinician such that clinical performance may depend on a variety of factors that should be considered in any study design including:
- timing of CADe application in the interpretive process (e.g., concurrent or second read);
- physical characteristics of the CADe mark, i.e., size and shape, type of boundary (e.g., solid, dashed, circle, isocontour), and proximity of the CADe mark to the abnormality;
- user’s knowledge of the type of abnormalities that the CADe is designed to mark; and
- number of CADe marks.
Your clinical performance assessment should use a well-controlled study design to preclude or limit various biases that might impact conclusions on the device’s safety or effectiveness (see Appendix). We especially recommend this when your clinical performance assessment is performed in a laboratory setting (i.e., off site of the clinical arena) since it is difficult to duplicate the clinical environment in the laboratory. Some study designs that may be utilized to assess your CADe device include:
- A field test or prospective reader study (e.g., randomized controlled trial) that evaluates a device in actual clinical conditions. A field test may not be practical in situations, for example, where there is very low disease prevalence that may necessitate enrollment of an excessively large number of patients.
- A retrospective reader study consisting of a retrospective case collection enriched with diseased/abnormal cases is a possible surrogate for a field test.
- A stress test is another option for the clinical performance assessment of some CADe A stress test is a retrospective reader study enriched with patient cases that contain more challenging imaging findings (or other image data) than normally seen in routine clinical practice but that still fall within the device’s intended use population (see Section V. Study Population). Note that the use of sample enrichment will likely alter reader performance in the trial compared with clinical practice because of the differences in disease prevalence (and case difficulty for stress testing) between the trial and clinical practice.
The clinical performance assessment of CADe devices is typically performed by utilizing a multiple reader multiple case (MRMC) study design, where a set of clinical readers (i.e., clinicians evaluating the radiological images or data in the MRMC study) evaluate image data under multiple reading conditions or modalities (e.g., readers unaided versus readers aided by CADe). The MRMC design can be “fully-crossed” whereby all readers independently read all of the cases. This design offers the greatest statistical power for a given number of cases. However, non-fully crossed study designs may be acceptable, for example in prospective studies where interpretations of the same patient data by multiple clinicians may not be feasible.
Whether you decide on a fully-crossed study design or not, we recommend the use of an MRMC evaluation paradigm to assess the clinical performance of a CADe device using one of the study designs described above. A complete clinical study design protocol should be included in your submission. Pre-specification of the statistical analysis is a key factor for obtaining consistent and convincing scientific evidence. We recommend that you provide:
- a description of the study design;
- a description of how the imaging data are to be collected (e.g., make and model of the imaging device and the imaging protocol);
- a copy of the protocol, including the following:
- hypothesis to be tested and study endpoints,
- plans for checking any assumptions required to validate the tests,
- alternative procedures/tests to be used if the required assumptions are not met,
- study success criteria that indicate which hypotheses should be met in order for the clinical study to be considered a success,
- statistical and clinical justification of the selected case sample size,
- statistical and clinical justification of the selected number of readers,
- image interpretation methodology and relationship to clinical practice,
- randomization methods, and
- reader task including rating scale used (see Section D. Rating Scale);
- the reader qualifications and experience;
- a description of the reader training; and
- a statistical analysis plan (i.e., endpoints, statistical methods) with description of:
Valid estimation of clinical performance for CADe devices is dependent upon sound study design. Aspects of sound clinical study design should include the following:
- study populations (both diseased and normal cases) are appropriately representative of or sampled from the intended use population;
- study design avoids confounding of the CADe effect (e.g., reading session effects);
- sample size is sufficient to demonstrate performance claims;
- truth definition is appropriate for assessment of performance, and uncertainty in the reference standard is correctly accounted for in the study analysis, if applicable;
- appropriate data cohorts are represented in the data set;
- readers are selected such that they are representative of the intended population of clinical users; and
- imaging hardware are selected such that they are consistent with current clinical practice.
A. Evaluation Paradigm and Study Endpoints 🔗
You should select study endpoints to demonstrate that your CADe device is effective (i.e., in a significant portion of the target population, the use of the device for its intended uses and conditions of use, when accompanied by adequate directions for use and warnings against unsafe use, will provide clinically significant results).14 Selection of the primary and secondary endpoints will depend on the intended use of your device and should be fixed prior to initiating your evaluation. Performance metrics based on the receiver operating characteristic (ROC) curve or variant of ROC (e.g., free-response receiver operating characteristic (FROC) curve or
location-specific receiver operating characteristic (LROC) curve), in addition to sensitivity (Se) and specificity (Sp) at a clinical action point will be likely candidates as endpoints. An ROC- based endpoint allows evaluation of the device over a range of operating points. An ROC curve is a plot of Se versus 1-Sp and is a summary of diagnostic performance of a device or a clinician. An FROC curve is a plot of Se versus the number of false positive marks per image set. FROC metrics summarize diagnostic performance when disease location and multiple disease sites per patient are accounted for in the analysis.15 Reporting the Se and Sp pair allows for the evaluation of the device at a clinical threshold or cut-point the reader would act upon. Se is defined as the probability that a test is positive for a population of patients with the disease/condition/abnormality while Sp is defined as the probability that the test is negative for a population of normal patients (i.e., patients without the disease/condition/abnormality).16
Se and Sp estimates should be based on an explicit clinical determination by the clinical readers (e.g., recall or no recall), not indirectly from ratings data used to generate the ROC curves. Data collection for ROC, Se, and Sp can be done simultaneously within a single reader study. An example reading scenario could be to first have the readers give a clinically justified binary response for Se and Sp evaluation and then to immediately follow with a rating response consistent with the binary response to be used for ROC evaluation.
You may employ various summary performance metrics to assess the effectiveness of the use of your CADe device by readers (and such metrics may vary based on the specific device and clinical indication). Examples of these include:
- area, partial area, or other measures of the ROC curve;
- area, partial area, or other measures of the FROC curve;
- area, partial area, or other measures of the LROC curve;
- reader Se and Sp (or Se and recall rate17); and
- reader localization accuracy.
We recommend using an ROC summary performance metric as part of your primary analysis, although we recognize that alternate performance metrics may also be appropriate. As mentioned above, we recommend that you include Se and Sp as a secondary endpoint in your analysis when an ROC summary performance metric is used. Reporting Se and Sp (or Se and recall rate) may provide additional information for understanding the expected impact of a device on clinical practice. We also recommend that you contact the Agency when you are considering an alternate performance metric.
For study endpoints based on the area under the ROC/FROC/LROC curve or partial area under the ROC/FROC/LROC curve, we recommend that you provide plots of the actual curves along with summary performance information for both parametric and non-parametric analysis approaches when possible.18 Finally, you should check all methods utilized in your analysis for the adequacy of their fit to the data as appropriate.
The selection of lesion-based, patient-based, or another unit-based measure of performance as a primary or secondary endpoint will depend on the intended use and the expected impact of the device on clinical practice. Powering any additional units-based analyses for statistical significance should not be necessary unless you intend to make specific performance claims.
We recommend that you describe your statistical evaluation methodology and provide results including:
- overall reader performance;
- stratified performance by relevant confounders or effect modifiers (e.g., lesion type, lesion size, lesion location, scanning protocol, imaging hardware, concomitant diseases) (see Section V. Study Population) (powering each cohort for statistical significance is not necessary unless you are making specific subset performance claims); and
- confidence intervals (CIs) that account for reader variability, case variability, and truth variability or other sources of variability when appropriate.
We recommend that you identify and validate your analysis software.19 You should provide a reference to the analysis approach used, clarify the software implementation, and specify a version number if appropriate. Certain validated MRMC analysis approaches, examples of which can be found in the literature or obtained online, may be appropriate for your device evaluation depending on its intended use and conditions of use.20,21
The definitions of a true positive, true negative, false positive, and false negative CADe mark should be consistent with the intended use of the device and the characterization of the reference standard (see Section VI. Reference Standard).
B. Control Arm 🔗
We recommend that you assess the clinical performance of your CADe device relative to a control modality. A study control arm that uses conventional clinical interpretation (i.e., reader interpretation without the CADe device) should generally be the most relevant comparator in CADe performance assessment. For CADe devices intended as second readers, another possible control is double reading by two clinicians. These controls or a direct comparison with the predicate CADe device should generally be appropriate for establishing substantial equivalence. Other control arms may be valid. We recommend that you contact the Agency to discuss your choice of a control arm prior to conducting your clinical study.
The study control arm should utilize the same reading methodology as the device arm and be consistent with clinical practice. The same population of cases, if not the same cases themselves, should be in all study arms to minimize potential bias. For designs that include distinct cases in each study arm, we recommend that you provide a description and flow chart demonstrating how you randomized patients and readers into the different arms.
C. Reading Scenarios and Randomization 🔗
Reading scenarios in the clinical evaluation should be consistent with the intended use of the device. The following are examples of reading scenarios that may be part of a CADe clinical evaluation.22
- Devices for second reader use only (sequential design):
- a conventional reading by the readers without the CADe device (i.e., reader alone);
- a second-read by the readers in which the CADe output is displayed immediately after conducting a conventional interpretation (this reading could occur within the same reading session as the conventional read in what is termed a “sequential” reading scheme).
- Devices for concurrent reading (cross-over design):
- a conventional reading by the readers without the CADe device (i.e., reader alone);
- a concurrent or simultaneous read by the readers in which the CADe output is available at any time during the interpretation process (this reading would be conducted in a separate reading session from the conventional read).
You should randomize readers, cases, and reading scenarios to reduce bias in performance measures. We recommend that you describe your randomization methodology and provide an associated flowchart. One approach to randomization is to make use of the principle of Latin squares as part of a block design to the reader study.23
In case of multiple reading sessions where the same cases are read multiple times, we recommend that you separate each reading session in time by at least four weeks to avoid memory bias. However, longer time gaps may be advisable. For shorter or longer time gaps between reading sessions, we recommend that you provide data supporting your proposed time gaps.
D. Rating Scale 🔗
You should use conventional medical interpretation and reporting for lesion location, extent, and patient management. ROC-based endpoints (see Section IV.A. Evaluation Paradigm and Study Endpoints) may support collecting data with a finer rating scale (e.g., a 7-point or 100-point scale) when readers rate the lesion and/or disease status in a patient. We recommend providing training to the readers on the use of the rating scale (see Section IV.F. Training of Study Participants).
E. Scoring 🔗
We refer to the procedure for determining the correspondence between the reader’s interpretation and the ground truth (e.g., disease status) as the scoring process. The scoring process and the scoring definition are important components in the clinical assessment of a CADe device and you should describe them. We recommend that you describe the process (i.e., rationale, definition, and criteria) for determining whether a reader’s interpretation corresponds to the truth status established during the truthing process (see Section VI. Reference Standard for information on the truthing process).
In this guidance, we describe scoring in terms of the clinical performance assessment. A different type of scoring is used to evaluate device standalone performance, which is described in FDA’s guidance entitled “Computer-Assisted Detection Devices Applied to Radiology Images and Radiology Device Data - Premarket Notification [510(k)] Submissions.”24
The scoring process for the clinical studies should be consistent with the abnormalities marked by the CADe and the intended use of your device. You should describe and fix the scoring process prior to initiating your evaluation. In your description of the scoring process, we recommend that you indicate whether the scoring is based on:
- electronic or non-electronic means;
- physical overlap of the boundary, area, or volume of a reader mark in relation to the boundary, area, or volume of reference standard;
- relationship of the centroid of a reader mark to the boundary or spatial location of reference standard;
- relationship of the centroid of the reference standard to the boundary or spatial location of a reader mark;
- interpretation by reviewing reader(s); or
- other methods.
For scoring that relies on interpretations by reviewing readers, we recommend that you provide the number of readers involved, their qualifications, their levels of experience and expertise, the specific instructions conveyed to them prior to their participation in the scoring process, and any specific criteria used as part of the scoring process. When multiple readers are involved in scoring, you should describe the process by which you combine their interpretations to make an overall scoring determination or how you incorporate their interpretations into the performance evaluation, including how you address any inconsistencies.
F. Training of Clinical Readers 🔗
We recommend that you specify instructions and provide training to clinical readers in the study on the use of the CADe device and the details on how to participate in the clinical study. Training should include a description of the device and instructions for how to use the device. For specialized reading instructions or rules (e.g., rules for changing initial without-CADe interpretation when reviewing the CADe marks), we recommend that you justify their clinical relevance according to reading task, clinical workflow, and medical practice.
We also recommend that you provide training to the readers on the use of the rating scale (see Section IV.D. Rating Scale), especially if such a rating scale is not generally utilized in clinical practice. Such training helps avoid incorrect or un-interpretable results. We recommend that reader training include rating a representative set of normal and abnormal cases according to the study design methodology, and making use of cases that are not part of the testing database.
V. Study Population 🔗
You may collect patient data (i.e., cases) prospectively or retrospectively based on well-defined inclusion and exclusion criteria. We recommend that you provide the protocol for your case collections. Note that cases collected for your clinical trial should be independent of the cases used during your device development. An acceptable approach for acquiring data is the collection of consecutive cases that are within the inclusion and outside of the exclusion criteria from each participating collection site.
Enrichment with diseased/abnormal cases is permissible for an efficient and less burdensome representative case data set. Enrichment may affect reader performance, so the extent of enrichment should be weighed against the introduction of biases into the study design. You may also enrich the study population with patient cases that contain imaging findings (or other image data) that are challenging to clinicians but that still fall within the device’s intended use population. This enrichment is often referred to as stress testing and will introduce biases into the study design. Therefore, not all approaches in developing a stress population may be appropriate for use in your clinical assessment. For example, a study population enriched with cases containing small colonic polyps may be appropriate when assessing a general CADe device designed to assist in detecting a wide range of polyp sizes. However, we do not recommend enrichment with cases based on the performance of the CADe device.
You should sample the study population from the intended use population for your device so that the case set includes an appropriate range of diseased/abnormal and normal cases. Selective sampling from cohorts within the intended use population is appropriate when you utilize stress testing. The sample size of the study should be large enough such that the study has adequate power to detect with statistical significance your proposed performance claims. When appropriate, the study should contain a sufficient number of cases from important cohorts (e.g., subsets defined by clinically relevant confounders, effect modifiers, and concomitant diseases) such that clinical performance estimates and confidence intervals can be obtained for these individual subsets. Powering each cohort for statistical significance should not be necessary unless you are making specific subset performance claims. When you make multiple performance claims, a pre-specified statistical adjustment for the testing of multiple subsets would be necessary.
When describing your study population, we recommend that you provide specific information, where appropriate, including:
- the patient demographic data (e.g., age, ethnicity, race, sex);
- the patient medical history relevant to the CADe application;
- the patient disease state and indications for the radiologic test;
- the conditions of radiologic testing (e.g., technique, including whether the test was performed with/without contrast, contrast type and dose per patient, patient body mass index, radiation exposure, T-weighting for MRI images, and views taken);
- a description of how the imaging data were collected (e.g., make and model of imaging devices and the imaging protocol);
- the collection sites;
- the processing sites, if applicable (e.g., patient data digitization);
- the number of cases:
- the number of diseased cases,
- the number of normal cases,
- methods used to determine disease status, location, and extent (see Section Section VI. Reference Standard);
- the case distributions stratified by relevant confounders or effect modifiers, such as lesion type (e.g., hyperplastic vs. adenomatous colonic polyps), lesion size, lesion location, disease stage, organ characteristics (e.g., breast composition), concomitant diseases, imaging hardware (e.g., makes and models), imaging or scanning protocols, collection sites, and processing sites (if applicable); and
- a comparison of the clinical, imaging, and pathologic characteristics of the patient data compared to the target population.
A. Data Poolability 🔗
Premarket approval applications based solely on foreign clinical data and otherwise meeting the criteria for approval may be approved if, among other requirements, the foreign data are applicable to the United States (U.S.) population and U.S. medical practice and clinical investigators of recognized competence have performed the studies.25 You should justify why non-U.S. data reflect what is expected for a U.S. population with respect to disease occurrence, characteristics, practice of medicine, and clinician competency. In accordance with good clinical study design, you should justify, both statistically and clinically, the poolability of data from multiple sites. We recommend that 510(k) submissions follow similar quality data practices with regard to foreign data and data poolability. You are encouraged to contact the Agency if you intend to rely on foreign clinical data as the basis of your premarket submission.
B. Test Data Reuse 🔗
The Agency recognizes the difficulty in acquiring data for CADe assessment. The Agency also recognizes that the readers and CADe algorithm may implicitly or explicitly become tuned to the test data if the same test set is used multiple times. If no algorithm training has occurred between two tests (e.g., if the sponsor is investigating the effect of a new prompt type), then a new set of readers with no knowledge of the results of the previous study may address concerns about the use of the same test data set. If CADe algorithm training has occurred between tests, or if this is a new CADe algorithm, the Agency encourages evaluation of the CADe system using newly acquired independent test cases (i.e., the Agency discourages repeated use of test data in evaluating device performance). If you are considering data reuse in the evaluation of your CADe device, you should demonstrate that reusing any part of the test data does not introduce unreasonable bias into estimates of CADe performance and that test data integrity is maintained.
In the event that you would like the Agency to consider the reuse of any test data, you should control the access of your staff to performance results for the test subpopulations and individual cases. It may therefore be necessary for you to set up a “firewall” to ensure those outside of the regulatory assessment team (e.g., algorithm developers) are completely insulated from knowledge of the radiology images and radiological data, and anything but the appropriate summary performance results. You should maintain test data integrity throughout the lifecycle of the product. This is analogous to data integrity involving clinical trial data monitoring committees and the use of a “firewall” to insulate personnel responsible for proposing interim protocol changes from knowledge of interim comparative results. For more information on data integrity for clinical trial data monitoring committees, refer to Section 220.127.116.11 of the FDA guidance entitled “Establishment and Operation of Clinical Trial Data Monitoring Committees for Clinical Trial Sponsors.”26
To minimize the risk of tuning to the test data and to maintain data integrity, we recommend that you develop an audit trail and implement the following controls when you contemplate the reuse of any test data:
- you randomly select the data from a larger database that grows over time;
- you retire data acquired with outdated image acquisition protocols or equipment that no longer represents current practice;
- you place a small fixed limit on the number of times a case can be used for assessment;
- you maintain a “firewall” such that data access is tightly controlled to ensure that nobody outside of the regulatory assessment team, especially anyone associated with algorithm development, has access to the data (i.e., only summary performance results are reported outside of the assessment team);
- you maintain a data access log to track each time the data is accessed, including a record of who accessed the data, the test conditions, and the summary performance results; and
- you use new readers in each new clinical reader study.
The purposes of the audit trail include: (1) establishing that you defined the cases in the training and test sets appropriately such that data leakage between training and test sets did not occur; (2) ensuring that you fixed the new CADe algorithm in advance (i.e., before application to the test set); and (3) providing information concerning the extent to which you used the same test set or a subset thereof for testing other CADe algorithms or designs, including results reported to the Agency as well as non-reported results. The controls we are recommending are intended to substantially reduce the chance that you evaluate a new CADe algorithm in a subsequent study using the same test data set that you used for a prior CADe algorithm.
If you reuse test data, you should report the test performance for relevant subsets in addition to the overall performance. These subsets include: (1) the portion of the test set for the new CADe algorithm that overlaps with previously used test sets; and (2) the portion of the test set that you have never been used before. Since these subsets will be smaller in size compared to the overall test set for the new CADe algorithm, confidence intervals for the subsets will be wider.
However, the trends for the mean performances would be helpful to indicate whether you have tuned the CADe system to previously used portions of the test set.
VI. Reference Standard 🔗
For purposes of this guidance, the reference standard (also often called the “gold standard” or “ground truth” in the imaging community) for patient data indicates whether the disease/condition/abnormality is present and may include such attributes as the extent or location of the disease/condition/abnormality. We refer to the characterization of the reference standard for the patient (e.g., disease status) as the truthing process.
We recommend that you provide the rationale for your truthing process and indicate if it is based on:
- the output from another device;
- an established clinical determination (e.g., biopsy, specific laboratory test);
- a follow-up clinical imaging examination;
- a follow-up medical examination other than imaging; or
- an interpretation by a reviewing clinician(s) (i.e., clinical truther(s)).
We also recommend that you describe the methodology utilized to make this reference standard determination (e.g., based on pathology or based on a standard of care determination). For truthing that relies on the interpretation by a reviewing clinician(s), we recommend that you provide:
- the number of clinical truthers involved;
- their qualifications;
- their levels of experience and expertise;
- the instructions conveyed to them prior to participating in the truthing process;
- all available clinical information from the patient utilized by the clinical truthers in the identification of disease/condition/abnormality and in the marking of the location and extent of the disease/condition/abnormality; and
- any specific criteria used as part of the truthing process.
When multiple clinical truthers are involved, you should describe the process by which you combine their interpretations to make an overall reference standard determination and how your process accounts for inconsistencies between clinicians participating in the truthing process (truth variability) (see Section IV.A. Evaluation Paradigm and Study Endpoints). Note that clinicians participating in the truthing process should not be the same as those who participate in the core clinical performance assessment of the CADe device.
VII. Reporting 🔗
Reporting of performance results may be guided by the FDA guidance entitled “Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests.”27 We recommend submitting electronically the data used in any statistical analysis in your study including the following:
- patient information,
- disease or normal status,
- concomitant diseases,
- lesion size,
- lesion type,
- lesion location,
- disease stage,
- organ characteristics,
- imaging hardware,
- imaging or scanning protocol,
- imaging and data characteristics (e.g., characteristics associated with differences in digitization architectures for a CADe using scanned films), and
- statistical analysis.
A. Potential Sources of Bias in a Retrospective Reader Study 🔗
Despite their practical value, the design of retrospective reader studies for evaluating CADe devices may generate estimates of CADe performance that are subject to one or more potential sources of statistical bias. Statistical bias is a tendency for a performance estimate in a study to be misaligned with the true performance in the intended use population. Many sources of statistical bias can often be minimized or at least mitigated through good study design. Some potential sources of bias in reader studies include, but are not limited to:
- Selection Bias: The sample of subjects (or readers) is not representative of the target population.
- Spectrum Bias: The sample of subjects (or readers) studied does not include a complete spectrum of the target population.
- Imperfect Reference (Gold) Standard Bias: The reference procedure is not 100% accurate at classifying subjects by presence or absence of the condition of interest (e.g., breast cancer, polyps in colon).
- Verification Bias: Statistical analysis of diagnostic performance is based only on subjects verified for presence or absence of the condition of interest by the reference standard.
- Reading-Order Bias: When comparing two or more tests, the reader’s interpretation is affected by his or her memory of the results from the competing test.
- Context Bias: When the sample prevalence of the abnormal condition differs greatly from the prevalence in the target population, the reader’s interpretations may be affected, resulting in biased estimates of his/her diagnostic accuracy.
Selection bias is introduced when subjects (cases) selected for study are not representative of the intended use population. Retrospective selection of subjects with available images can introduce a selection bias. Random or consecutive sampling of subjects may eliminate or mitigate selection bias. Selection bias can also occur in terms of the readers selected to participate in the study. Readers should reflect the population of readers that will use the device. A small number of readers having similar training or clinical experiences (e.g., readers all from the same clinic) may not generalize to the full population of readers expected to use a CADe device.
Spectrum bias is a special form of selection bias in which the study subjects represent an incomplete spectrum of characteristics of subjects in the intended use population (i.e., important subgroups are missing from the study).
Enriching a CADe study with subjects that are difficult to diagnose (i.e., stress testing) can bias the CADe effect. For example, if the CADe is most helpful for these difficult cases, then the CADe effect can be enhanced with stress testing relative to the intended use population. Nonetheless, stress testing is encouraged because it provides value in evaluating CADe in important subgroups, and results in study designs with smaller sample sizes when applied appropriately.
Enriching a CADe study with subjects having the abnormal condition produces biased estimates of CADe positive and negative predictive value because these depend on the prevalence of the abnormal condition. “Corrected” estimates of predictive values would have to rely on an estimate of prevalence external to the study. The enhanced prevalence may also indirectly cause biased estimates of area under the ROC curve (AUC), of Se, and of Sp if readers are “tipped off” to the enhanced prevalence in the study and change their reading behavior, accordingly creating context bias.
Retrospective reading of image sets may also introduce bias into CADe performance. The basic concern is that reading behavior may change relative to actual clinical practice because readers know that they are participating in a study in which patient management is not affected by their diagnostic readings.
Ground truth determination for the presence or absence of the condition of interest can be subject to several sources of bias. A reference standard is used to determine ground truth. Imperfect reference standard bias refers to the reference standard misclassifying some subjects as having or not having the condition. For example, ground truth determination is incomplete if subjects diagnosed as negative for the abnormal condition on initial evaluation are not followed up for confirmation that they were indeed free of the condition.
Verification bias refers to the ground truth being missing for some subjects. If in the statistical analysis, only the subjects on whom ground truth has been established are included, estimates of CADe performance can be biased.
The study design of a retrospective reader study can produce potential sources of statistical bias. For example, in a sequential reading design, readers are instructed to read sequentially first unaided by CADe, and then aided by CADe as a second reader. The comparison is between an unaided read and the unaided read combined with the CAD-aided read. The sequential reading design is attractive because the proximity of the readings in the two modalities minimizes intra-reader variability. However, if the reading condition in the indications for use (IFU) of the device are different from that in the reader study, a concern might be that the effect of CADe on the diagnostic performance of readers may be confounded with the effect due to the additional time given readers to read each case. Another concern is that a reader may undercall the unaided reading relative to the reading aided by CADe, producing an enhanced CADe effect. If this concern is real, mitigation may be to randomize a fraction of the cases to be read only in the initial unaided reading mode. Randomization is revealed only after the initial unaided read is made. Randomization of the cases should be done separately for each reader in a way that ensures that all cases are given a CADe-aided reading by some readers.
An alternative to the sequential reading design is the cross-over design. Cases are read unaided and aided by the CADe output in two independent reading sessions separated by a washout period to erase reader memory of the images. Half of the cases are randomly assigned to group A and the other half to group B. In reading session 1, group A cases are read unaided and group B cases are read aided by CADe. In reading session 2, the cases are “crossed over” to the other modality (aided reading for group A, unaided reading for group B). Any effects the particular reading sessions have on the readings cancel in the comparison of the two modalities. However, relative to the sequential reading design, the two reading sessions contribute additional variability to estimate of the CADe effect. The cross-over design may be particularly appropriate for concurrent reading. It can also be generalized for use in evaluating more than two modalities (e.g., if in addition to unaided reading, the CADe has two or more modalities itself).
B. Additional References 🔗
- Gallas, B.D., Chen W., Cole E., Ochs R., Petrick N., Pisano E.D., Sahiner B., Samuelson W., Myers K.J., Impact of prevalence and case distribution in lab-based diagnostic imaging studies. Journal of Medical Imaging, 2019. 6(1): 015501. https://dx.doi.org/10.1117%2F1.JMI.6.1.015501.
- Bossuyt, P.M., Reitsma, J.B., Bruns, D.E., Gatsonis, C.A., Glasziou, P.P., Irwig, L., Lijme,r G., Moher, D., Rennie, D., de Vet, H.C., Kressel, H.Y., Rifai, N., Golub, R.M., Altman, D.G., Hooft, L., Korevaar, D.A., Cohen, J.F., STARD Group, STARD 2015: An Updated List of Essential Items for Reporting Diagnostic Accuracy Studies. Radiology, 2015. 277(3):826-832. https://doi.org/10.1148/radiol.2015151516.
- Obuchowski, N.A., Gallas, B.D., Hillis, S.L., Multi-reader ROC Studies with Split-plot Designs: A Comparison of Statistical Methods. Acad Radiol, 2012. 19(12): 1508-1517. https://dx.doi.org/10.1016%2Fj.acra.2012.09.012.
- Obuchowski, N.A., Meziane, M., Dachman, A.H., Lieber, M.L., Mazzone, P.J., What's the control in studies measuring the effect of computer-aided detection (CAD) on observer performance? Acad Radiol, 17(6):761-767. https://doi.org/10.1016/j.acra.2010.01.018.
- Pepe S., Statistical evaluation of medical tests for classification and prediction. Oxford University Press, 2003. 24(16). https://doi.org/10.1002/sim.2185.
- Petrick, N., Sahiner, B., Armato, S.G. 3, Bert, A., Correale, L., Delsanto, S., Freedman, T., Fryd, D., Gur, D., Hadjiiski, L., Huo, Z., Jiang, Y., Morra, L., Paquerault, S., Raykar, V., Samuelson, F., Summers, R.M., Tourassi, G., Yoshida, H., Zheng, B., Zhou, C., Chan, H.P., Evaluation of computer-aided detection and diagnosis systems. Med Phys, 2013. 40(8), 087001. https://doi.org/10.1118/1.4816310.
- Zhou, X., Obuchowski, N.A., McClish, D.K., Statistical Methods in Diagnostic Medicine, Second Edition. Wiley, 2011. https://doi.org/10.1002/9780470906514.
The use of the acronym CADe for computer-assisted detection may not be a generally recognized acronym in the community at large. It is used here to identify the specific type of devices discussed in this document. ↩
Available at https://wayback.archiveit.org/7993/20170405191951/https:/www.fda.gov/AdvisoryCommittees/CommitteesMeetingMaterials/MedicalDevices/MedicalDevicesAdvisoryCommittee/RadiologicalDevicesPanel/ucm124890.htm. ↩
Available at https://wayback.archiveit.org/7993/20170405191949/https:/www.fda.gov/AdvisoryCommittees/CommitteesMeetingMaterials/MedicalDevices/MedicalDevicesAdvisoryCommittee/RadiologicalDevicesPanel/ucm147063.htm. ↩
See “Radiology Devices; Reclassification of Medical Image Analyzers” (85 FR 3545) at
For more information on product codes, please refer to FDA’s guidance entitled “Medical Device Classification Product Codes“available at https://www.fda.gov/regulatory-information/search-fda-guidance-documents/medical- device-classification-product-codes-guidance-industry-and-food-and-drug-administration-staff. ↩
CFR 892.2070(b)(1)(ii). ↩
CFR 892.2070(b)(1)(iii). ↩
A 510(k) submission is the most common submission type for the CADe devices addressed in this guidance. The guidance “Computer-Assisted Detection Devices Applied to Radiology Images and Radiology Device Data - Premarket Notification [510(k)] Submissions” provides additional information on the device description and standalone performance assessment for 510(k) submissions. ↩
Section 513(g) of the FD&C Act (21 U.S.C. 360c(g)) provides a means for obtaining the Agency's views about the classification and the regulatory requirements that may be applicable to your device. ↩
For more information on Pre-Submissions, see FDA’s guidance entitled “Requests for Feedback and Meetings for Medical Device Submissions: The Q-Submission Program” available at https://www.fda.gov/regulatory- information/search-fda-guidance-documents/requests-feedback-and-meetings-medical-device-submissions-q- submission-program. ↩
See 21 CFR 860.7(e). ↩
For additional details on these assessment paradigms, see Wagner, R.F., Metz, C.E., and Campbell, G., Assessment of medical imaging systems and computer aids: A tutorial review. Acad. Radiol., 2007. 14(6):723–48. https://doi.org/10.1016/j.acra.2007.03.001. ↩
For online access to software that analyzes MRMC data based on validated techniques, see, for example: LABMRMC software and general ROC software, The University of Chicago: http://metz-roc.uchicago.edu/ (for either quasi-continuous or categorical data); University of Iowa MRMC software: https://perception.lab.uiowa.edu/OR-DBM-MRMC-program-manual (for categorical data); OBUMRM software: https://www.lerner.ccf.org/qhs/software/ ↩
For more information on potential limitations of relying on only one type of ROC analyses, see Gur, D., Bandos, A.I., and Rockette, H.E., Comparing Areas under Receiver Operating Characteristic Curves: Potential Impact of the “Last” Experimentally Measured Operating Point. Radiology, 2008. 247(1):12–15. https://doi.org/10.1148/radiol.2471071321. ↩
For more information on MRMC analysis software, see, for example, Obuchowski, N.A., Beiden, S.V., Berbaum, K.S., Hillis, S.L., Ishwaran, H., Song, H.H., and Wagner, R.F., Multi-reader, multi-case ROC analysis: An empirical comparison of five methods. Acad. Radiol., 2004. 11(9); 980–995. https://doi.org/10.1016/j.acra.2004.04.014. ↩
For MRMC literature references, see, for example: Metz, C.E., Fundamental ROC analysis. Handbook of Medical Imaging, Vol. 1. Physics and Psychophysics. SPIE Press, 2000. Chapter 15, 751–769. https://doi.org/10.1117/3.832716.ch15; Wagner, R.F., Metz, C.E., and Campbell, G., Assessment of medical imaging systems and computer aids: A tutorial review. Acad. Radiol., 2007. 14(6):723–48. https://doi.org/10.1016/j.acra.2007.03.001; and Obuchowski, N.A., Beiden, S.V., Berbaum, K.S., Hillis, S.L., Ishwaran, H., Song, H.H., and Wagner, R.F., Multi-reader, multi-case ROC analysis: An empirical comparison of five methods. Acad. Radiol., 2004. 11(9); 980–995. https://doi.org/10.1016/j.acra.2004.04.014. ↩
For online access to software that analyzes MRMC data based on validated techniques, see, for example: LABMRMC software and general ROC software, The University of Chicago: http://metz-roc.uchicago.edu/ (for either quasi-continuous or categorical data); University of Iowa MRMC software: https://perception.lab.uiowa.edu/OR-DBM-MRMC-program-manual (for categorical data); or OBUMRM software: https://www.lerner.ccf.org/qhs/software/. ↩
Obuchowski N.A., Meziane M., Dachman A.H., Lieber M.L., Mazzone P.J., What's the control in studies measuring the effect of computer-aided detection (CAD) on observer performance? Acad Radiol., 2010. 17(6):761- 7. https://doi.org/10.1016/j.acra.2010.01.018. ↩
21 CFR 814.15. ↩