The AI-STREAM study, described in more detail elsewhere23, is a prospective, population-based study aimed to compare the diagnostic accuracy of BRs when interpreting screening mammograms with or without AI-CAD for breast cancer (Fig. 4). The trial enrolled women aged ≥40 years from six academic hospitals that participated in national breast cancer screening program in South Korea. All women participating in the study provided their consent by completing an informed consent form about taking part in the study and reading a participant information sheet. Of all eligible women, those with a history of breast cancer or mammoplasty were excluded as the used AI software was not validated for these subgroups. During breast cancer screening, mammography was performed by technologists and interpreted by a single radiologist (ie, single-reading strategy) using two standard craniocaudal and mediolateral oblique views of each breast using DM, which is the standard procedure for interpreting mammograms in South Korea.
Set 1: As part of the standard single screening procedure, mammograms were read by radiologists without AI-CAD, and then read with AI-CAD assistance. A radiologist specializing in breast imaging (i.e., BR) was defined as someone with more than 10 years of experience at a university hospital and expertise in breast imaging. Even when AI-CAD assistance was provided after the radiologist’s initial interpretation without AI-CAD, the final decision to recall a patient for further diagnostic evaluation was based on the radiologist’s comprehensive decision. Set 2: As a secondary and exploratory objective, a separate simulation study was conducted involving general radiologists (i.e., GR), who did not specialize in breast imaging, to compare their performance with and without AI-CAD. Standalone AI: For exploratory purposes, results from standalone AI-CAD were also evaluated. AI results were considered positive when the abnormality score exceeded a predefined cutoff value of 10. The study compared the diagnostic performance of radiologists, with and without AI-CAD assistance, in identifying screening-detected cancers confirmed pathologically within a 1-year follow-up period.
Study design
The study received approval from the Institutional Review Board (IRB) of all participating centers (Kyung Hee University Hospital at Gangdong, Soonchunhyang University Seoul Hospital, Konkuk University Medical Center, Noweon Eulji Medical Center, CHA Bundang Medical Center, and Inje University Busan Paik Hospital), and written consent for data publication was obtained from all screening participants. In this study, mammography interpretation was performed by a breast-specialized radiologist at each of the six university hospitals in a real clinical setting (Supplementary Table 1). A radiologist specializing in breast imaging (i.e., BR) was defined as a radiologist with >10 years of experience at a university hospital, having specializing expertise in breast imaging. In the standard single-reading strategy, the subsequent clinical decision about whether the participant requires further diagnostic workup is determined by a single radiologist. Generally, the radiologist performs a comparative reading with a previous mammogram to determine recall, if previous mammography is available. In this study, mammograms were read by radiologists without, and then with AI-CAD. As part of the standard single screening procedure, even when AI was assisted following the radiologist’s reading without AI-CAD, the final recall for further diagnostic workup was made based on the radiologist’s comprehensive decision (set 1 from Fig. 4).
For exploratory purposes, results from standalone AI-CAD were also collected and compared to the diagnostic performance of radiologists with and without the use of AI-CAD. Moreover, a separate simulation study was conducted (set 2 from Fig. 4) to compare the cancer detection rare (CDR) and recall rate (RR) of general radiologists (i.e., GRs), who did not specialize in breast imaging, with versus without AI-CAD. Note that these results from GRs did not impact any real-world clinical decision-making of study participants for further workup, given that mammograms are those examined from BRs. The main rationale behind this additional study was that GRs comprise the majority in South Korea’s breast cancer screening program due to shortage of BRs. Thus, gaining an insight into the potential effect of AI-CAD on GR’s performance would be highly reflective of and valuable to real-world screening practice. All radiologists who participated in the study, including both BRs and GRs, had no prior experience with the AI-CAD program, minimizing bias.
Procedures
For image acquisition and data management, a cloud-based imaging data management platform (IRM’s BEST Image) was used (Supplementary Fig. 1, Supplementary Note 1). Mammograms were processed by a Snupi program, which performed various operations (eg, search, inquiry, de-identification, separation) and transmission functions on Digital Imaging and Communication in Medicine (DICOM) files. After de-identifying the participant information and assigning identifications, mammograms from participants were exported to the platform to record reading results. If participants had mammograms within the past four years, these were also exported to the study platform for comparison to replicate the same procedure as the real-world reading procedure of each site.
As part of the routine screening process, BRs interpreted the mammography without AI-CAD and recorded the findings (Test 1), followed by the automatic presentation of AI-CAD results (abnormality score and marks) for review and recording. The final records results were based on a comprehensive assessment that considered interpretations from both with and without AI-CAD (Test 2). Results of Test 1 (or without AI-CAD) could not be modified or adjusted once the AI-CAD results were reviewed. Likewise, the results of Test 2 (or with AI-CAD) could also not be corrected after reading (Supplementary Fig. 1).
For variables recorded per ‘Test’, the radiologist first recorded the breast density according to Breast Imaging Reporting and Data System (BI-RADS) 5th edition (A, B, C, D). Second, the radiologist assessed malignancy using a 7-point scale (1, definitely normal; 2, benign; 3, probably benign [0–2%]; 4, low suspicion for malignancy [2–10%]; 5, moderate suspicion of malignancy [10–50%]; 6, high suspicion for malignancy [50–95%]; 7, highly suggestive of malignancy [≥95%]). For cases not recalled, the radiologist could choose from a malignant assessment score of 1 or 2, whereas for cases recalled, the radiologist could choose a score between 3 and 7. Scores 1 or 2 were considered negative (BI-RADS 1 or 2), while scores of 3 and higher were considered positive (BI-RADS 3 to 5), indicating the need for a recall. In the case a recall decision was made, the location (left, right, both) of the recall was also recorded. The recall for further diagnostic workup of participants was a BR’s comprehensive decision informed by the results from considering the paired reading resulting with and without AI-CAD.
If a participant was recalled and visited the same hospital where the screening mammography was performed, additional diagnostic workup (e.g., special mammography views, DBT, ultrasonography) was conducted. If needed, a biopsy was performed, and if a pathologist subsequently diagnosed breast cancer (screen-detected), the participant’s information was recorded separately. If surgery was performed, the final pathology was confirmed. Participants diagnosed with breast cancer were reviewed for other breast imaging and pathologic features from electronic medical records and pathology reports (if available).
As a secondary and exploratory objective, a separate reading set (Set 2) was designed and conducted as a simulation study (Fig. 4). In set 2, five general radiologists (GRs) who did not specialize in breast imaging interpreted the same participant’s mammography; GRs had variable experience as radiologists and in interpreting mammography. The participant’s mammography was interpreted using the same research platform with and without AI-CAD, and the corresponding results were recorded on the same platform. All participating radiologists, including both BRs and GRs, had no prior experience with the AI-CAD program.
The study used a commercial AI-CAD system (Lunit INSIGHT MMG, available at https://insight.lunit.io, version 1.1.7.1), which has been validated through various studies11,24 (Supplementary Note 3). In brief, the AI system improves radiologists’ performance and has diagnostic performance equivalent to or superior to those of radiologists alone24. It also has shown superior performance compared with two other commercial AI-based software products11. The AI system provides abnormal scores ranging from 0 to 100, per breast based on mammograms. These scores can also be presented as a heatmap or grayscale map. AI results were considered positive if the abnormality score was above a predefined cutoff of 10. The highest lesion abnormality score was reflected, and examinations with scores of 10 or higher were considered positive.
Outcomes
The primary outcomes were CDRs and RRs of BRs with and without AI-CAD in mammography reading for screen-detected breast cancer, including invasive or ductal in situ (or both). The secondary outcome was to compare CDRs and RRs of mammography reading in the following comparisons: (1) BRs without AI-CAD vs AI standalone, (2) BRs with AI-CAD vs AI standalone, (3) GRs without AI-CAD vs GRs with AI-CAD, (4) GRs without AI-CAD vs AI standalone, and (5) GRs with AI-CAD and AI standalone. PPV1 was defined as the percentage of all positive screening exams with a pathologic cancer diagnosis within 1 year, and PPV1 of screening-detected cancer by BR with or without AI was obtained.
Statistical analysis
The sample size was estimated using McNemar’s test to detect differences in CDRs between groups of radiologists with and without AI-CAD, with a two-sided test at a significance level of 0.05 and 80% power. The assumed cancer prevalence was 3.21 per 1000 examinations, determined from data in a previous retrospective study, and the target sample size was chosen based on this expected cancer prevalence25. The target sample size was 32,714 participants, corresponding to approximately 16,000 participants per year. The total number of expected participants was, however, adjusted from the initial study design due to the COVID-19 pandemic, but no effect on the primary study endpoint was observed. Assuming the same cancer prevalence as 3.21 per 1000 examinations, it was calculated that the sample size could be maintained if approximately 24,000 people were recruited while still maintaining 80% power and detecting more than 90 cases of cancers (Supplementary Note 4).
All statistical analyses, including CDRs and RR, were taken place based on a 1-year follow-up from all the study participants. All statistical analyses were performed using R version 4.3.3. (R Foundation for Statistical Computing, Vienna, Austria). Descriptive statistics were used for continuous and categorical variables, as appropriate. Logistic regression analysis using a generalized estimating equation to account for reader variability was used to estimate 95% CI and for comparative analysis. Pairwise comparisons were performed to compare BRs with AI-CAD, BRs without AI-CAD, AI standalone, GRs with AI-CAD, and GRs without AI-CAD. In addition, corrections based on multiple comparisons are necessary for confirmation, but no corrections were made considering the preliminary nature of the study.
Prespecified subgroup analyses were performed to examine results in different age groups (40–49, 50–59, 60–69, 70+ years), mammographic density (four categories of BI-RADS by the American College of Radiology), and malignant scale assessment using a 7-point scale [defined above]. In breast cancer, the following subgroups were analyzed for cancer characteristics including invasiveness, categories of tumor size (5).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.