Materials and Methods
Characteristics of the Tested Images and Algorithms
The test data set of the BACH challenge was composed of 2 independent parts (A and B). Test A consisted on 100 H&E-stained breast tissue microscopic photographs (from 38 patients) classified into 4 classes: normal, benign, carcinoma in situ (CIS), and invasive carcinoma (IC). Test B consisted of 152 regions of interest (ROI) in 10 H&E-stained WSIs of breast tissue (from 8 patients) classified into the same 4 classes described above. The cases included formalin-fixed, paraffin-embedded needle core biopsies and surgical excision specimens diagnosed between 2013 and 2017 originating from 2 histology laboratories (Ipatimup Diagnostics and Centro Hospitalar Universitário Cova da Beira). The ground truth (GT) was established by 2 pathologists (A.P. and C.E.) with glass slides and cases, with disagreements resolved through common microscopy sessions. IHC analysis was performed in all IC and CIS and some benign lesions (ductal hyperplasia, intraductal papilloma, sclerosing lesions, and fat necrosis). At the end of the study, no observer disagreed with the GT classifications. Each photograph and ROI included only 1 of the 4 classes, except for normal tissue that could be present with any other class. The characterization of the 4 classes in both tests is summarized in Table 1.
Photographs from test A were acquired with a Leica DM 2000 LED microscope and a Leica ICC50 HD camera with a ×20 objective and 0.4 numerical aperture originating RGB images in a TIFF format without compression, a size of 2,048 × 1,536 pixels (0.56 mm2), and a pixel scale of 0.42 μm/pixel, without color normalization. WSIs from test B were acquired with a Leica SCN400 scanner with a ×20 objective (pixel scale of 0.47 μm/pixel). The irregularly shaped (freehand) ROIs had different sizes, varying from 0.04 to 171.19 mm2 and a median of 0.49 mm2.
The algorithms used for assessing the images of tests A and B were the ones that achieved the best performance on the BACH challenge's independent test set. Namely, for test A, we selected the method with the highest overall accuracy (measured as the ratio between the correct answers and the total number of photographs) and better accuracy on distinguishing between normal and nonnormal samples (algorithm A). Likewise, for test B, we selected the method with the highest classification performance (algorithm B).[20,21] Both algorithms rely on deep learning and were developed by the same participant using the training images of the BACH challenge. Algorithm A is based on a convolutional neural network that classifies patches of 299 × 299 pixels resized from patches of 1,495 × 1,495 pixels collected from the original image. For each case, overlapping patches are collected at a fixed distance interval, and the final label is produced by averaging the patch-wise predictions. Algorithm B uses the same convolutional neural network to predict patch-wise classifications on the WSIs. Instead of averaging all predictions as in algorithm A, each pixel of the image is labeled as the average of the overlapping patch-wise predictions, creating a pixel-wise abnormality classification map (for additional details, see supplemental data; all supplemental data can be found at American Journal of Clinical Pathology online). For this study, a classification of the ROI was obtained if more than 95% of the classification map pixels shared the same label; if not, the 2 most frequent pixel classifications were used as the favorite and alternative classifications, respectively.
Evaluation Criteria of the Observers' Accuracy
The accuracy of 4 pathologists (P1 to P4) and 3 pathology residents (R1 to R3) were evaluated in different phases. P1 to P3 are generalist pathologists with 6, 6, and 42 years of practice, respectively. P4 is a subspecialist breast pathologist with 4 years of practice. In phase 1, the observers classified photographs from test A and ROIs overlaid in the entire WSI from test B into 4 classes, without exceptions. In cases of doubt between classes, observers could choose their favorite classification and provide the respective alternative.
In phase 2, the observers had the opportunity to reclassify the photographs and ROIs, knowing their initial classification and the one performed by the algorithms, without being aware of its accuracy or their own. Some rules of engagement were established (summarized in Table 2 and Supplemental Figure S1): if the observer classification matched the algorithm classification, it could not be changed; in this case, if there was an alternative classification (by the observer, the algorithm, or both), there was the possibility to keep or discard the alternative classification. If the participant classification did not match the algorithm classification, with or without the presence of an alternative classification, the observers could reclassify the photograph or ROI.
In test A, before phase 3, both confusion matrices of the observers and the algorithm were revealed, showing their global accuracy and the types of errors between different categories, without specifying the correct answer of each photograph. Then, observers performed the same task as in phase 2. Test B did not have a phase 3. The observers had no time constraint applied during the classification, and no washout period existed between the evaluation of different phases. All photographs and ROIs were classified before the next phase started. Each phase was performed in less than a week, and all phases were performed in less than a month. Photographs were reviewed with Windows Photo Viewer (Microsoft) and WSIs with Aperio ImageScope v12.3.2 (Leica Biosystems). The classification was recorded manually in a prefilled Excel sheet (Microsoft). None of the observers were involved in the establishment of the GT or received training by the pathologists responsible by the GT. Moreover, IHC information was not available for the classification of images in both tests in all phases, which was based on morphology alone. In addition, P4 did not participate in test B.
Ethics approval and informed consent were not required for this study, given the anonymized images of the samples.
Statistical analyses were performed using SPSS version 25.0 for Windows (IBM). The Pearson χ2 test (or the Fisher exact test, if appropriate) was used for comparison of qualitative variables, and the Mann–Whitney U test (MW), the Wilcoxon (WC) test, and the Kruskal-Wallis test were used for quantitative variables. The level of significance was set at P < .05. Accuracy was defined as the ratio between the correct answers and the total number of photographs or ROIs. Concordance rates were evaluated with quadratic weighted κ statistics to penalize discordances with higher clinical impact. The Landis and Koch classification was used to interpret the values: no agreement to slight agreement (<0.20), fair agreement (0.21–0.40), moderate agreement (0.41–0.60), substantial agreement (0.61–0.80), and excellent agreement (>0.81).
Am J Clin Pathol. 2021;155(4):527-536. © 2021 American Society for Clinical Pathology