Artificial Intelligence Improves the Accuracy in Histologic Classification of Breast Lesions

António Polónia, MD, PhD; Sofia Campelos, MD; Ana Ribeiro, MD; Ierece Aymore, MD; Daniel Pinto, MD; Magdalena Biskup-Fruzynska, MD; Ricardo Santana Veiga, MD; Rita Canas-Marques, MD; Guilherme Aresta, MEng; Teresa Araújo, MEng; Aurélio Campilho, PhD; Scotty Kwok, MSc; Paulo Aguiar, PhD; Catarina Eloy, MD, PhD

Disclosures

Am J Clin Pathol. 2021;155(4):527-536. 

In This Article

Results

Test A

Accuracy of Algorithm A. Algorithm A had an accuracy of 0.87 (Table 3, Supplemental Table S1, and Figure 1A) and a concordance rate with the GT of 0.88 (Supplemental Figures S2A and S2B). The benign class had lower accuracy (0.72; 18/25) in comparison with the remaining classes (0.96 [24/25], 0.88 [22/25], and 0.92 [23/25] for normal, CIS, and IC, respectively; Fisher exact test, P = .02). Most discordances with GT occurred in distinguishing normal from benign (4%) and benign from IC (4%) (Supplemental Table S2). Fat necrosis was the benign lesion confused with IC Image 1A and Image 1B. The accuracy of the algorithm was 0.71 (10/14) in photographs correctly classified by less than 50% of the observers, increasing to 0.92 (60/65) in photographs correctly classified by more than 75% of the observers (χ2, P = .03) Figure 1C.

Figure 1.

Classification accuracy of test A (A) and test B (B) in all phases. C, Accuracy of algorithm A in photographs correctly classified by <50% (10/14), 50%-75% (17/21) and >75% of the observers (60/65). D, Accuracy of algorithm B in ROIs correctly classified by <50% (5/9), 50%-75% (13/26), and >75% of the observers (57/117). E, Accuracy of algorithm B in ROIs <0.15 mm2 (14/38), 0.15–0.49 mm2 (16/38), 0.49–1.92 mm2 (24/38), and >1.92 mm2 (21/38). F, Average accuracy of the observers in ROIs <0.15, 0.15–0.49, 0.49–1.92, and >1.92 mm2. The cutoffs used correspond to the 25th, 50th, and 75th percentiles of the size of the ROIs. O, average of all observers; P, average of pathologists; P1-P4, pathologists 1–4; R, average of residents; R1-R3, residents 1–3; ROI, region of interest; WSI, whole-slide image.

Image 1.

Benign (fat necrosis) (H&E, ×200), correctly classified in phase 1 by 6 of 7 observers (A) and 3 of 7 observers (B), and as IC by algorithm A. C, DCIS (H&E, ×200), classified in phase 1 as benign by 4 of 7 observers, as DCIS by 3 of 7 observers, and as DCIS by algorithm A. D, DCIS (H&E, ×200), classified in phase 1 as DCIS by 4 of 7 observers, as benign by 3 of 7 observers, and as DCIS by algorithm A. E, Benign (inflammation) (H&E, ×100), correctly classified in phase 1 by 6 of 6 observers and as IC by algorithm B. F, DCIS (H&E, ×100), correctly classified in phase 1 by 4 of 6 observers and as benign by algorithm B. DCIS, ductal carcinoma in situ; IC, invasive carcinoma.

Figure S2.

Concordance rate of test A and B in favourite classification.
A=Test A phase 1; B=Test A phase 3; C=Test B phase 1; D=Test B phase 2; P: pathologist; R: resident; Alg A: algorithm A; GT: ground truth; Alg B: algorithm B.

Accuracy of the Observers. Phase 1: The observers had an average accuracy of 0.80; only 1 pathologist had accuracy higher (P2, 0.94) than that obtained by the algorithm A (Table 3, Supplemental Table S1, and Figure 1A). The mean concordance rate between the observer's classification and the GT was 0.86 (range, 0.80–0.93), with 2 pathologists having concordance rates higher than algorithm A (P1 and P2: 0.93). In this phase, the mean interobserver concordance rate was 0.83 (range, 0.75–0.90) (Supplemental Figure S2A). IC was the class with higher accuracy (average, 0.95) in comparison to the other classes (average of 0.73, 0.78 and 0.76 for normal, benign and CIS, respectively; MW, P < .001). Most discordances with GT occurred in distinguishing normal from benign (7.4%) and benign from CIS (7.1%) (Image 1C and Image 1D and Supplemental Table S2).

The observers proposed an alternative classification in 14% of the photographs, with pathologists proposing more alternative classifications than residents (18.3% and 8.3%, respectively; MW, P = .002) (Supplemental Table S3). The most frequent alternative classifications were those between CIS and IC (4.7%), benign and IC (4.7%), and benign and CIS (4.3%).

Phases 2 and 3: In phase 2, the observers increased their average accuracy from 0.80 to 0.85 (WC, P < .001), with 3 pathologists obtaining accuracies equal to or higher than algorithm A (Table 3, Supplemental Table S1, and Figure 1A). In phase 3, the observers had an additional increase in their average accuracy from 0.85 to 0.88 (WC, P = .001), with 3 pathologists and 2 residents with accuracies higher than algorithm A. In this last phase, the mean concordance rate between the observer's classification and the GT increased from 0.86 to 0.91 (range, 0.85–0.99), with only 2 observers having concordance rates lower than algorithm A. In addition, the mean interobserver concordance rate increased from 0.83 to 0.90 (range, 0.83–0.95) (Supplemental Figure S2B). The accuracy increased in all classes (average of 0.86, 0.82, 0.89, and 0.97 for normal, benign, CIS, and IC, respectively). Most discordances with GT decreased and occurred in distinguishing normal from benign (5.4% and 4.7%) and benign from CIS (5.7% and 3.4%) in phases 2 and 3, respectively (Supplemental Table S2).

A similar proportion of alternative classifications was proposed by the observers in phase 2 compared with phase 1 (13.1% vs 14%, respectively; WC, P = .96), increasing in phase 3 in comparison with phase 2 (16.6% vs 13.1%, respectively; WC, P = .002) (Supplemental Table S3). The most frequent alternative classifications were those between benign and CIS (5.7%), benign and IC (4.6%), and normal and benign (4.1%).

The favorite classification was modified, on average, in 6.3% of the photographs in phase 2, increasing to 10.9% in phase 3 (WC, P < .001), with only 2 observers with less than 5% modifications in both phases (P3 and R1) (Supplemental Table S4). In addition, pathologists and residents had similar frequencies of modifications on their favorite classification (6.8% and 5.7% [MW, P = .24] for phase 2 and 9.5% and 12.7% [MW, P = .36] for phase 3). The alternative classification of the photographs was modified, on average, in 15.7% in phase 2, increasing to 19.3% in phase 3 (WC, P < .001) (Supplemental Table S4). In addition, pathologists had more frequent modifications than residents on their alternative classification (21.8% and 7.7% [MW, P < .001] for phase 2 and 23.8% and 13.3% [MW, P = .003] for phase 3). The alternative classification had more frequent modifications than the favorite classification in both phases (15.7% and 6.3% in phase 2, and 19.3% and 10.9% in phase 3; WC, P < .001 for both phases).

Test B

Accuracy of Algorithm B. Algorithm B had accuracy of 0.49 (Table 3, Supplemental Table S5, and Figure 1B) and a concordance rate with the GT of 0.37 (Supplemental Figures S2C and S2D). CIS was the class with lower accuracy (0.06; 2/33) in comparison to the remaining classes (0.84 [26/31], 0.58 [38/65], and 0.39 [9/23] for normal, benign, and IC, respectively; χ2, P < .001). Most discordances with GT occurred in distinguishing normal from benign (15.1%), benign from IC (14.5%), and benign from CIS (11.8%) (Supplemental Table S6). Inflammation was the benign lesion confused with IC Image 1E and Image 1F. The accuracy of the algorithm was similar in ROIs correctly classified by more than 75% of the observers (0.49; 57/117) compared with ROIs correctly classified by less than 50% of the observers (0.56; 5/9; Fisher exact test, P = .74) Figure 1D. Moreover, the accuracy of the algorithm was lower in ROIs smaller than 0.49 mm2 (0.39; 30/76) in comparison to larger ROIs (0.59; 45/76; χ2, P = .02) Figure 1E.

Algorithm B proposed an alternative classification in 53.9% of the ROIs. The most frequent alternative classifications were those between normal and benign (37.5%), and between benign and IC (7.9%) (Supplemental Table S7).

Accuracy of the Observers. Phase 1: The observers had average accuracy of 0.86, with all observers with accuracies higher than those obtained by the algorithm B (Table 3, Supplemental Table S5, and Figure 1B). The mean concordance rate between the observer's classification and the GT was 0.91 (range, 0.84–0.95), with all observers having concordance rates higher than algorithm B. In this phase, the mean interobserver concordance rate was 0.87 (range, 0.79–0.96) (Supplemental Figure S2C). IC was the class with higher accuracy (average, 0.99) in comparison to the other classes (average of 0.84, 0.85, and 0.81 for normal, benign, and CIS, respectively; MW, P < .001). Most discordances with GT occurred in distinguishing normal from benign (7.4%) and benign from CIS (6.3%) (Image 1F and Supplemental Table S6). The average accuracy of the observers increased from 0.79 in ROIs smaller than 0.15 mm2 to 0.93 in ROIs larger than 1.92 mm2 (Kruskal-Wallis, P = .001) Figure 1F.

The observers proposed an alternative classification in 5% of the ROIs, with pathologists and residents proposing similar alternative classifications (4.4% and 5.7%, respectively; MW, P = .34) (Supplemental Table S7). The most frequent alternative classifications were those between benign and CIS (2.3%).

Phase 2: The observers had similar accuracy (average of 0.85) in comparison to phase 1 (WC, P = .96) (Table 3 and Supplemental Table S5, and Figure 1B). The mean concordance rate between the observer's classification and the GT maintained at 0.91 (range, 0.84–0.98), and the mean interobserver concordance rate was 0.87 (range, 0.79–0.94) (Supplemental Figure S2D). All classes maintained their accuracy (average of 0.86, 0.85, 0.76, and 0.99 for normal, benign, CIS, and IC, respectively). Most discordances with GT occurred in distinguishing normal from benign (7.7%) and benign from CIS (6.5%) (Supplemental Table S6). The favorite classification was modified, on average, in 4.6% of the ROIs, with pathologists and residents showing similar frequencies of modifications (3.7% and 5.5%, respectively; MW, P = .18) (Supplemental Table S8).

The observers increased the alternative classification from 5% to 15.6% of the ROIs (phase 1 vs 2; WC, P < .001), with residents proposing more alternative classifications than pathologists (22.2% and 9.0%, respectively; MW, P < .001) (Supplemental Table S7). The most frequent alternative classifications were those between benign and CIS (8.8%). The alternative classification of the ROIs was modified, on average, in 13.7%, with pathologists showing a lower frequency of modifications than residents (8.7% and 18.8%, respectively; MW, P < .001) (Supplemental Table S8). The alternative classification had more frequent modifications than the favorite classification (13.7% and 4.6%, respectively; WC, P < .001).

processing....