Artificial Intelligence Improves the Accuracy in Histologic Classification of Breast Lesions

António Polónia, MD, PhD; Sofia Campelos, MD; Ana Ribeiro, MD; Ierece Aymore, MD; Daniel Pinto, MD; Magdalena Biskup-Fruzynska, MD; Ricardo Santana Veiga, MD; Rita Canas-Marques, MD; Guilherme Aresta, MEng; Teresa Araújo, MEng; Aurélio Campilho, PhD; Scotty Kwok, MSc; Paulo Aguiar, PhD; Catarina Eloy, MD, PhD


Am J Clin Pathol. 2021;155(4):527-536. 

In This Article


Image fidelity in the computer display has been a major concern regarding digital diagnosis, an issue previously addressed by digital radiologists.[23,24] Systematic reviews have been performed to evaluate the concordance of pathologic diagnoses by WSIs in comparison to traditional light microscope (LM), revealing mean diagnostic concordance higher than 90%. This result demonstrates that DP can be used for primary diagnosis, provided that current best practice recommendations are followed.[25,26] In this work, the images used in both tests had resolutions near 0.5 μm/pixel, comparable to an LM, and excellent mean interobserver concordance rates were achieved in both tests.[27] We recognize that microscopic photographs are not the method for pathology diagnosis, as shown by the higher classification accuracies after the observation of ROIs in WSIs. Nevertheless, the similar accuracy achieved by the pathologists in both tests reveals that the photographs contain enough information to simulate clinical practice. In future studies, we would like to measure the role of AI algorithm outputs in the classification of WSIs without the use of ROIs.

The accuracy of algorithm A (photographs) was higher than the average accuracy of the observers, including the average accuracy of the pathologists, with excellent agreement with the GT. This result indicates that it is possible to develop an algorithm with the ability to perform a complex task, such as medical image interpretation or diagnosis, at an expert level. However, it had lower accuracy in classifying the photographs that observers classified correctly less frequently, indicating its limitation in assisting pathologists in classifying difficult cases. In the future, the accuracy for diagnosis of difficult cases may eventually be increased if the training sets of these types of algorithms are enriched in such cases. In real life, extraordinary cases without established GT will almost always need the intervention of an expert pathologist. This reinforces the idea that CAD tools will not replace pathologists in the future but probably will originate a trend of superspecialization to solve those difficult cases.

In contrast, the accuracy of algorithm B (WSIs) was lower than that of all observers for all classes, with only fair agreement with the GT. In addition, the algorithm had a large performance drop for ROIs smaller than 0.49 mm2, given that it was trained to predict patches of approximately 0.50 mm2 with consideration of the classification of the neighboring patch. When predicting a patch smaller than the training size, the nonrelevant classification of neighboring patches will have a greater effect on the classification of the patch, lowering the performance of the algorithm. A possible approach to improve the sensitivity of the algorithm would be to change the decision rule for the overlapping patches (eg, from average to local maximum) to increase the importance of these small regions in the final WSI labeling. Smaller lesions will probably continue to be a challenge for both pathologists and image analysis algorithms. The use of ROIs allowed direct comparison of the accuracy of the observers and the algorithm in precise regions, even small ones, given that the observers were forced to classify all ROIs, without exception.

Both algorithms had problems in the classification of benign lesions, usually showing difficulties in distinguishing benign from CIS (a known pitfall in LM diagnosis) and benign from IC, demonstrated by the recurrent misclassification of fat necrosis and inflammation as IC. Benign lesions have higher morphologic variability, making discriminant features more difficult to learn and lowering accuracy. These algorithms are probably learning that inflammation associated with some ICs is a typical characteristic of IC; this learning could give rise to a false-positive diagnosis, suggesting that these tools must be human supervised. We also recognize that a limitation of this study was the use of only 4 classes when performing the classification task. These classes do not cover all categories of breast lesions or the low number of patients who do not represent the wide morphologic pattern variation observed in real practice. However, we wanted to establish a proof of concept that artificial intelligence could be useful in DP diagnosis using the most common classes in breast pathology.

In our study, the observers had average accuracy higher in WSIs than in the photographs for all classes. This fact could be explained by the larger size of the ROIs, with more morphologic features to reach the correct classification, and the presence of adjacent context outside the ROIs in the WSIs. As expected, IC was the class more often correctly classified by the observers in both tests, which reflects the training and ability in detecting this relevant clinical lesion. The absence of a washout period between the evaluation of different phases had the objective of removing the intraobserver variability in the following phases and measuring only the impact of the algorithms in the change of the classification by the observers. The rules of engagement, which prevented changes when the classification of the observers matched the classification of the algorithms, had the purpose of simulating the future situation of the pathologist having access to the output of the algorithm and confirm or exclude their own classifications. Although the impact of revisiting the cases without AI assistance was not measured in this work, we estimate it to be low, given that the observers in test B did not improved their accuracy in phase 2.

The assistance provided by algorithm A significantly increased the average accuracy of the observers (in all classes) and the mean interobserver concordance rate, suggesting that CAD tools may be used to increase classification accuracy and homogeneity in pathology, even in important differential diagnostic problems, such as those between benign and CIS.

In this work, we show that the recognition of CIS by the observers was suboptimal in both tests, even when shown directly to the observers in either microscopic photographs or ROIs in WSIs. The identification of CIS has been shown to be underdiagnosed, despite being clinically relevant and identifying patients who usually need surgical treatment and close follow-up due to increase risk of developing IC.[8–10] Importantly, the CIS classification accuracy was substantially increased with the support of algorithm A, showing that CAD tools can close the gap of false-negative results and ultimately contribute to increased patient health.

In test B, the excellent mean concordance rate between observers and the GT was maintained throughout the phases, meaning that an algorithm with a lower accuracy than that of the observers did not jeopardize their accuracy. In this case, the algorithm was not providing a credible alternative classification but rather keeping the observers faithful to their initial classification. However, in the last phase of test B, more alternative classifications were proposed, specially by residents with less experience than the pathologist, letting algorithm B work as a confusion generator. These results suggest that only CAD tools with the high accuracy should be implemented for clinical use.

Awareness of the accuracy and types of errors of algorithm A in phase 3 allowed measurement of its effect in the observers. This awareness translated into a higher proportion of changes in the classification of the photographs and in more alternative classifications, particularly from pathologists who took more advantage from the algorithm, even surpassing the accuracy of the algorithm. This effect points to the concept that better classification accuracy is achieved when both algorithm and observers work together rather than alone, producing a synergic effect. Phase 3 in test B was not performed because the observers would never consider an algorithm output with lower accuracy than their own.

We are aware that the use of IHC, as part of the daily practice in pathology, could have a positive impact on the observer's accuracy. IHC was not available to the observers, representing a limitation of this work. Nevertheless, one of the goals was to test whether CAD tools could improve the observer's accuracy on H&E.

Interestingly, in test A, there were two observers making less than 5% modifications of the favorite classification in both phases. These "nonbelievers" were the ones with concordance rates with GT lower than algorithm A in the last phase, indicating that CAD tools may have different impacts on different types of observers. Moreover, pathologists more often changed the alternative classification with algorithm A than with algorithm B, and residents more often changed the alternative classification with algorithm B than with algorithm A, suggesting that the higher experience of pathologists may have a role in determining how far they let the use of a CAD tool influence their final classification.