Systematic Review With Meta-analysis

Artificial Intelligence in the Diagnosis of Oesophageal Diseases

Pierfrancesco Visaggi; Brigida Barberio; Dario Gregori; Danila Azzolina; Matteo Martinato; Cesare Hassan; Prateek Sharma; Edoardo Savarino; Nicola de Bortoli


Aliment Pharmacol Ther. 2022;55(5):528-540. 

In This Article


This systematic review with meta-analysis evaluated the performance of AI in the diagnosis of both malignant and benign ODs. According to this study, AI has potential to accurately diagnose several ODs, clinically and endoscopically.

In the diagnosis of BN, AI showed pooled AUROC, sensitivity and specificity of 90%, 89% and 86%, respectively. The performance of AI was not significantly different from that of expert endoscopists with WLE. These results satisfy the optical diagnosis performance thresholds required by the Preservation and Incorporation of Valuable Endoscopic Innovations (PIVI) initiative by the American Society of Gastrointestinal Endoscopy. According to PIVI indications, any proposed screening technique aspiring to be incorporated into clinical practice, should at least equal, or improve, the performance of random sampling in BE (ie Seattle Protocol), demonstrating a per-patient sensitivity of 90% or greater, and a specificity of at least 80% for detecting oesophageal adenocarcinoma.[60] Moreover, these results suggest that potentially AI application by non-expert endoscopists may result in increased early detection of BN and, in the long term improved survival for the patients.

Endoscopic recognition of early OSCC is challenging, as lesions often pass unrecognised with WLE. Lugol's dye spray chromoendoscopy has shown to increase the sensitivity of WLE in the diagnosis of early OSCC, and NBI significantly increases the specificity of oesophago-gastroduodenoscopies compared with Lugol's dye.[61–63] However, non-expert endoscopists may not perform as good as experts when operating under NBI,[63] limiting the applicability of the technique. In this meta-analysis, the application of AI in the diagnosis of OSCC comprehensively achieved pooled AUROC of 97%, pooled sensitivity of 95% and pooled specificity of 92%. These results are in keeping with those of previous meta-analyses,[9,10,64–66] in which AI showed good performance in the diagnosis of OSCC, BE-related or gastric adenocarcinoma and colorectal lesions.

We also provided a pooled estimate of the AUROC, sensitivity and specificity of AI in the detection of abnormal IPCLs, which are microvascular structures on the surface of the oesophagus that appear as brown loops on ME with NBI and show morphological changes that strictly correlate with neoplastic invasion depth of OSCC. Moreover, IPCLs have been also associated to GERD diagnosis and therefore their detection could be helpful also in the endoscopy-based suspicion of reflux disease.[67] However, the optical classification of IPCLs requires experience and is mastered by experts only. In this study, AI showed pooled AUROC of up to 98% and sensitivity and specificity of up to 99% and 97%, respectively, in the detection of abnormal IPCLs. This has relevant therapeutic and prognostic implications as early lesions are amenable of endoscopic treatment,[68] and the estimation of invasion depth allows intra-procedural decisions for endoscopic resections.[69,70]

In this study, CAD tools showed good performance in the diagnosis of benign ODs. Investigations that applied AI to the diagnosis of GERD exclusively based on symptoms were included in the meta-analysis. In this task, AI achieved pooled AUROC of 99% and sensitivity and specificity of 97%. Because symptoms prompt patients with GERD to seek medical attention, AI-based questionnaires represent the ideal tool to timely and accurately diagnose reflux disease without performing invasive procedures (ie EGDS and pH-impedance metry) and avoid the delay of treatment. Moreover, AI excels at solving the non-linearity inherent in the relationship between symptoms and underlying pathology. Therefore, DL algorithms can be used to reduce the number of questionnaire variables needed to achieve a definite diagnosis of GERD,[34] allowing clinicians to administer shorter and more acceptable questionnaires to patients.

Several single studies that used AI for the diagnosis of benign ODs were retrieved from the literature and could not be included in the meta-analysis. This lack of data does not reflect a scarce interest in the subject, rather it attests the novelty of AI in the field of oesophageal benign diseases. In this setting, AI models autonomously extracted and analysed pH-impedance tracings and also individuated a novel pH-impedance metric that segregated responders to GERD treatment from non-responders.[52] The effectiveness of a real-time endoscopic GERD diagnosis and AI algorithm for prediction of EoE diagnosis were also shown.[53,55] A CAD tool demonstrated to recognise stationary manometry motor patterns with accuracy,[54] but the application of novel CAD tools to high-resolution manometry recordings is yet to be evaluated. Importantly, AI demonstrated utility in the recognition of infrequent forms of oesophagitis (ie, CMV vs HSV), which may be mischaracterised even by expert endoscopists.[57]

There are limitations that were identified in the included studies. Almost every study available for this meta-analysis was retrospective. An inherent bias related to the nature of these studies is the convenience sampling of controls (ie, selection bias). In this regard, most studies were based on endoscopic images only, which were often carefully selected among optimal stored endoscopic images. Far less studies tested AI with real-time endoscopic videoclips, which would better reflect the real life where AI models would help most. On the other hand, in this study, the performance of AI applied to real-time videos was not statistically different from that on still images, and the performance of AI was similar to that of endoscopists. Additionally, a recent meta-analysis reported that the inclusion of video clips in the training and validation data sets of AI models could achieve even higher performance than those including images alone.[66]

Furthermore, retrospective studies offer the possibility to quickly test for the first-time hypothesis that could be further investigated by larger and prospective trials.

Most of the studies were based on DL models, and others applied ML with SVM algorithms. Additionally, various training, validation and testing techniques were used in the various investigations, namely a different AI algorithm, a different number of training/validation/testing images or videoclips, and a different proportion of images or videoclips for training, validation and testing. On the other hand, a recent meta-analysis concluded that AI could detect and characterise colorectal polyps despite the use of different AI algorithms and imaging techniques.[71] Importantly, only two studies included in this meta-analysis clarified whether a CADe or a CADx was used.[25,31] Accordingly, efforts should be put in place in future studies for a more rigorous distinction between detection and diagnosis/characterisation of lesions to overcome this limitation. Of note, one third of the studies included in the qualitative synthesis could not be included in the meta-analysis because of non-poolable or non-extractable data. As it has already been reported,[64] this represents a major limitation of the literature regarding AI in upper GI diseases. Finally, AI itself has inherent limitations. A high volume of training data is needed to refine the performance of the algorithm. In addition, the high computational power of AI carries the risk of overfitting, in which the model is too tightly fitted to the training data and does not generalise towards new data.[72] Furthermore, AI has a black-box nature, and its decision process is obscure. Therefore, reliance on AI tools should never replace clinical judgement and should be considered of support only.

Despite the application of AI in the diagnosis of ODs is relatively recent, our results demonstrated high accuracy of the examined CAD tools for the evaluation of ODs. However, several limitations still hamper a capillary diffusion of CAD tools in the diagnosis of upper GI disorders, and further prospective and real-time studies are needed to fully understand what impact will AI have in the practice of gastroenterologists. Our investigation yielded several gaps in the studies that investigated AI in the diagnosis of ODs, which will need to be addressed and filled when designing future studies.