Systematic Review With Meta-analysis

Artificial Intelligence in the Diagnosis of Oesophageal Diseases

Pierfrancesco Visaggi; Brigida Barberio; Dario Gregori; Danila Azzolina; Matteo Martinato; Cesare Hassan; Prateek Sharma; Edoardo Savarino; Nicola de Bortoli

Disclosures

Aliment Pharmacol Ther. 2022;55(5):528-540. 

In This Article

Results

The search strategy generated 2568 citations. From these we identified 67 separate articles that appeared to be relevant to the study question. In total, 42 studies[17–58] reported on the performance of AI in the diagnosis of various ODs and were included in the qualitative synthesis (Supplementary Table S1). Among the included studies, 19[17–35] reported complete data for extraction and were included in the meta-analysis: 9 on BN,[17–25] 5 on OSCC,[26–30] 2 on abnormal IPCLs[31,32] and 3 on GERD[33–35] (Figure 2). Agreement between investigators for the assessment of study eligibility was excellent (kappa statistic = 0.85).

Figure 2.

Flow diagram of assessment of studies identified in the meta-analysis

Studies not Included in Quantitative Synthesis

Among the 42 studies included in the qualitative synthesis, 23[36–58] could not be included in the meta-analysis for various reasons. They also included investigations on eosinophilic oesophagitis (n = 1), reflux monitoring (n = 1), optical endoscopic diagnosis of GERD (n = 1), motility assessment (n = 1), diagnosis of cytomegalovirus and herpes simplex virus oesophagitis (n = 1) and varices (n = 1). In particular, nine studies[38,39,41–44,48–50] did not report complete data for extraction, five studies[45–47,51,58] reported non-poolable data, and nine[36,37,40,52–57] were the unique retrieved studies of their type. We included these studies in Supplementary Table S1 for completeness, but not in the final quantitative synthesis through meta-analysis.

AI in the Diagnosis of Barrett's Neoplasia

Nine studies[17–25] reported extractable and comparable data regarding AI in the diagnosis of BN (Figure 3). Eight studies were performed in Europe[17–25] and one in America.[22] All the studies used DL models, except two in which an SVM algorithm was tested.[17,18] Moreover, seven studies provided the performance of AI under WLE,[17–22,24] two under NBI[22,25] and one provided the comprehensive performance of AI with WLE or NBI.[23] Six studies were retrospective,[18,20,22–25] and three were prospective.[17,19,21] Three studies evaluated the performance of AI using real-time videos.[19,21,25]Three studies compared the performance of the AI system to that of endoscopists, and all these studies used WLE.[17,18,20] In all the included studies, BE and BN were diagnosed according to histology as ground truth.

Figure 3.

Performance of artificial intelligence in the diagnosis of Barrett's neoplasia

The comprehensive performance of AI in the diagnosis of BN with WLE or NBI, based on all the nine studies,[17–25] was: AUROC 0.90 (CI, 0.85–0.94), pooled sensitivity 0.89 (CI, 0.84–0.93), specificity 0.86 (CI, 0.83–0.93), PLR 6.50 (CI, 1.59–2.15), NLR 0.13 (CI, 0.20–0.08) and DOR 50.53 (CI, 24.74–103.22) (Table 1).

For the detection of BN under WLE the pooled AUROC was 0.89 (CI, 0.84–0.94), pooled sensitivity 0.89 (0.82–0.94), pooled specificity 0.86 (CI, 0.82–0.89), pooled PLR 6.43 (CI, 1.53–2.17), pooled NLR 0.12 (CI, 0.21–0.01) and pooled DOR 52.03 (CI, 21.56–125.58) in seven studies[17–22,24] (Table 1).

For the detection of BN under NBI in two studies,[22,25] the pooled performance was AUROC 0.93 (CI, 0.75–0.99), sensitivity 0.89 (CI, 0.77–0.95), specificity 0.96 (CI, 0.47–1.00), PLR 20.19 (CI, 0.37–6.23), NLR 0.11 (CI, 0.5–0.05) and DOR 177.11 (CI, 2.9–10 821.79). Very wide confidence intervals are observed especially for DOR given that only two studies were reported in the evidence synthesis (Table 1).

As regard the type of AI algorithm, the pooled AUROC, sensitivity, specificity, PLR, NLR and DOR of the studies that used DL as a backbone were 0.91 (CI, 0.86–0.95), 0.89 (CI, 0.83–0.93), 0.87 (CI, 0.83–0.90), 6.80 (CI, 1.60–2.22), 0.12 (CI, 0.21–0.07) and 54.65 (CI, 24.01–124.4), respectively, in seven studies.[19–25] The pooled AUROC, sensitivity, specificity, PLR, NLR and DOR of the studies that used SVM as a backbone[17,18] were as follows: 0.87 (CI, 0.78–0.97), 0.89 (CI, 0.70–0.97), 0.84 (CI, 0.72–0.91), 5.45 (CI, 0.91–2.39), 0.13 (CI, 0.42–0.04) and 42.86 (CI, 5.95–308.51), respectively (Table 1).

For the pooled performance of AI on real-time videos, the AUROC, sensitivity, specificity, PLR, NLR and DOR were 0.82 (CI, 0.80–0.92), 0.81 (CI, 0.73–0.87), 0.84 (CI, 0.79–0.89), 5.20 (CI, 1.25–2.03), 0.22 (CI, 0.94–0.15) and 23.16 (CI, 10.35–51.81), respectively, in three studies.[19,21,25] For non-real-time studies,[17,18,20,22–24] the pooled AUROC, sensitivity, specificity, PLR, NLR and DOR were 0.93 (CI, 0.86–0.96), 0.92 (CI, 0.87–0.95), 0.87 (CI, 0.82–0.91), 7.11 (CI, 1.60–2.32), 0.10 (CI, 0.16–0.06) and 73.32 (CI, 30.61–175.63), respectively (Table 1).

As for retrospective studies,[18,20,22–25] the pooled performance of the AI algorithms was as follows: AUROC 0.93 (CI, 0.87–0.97), sensitivity 0.90 (CI, 0.85–0.94), specificity 0.87 (CI, 0.82–0.90), PLR 6.69 (CI, 1.54–2.25), NLR 0.11 (CI, 0.18–0.07) and DOR 59.54 (CI, 25.57–138.60). Instead, for prospective studies,[17,19,21] the pooled diagnostic efficacy of the AI algorithms was as follows: AUROC 0.87 (CI, 0.80–0.94), sensitivity 0.84 (CI, 0.70–0.92), specificity 0.86 (CI, 0.79–0.91), PLR 5.87 (CI, 1.19–2.28), NLR 0.19 (CI, 0.39–0.08) and DOR 31.66 (CI, 8.51–117.78) (Table 1).

As for studies performed in Europe,[17–25] the pooled AUROC, sensitivity, specificity, PLR, NLR and DOR were 0.85 (CI, 0.84–0.93), 0.85 (CI, 0.81–0.89), 0.84 (CI, 0.80–0.88), 5.50 (CI, 1.43–1.98), 0.17 (CI, 0.23 −0.13) and 32.07 (CI, 18.04–57.00), respectively. In the study performed in America,[22] the pooled AUROC, sensitivity, specificity, PLR, NLR and DOR were 0.98 (CI, 0.90–0.99), 0.97 (CI, 0.82–0.99), 0.97 (CI, 0.66–1.00), 28.62 (CI, 0.87–6.06), 0.04 (CI, 0.27–0.01) and 816.6 (CI, 8.73–76 349.10), respectively (Table 1).

The bootstrap AUROC comparison among groups indicates a significant difference across countries (P = 0.04). Moreover, the geographical location (Europe vs America) is a significant source of heterogeneity according to the χ2 test results (0.01). No significant differences in AUROC were identified across the other subgroups.

As regard endoscopists with WLE, the pooled AUROC, sensitivity, specificity, PLR, NLR and DOR were 0.90 (0.85–0.95), 0.93 (0.66–0.99), 0.85 (0.71–0.93), 6.17 (0.82–2.63), 0.09 (0.48–0.01) and 70.12 (4.70–1045.93), respectively, in three studies[17,18,20] (Table 2).

For the diagnosis of BN under WLE, the performance of AI was comparable with that of endoscopists (P = 0.98). Moreover, the method of diagnosis endoscopists versus AI is not a significant source of heterogeneity according to the χ 2 test results (0.96).

None of the included studies investigated the performance of endoscopists under NBI in the diagnosis of BN.

Artificial Intelligence in the Diagnosis of Oesophageal Squamous Cell Carcinoma

Five studies provided extractable and comparable data for the meta-analysis on the diagnosis of OSCC[26–30] (Figure 4). All the studies were conducted in Asia and used DL techniques. Four studies used WLE,[27–30] and two used NBI.[26,30] Two studies investigated the performance of AI during real-time videos,[26,27] whereas three used stored images.[28–30] Two studies compared the performance of the AI system to that of endoscopists.[28,30] All the studies defined OSCC according to histology as ground truth.

Figure 4.

Performance of artificial intelligence in the diagnosis of oesophageal squamous cell carcinoma

The pooled performance of AI in the diagnosis of OSCC with WLE or NBI was: AUROC 0.97 (0.92–0.98), sensitivity 0.95 (0.91–0.98), specificity 0.92 (0.82–0.97), PLR 12.65 (1.61–3.51), NLR 0.05 (0.11–0.02) and DOR 258.36 (44.18–1510.7) in five studies[26–30] (Table 1).

Under WLE, the pooled performance was as follows: AUROC 0.98 (0.95–0.99), sensitivity 0.95 (0.86–0.98), specificity 0.93 (0.77–0.98), PLR 14.42 (1.31–4.11), NLR 0.05 (0.18–0.02) and DOR 277.2 (19.94–3852.9) in four studies[27–30] (Table 1).

With NBI, the pooled diagnostic efficacy was as follows: AUROC 0.98 (0.94–0.99), sensitivity 0.96 (0.83–0.99), specificity 0.96 (0.94–0.97), PLR 23.49 (2.59–3.62), NLR 0.04 (0.19–0.01) and DOR 537.21 (71.81–4018.64) in two studies[26,30] (Table 1).

For the real-time diagnosis of OSCC by AI, the pooled AUROC, sensitivity, specificity, PLR, NLR and DOR were 0.99 (0.94–0.99), 0.94 (0.79–0.99), 0.98 (0.94–0.99), 39.4 (2.51–4.73), 0.06 (0.23–0.01) and 651.92 (53.83–7895.1), respectively, in two studies,[26,27] whereas, in non-real-time studies,[28–30] the pooled diagnostic efficacy of AI was as follows: AUROC 0.96 (0.89–0.97), sensitivity 0.96 (0.92–0.98), specificity 0.87 (0.71–0.95), PLR 7.29 (1.14–2.93), NLR 0.05 (0.11–0.03) and DOR 143.03 (27.61–741.01). The pooled diagnostic efficacy on videos was comparable with that on images (P = 0.29) (Table 1). There were no significant differences in AUROC across the other subgroups. Moreover, no sources of heterogeneity were found.

As regard the pooled performance of endoscopists in the diagnosis of OSCC, the AUROC was 0.88 (0.83–0.98), sensitivity 0.75 (0.68–0.80), specificity 0.88 (0.84–0.92), PLR 6.46 (1.46–2.27), NLR 0.29 (0.38–0.22) and DOR 22.45 (11.5–43.84) in two studies[28,30] (Table 2). The method of diagnosis endoscopists versus AI is a significant source of heterogeneity according to the χ2 test results (0.02), whereas no significant differences in AUROC have been identified across the other subgroups (P = 0.11) (Table 2).

Artificial Intelligence in the Detection of Abnormal Intrapapillary Capillary Loops

Two studies reported complete and poolable data on the detection of abnormal IPCLs and were included in the meta-analysis.[31,32] Both studies were performed in Asia, used DL algorithms, used fivefold cross-validation to generate five distinct data sets with different combinations of images and used magnified endoscopy (ME) with NBI. Both studies classified IPCL patterns according to the Japanese Endoscopic Society classification and histology as ground truth.[59]

For the detection of abnormal IPCL, the pooled performance of all the included AI algorithms was as follows: AUROC 0.98 (0.86–0.99), sensitivity 0.94 (0.67–0.99), specificity 0.94 (0.84–0.98), PLR 14.75 (1.46–3.70), NLR 0.07 (0.39–0.01) and DOR 225.83 (11.05–4613.93) (Table 1).

The pooled performance of the best fold of each study was AUROC 0.98 (0.97–0.99), sensitivity 0.99 (0.97–1.00), specificity 0.97 (0.96–0.98), PLR 32.18 (3.18–3.76), NLR 0.01 (0.03–0.00) and DOR 2779.61 (804.87–9599.39) (Table 1).

As regard the pooled performance of the worst fold of each study, the pooled AUROC was 0.87 (0.66–0.96), sensitivity 0.73 (0.55–0.86), specificity 0.87 (0.71–0.95), PLR 5.74 (0.64–2.85), NLR 0.31 (0.64–0.15) and DOR 18.56 (2.97–115.85) (Table 1). The bootstrap AUROC comparison across groups indicated a significant difference for the best and worst performance (P = 0.001). The algorithms type was a significant source of heterogeneity according to the x2 test results (<0.001).

Artificial Intelligence in the Diagnosis of Gastroesophageal Reflux Disease

Three studies used symptoms questionnaires for the AI-based diagnosis of GERD and were included in the meta-analysis.[33–35] Two studies were performed in Europe and used SVM algorithms,[33,34] whereas one study took place in Asia and used DL.[35] Two studies defined GERD (as erosive [ERD] or non-erosive reflux disease [NERD]) based on symptoms and endoscopy findings,[33,35] and one study[34] used symptoms, endoscopy findings and pH-metry as ground truth for the diagnosis of GERD (ERD or NERD).

For the diagnosis of GERD based on questionnaires, the pooled performance of AI was as follows: AUROC 0.99 (0.80–0.99), sensitivity 0.97 (0.67–1.00), specificity 0.97 (0.75–1.00), PLR 38.26 (0.98–6.22), NLR 0.03 (0.44–0.00) and DOR 1159.6 (6.12–219 711.69) in three studies[33–35] (Table 1).

For studies performed in Europe and with SVM,[33,34] the pooled AUROC, sensitivity, specificity, PLR, NLR and DOR were 0.98 (0.97–0.99), 0.99 (0.98–1.00), 0.99 (0.95–1.00), 145.88 (3.05–6.95), 0.01 (0.02–0.00) and 16 120.13 (1009.41–257436.50), respectively (Table 1).

The single study performed in Asia and with DL[35] had sensitivity, specificity, PLR, NLR and DOR of 0.70 (0.59–0.80), 0.78 (0.66–0.87), 3.25 (0.55–1.81), 0.38 (0.62–0.23) and 8.61 (2.8–26.48), respectively (Table 1).

Deeks' Funnel Plot for Publication Bias

The Deeks' funnel plot asymmetry test indicated the absence of a publication bias in the included studies (P = 0.39).

processing....