We searched MEDLINE, EMBASE, EMBASE Classic and the Cochrane Library (via Ovid), from inception to 30 March 2021, to identify prospective and retrospective case-control type or cohort-type accuracy studies reporting the performance of AI systems in the instrumental or clinical diagnosis of malignant and benign ODs. To identify potentially eligible studies published only in abstract form, conference proceedings (Digestive Disease Week, American College of Gastroenterology and United European Gastroenterology Week) from 2000 until 30 March 2021 were also searched. The complete search strategy is provided in Supplementary Methods. There were no language restrictions. We screened titles and abstracts of all citations identified by our search for potential suitability and retrieved those that appeared relevant to examine them in more detail. Foreign language papers were translated. A recursive search of the literature was performed using bibliographies of all relevant studies. We also planned to contact authors if a study appeared potentially eligible, but did not report the data required, to obtain supplementary information and, therefore, maximise available studies.
Study Selection (Inclusion and Exclusion Criteria)
The eligibility assessment was performed independently by two investigators (PV, BB) using pre-designed eligibility forms. We included in the systematic literature (a) studies reporting the use of AI in the diagnosis of ODs in adult patients, (b) studies that reported the rates of true positivity, false positivity, false negativity and true negativity compared with the gold-standard diagnosis of the disease as ground truth and (c) studies that reported the numbers of images or videoclips included in the AI analysis. For the meta-analysis we included studies that (a) separately assessed the performance of AI with different tools when more than one tool was used (ie white light endoscopy [WLE], narrow band imaging [NBI]), (b) separately assessed the performance of AI in the diagnosis of different histological types of oesophageal cancer. We excluded review articles, case reports and studies that applied AI restrictedly to radiology or histopathology from the qualitative analysis. We excluded (a) studies exclusively providing comprehensive performance scores of AI for different histological types of oesophageal cancer and (b) studies not reporting data for extraction from the meta-analysis. Any disagreements were resolved by consensus opinion among reviewers, and the degree of agreement was measured with a kappa statistic. Ethical approval was not required because this study retrieved and synthesised data from already published studies.
Data Extraction and Analysis
Data were extracted independently by two authors (PV, BB) on to a Microsoft Excel spreadsheet (XP professional edition; Microsoft, Redmond, WA, USA). Disagreements were resolved by consensus among the reviewing authors.
The following data were collected for each study: total number of images/cases used in the validation sets, total number of "ground truth" images/cases (ie human detected and histologically confirmed as neoplastic or non-neoplastic; diagnosis of GERD based on symptoms, endoscopy findings and/or pH-metry), the numbers of images/cases that were true positive (images/cases showing a neoplastic lesion detected/predicted-as-neoplastic by AI), true negative (images/cases showing non-neoplastic mucosa without AI detection or lesions predicted as non-neoplastic), false positive (FP, images/cases showing non-neoplastic mucosa or lesions detected/predicted-as-neoplastic by AI) or false negative (images/cases showing a neoplastic lesion missed by AI or predicted as non-neoplastic). In addition, year of publication, country where the study was conducted, type of study (prospective, retrospective), number of patients, diagnostic tool (endoscopy and type of endoscopic light, questionnaires, pH-impedance monitoring, oesophageal manometry, oesophageal biopsies), type and design of AI systems (DL, SVM) were also retrieved.
The primary outcomes of interest were the pooled diagnostic sensitivity, specificity, positive likelihood ratio (PLR), negative likelihood ratio (NLR), diagnostic odds ratio (DOR) and the area under the summary receiver operating characteristic curve (AUROC) of the AI models in the diagnosis of malignant and benignant ODs.
The secondary outcome was the comparison of the performance of AI models versus endoscopists (without the aid of AI) in analysing the same validation data sets.
The degree of bias was assessed using the Quality for Assessment of Diagnostic Studies (QUADAS) score. In detail, we identified four domains: patient selection, index test, reference standard and flow and timing. The first three domains were also assessed for concerns regarding applicability. Each section was classified as having a high, low or unclear risk of bias (Figure 1).
A bivariate, random-effect model was used to calculate pooled sensitivity, specificity, PLR, NLR, DOR and the AUROC of AI-assisted models and endoscopist in detecting oesophageal lesions. The method takes into account the correlation between sensitivity and specificity. The estimation procedure is based on a restricted maximum likelihood approach. The model parameterisation assumes that the sensitivity the specificity, on the logit scale, are distributed as bivariate normal random variables. The pooled AUROC 95% confidence interval has been estimated by performing a bootstrap resampling of the AUC value; however, some concerns are possible in cases of few studies included in the computation. The calculation has been performed by considering the extended 95% CI procedure of computation for meta-analysis of diagnostic accuracy studies.
Heterogeneity has been performed by considering the Cochrane guidelines. The χ 2 tests to assess heterogeneity of sensitivities and specificities were performed. The sources of heterogeneity were explored through subgroup analysis. We conducted subgroup analyses according to (a) specific diagnosis (Barrett's neoplasia [BN], oesophageal squamous cell carcinoma [OSCC], abnormal IPCL, GERD), (b) country, (c) study type, (d) AI algorithm, (e) endoscopy type, (f) real-time evaluation of the performance of AI and g) best and worst performance of different algorithms on the same image set.
Aliment Pharmacol Ther. 2022;55(5):528-540. © 2022 Blackwell Publishing