Systematic Review and Meta–analysis of Algorithms Used to Identify Drug–induced Liver Injury (DILI) in Health Record Databases

Eng Hooi Tan; En Xian Sarah Low; Yock Young Dan; Bee Choo Tai


Liver International. 2018;38(4):742-753. 

In This Article


We identified 420 citations through our search strategy. We assessed 84 full–text articles for eligibility, of which 55 articles were excluded for various reasons outlined in Figure 1. We included 29 articles in the descriptive review, of which 25 studies were included in the meta–analysis. Three studies[32–34] did not validate the cases detected by the algorithm and one study[35] did not report the number of subjects who had medical records reviewed.

Figure 1.

Flow diagram of study selection and review

Systematic Review of 29 Articles

Study characteristics. The studies were published from 1993 to 2016, with 17 (59%) studies published after 2010 (Table 1). Most studies (79%) were conducted in the USA or Europe. Studies from Asia or Australia were published in more recent years. Majority of the studies included adult populations (62%),[6,32–48] two studies (7%) included children aged 18 years and below,[49,50] eight studies (28%) included a mixed adult and children population,[51–58] while age limit was not reported in one study.[59] Apart from two prospective studies[38,42] with shorter study periods, the rest of the studies were conducted retrospectively. Sixteen studies (55%) were designed with specific drugs in mind, with antimicrobials being the most common.

Medical record review was the reference standard used in 25 studies (86%) to confirm the DILI cases. For causality assessment, 16 studies (55%) relied on expert opinion, 10 studies (34%) named the RUCAM or World Health Organization Uppsala Monitoring Centre (WHO–UMC) drug causality scale, whereas one study (3%) used the Japanese diagnostic scale, DDW–J. Two studies[34,49] (7%) did not do a proper causality assessment but were included in this review because competing diagnoses for DILI, such as viral hepatitis, chronic liver injury, autoimmune hepatitis and biliary tract diseases were systematically excluded.

Studies conducted between 1993 and 2000 tended to utilize diagnosis codes as the screening filter to select cases, subsequently confirming the case status with laboratory criteria. When EMR systems were adopted after year 2000, laboratory information which were also digitalized, was used as an additional screening criterion in some studies. Fourteen studies (48%) did not use diagnosis codes for selecting patients with DILI, relying solely on laboratory criteria. In studies that specified diagnosis codes in their algorithms, International Classification of Diseases, Ninth Revision (ICD–9) was used in USA, Canada, Italy and Netherlands whereas studies in the UK used Oxford Medical Information System (OXMIS) codes. The number of diagnosis codes used to select potential cases also varied from 3 to 15 for ICD–9 codes and from 19 to 25 for OXMIS codes. The most commonly referenced laboratory criteria was the Council for International Organizations of Medical Sciences (CIOMS) criteria,[60] found in 14 studies (48%). A myriad of other criteria were referenced, however, these criteria were only restricted to a single study; hence, no meaningful comparison could be inferred. Furthermore, five studies (17%) did not provide any citation for their chosen criteria.

Study quality. Quality scores ranged from 3 to 11 (median 7) (Data S2). With respect to reporting quality, selection criteria and detection algorithm were generally well–reported, with the exception of two studies[54,58] which did not list out the specific diagnosis codes. Eleven studies[37,43,45–48,50–53] did not report the expertise of adjudicators in confirming the DILI cases. Only seven studies[41,42,48–50,54,58] performed well in generalizability. Studies which reported low scores had more selective population (eg drug–specific or inpatient cases only). As for risk of bias, studies that performed poorly had selection bias because of missing records. Most studies also did not verify that patients not detected by the algorithm were true negatives.

Assessment of publication bias. The funnel plot appears symmetrical about the random effect estimate (Figure 2). There were eight studies beyond the pseudo 95% confidence limits. Four of these that reported lower PPV were among studies with the largest sample sizes.[40,41,49,57] The other four studies that reported higher PPV were studies with moderate sample sizes ranging from 130 to 688.[45,46,48,59] The result remained unaltered using the trim and fill method, suggesting that publication bias in this meta–analysis may not be a major issue of concern.

Figure 2.

Funnel plot of positive predictive value against standard error with pseudo 95% confidence limits

Meta–analysis of 25 articles. The validation studies assessing the DILI detection algorithms reported PPVs that ranged from 1.0% to 40.2%. There was evidence of heterogeneity across studies (I[2] = 98.6%, P < .001). The PPV of the algorithms that combined diagnosis codes with laboratory criteria (n = 13) ranged from 1.0% to 29.1% whereas the PPV of the algorithms that did use laboratory criteria alone (n = 12) ranged from 3.1% to 40.2% (Figure 3). The most commonly used liver enzyme test criteria was alanine aminotransferase (ALT), reported in all except one study, which measured bilirubin alone.[41] Studies that specified drugs of interest (n = 13) had PPV ranging from 1.0% to 40.2% whereas studies that did not select cases based on drug class (n = 12) had PPV ranging from 3.7% to 28.2% (Figure 4).

Figure 3.

Forest plot of positive predictive values of detection algorithms by laboratory and/or diagnosis code criteria (n = 25)*. *Study by Udo et al had various algorithms ranging from 22% to 48%. However, data for unique number of patients per algorithm is unavailable, hence the overall number of DILI cases over the number of unique medical records reviewed was reported

Figure 4.

Forest plot of positive predictive values of detection algorithms by specification of study drug (n = 25)

The overall PPV estimate of detection algorithms was low (14.6%, 95% CI: 10.7–18.9). Combining diagnosis codes with laboratory criteria marginally improved the pooled PPV estimate (14.9% vs 14.3%, P = .837). Studies that prespecified drugs of interest generally had higher PPV than studies which did not do so (17.7% vs 11.6%, P = .053). There was no significant difference in PPV between studies that validated ALI and those that validated DILI (13.4% vs 15.3%, P = .537) (Figure 5).

Figure 5.

Forest plot of positive predictive values of detection algorithms by validation of ALI vs DILI (n = 25)

The subgroup that prespecified drugs of interest was further analysed. For better comparison across studies, studies were excluded if the biochemical criteria was for severe DILI, such as Hy's Law, and ALT more than 10 times ULN. Furthermore, if exclusion diagnoses were specified in the algorithm, and the number of liver–related diagnosis codes for inclusion criteria were <20, the pooled estimate for PPV of these eight studies was 22.6% (95% CI: 14.9–31.3) (Data S3). We performed additional subgroup analysis according to types of databases (integrated health plan, primary care, and hospital–based databases) but did not find any evidence of difference in the PPV between these three groups (Data S4). The PPV for health plan, primary care and hospital–based databases were 12.9%, 13.3% and 16.3% respectively. Similarly, stratification by disease dictionary used showed that studies which used OXMIS codes had slightly higher PPV than studies which used ICD codes (16.4% vs 14.2%, P = .408) (Data S5). This is possibly attributed to the higher granularity of OXMIS codes.