This systematic review revealed some limitations in the current literature. These include (i) the validation of algorithm–positive patients only in most studies, (ii) wide variability in DILI definitions and methodologies affecting the accuracy of the detection algorithms, and (iii) insufficient detail on the causality assessment method.
Studies assessing the DILI detection algorithms reported low PPVs, which could be affected by the low prevalence of DILI. The estimated incidence of DILI in retrospective databases ranged from 0.7% to 22.8%. In hypothetical populations, the PPV is shown to decrease with lower disease prevalence, with a stronger decrease when prevalence is below 50%.[62,63] When Hy's Law was used to identify ALF in DILI patients, the PPV was only 2% given the low incidence rate of drug–induced ALF at 1.6 events per 1 000 000 person–years. For detection of DILI, the algorithm would be more clinically useful if it was applied discriminately to patients who are more likely to have DILI. With the exception of one study which performed medical record review on criteria–negative patients, there was insufficient information to compute sensitivity, specificity and negative predictive value. Thus, we were unable to assess the magnitude of false negatives though it is likely to be small given the rare occurrence of DILI.
Detection of an ADE requires two elements—an adverse outcome, and association with a drug. Using a single criterion in detecting an ADE could lead to high false positive rates. Jinjuvadia et al reported that using ICD codes alone to search for DILI cases is a monumental task, yielding a PPV of merely 0.7%. This was attributed to a high degree of inaccuracy in the diagnosis codes in their administrative database. Indeed, the performance of a detection algorithm relies on the accuracy of the parameters in the healthcare database. A systematic review revealed that the detection of ALI via diagnosis codes performed worse than other disease conditions, with PPVs <25%. Detection of DILI is even more difficult because the occurrence of ALI may not be associated with drugs. Our study showed no significant difference in PPVs between studies validating ALI and studies validating DILI. This is possibly attributed to our selection criteria of ALI validation studies with drug exposures and exclusion of alternative causes of liver injury, thus making the two groups more similar because ALI in these studies are more likely to be associated with drugs. However, Ruigomez et al reported an improved PPV of 63% when automated exclusions according to laboratory results, symptoms and diagnosis codes were implemented in their ALI detection algorithm. Investigators not only need to define DILI by diagnosis codes or laboratory criteria but the sequence of these definitions in detection algorithms also matter. Udo et al demonstrated that although selecting cases via ICD codes at the start of the algorithm was more efficient, ultimately true DILI cases may be missed. Our study shows that using diagnosis codes as a search criteria for inclusion in the algorithm does not add significant additional value to the PPV for DILI. This is not surprising as rechallenge is usually omitted, thus most clinicians will not make the diagnosis of DILI with confidence at discharge. Furthermore, many of the diagnosis codes, such as ALF or abnormal liver function tests, are aetiology agnostic.
The difficulties in making comparisons across studies were compounded by the different laboratory threshold criteria, diagnosis codes and drugs used to select cases. Perez Gutthann et al reported a higher PPV than other studies of similar design and time period. This could be because of the lower number of diagnosis codes used, in combination with specific drugs to identify potential cases, as well as exclusion of alternative liver injury causes before medical record retrieval, thus resulting in less false positives. Conversely, Maggini et al reported a low PPV even though a small number of diagnosis codes were used and the drug of exposure was limited to amoxicillin–clavulanic acid. Comparatively, the latter study had a higher number of diagnosis codes to identify cases but lower number of exclusion diagnoses. Nonetheless, more recent studies[48,56] that used a larger number of diagnosis codes to detect DILI had better PPV because exclusion diagnoses were also specified in the algorithm. Of the studies that did not use diagnosis codes, only four studies specified drugs of interest. Three of these studies[45,46,59] performed better than the others, except one study, which assessed Hy's Law in chemotherapy patients. The population of patients was dissimilar to the other studies because the most common false positive was because of liver metastases. Using Hy's Law also selected more severe DILI cases, which occur rarely, resulting in a low PPV. Most studies used rule–based algorithms, except for two studies which implemented more computationally advanced algorithms, incorporating natural language processing in the discrimination between DILI cases and noncases.[56,59] In this meta–analysis, the PPV for the algorithm developed by Lin et al was calculated as per the development of the gold standard via medical record review. However, machine learning algorithms implemented subsequently had the potential to increase the PPV to an impressive range of 59.0%–81.4%, depending on the algorithm features and settings.
Despite the limitations, several studies were able to detect DILI in a reasonably high proportion of patients. These tend to include specific drugs and have built–in algorithms to exclude other aetiologies through the diagnosis database. While not perfect, PPV of more than 20% would significantly improve the reporting rate, enhance the value of registry reporting and impact its value for guiding clinical care. Ramirez et al reported 20.6% of patients identified through automatic laboratory signals had a serious ADE; which allowed a 27–fold increase in its detection and reporting rate.
Apart from the ubiquitous CIOMS criteria, new DILI criteria had been proposed by various groups involved in observational studies.[1,19,70] Unfortunately, we were not able to make meaningful comparisons between these criteria because of the low number of studies using the new criteria that were eligible for this review.
The lack of a universally accepted gold standard for causality assessment of DILI could cause differences in the algorithm performance. The most commonly used causality assessment tool was RUCAM. However, most studies used expert opinion to adjudicate DILI cases. It is unclear if the investigators in these studies followed a particular framework, which poses a challenge in replication of results. In studies that validated ALI, most reviewers are blinded to drug exposure and hence did not interpret causality in the same manner as reviewers for DILI but this difference was not evident based on the similar pooled PPVs among the subgroups. The US DILI network (DILIN) prospective study attempted to structure expert opinion process to minimize inter–rater variability and bias. Although complete agreement was higher in the DILIN method compared to the RUCAM, there is still significant inter–rater variability in both methods.
To our knowledge, this is the first systematic review and meta–analysis to determine the performance characteristics of DILI detection algorithms in healthcare databases. Our search was extensive as it included algorithms used in administrative databases as well as EMR, in addition to a variety of hepatotoxic drug classes. Although there are other systematic reviews assessing the validity of case definitions in administrative databases, these are mostly limited to disease conditions without mention of drug causes, such as diabetes mellitus, myocardial infarction or rheumatoid arthritis. Previous systematic reviews on the detection of ADE in electronic systems identified substantial variability in definitions and methodologies[11,65] because there were no limits to the type of ADEs studied. Identification of nonspecific ADEs demands knowledge of a multitude of clinical concepts, and unsurprisingly there is no one–size–fits–all approach. Optimal solutions should be customized to a particular ADE. Thus, our review that is focused on DILI provides a reference for the existing detection algorithms used in electronic systems and the characteristics that leads to a high PPV.
The main limitation of this study is the significant heterogeneity in the studies. Studies differ in their DILI definition in terms of number of diagnosis codes, type of laboratory threshold criteria, exclusion diagnoses, and drugs of interest, which relates to population selection criteria. However, we have described the possible correlations between study characteristics and their performance. To account for the heterogeneity, we have performed a random effects model to obtain a pooled estimate of PPV.
Our meta–analysis demonstrate a few areas of critical needs: To facilitate generalizability and reproducibility of published studies, investigators should adopt a consistent definition of DILI such as the most recent internationally agreed phenotype suggested by Aithal et al, with standardized causality assessment methods. We recommend studies to utilize a more specific detection algorithm to minimize the number of false positives. This could be achieved by specifying known hepatotoxic drugs of interest according to reference databases[20,75] to improve the PPV because of increased prevalence of DILI and excluding differential diagnoses of liver injury in the algorithm. Machine learning methods could also be explored to develop more intuitive algorithms to detect DILI.
Liver International. 2018;38(4):742-753. © 2018 Blackwell Publishing