Towards the Best Kidney Failure Prediction Tool

A Systematic Review and Selection Aid

Chava L. Ramspek; Ype de Jong; Friedo W. Dekker; Merel van Diepen


Nephrol Dial Transplant. 2020;35(9):1527-1538. 

In This Article


Study Selection

The study selection process is described in a flowchart (Figure 1). Overall, 2183 titles were identified, of which 431 abstracts were assessed, and 90 full-text publications were evaluated in depth. From these articles, a final 42 studies met all inclusion criteria and were included in the current review. Most full-text exclusions were due to the predicted outcome not including kidney failure or the lack of a multivariate model. Although prediction research has seen a great surge in nephrology over the last few years, the first included predictive model was published in 1986 for IgA nephropathy patients. Since the beginning of the 2000s, a substantial increase of published models is apparent, as can be seen in Figure 2. Although the number of developed models has increased almost every year, the number of validation studies has remained small. Of the 42 included studies, 7 exclusively externally validated already existing models.[16–22] Besides development, 10 studies also externally validated their own or previously published models. Disconcertingly, no study assessing the impact of using such a prediction tool was found, which ultimately is the only way of assessing whether the model can improve patient care.

Figure 1.

PRISMA flow diagram of study inclusion.

Figure 2.

Cumulative number of published development and validation studies for models that predict kidney failure in CKD patients (N = 42).

Characteristics of Development Studies

A total of 35 studies were published on the development of novel tools to predict kidney failure in CKD patients. Generally, a distinction can be made between models developed for a general CKD patient population (n = 16) and models developed for a population with a specified primary renal disease (n = 19), mainly IgA nephropathy or diabetic nephropathy. The characteristics of all included development studies are described in Table 1. Since each study developed between 1 and 12 prediction models, the results presented in Table 1 concern the final model(s) as selected by the authors or the model with the best performance if no final model was suggested. The population size differed greatly between studies and ranged from 75 to 28 779 patients. A small sample size was a problem in 17/35 studies, as they had <10 EPV, thus running the substantial risk of overfitting their model.[14] To assess to what extent these models are overfit, external validation is of key importance. Before the validity of these models is tested, they should not be used in practice.

For specific renal diseases, the baseline was almost always the first biopsy (and disease confirmation), providing a clear moment in time for when to use the prognostic model or score. Models developed in general CKD, however, rarely defined the moment in time when their prediction tool should be used, as most of these studies enrolled prevalent CKD patients with a wide range of disease severity. Only two models were developed on incident patients, who were included at the first referral to a nephrologist.[26,34] There was some variation in outcome definitions, but for most studies, renal failure was defined as the need for RRT (dialysis start or kidney transplantation). Five studies used estimated glomerular filtration rate (eGFR) or creatinine as a proxy for kidney failure. Two development studies used RRT start or death as a composite outcome measure. A total of four studies did not report their definition of ESRD. The time frame over which the models predict kidney failure ranged from 6 months to 20 years and nine studies failed to define a prediction time frame, presumably using the maximum study follow-up. The specific predictors included per development study are presented in Figure 3. There is a large amount of overlap in final predictors with almost all studies including age, sex, eGFR (or serum creatinine), proteinuria and histological features for IgA nephropathy tools.

Figure 3.

Predictors included in development studies (N = 35). The inclusion of a predictor is shown as 'X'. The subscript under X (e.g. 'X2') indicates the number of predictors included from that category.

Concerning the reporting of performance measures, discrimination measures were reported far more often than calibration measures. Discrimination in the form of a C-statistic was reported in 28/35 studies. The C-statistic ranged from 0.72 to 0.96 and was generally high, indicating good to excellent discrimination in most studies. Calibration was presented far less frequently, with only 11 studies presenting a calibration plot, bar chart or test.

In order to calculate an individual's risk, the model constant and hazard ratios (HRs)/regression coefficients per predictor are needed. Many studies only presented HRs per predictor without the constant (intercept or baseline hazard value), and some gave no data on the model equation at all. The full formula for the developed model was presented in only 6/35 studies. Just three studies provided a web calculator for easy use, of which two web calculators are no longer working. A total of 13 studies provided a simplified scoring system. In total, 25 final models were validated in some form, either internally and/or externally. Cross-validation, bootstrapping and random split sample were the most used forms of internal validation.

Characteristics of External Validation Studies

A total of 17 studies externally validated one or more of the developed prediction tools. The characteristics of these models and validations can be found in Table 2. Most validation studies were performed by the same group of researchers who developed the models and were often presented in the same publication as the development. Compared with the development performance, the C-statistic was lower in 68% of the validations. Two studies updated the validated model by recalibrating the baseline hazard and two studies added predictors to the existing model. In total, five risk scores predicting prognosis in IgA nephropathy patients and seven prognostic tools for general CKD patients were externally validated. Only the Absolute Renal Risk (ARR) score, Goto score and Kidney Failure Risk Equation (KFRE) (three, four and eight variables) were validated multiple times. The largest validation study of the KFRE was performed by Tangri et al.[18] and summarized the validation of the KFRE in >30 countries, including more than half a million patients.

Risk of Bias

Risk of bias was assessed in all 42 included studies, using signalling questions from the PROBAST specified for detecting methodological flaws in both development and validation prediction studies. Overall, the risk of bias was high, as can be seen in Figure 4A and B. Forty-one of 42 studies received a high risk of bias in at least one of the five domains; the only study with an overall low risk of bias was by Schroeder et al..[24] The majority of studies had a high risk of bias in the domain sample size and missing data. This was often due to the use of complete case analysis, which is generally an inappropriate method of handling missing data. A small sample size was a frequent problem limiting model usage, as a small sample often results in an overfit model and thereby biased results. In the domain statistical analysis, 83% of studies had a high risk of bias. The most common reason was incomplete reporting of performance measures, as few studies reported sufficient calibration results. Also, many studies did not correct their model for overfitting through internal validation. The usability of the model was assessed in a separate domain. If in the publication the full model formula, a calculator or a risk score with absolute risk table was available, then the tool was considered usable. Less than half the studies (48%) presented enough detail for the use of their prediction tool in practice. The usable models that specified a prediction time frame are presented in Figure 5, categorized by the type of patient population and outcome. This figure may be employed as a selection guide when wanting to calculate an individuals' prognosis, taking into account that many of the models have significant shortcomings and may not be ready for clinical use.

Figure 4.

(A) Risk of bias and usability of prediction models (N = 42). Assessed using the PROBAST. The five risk of bias domains were evaluated as low risk (+), unclear risk (?) or high risk (−). Usability was evaluated as yes (+) or no (−). (B) PROBAST risk of bias summary for all studies (N = 42).

Figure 5.

Model selection guide for CKD patients. In this graph, only models that allow calculation of an individual's prognosis and are therefore labelled as usable are included. This entails that these models provide either a full formula, score with absolute risk table or (currently working) web calculator for a specified prediction time frame. For categories containing multiple models, the risk of bias combined with evidence of external validity was weighed in determining the model order, starting with the most valid and least biased models. Nevertheless, many of the models listed have significant shortcomings and should be used with caution.