How Robust Are Clinical Trials in Heart Failure?

Kieran F. Docherty; Ross T. Campbell; Pardeep S. Jhund; Mark C. Petrie; John J.V. McMurray


Eur Heart J. 2017;38(5):338-345. 

In This Article

Abstract and Introduction


Aims Guidelines for the management of chronic heart failure (CHF) cite the results of randomized controlled trials (RCTs) to support treatment recommendations. The significance of an observed treatment-effect relies on the use of a boundary P-value, most commonly P < 0.05. There is concern about relying on arbitrary threshold P-values to report results as 'statistically significant'. The 'fragility index' (FI) has been proposed as an additional measure of the robustness of trial findings. FI is the minimum number of events needing to change from a non-event to an event in order to render a significant result non-significant. We calculated the FI to examine the robustness of statistically significant RCTs in CHF.

Methods and results Two reviewers extracted data from RCTs supporting treatment recommendations in CHF guidelines. Twenty-five eligible trials were identified with a median sample size of 2331 patients (range 129–8399) and a median number of primary endpoints of 688.5 (range 88–2031). For the primary endpoint (analysed for 20 trials), the median FI was 26 (range 0–118). The FI was ≤10 in 7 (35%) of these 20 trials, and in 4 (20%) trials the number of patients lost to follow-up in the treatment group exceeded the FI.

Conclusion The results of some large RCTs in CHF hinge on a small number of events. The FI offers an additional, easy to understand metric, which augments the standard reporting of boundary P-values for statistical significance. The FI helps in the interpretation of the robustness of the results of RCTs.


The practise of evidence-based medicine emphasizes the importance of the results of randomized controlled trials (RCTs) in guiding and justifying treatment decisions.[1] It is therefore crucial that such results are robust and that guideline writers and practitioners have a clinically meaningful and readily understandable method of evaluating robustness. Many clinicians, however, focus on relative risk reductions derived from hazard ratios, the 95% confidence intervals around these, and the threshold P-value of <0.05 which is commonly taken to denote statistical significance.[2] However, reliance on these metrics alone is of concern.[3] Implicit in the reporting of a relative risk reduction as 'significant' is the assumption that a true treatment effect exists. Sample size, number of events, and number of patients lost to follow-up, along with other factors including whether there is more than one trial, are also important determinants of the robustness of the findings.[4,5]

In order to assist the interpretation of trials an additional statistical metric, the 'fragility index' (FI), has been proposed as a tool to evaluate the robustness of results.[6] FI is the minimum number of events that need to change from a non-event to an event in order to render a significant result non-significant. The smaller the index, the more fragile the result. The principle underlying FI can be illustrated using an example of a trial with 100 patients randomized equally to treatment or placebo. If 10 patients in the treatment group experience an event, compared with 20 patients in the placebo group, the resultant P-value is 0.049 using a two-sided Fisher's exact test. If only one more event is added to the treatment group (n = 11) while maintaining the same event rate in the placebo group, the trial loses 'significance' as the P-value increases to 0.083.

To explore the value of FI, we examined its use in assessing the robustness of the results of trials in chronic heart failure (HF) with reduced rejection fraction (HF-REF) as this is one of the most evidence-based areas in the whole of medicine.[7,8] Multiple RCTs involving tens of thousands of patients have evaluated the effects of pharmacological and non-pharmacological therapies over the past 30 years. We analysed the trials providing the basis of guideline-recommended therapy in this condition. We also tested the value of three extensions of the FI concept. Firstly, we examined the FI for the different regulatory P-value thresholds for approval of a treatment based upon two independent trials compared with one single trial. Secondly, we studied the impact of loss to follow-up for vital status on the fragility of results. Finally, we explored the concept of FI applied to the results of a group of neutral trials in HF-REF.