Statistical Essentials in Interpreting Clinical Trials

Stuart J. Pocock, PhD


June 03, 2016

This feature requires the newest version of Flash. You can download it here.

I'm Stuart Pocock, professor of medical statistics at London University. I am going to take you through a brief account of the statistical essentials in the analysis of clinical trials.

The topics that I will cover include significance testing, P values, estimation and confidence intervals, estimates from certain types of data, relative risks and odds ratios, and time-to-event data. I will also be covering hazard ratios and analyses of quantitative outcomes. Finally, I will discuss different ways to analyze data (such as intention-to-treat) and make a few comments on composite outcomes and some of the problems with subgroup analyses.

Significance testing and P values. A randomized clinical trial compares two treatments. If done well, with appropriate randomization and blinding, there should be no bias in the results. We want to determine whether an observed treatment difference is more than what could be attributed to chance. How strong is the evidence that what we find is a real treatment difference?

To answer that question, we would perform a significance test that gives us a P value. The smaller the value of P, the stronger the evidence of a genuine treatment difference.

There are three main types of data in performing tests of significance.

For binary outcomes, such as target lesion failure—yes or no—in a stent trial, you would perform chi-square tests. For another common outcome, the time-to-event (such as time to death), you would typically use a log-rank test. For quantitative outcomes (eg, blood pressure or late loss in millimeters), you would perform a two-sample t-test.

I won't go through how those tests are calculated but instead how to interpret what they mean when you perform them.

Starting with the chi-square test for a binary outcome, I am looking at the SPIRIT IV trial[1] comparing the XIENCE V stent versus the TAXUS stent. Of note, this trial employed a 2:1 randomization, with more patients on the XIENCE V stent. The primary outcome was target lesion failure at 1 year. We can see that the percentage of target lesion failure is 4.2% with the XIENCE V stent and larger (6.8%) on the TAXUS stent.

Is that difference genuine or could it have arisen by chance?

To find out, we propose a null hypothesis. Suppose that both stents are equally effective. If the null hypothesis is true, what is the probability (P) of finding a difference—4.2% vs 6% or bigger? The answer in this case is P = .001, or 1/1000. Assuming that the trial is unbiased, we have strong evidence that the XIENCE V stent is associated with a lower rate of target lesion failure.

To estimate the magnitude of effect in the SPIRIT IV trial, we have several options. The first is to produce a relative risk, which is the ratio of the two percentages. In this case, the relative risk is 0.62. A relative risk > 1 means that the new treatment or stent was worse. A relative risk < 1 means that it was better. This can also be expressed in terms of relative risk reduction, which is 100 x 1 minus the relative risk, or a 38% reduction in risk.

That is on a relative scale, but we also need to look on an absolute scale, looking at the difference in the percentages. So 6.8% minus 4.2% equals a 2.6% absolute benefit. We need to look at both relative risk reduction and absolute risk reduction when reporting such a trial.

The number needed to treat, another useful indicator, is simply 100 divided by the absolute percent reduction, which in this case is 38.

Another commonly used technique is the odds ratio, which is similar to relative risk, but it's the odds of target lesion failure in one group divided by the odds in the other group. In this case, the odds ratio is 0.6. The odds ratio typically is further from 1.0 than the relative risk. If the percentages are small, however, the odds ratio and the relative risk are rather similar, as shown in this example.

Those are the options for estimation. Next we want to know: When we produce an estimate, how reliable is it? How can we express the uncertainty in that estimate? The smaller the trial, the greater the uncertainty in any estimate. For that, we use the 95% confidence interval. Remember, in SPIRIT IV the observed relative risk was 0.62.

Now we can calculate the 95% confidence interval for that relative risk, which goes from 0.46 to 0.83. What does this confidence interval mean? It means that we are 95% sure that the true relative risk is in this interval, but there is a 5% chance that the true relative risk lies outside that interval. So the bigger the study, the tighter, the narrower the confidence interval. If you are unhappy with a confidence interval of 0.46 to 0.83 and you would like it to be half that width, you need a trial that is four times larger.

There is a link between P values and confidence intervals. A P value < .05 means that the 95% confidence interval for the relative risk does not include 1.0. Conversely, a P value > .05 means that the confidence interval will include the null effect (which is 1.0).

Let's turn to the PARTNER trial,[2] which looked at transcatheter aortic valve (TAVI or TAVR) versus surgical aortic valve replacement. The rate of deaths at 1 year seems to be rather similar for TAVI (24.2%) versus surgery (26.8%), with a P value of .44,—which is big, so there is no evidence of a treatment difference. However, what we wanted to do in this noninferiority trial was to show that TAVI was as good as surgery. To do that, we proposed in advance that the predefined margin of +7.9% was the minimum that we wanted to rule out to be able to say that the difference isn't as bad as that. The 95% confidence interval increased to +3%, which is less than 7%. Therefore, we were able to claim noninferiority in this particular trial.

The interpretation of P values is often too dogmatic. A P value < .05 is a shorthand way of saying, "The result is statistically significant at the 5% level." That is not the same as saying, "We have proven that there is a treatment difference." A P at .05 is an arbitrary guideline. Equally, a P value > .05 (which is not statistically significant) does not mean that there is no difference. It may be that the study was too small.

To produce convincing evidence of a treatment difference, P values must be substantially smaller than .05 (eg, .001) to come up with proof beyond a reasonable doubt that the treatment difference is genuine.

Time-to-event data involve a rather different set of analysis techniques. The PLATO trial[3] compared 1-year outcomes for ticagrelor versus clopidogrel in patients with acute coronary syndrome. The composite primary endpoint was cardiovascular death, myocardial infarction, or stroke, whichever happened first in any particular patient. The proportion of patients experiencing that primary endpoint by 1 year was 9.8% in the ticagrelor group and 11.7% in the clopidogrel group.

This was a very large trial, so although those percentages don't look very different, there is actually overwhelming evidence of a real treatment difference because the log-rank test of time-to-occurrence of primary endpoint achieved a P value < .001. This is used in time-to-event data to account for variations in the length of follow-up for different patients.

If you want an estimate, you could do a relative risk of the percentages at 1 year, but that gain ignores the fact that patients are followed for different periods of time. With this type of data, we instead produce what is called a hazard ratio. This is like the instantaneous relative risk averaged over time, with what is called a Cox proportional hazards model. A hazard ratio < 1 means that ticagrelor was superior to clopidogrel. A hazard ratio > 1 means that ticagrelor was worse. So here we have a hazard ratio of 0.84, with quite a tight confidence interval, from 0.77 to 0.92. That was possible because the trial was very large.

With quantitative data, we face yet another type of analysis. The SPIRIT III trial[4] compared the XIENCE V and TAXUS stents to look at in-segment late loss in millimeters, measured 9 months after randomization. On the second line of results here, the mean late loss was less with the XIENCE V stent than with the TAXUS stent (0.14 mm vs 0.26 mm), suggesting that the XIENCE V stent is more effective in keeping the arteries open.

To express each of those means, and the uncertainty around them, we use the standard deviation (0.39 and 0.46, respectively). But to say how precisely we know each mean, we use the standard error of the mean (the standard deviation divided by the square root of the number of patients studied). That is a rather smaller value.

To compare the two means and determine whether they are significantly different, we calculate the difference in the means (0.12 mm) and then the standard error of the difference given by the formula you see on the slide. The two-sample t-test divides the mean difference by the standard error of the difference—in this case, giving a t value of 2.86. If t is > 2, you have 5% significance, generally speaking. The P value in this case is .004, providing substantial evidence that the XIENCE V stent was superior in terms of late loss compared with the TAXUS stent. We can also calculate a 95% confidence interval for the difference in the means, as shown on the slide.

Another issue to remember in interpreting analysis of data in clinical trials is which periods of time or which patients were included in the analysis. In a pragmatic trial comparing treatment strategies, all randomized patients are analyzed in their allocated groups, regardless of whether they received the treatment for that group. It is an unbiased comparison of strategies, including any deviations from those strategies that occurred in real-world trials.

The PARTNER trial is an example of that. The analysis was by intention to treat, which is what we expect for the primary analysis of trial results submitted to regulating authorities or for publication.

You might also wish to do a supplementary as-treated analysis, which only includes the patients who received the intended treatment. In the PARTNER trial, four TAVI patients and 38 surgery patients did not receive their intended treatment strategy. You can include those patients in a supplementary analysis. In this case, the supplementary analysis was consistent with analysis by intention to treat, confirming the substantial quality of the results.

Composite endpoints are often used in cardiovascular trials, when you look at time to one of a series of events. In the SPIRIT IV trial, target lesion failure was a composite of cardiovascular death, myocardial infarction, and target lesion revascularization. In addition to the composite endpoint, the trial looked at the individual components [of the composite endpoint]. The percentage difference in cardiac death didn't really exist. There was a small benefit in myocardial infarction, but most of the reason for the treatment difference in the composite endpoint was a reduced incidence of target lesion revascularization in the XIENCE V stent. Looking at those details is helpful in interpretation.

In the PLATO trial, the composite endpoint was cardiovascular death, myocardial infarction, and stroke, and the difference was highly significant. The difference was driven by a reduction in cardiovascular deaths and myocardial infarctions, but stroke actually went slightly in the opposite direction, although that was not statistically significant. It is important to get a feel for what the composite means by looking at the treatment effects and the different components.

My last topic is subgroup analyses. These need to be interpreted with caution because there are so many different subgroups. A subgroup of importance in the PARTNER trial was different methods of delivering the transcatheter valve, such as the transapical technique or the transfemoral technique. Comparing TAVI with surgery, the hazard ratio using the transapical technique was slightly in favor of surgery. But for the transfemoral group, it was slightly in favor of TAVI. To evaluate this properly and cautiously, you look at whether the two hazard ratios are different, and in this case they are not. That is captured in the interaction test. Here, the interaction P value is.43, meaning that there is no evidence that the transapical route is [statistically] inferior to the transfemoral route.

Similarly in the SPIRIT IV trial, you could look at patients with and without diabetes. This was interesting because most of the effect in the XIENCE V or everolimus-eluting stent (EES) stent was confined to nondiabetics, suggesting that it doesn't work as well in diabetics. The interaction P value here was .02, comparing the two relative risks, indicating that the substantial evidence of superiority of the XIENCE V or EES is largely confined to nondiabetic patients.

We can do some rather perverse subgroup analyses.

For instance, the PLATO trial studied ticagrelor versus clopidogrel. In the analyses, we saw that in the United States it looked like the hazard ratio was in the wrong direction, showing that ticagrelor was less effective. In the rest of the world, it seemed effective. The interaction P value was .01, perversely suggesting that ticagrelor was superior, unless you happen to be American. It's a bizarre finding that must be interpreted cautiously because it was true across 32 predefined subgroups. This could well be due to chance, but some have suggested that it was the result of a higher maintenance dose of aspirin given in US patients.[5] This may be a plausible explanation for this rather bizarre geographic finding.

I will cover essentials of trial design another day. I hope this was useful in your statistical education. Thank you.

Editor's Recommendations


Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.
Post as: