Reading Deeper: When Study Design Matters

David Graham, MD


April 05, 2016

Working with fellows and residents helps me remember certain important aspects of published trials. Reviewing trial reports with trainees in the clinic or in journal clubs highlights an important issue. Many trainees will dutifully report the study population, the conduct of the trial, and the results. They may critically review the clinical significance of the results. What they rarely do, however, is assess the design and assumptions presented by the trial's design.

It's easy to think, "Why do this?" because large, multicenter trials are generally assumed to be highly vetted before enrollment. There are situations, however, that merit critically assessing the design and assumptions of a given trial.

Certainly, a trial started 10-20 years ago may not include recently recognized stratifying factors or might use therapies now considered outdated. These are easy to recognize. What may be harder to recognize are statistical assumptions that may directly affect the analysis and reported results.

We have recognized the benefit of androgen deprivation as a therapy in prostate cancer for many years. We have, during that time, also recognized the toxicities of that therapy. Scientists started questioning the benefit of using hiatuses in androgen deprivation, both in delaying time to castration resistance and quality of life.

Two large randomized trials were reported in 2012 and 2013 comparing continuous androgen deprivation therapy with intermittent therapy using a study protocol designed to show noninferiority. In other words, the studies were not designed to show that one therapy was better than the other, but rather that one was not worse. If the outcomes were similar, a persuasive argument could be made that continuous therapy need not be used.

In 2012, the PR.7 trial was reported in the New England Journal of Medicine.[1] This trial studied continuous or intermittent androgen deprivation in men with a rising prostate-specific antigen (PSA) level after primary radiation therapy. The results were reported after an interim analysis. The statistical analysis was reported as showing noninferiority between the two arms. The results were widely heralded at the time, and the study fairly quickly found its way into clinical use.

Unfortunately, in 2013, the results of the S9346 trial[2] were reported, also in the New England Journal of Medicine. This was a trial of men with newly diagnosed metastatic prostate cancer treated with either continuous or intermittent androgen deprivation. In contrast to the PR.7 trial, the S9346 trial found that the confidence interval exceeded the predetermined limit for noninferiority, meaning that it could not be reported that the outcomes with continuous or intermittent androgen therapy were the same. In fact, the overall survival for the intermittent therapy arm was 7 months shorter than that for the continuous therapy arm.

The differing results of these two trials have sparked numerous discussions over the past 3-4 years. Was the difference in study populations a reasonable explanation? Maybe once the disease metastasizes, it has become a different enough animal that the same therapy would lead to different results than among men with PSA-only recurrence.

However, a recently published analysis of multiple trials suggests that there may be a more fundamental statistical question at play.[3] To put it simply, how different can two things be and still be considered the same?

To be clear, the lead author of that review is Dr Maha Hussain of the University of Michigan. Dr Hussain is also the lead author of the paper reporting the results of the S9346 trial, which could raise questions. The idea, however, is valid.

The fundamental question is this: What upper limit in a hazard ratio is acceptable to use when describing an outcome as noninferior? In the S9346 trial, the assumption was that the risk for death could not be more than 20% higher to be called noninferior. As stated above, the upper limit of the confidence interval exceeded that 20% threshold in the intermittent therapy arm. In the PR.7 trial, the acceptable hazard ratio was 25%. With a reported survival in the continuous therapy arm of 9.2 years, that meant that a decreased average survival of 1.8 years would still be considered statistically noninferior.

Does this invalidate the results of the PR.7 trial? Not in and of itself. If the results had not been reported after the interim analysis or if the acceptable hazard ratio was in the same range as the S9346 criteria, could the same conclusion have been made? Possibly not.

Whatever the case, the message for us, and the one to impart to our trainees, remains the same: If data conflict, read deeper into the studies. It is always reasonable to question assumptions inherent in a study's design. Keeping this in mind will make us better interpreters of the medical literature, better teachers, and better practitioners for our patients.


Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.