Clinical Trial (Mis)Interpretation: 4 Pet Peeves

Andrew D. Althouse, PhD; Robert W. Yeh, MD, MSc


October 16, 2019

The conduct, analysis, and interpretation of clinical trials vary widely. We asked Andrew Althouse, PhD, a biostatistician with expertise in randomized trials, and Robert Yeh, MD, MSc, a cardiologist and trialist, to discuss their pet peeves in clinical trial interpretation.

1. Absence of evidence is not evidence of absence.

Andrew D. Althouse, PhD: I think my number-one pet peeve is how often people interpret all "nonsignificant" results (eg, P > .05 for the primary endpoint) as conclusive evidence that there is no effect, when there is often much uncertainty in the data from a single trial. Over 20 years ago, Doug Altman and Martin Bland wrote a famous piece titled "Absence of Evidence Is Not Evidence of Absence,"[1] which included this blistering statement: "To interpret all these 'negative' trials as providing evidence of the ineffectiveness of new treatments is clearly wrong and foolhardy. The term 'negative' should not be used in this context."

Yet, we frequently see trials interpreted in this manner. For example, the CABANA trial results were accompanied by headlines such as "Catheter Ablation for Atrial Fibrillation No Better Than Drug Therapy," based on a primary endpoint with a hazard ratio (HR) of 0.86 and a 95% confidence interval (CI) of 0.65-1.15 for the ablation group versus the medical therapy group.[2] The data are compatible with a range of possible "truths" about the effect of ablation, ranging from "ablation is pretty helpful" (eg, lower limit of 95% CI of 0.65) to "ablation is slightly harmful" (eg, upper limit of 95% CI of 1.15). While the CABANA findings are not strong enough to conclusively prove that catheter ablation has a benefit, neither are they strong enough to conclusively rule out a benefit of catheter ablation, yet the results are often reported with statements like "no better" or "no benefit."

Robert W. Yeh, MD, MSc: I couldn't agree more. One of the most egregious recent examples was the EOLIA trial, which assessed the effect of early extracorporeal membrane oxygenation (ECMO) for patients with severe acute respiratory distress syndrome, in which there was a 24% relative risk reduction in mortality at 60 days in a 250-patient trial with 28% crossover.[3] Because the P value narrowly missed .05, the conclusion was that ECMO "showed no significant benefit." I suspect that this was one of the cases that led the NEJM to change their policy on the interpretation of statistical significance.[4]

Of course there are risks to loosening the stringency with which we declare trials positive or negative. You could imagine that every trial not meeting statistical significance could be spun one way or another: The study was underpowered, the event rates were lower than expected, the population was different from the ones we see in practice. There would almost be a perverse incentive to roll the dice on intentionally underpowered studies, where negative trials could be labeled inconclusive but positive trials would be celebrated.

Althouse: That's a fair concern. To be clear, it's good for the field to retain a high evidentiary bar before concluding that treatments are effective and should be used in routine clinical practice. I just think we need to be better in differentiating trials that truly have ruled out a meaningful treatment effect from trials that don't prove that a meaningful treatment effect exists, but also don't rule out that possibility.

2. Subgroup analyses: 'Fun to look at but don't believe them.'

Yeh: In reading and reviewing papers, I commonly see researchers claim a particular benefit for a therapy in a subgroup—say, men—based on a subgroup P value < .05 for the treatment effect, while claiming no effect in another group—in this case, women—based on P > .05. In a recent paper I reviewed, the subgroup point estimates for treatment effect were identical and it was the sample sizes of the subgroups that drove the different P values. My guess is that most readers of clinical trials don't really understand what a test for interaction is and why that's a better assessment of heterogeneous treatment responses than focusing on subgroup P values. Based on the frequency with which I see the error being made, publishers and authors also may be in need of further education. As a stats editor for Circ Interventions, have you seen the same?

Althouse: Yep, we definitely see this. There are two problems to discuss here: the mass-application of subgroup analyses in general, and the overreliance on "statistical significance" when interpreting these analyses.

The introduction of the forest plot showing the treatment effect by different subgroups (eg, age > 65, men vs women, etc.) had good historical intentions. Given concerns about whether trials included sufficiently representative populations, the forest plot made it easy to see the sample size within select groups, and it serves as a useful alert if the results in one subgroup departed dramatically from the overall trial result. If this happened, then we could consider the plausibility of that finding.

Because many trials are just barely large enough to test the treatment effect in the full study population, they're inherently underpowered to test for treatment effects within subgroups; slicing the data into many subgroups is virtually certain to show that the treatment "doesn't work" for one or more (look, the confidence interval crosses 1 or P > .05 in women with red hair!) There's a famous example from the ISIS-2 trial of aspirin in AMI in the late 1980s. The Lancet editors asked the authors to add some subgroup analyses that were not in the original trial report.[5] The authors agreed only on the condition that they also include a subgroup analysis by astrological sign, which showed that the treatment benefit of aspirin was "not significant" for Librans and Geminis, despite a resoundingly positive overall result (P < .000001). Is there any reason to believe that aspirin would have a differential effect by astrological sign? Of course not, but the investigators used this example to illustrate the perils of fishing around in subgroup analyses.

A few years later, one of them, Peter Sleight, published "Subgroup Analyses in Clinical Trials: Fun to Look At - But Don't Believe Them!"[6] Sadly, that lesson seems to be lost on many.

Yeh: If we think about how we identify patients who might benefit more or less from a treatment, clinicians usually integrate more than one factor into that decision. For example, when asking how long a patient should be treated with dual antiplatelet therapy after stents, our question isn't "How long should I treat this patient based on her sex?" It's "How long should I treat this patient based on their age, sex, other comorbidities, and characteristics of their procedure?" We tried to do something like this when devising the DAPT Score. David Kent[7] at Tufts and Rod Hayward at Michigan have long been arguing for more nuanced subgroup analyses that are based on multivariable rather than single-variable stratifications.[8]

I may be biased, but I think this type of analysis can provide a lot more information than the typical subgroup analyses we see, and also will be more likely to identify heterogeneity in treatment effect if it does exist.

3. Yes, you should consider covariate adjustment in RCTs.

Yeh: OK, Andrew, I've seen you ranting about this one on Twitter. I thought the whole point of randomized trials was that we didn't need to adjust for anything. What gives?

Althouse: I see where that idea comes from. The purpose of randomization is to remove systematic bias from treatment assignment, which might otherwise lead to systematic differences in the distribution of outcomes between the treatment groups (eg, healthier patients preferentially assigned to one of the treatments, creating an exaggerated appearance of benefit). Since we know that there are no systematic differences between groups in a randomized trial, it's natural to think that the randomization obviates the need for any covariate adjustment, but it's not quite that simple.

Yeh: Yes, I think most clinicians think that for observational comparisons, sure—it makes sense that we would need some form of statistical adjustment for differences between the groups. But the baseline characteristics table of most large RCTs almost always shows balance. If you can demonstrate balance, why should we adjust for any of those variables?

Althouse: This comes as a surprise to most, but it's not about the "baseline balance" of the treatment groups. The reason to adjust is to account for explainable variations in the outcome. Adjusting for baseline variables that are strongly associated with the outcome gives a little boost to the statistical power of the trial, which is important. If there is a "real" treatment effect, we want to maximize the trial's probability of concluding that a treatment effect is present.

Yeh: Fascinating. I recall some of the statisticians for trials I've worked on suggesting a prespecified strategy of baseline covariate adjustment. Most of us nonstatisticians thought that was to deal with potential imperfections in the randomization process. Shoot—if we knew we would be getting more power, we'd have jumped at the chance. But wait, what about trials that don't use any covariate adjustment? Are they invalid for some reason?

Andrew: It's not necessarily wrong to design or analyze a trial without covariate adjustment, especially if there aren't any obvious predictor variables that will have a strong association with study outcome. But if you can rattle off a few variables that are strongly associated with the outcome, including them as covariates in the final analysis has the added bonus of increased power.

It would be incomplete to chat about this without addressing the myth of "baseline balance" in trials. Darren Dahly has an excellent and accessible post on Medium. A brief summary goes something like this: (1) adjusting for baseline covariates only matters if they are actually predictors of the outcome; (2) it is still useful to adjust for predictors of outcome even if they appear to be "balanced" between the treatment arms; and (3) adjusting for an "imbalanced" variable that is not related to the study outcome doesn't help.

From what I've seen, much of the resistance to covariate adjustment in RCTs comes from concern that this could be used to game the results; trialists could just add or subtract covariates from the final model until they get the results that they want for the main effect. This is why prespecification is critical: By naming a short list of variables that are known to be strongly associated with the study outcome ahead of time, you get the benefits of covariate adjustment without the risks (or accusations) of "gaming the results."

If readers take one thing away from this section, let it be this: Don't dismiss the results of a trial as "statistical voodoo" just because the researchers used a regression model.

4. An effective treatment can appear ineffective if tested in the wrong population.

Althouse: All right, let's talk about Bobby Yeh's single greatest career achievement: carrying out the first randomized controlled trial of parachutes![9]

Yeh: It's got the highest Altmetric score of any paper I've been affiliated with, by at least an order of magnitude. Pretty sure it'll be on my epitaph, for better or worse.

Courtesy of Robert W. Yeh, MD, MSc

Andrew: Seriously, I loved the point you made with the PARACHUTE trial (PArticipation in RAndomized trials Compromised by widely Held beliefs aboUt lack of Treatment Equipoise). You concluded that parachute use did not reduce death or major traumatic injury when jumping from aircraft. Of course, the aircraft was grounded when you jumped. Statistician Frank Harrell often points out that trials are designed to estimate relative efficacy, but if you're enrolling patients who have zero risk of the study outcome, or people who (for some other reason) cannot benefit from the treatment, you might make an effective treatment for the "right" patients appear ineffective because it's tested in the wrong people.

Yeh: I think Frank is right that over much of the continuum of risk, relative measures of treatment efficacy are more or less constant. But any doc will tell you that that rule doesn't always hold, particularly at the extremes. There are definitely patients who are too sick to benefit from any treatment, and others who are just too healthy to benefit. Then there are patients who likely have the most to gain. Trying to figure out which category the patient in front of you falls into constitutes much of what it means to be a thoughtful clinician. And clinicians don't just turn that off because they are enrolling patients in trials. If a doc thinks their patient is going to benefit from a therapy, they're going to be reluctant to enroll the patient in a trial where there's an equal chance that the patient won't receive that treatment.

Patients do this too. If a patient has terrible symptoms, they might be reluctant to enroll in a placebo-controlled trial compared with another patient with just mild symptoms. If patients who might benefit get excluded from trials, you could get a trial that shows, for example, that parachutes don't prevent injuries among people jumping out of airplanes.

Andrew: So let me ask the obvious question: Do you think this actually happens? And in what kind of trial is this most problematic?

Yeh: There's no question that some amount of this happens in nearly every trial of an approved therapy, particularly if there are people (whether patients or doctors) who really believe that the therapy works. Now, it may well be that the therapy, in fact, doesn't work and that a well-conducted trial would demonstrate that. The problem is that it's impossible to distinguish whether a negative result is due to the therapy not working or because the trial is stricken by the problem of generalizability due to the "parachute problem."

A real pet peeve of mine is when strong supporters of evidence-based medicine criticize this line of thinking as being anti-RCT. I'm a huge supporter of RCTs. I've helped design and lead them. But pretending that this isn't an issue actually does RCTs, and patients, a disservice.

Andrew: OK, I'll ask another obvious question, albeit one that's much harder to answer: What can/should we do about this?

Yeh: I'm glad you asked. The irony is that the solution to this, in most cases, is actually to conduct more RCTs, and to do them earlier, before non-evidence-based beliefs cement into prevailing standards of care. Once you've unleashed a treatment or a device into routine practice, it's understandable that clinicians are going to do their best (through trial and error) to figure out how best to use it, and eventually norms get established. It's those norms that prevent us from then really testing the hypothesis in a properly conducted randomized trial.

Some might argue that docs may be able to figure out the best way to treat patients without RCT-level evidence. As a physician, I believe that trial and error may lead us down the right path. But as a clinical researcher, I'm also well aware of the many times throughout our history where our intuition and observations have led us astray.

Andrew: So let's talk about the timing issue, since you mentioned that one solution is to conduct more RCT's earlier, before beliefs become entrenched. The catch-22 is that if you insist on randomized trials too early in the lifecycle of a new device or procedure (eg, TAVR), while we're still on the learning curve, we may not see the full potential benefit of the experimental treatment. We may conclude that an effective treatment (once optimized) is actually ineffective because those kinks hadn't been worked out before the trial began.

But if you can't do a trial until all of the kinks are ironed out, you hit the PARACHUTE problem: The treatment is already used in practice, the preliminary data look great, people think it's the new "standard of care" so they are reluctant to enroll patients in a placebo-controlled trial of said treatment.

Yeh: One issue that people outside the field rarely acknowledge and may not understand is that devices and procedures are rarely fully formed at their outset. They are constantly evolving through iteration and improvement. Procedures we generally acknowledge to be life-saving today would have definitely failed had they been tested in randomized trials during their infancy. So how do you give sufficient time for such procedures to mature before testing them, while not allowing norms to set in so strongly that they become impossible to test? It's a huge question that I wish I knew the answer to. Maybe some of our readers can suggest potential solutions.

This was a fun discussion, Andrew. Hopefully we can do it again. If anyone reading has suggestions for future topics, please enter them in the comments section.

Follow Drs Robert Yeh and Andrew Althouse on Twitter

Follow | Medscape Cardiology on Twitter

Follow Medscape on Facebook, Twitter, Instagram, and YouTube


Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.