Free Yourself From Study Fragility and P Values

F. Perry Wilson, MD, MSCE


September 25, 2019

Welcome to Impact Factor, your weekly dose of commentary on a new medical study. I'm Dr F. Perry Wilson.

This week, I'm taking a break from our usually scheduled programming to talk about a newish concept percolating in the evidence-based medicine space, something called the "fragility index." And no, it's not another frailty measure for elderly patients; it's about the stability of results in clinical studies.

A study in Lancet Oncology found, for example, that of 17 recent randomized trials that resulted in a cancer drug receiving FDA approval, nine had a fragility index of 2 or less, meaning that if just two "events" in the study were converted to non-events, the results would no longer be statistically significant.

Typically, fragility index comes up when people are trying to disparage statistically significant findings in the medical literature. But I want us to think a little bit deeper about this, because frankly, I don't really like this metric. It seems so beholden to our conception of a P value of .05 as this magical thing that defines truth, when a P value is just a continuous metric like any other.

Let me walk through a quick example to show you what I mean.

Imagine I find a coin on the street, a quarter, and I want to know if it is a "fair" coin. Who knows—maybe someone has messed with it, and I don't want no adulterated currency jingling in my dungarees. So I do an experiment: I flip the coin 100 times.

Adding up my results, I find the following: I got heads 60 out of the 100 flips.

Now, if this were a fair coin, I'd expect 50 heads out of 100. So perhaps my dander is up a bit. What is going on at the Delaware mint?

But wait, you say. Just because a fair coin would have 50 heads on average doesn't mean it has to come up with 50 heads. There's going to be a range there. In fact, the range looks something like this.


So, how weird are the 60 heads that I saw?

Well, assuming that the coin on the street was fair, I'd see a result as weird as the one I got about 4.5% of the time.


Or in P value terms, .045.


In other words, my results are statistically significant. By our conventional definition, I will be calling my local numismatist and making a complaint.

But wait, you say. These results are fragile. If just one of those 60 heads had actually come up tails, I'd have 59 heads! And my P value would be .07.


That changes everything. This is not statistically significant. Pitchforks down.

But what has really changed here? Proponents of the fragility index would say that we have taken a positive study and, with the most minor of changes, made it negative.

How frightening that the scientific literature should be so delicate, so ephemeral.

But we're missing the point. The P value is not magical. It just provides information. How weird was it that I got 60 heads? Pretty weird. A deviation that large happens only 4.5% of the time. How weird was it that I got 59 heads? Pretty weird! A deviation that large only happens 7% of the time.

The problem isn't that medical studies are fragile; it's that we are way too beholden to the P value. We need to be willing to reject studies that have a P value of .045 if the hypothesis being tested is unlikely or the methodology is flawed. We need to be able to accept studies with a P value of .07 if the hypothesis being tested is very likely.

Look at it this way: Assume you're really worried about fragile medical studies; what are the potential solutions?


First, we could lower the P value threshold for statistical significance. There's an ongoing debate about that. Of course, there will be "fragile" studies at any threshold. If the P value threshold were .01, people might complain that studies are fragile because small changes in outcomes will change the P value from .009 to .011.

Or maybe you think we should just do bigger studies. But realize here that if the effect size is the same, doing a larger study will just lower the P value. And if you do a really large study and get a barely significant P value, you're probably in the realm of a statistically significant finding that doesn't have much clinical impact.

We see so-called "fragile" studies because studies are designed with the P value threshold of .05 in mind—because studies are expensive and expose people to potential risk. If you're spending $50,000 per patient to enroll a clinical trial of a new cancer drug, and all you need is a P value of less than .05 to get FDA approval, well, why would you enroll more than what you need? We're changing the rules after the match is over.

The real solution is to forget about .05 and interpret P values in the context of the underlying hypothesis of the study.

Take my quarter. Are my 60 out of 100 heads really going to convince you that it's a biased coin? That someone actually shaved part of it or weighted it in a weird way? Probably not. I just found the thing on the street, after all. The hypothesis being tested—that it is a biased coin—was very unlikely, so we should see those 60 heads as a weird fluke, not confirmation that the coin is weird. Use the data to update your prior probabilities.

So if you're reading a study and someone says, "But wait. If just two people who survived died or vice versa, this positive study would be negative," you should say, "That's so interesting. I also agree that a P value threshold of .05 is arbitrary and a study should be interpreted based in the light of the strength of its underlying hypothesis."

And as your colleague slowly backs away, feel comfortable that the difference between a P value of .049 and .051 isn't much of a difference at all.


Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.
Post as: