To Err Is Human, to Forgive Is Statistical

Andrew J. Vickers, PhD


March 18, 2010

It is generally held that a study with a flaw is unreliable and should be discarded. Indeed, there exists a list of buzzwords, the mere mention of which is sometimes viewed as sufficient to consign a study to the scientific scrap heap: "unblinded," "retrospective," "selection bias," and "underpowered."

Statisticians tend to avoid "flawed = wrong" because they like to quantify. The most basic type of quantification is direction: I can't, off the top of my head, tell you the height of the Empire State Building or the distance to the floor of the Grand Canyon, but I do know that one goes up and the other goes down. Comparably, errors in studies can go up or down. Every study involves a null hypothesis that nothing interesting is going on (eg, the drug doesn't work, the toxin doesn't cause cancer). So one type of flaw makes it more likely that the null hypothesis will be rejected when it is in fact true (eg, the toxin is said to be harmful although it doesn't cause cancer). Alternatively, a flaw in a study might increase the chance of not rejecting the null hypothesis when it is false (eg, the drug is said to be ineffective even though it does actually work).

Critiquing a study is therefore not merely a case of identifying flaws, but also working out what effects those flaws might have on the results. For example, I once read a letter to the editor in which a clinical trial was criticized for having a poor method of measuring outcome. Yet the trial was double-blind, and the results positive, and the only effect that outcome measurement can have on a blinded trial is to make it less likely to see a difference between groups. Hence this study flaw did not support the letter writers' argument that the trial has erroneously concluded in favor of the study drug. When this was pointed out, the authors responded that poor outcome ascertainment supported the argument that the trial was "unreliable in general." In other words, "flawed = wrong," and don't trouble me with any of your statistical ideas on how to understand statistical results.

Another example of current interest to me: we are trying to publish a paper showing large differences in outcome between 2 ways of treating cancer. Rejection letter after rejection letter has stated that, because patients were not randomized, "unmeasured confounders might influence the results." Except that no reviewer has given a specific example of a confounder that could explain the massive differences between groups (by the way, if you know of something that we don't commonly measure, but which doubles a patient's risk of dying from cancer, please drop me a line, as I have this feeling it might be important).

To reflect that flaws can work in different direction, statisticians have named specific types of error: rejecting the null hypothesis when it is true is called "Type I" error; failing to reject the null hypothesis when it is false is known as "Type II" error. But on this point, I'd like to introduce 2 additional types of error: "Type III" error is giving the right answer to the wrong question; "Type O" error (pronounced "typo") is giving the wrong answer because of bad data. Which is to say, before you start worrying about the influence of methodology on the direction of your results, make sure you are asking the right question, and that your data are clean.

If you liked this article, you'll love Andrew Vickers' collection of stories on statistics: What is a p-value anyway?


Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.
Post as: