# Three Statistical Errors That Are Totally Trivial but Which Matter a Great Deal

Andrew J. Vickers, PhD

Disclosures

August 18, 2008

Tommy John, the renowned pitcher, once made 3 errors on a single play: He fumbled a grounder, threw wildly past first base, then bobbled the relay throw from right field and threw past the catcher. I was reminded of that story when peer-reviewing a paper describing a randomized trial. Near the start of the results section, the authors wrote something like, "Although there was no difference in baseline age between groups (P=.458), controls were significantly more likely to be male (P=.000)."

This goes one better than Tommy John, because there are actually 4 errors in this single sentence (or perhaps even 4.5).* The first error has been discussed in a previous article (please see Related Links): You cannot conclude "no difference" between groups on the basis of a high P value because failing to prove a difference is not the same as proving no difference.

Here are the other 3 errors:

1. P values for baseline differences between randomized groups. P values are used to test a hypothesis -- in this case, a null hypothesis that can be informally stated as: "There is no real difference between groups; any differences we see are due to chance alone." But this is a randomized trial, so any differences between groups must be due to chance alone. In short, we are testing a null hypothesis that we know to be true. Nonetheless, reporting P values for baseline differences in randomized trials remains routine: When I recently refused a clinician's request to calculate these P values for baseline differences, he sent me references to several recent papers published in high-profile journals to show that what I thought was wrong was actually quite common. Given that copying others is not necessarily the best path to statistical truth, I politely declined a second time.

2. Inappropriate levels of precision. The first p value in our multierror sentence is reported to 3 significant figures (P=.458). What do the 5 and 8 tell us here? We are already way above statistical significance; a little bit more or less isn't going to change our conclusions, so reporting the P value to a single significant figure (ie, P=.5) is fine. Inappropriate levels of precision are pretty ubiquitous in the scientific literature, perhaps because a very precise number sounds more "scientific." One of my favorite examples is a paper that reported a mean length of pregnancy of 32.833 weeks, suggesting that we want to know the time of conception to the nearest 10 minutes. This would require some rather close questioning of the pregnant couple.

3. Reporting a P value of zero. No experimental result has a zero probability; even if I throw a billion unbiased coins I have a small, but definitely non-zero, chance of getting all heads. I once pointed this out in a peer review, only to have the authors reply that the statistical software had given them P=.000, so the value must be right.

This gets to the heart of why I care about these errors even though they don't make much difference to anything (why don't I just ignore those unnecessary decimal places?). Many people seem to think that we statisticians spend most of our time doing calculations, but that is perhaps the least interesting thing that we do. Far more important is that we spend time looking at numbers and thinking through what they mean. If I see any number in a scientific report that is meaningless -- a P value for baseline differences in a randomized trial, say, or a sixth significant figure --I know that the authors are not being careful about what they are doing; they are just pulling numbers from a computer print-out. And that doesn't sound like science to me.

*Note: About that "half an error": the authors tell us that "baseline" age was no different between groups. This was a trial on pain in which all patients were on study for the same period of time, so unless patients in different treatment groups grew old at different rates, there is no reason to tell us that it is "baseline” age that is being compared.