The Year’s Most Important Study Adds to Uncertainty in Science

John M. Mandrola, MD


November 02, 2018

Use of evidence is what separates doctors from palm readers. Evidence helps prevent us from fooling ourselves. It tamps down our hubris.

Yet some expert clinicians have rightly criticized the overuse of evidence-based practice because it can lead to unthinking algorithmic medicine. That sort of practice is scary because evidence rarely provides easy answers, as in yes, do this, no, don’t do that.

As medical science progresses, patients increasingly depend on clinicians to help translate evidence. To do that we must ask: Did the investigators ask the right question, did they recruit patients similar to everyday patients, did they choose fair comparators, and did the statistically significant results reach clinical relevance? These are tough enough to sort through.

Now, findings from an elegant study[1] from researchers led by Professor Brian Nosek at the University of Virginia in Charlottesville make the job of translating medical evidence even harder. His team has shown that the choices researchers make in analyzing a dataset can substantially affect the results.

For years, when I read a scientific paper, I’ve thought that the data yield the published result. What Nosek and his colleagues have found is that results can be highly contingent on the way the researchers analyze the data. And, get this: There is little agreement on the best way to analyze data.

The Study

Nosek’s group recruited 29 teams comprising 61 researchers to use the same data set to answer one simple question: Are professional soccer referees more likely to give red cards for foul play to dark-skin–toned players than to light-skin–toned players? Red cards result in instant ejection from the game, whereas a yellow card allows players to continue unless they incur another infraction.

This was a multiyear project that included building a data set of sports statistics largely from the 2012-2013 season for four European men’s premiere leagues, then recruiting teams of researchers from varying fields and experience to do an initial analysis. In the first phase of the experiment, the teams submitted summaries of their approach to answering the question but worked independently.

In the next phase, Nosek’s team brought the 29 groups together for a round-robin of peer evaluations in which each team provided feedback on other teams’ analytic method. An aggregate of these evaluations was provided to each of the teams, which allowed the groups to learn from each other’s approach.

In the next phase the teams, having learned from their peers, could change their approach to the analysis and possibly change their conclusions.

In the sixth phase of the study, investigators discussed and debated the final analyses. This prompted some teams to perform additional testing to assess whether results were driven by a few outliers—they were not. The discussion led to the discovery that variability of the results occurred not just because of analytic methods but also because of the choice of covariates.

The Results

The 29 teams chose 21 unique combinations of covariates and used many different analytic techniques, ranging from simple linear regression to complex multilevel regression and Bayesian approaches.

The point estimate of the odds ratio for effect size ranged from 0.89 (slightly negative) to 2.93 (moderately positive).

Twenty teams (69%) found a statistically significant effect and nine teams (31%) did not. Neither the level of expertise, peer ratings, nor the prior beliefs of investigators (assessed in surveys before investigators saw the data set) explained the variability of effect size.


This is big because everyone understands that analyzing different data or asking different questions yields varying results. These were the same data and the same question!

When you read a research study, the methods section usually has one or two sentences describing the  (singular) analytic method. This paper shows that identical data sets can yield variable results—some statistically significant and others not.

What makes this previously undescribed area of heterogeneity so striking is that most of the analytic approaches used in Nosek’s study were defensible and rated as reasonable by the other methodologists.

What These Findings Are Not

These analysis-contingent results are not the same as P-hacking or the garden of forking paths. P-hacking (aka cheating) occurs when researchers actively pursue significance and do numerous analyses of the data, then select and publish the method that produces the significant result. In this study, each research team set out their method before they had the data.

The garden of forking paths problem occurs when researchers refine their analysis plan after patterns in the data have been observed.[2] For instance, if an expected result does not show up as a main effect, the researchers can then look for interactions. Nosek and colleagues explained that because they asked only one basic question—were soccer referees more likely to give red cards to players with darker skin—this limited the problem of forking paths. What’s more, the 29 teams had no incentive to find positive results.

Clinical Relevance

Don’t be lulled into thinking this is merely an issue with social science questions. In an email, Brahmajee Nallamothu, MD, from the University of Michigan in Ann Arbor, pointed me to an excellent clinical example: In 2010, JAMA published a paper using the UK General Practice Research Database showing that bisphosphonates aren’t associated with cancer,[3] but 1 month later, the BMJ published a paper based on the same database showing that bisphosphonates are associated with cancer.[4]

What about the recent analysis of a UK database that reported a link between angiotensin-converting enzyme inhibitor use and lung cancer.[5] The point estimate of hazard just barely met significance at 1.14, with a 95% confidence interval of 1.01 to 1.29. Would another analytic method have produced nonsignificant results? What about 10 different analytic methods?

Also Pertinent to RCTs 

The first question I asked Professor Nosek when we spoke on the phone was whether analysis-contingent results could apply to randomized controlled trials (RCTs). His “yes” answer alarmed me. Nosek said that whenever there is flexibility of choices, such as the choice of outcomes, which patients to include, and how to dichotomize variables you can expect variability.

Harlan Krumholz, MD, from Yale University in New Haven, Connecticut, also saw relevance to the RCT. By email, he wrote, “For any given question, different groups could address it very different ways—even with an RCT…. If you give them the question with freedom to design the experiment—they could conclude different things.”

Nallamothu underscored the reality of variability in RCTS by noting the divergent results from the seemingly similar MitraClip trials, Mitra-FR[6] and COAPT.[7]

You may counter this argument by saying that RCTs and their analytic methods are pre-registered and this prevents researchers from switching methods after seeing the data. While more and more trials are pre-registered, Nosek pointed out that, in reality, lack of specificity in describing protocols can allow researchers flexibility in the final analysis.

In a paper from the Proceedings of the National Academy of Sciences,[8] he and his coauthors list no fewer than nine practical challenges to data analysis even with pre-registration. The short message from this long paper is captured in this quote: “Deviations from data collection and analysis plans are common in the most predictable investigations.”

Another relevant and recent example of flexibility in RCTs concerns the problem of how changing trial endpoints can influence results.[9] This issue has provoked debate on the yet-to-be-completed ISCHEMIA trial of PCI vs medical therapy in patients with stable coronary heart disease.[10,11]

Multiple Analyses: A Path to Truth?

A wide-angle view of Nosek and colleagues’ paper reveals a bit of good news, and perhaps a path toward scientific truth. In Figure 2, the authors show the 29 different odds ratios and confidence intervals in descending order. While roughly two thirds of the point estimates yielded significant positive effects and one third did not, the overall picture shows relatively consistent results. Most of the confidence intervals overlap, and, when they are taken together, one can see a trend toward a positive effect—so, yes, soccer referees likely do give more red cards to players with dark skin tones.  

That got me thinking: Why don’t investigators do multiple analyses more often? Nosek told me that statistical software makes it relatively easy to run different analyses on the data. Krumholz added that the discovery of data-contingent results points to the value of open science and data sharing, since this would allow many designs to come forward.

A team of Belgian and US authors termed such a process a multiverse analysis.[12] They wrote that the thinking behind doing multiple analyses of data “starts from the observation that data are not passively recorded in an experiment or an observational study. Rather, data are to a certain extent actively constructed.”

This group used a multiverse analysis to challenge a provocative analysis[13] suggesting that a woman’s menstrual cycle influences religiosity and political attitude. When they analyzed the same data in other ways, with different, but defensible methods, they discovered that most P values did not indicate significant differences.

To me, the best part of the multiverse approach to a scientific question is that it addresses a limitation of pre-registration. Namely, while pre-commitment to an experimental method is vital, doing so allows for only one—of many—analytic approaches. Perhaps medical science would be more reliable, more trusted, if scientists heeded the advice Nosek and colleagues offered in their concluding remarks: “We encourage scientists to come up with every different defensible analysis possible, run them all, and then compute the likelihood that the number of observed significant results would be seen if there was really no effect.”


What this paper taught me, a user of medical science, is to be even more cautious in drawing conclusions from one or two papers. Would the result of the chosen analysis hold up to other reasonable ways to analyze the data?

The other clear lesson: Embracing the behaviors of open science, such as pre-registration, crowdsourcing, and doing multiple analyses, may lessen the number of “positive” newsworthy papers, but this may actually speed the rate of true medical progress.

Fewer scientific reversals would also likely boost the public’s trust in science.


Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.