Data Torture and Dumb Analyses: Missteps With Big Data

A Discussion With Biostatistician Frank Harrell, PhD

; Frank E. Harrell, PhD


August 06, 2018

Robert A. Harrington, MD: Hello. This is Bob Harrington from Stanford University. We'll be having an interesting podcast today on | Medscape Cardiology with a good friend and colleague, Frank Harrell.

Frank Harrell, PhD

There is no question that we're living in an unprecedented time in regard to biomedical research. We have an incredible discovery engine going on right now where we can measure virtually any human biologic process. This includes the various "omics" (genomics, proteomics, metabolomics) and also things that measure continuously a variety of physiologic measurements, like heart rate, temperature, and heart rate variability.

All of this has given us the ability to collect enormous amounts of data on individuals. It's also given us a tremendous ability to analyze those data in ways that perhaps we've not been able to do before, in part because of cloud computing and increasingly advanced computational methods. Many of us are interested in the concept of how to take this continually accruing information and include things like social media, GPS tracking, and zip code to gain insights into human health and disease that is beyond what we've been able to do before.

I'm really privileged to have as a guest my long-time friend and colleague, Dr Frank Harrell. Frank is professor of biostatistics at Vanderbilt University School of Medicine. He's also an expert statistical advisor to the US Food and Drug Administration Center for Drug Evaluation and Research and their biostatistics group. Frank is the perfect person to have a conversation with about how we've arrived at this point in history in biomedical research. We can hear his ideas on the opportunities, challenges, and potential pitfalls he sees, particularly as we talk about some of the new advanced computational methods, including machine learning, neural networks, and so on. Frank, thanks for joining me here on | Medscape Cardiology.

Frank E. Harrell, PhD: Absolute pleasure to be here, Bob.

Complexities of Having Vast Amounts of Data

Harrington: Do you want to give some broad comments on how you are thinking about the enormous amount of data that is helping inform the human health experience, and some challenges that it leaves for the community?

[Y]ou don't necessarily get smarter as you are in a job for more and more decades, but you do gain perspective.

Harrell: Yes, it's hard to know where to start because there is a genomics view of things and then there are all of the other fields—including, as you mentioned, modern, continual physiologic monitoring, which I actually believe has more promise than most of the other methods.

Harrington: Yes, let's stay away from genomics right now and talk about the larger context of data.

Harrell: Okay. The sheer vastness of data is a challenge to everyone and the ready availability, but I think a lot of issues are not really well understood by clinician researchers and some biostatisticians. One of the things you learn about over time is that you don't necessarily get smarter as you are in a job for more and more decades, but you do gain perspective.

One of the perspectives that statisticians get good at over time is knowing how much information content is needed to make a certain conclusion about something. Whether you're trying to better diagnose patients, better prognosticate, or compare therapies, a certain amount of information is needed in order to have any hope of answering a question.

There is sort of a separate question about bias, and that is a really big issue in treatment comparisons. Even when you are not doing treatment comparisons, knowing what the limitations of the data are is something that a lot of people are not yet good about. They have this mistaken impression that because of the nature and ready availability of data, the data must have the information buried inside of it somewhere that allows you to answer almost any question.

Data 'Torture'

Harrell: Somebody tweeted the other day, and I was quick to react to it, that they felt that there are new causal inference methods that can tell you in real time as a clinical trial is underway which patients are receiving the most benefit from the treatment. I just pointed out that that is mathematically impossible to do. It's almost impossible to do at the end of the study, but while the study is unfolding, it's really hard to do. There is this kind of crazy analogy between data torture and human torture. We know in human torture—and there is lots of evidence for this—that if you torture a human to obtain information, the human will confess to whatever the torturer wants to hear.

If you torture data, the data will confess and tell you what you want to hear.

The same will happen with data. If you torture data, the data will confess and tell you what you want to hear. Then the researcher kind of moves on and tries to make use of that, but it's not reliable. It's no more reliable with data than it is with torturing humans. There is this belief that if you use modern methods, all of a sudden there is more information in the data than there ever was.

You are seeing people apply machine learning, especially to more rare diseases like specific types of cancer, where they are trying to find out who is likely to have metastasis, or whatever they are trying to predict. They may have a limited number of patients, but they may have an unlimited number of possible features, like protein expression, gene expression, and SNPs [single nucleotide polymorphisms]. And now we are hearing all of this hype about the microbiome and all sorts of other "omes."

If you have a limited number of subjects and you have tens of thousands of possible predictors, there is no mathematical way for that type of research to actually work. With one exception: if there is a smoking gun which somehow the whole world missed and no one published on before (which is unlikely). If there is a smoking gun, like, "If you have that characteristic, then everyone has a disease; and if you don't have it, then no one has the disease," you can find it, no matter what else is thrown into the data. That is just not the way things happen with research in the modern era.

Sample Size

Harrell: I blogged about this from the standpoint of, how many subjects would you need to [do a good] study on a single patient characteristic and relate that to something? You can think about this logically. The minimum sample size you would need to do something complex, like neural network, is going to be greater than if you had preselected one feature and wanted to see how that relates to patient outcome. At the heart of that is estimating something like a correlation coefficient. How many patients does it take to estimate a single correlation coefficient?

The answer is, it's over 300 patients to estimate only that. That is with a highly focused prespecified single candidate feature for prediction. If that takes more than 300 and you publish a complicated machine learning result with less than 100 people that used more than 1000 candidate features, the hopefulness for that actually being sustained is just zero. Maybe you will recall back-to-back papers, maybe 10 years ago, on determining variants that predict breast cancer risk.[1,2]

They used the same sort of cohorts of women, same sort of screening—SNPs and GWAS [genome-wide association studies]—and everything was similar in the setup. In these two papers, the findings had not a single SNP in common. It was a stunning example of the impossibility of learning that much from so little.

Harrington: What is the road forward, Frank? Certainly, one of the opportunities today is the vastness of the data, and sometimes we are so enamored by the vastness that we can get lost in it.

There are tools that can help us make sense of the data, but in some ways what I'm hearing you say is that basic principles still apply. Not forgetting about your type 1 error is one of the issues that you are getting at here in terms of your false discovery rate. How we even think about visualizing the data might be helpful as we are looking for things. Do you want to talk about the type 1 error and data visualization, two topics that you've spent a lot of time on?

Harrell: Yes, I'd like to talk about things that are almost that.

Harrington: Okay.

False-Negative Rate

Harrell: The false-discovery rate, which is related to type 1 error, is a big deal, but people give far too little attention to the false-negative rate. People are publishing things that are announcing discoveries that are just barely publishable. It might be an odds ratio of 1.3 or something, and not clinically predictive of anything. They are ruling out a whole vast number of features that didn't pass their feature screening, not really realizing that their false-negative rate was off the charts.

[P]arsimony is the enemy of predictive accuracy...

There is a real lack of appreciation of reliability of discoveries and reliability of nondiscoveries—especially the latter. I think that is really holding back research. People are dismissing things that do have information, and part of the reasoning is because they are seeking parsimony. I like to say that parsimony is the enemy of predictive accuracy. Nature has so many pathways and genetic backup systems and everything, and parsimony is not the way nature works. It's the way things work sometimes in physics, but not so much in biology.

The idea that almost all research that you see published in a discovery mode is an attempt to be parsimonious is where it's going seriously wrong. Better methods of analyzing the data will say, "What sort of signal is there if we don't try to understand the signal?" The first step is to measure the signal that is predictive.

Are you trying to diagnose colon cancer? If you have suitable data with enough cases of colon cancer and controls, you can start to analyze it. You may find that there is a signal hidden among these thousands of variables to the tune of R2 = 0.4 in predicting a final diagnosis of colon cancer. Then you are content to publish a paper where the R2 = 0.04. My conclusion from that would be, there is a 0.36 of signal that you have no idea about because you tried to name names. You tried to be parsimonious and that is where you went wrong. That sort of research is really hard to justify. If you are only recovering one tenth of the signal for what your aim is—whether it's diagnostic or prognostic or what—you are publishing something that gets on your curriculum vitae, it counts in promotion, but it's never found to be of clinical utility, and you quickly abandon where the signal was in a lot of what you call losing features.

You abandoned that and were content to publish something that had almost no signal at all, but it was statistically significant. That is a lack of understanding about how multiple factors work together and what pathways are. I just see that as a rampant problem in imaging research, genetics, proteomics, and probably in microbiome, which I've had less exposure to.

Focusing on the Right Variables

Harrell: There is a different problem, and I would love your comment on this. There is a lack of understanding by many researchers about what sort of variables are really the ones they need to be concentrating on.

There is a lack of understanding by many researchers about what sort of variables are really the ones they need to be concentrating on.

A fantastic meta-analysis[3] showed that the history of genetic research in risk factors for depression is just a series of conflicting results with weak signals. They put it all together and tried to estimate how much of depression can be explained by genetic forces versus capturing the life everts of the person. How many tragedies (eg, loss of a spouse, loss of a child) had the person suffered?

They showed that life events just made fun of the genetic factors; there was no comparison. A lot of predictive exercises go forward where people are not really taking this into account. I heard a geneticist from the University of Washington say once, "If I had a choice of measuring cholesterol or knowing that someone was predisposed to hyperlipidemia, I'd measure the cholesterol every time."

Harrington: A paper published during the past year or so[4] was looking at machine learning techniques. They say that the machine was better than the cardiologist at predicting cardiovascular events. Then they list all of the variables that the machine identified as being highly predictive. One of the variables that was most important was "no data available." That points out your issue that you really need an understanding of what the biologic processes or what the clinical imperatives are.

Frank, it takes me back to the days of the Duke databank, with clinicians and statisticians talking about what were they observing that seemed to carry importance in the clinical setting, and then bringing that back and formally testing it. It is an exercise that we don't want to forget. It should not be a black box. We should be thinking about what are the observations—biologically, clinically—that seem to be important.

Harrell: We spent a lot of time breaking things down into logical components that you could understand clinically, and they were highly predictive. What is a good way to score obstructive coronary artery disease? What is a good way to score ischemia, and what are the different manifestations of ischemia? What are the different manifestations of heart failure, and how do you put all of that together? How do you score peripheral vascular disease and so on? And we created indexes to summarize each of these phenomena.

That led to great stability over years and years of analyzing the data, instead of looking for individual features. The clinical interpretation was always there. People need to take into account what it is that is going to make sense, be predictive, and be useful for clinical decision-making. The paper you were referring to may be the same one that I saw, where they showed that if you used a lot of medical tests, you had the result of the test; and whether or not the test was ordered, the thing that was predictive was the physician test-ordering behavior. The machine learning algorithm at no point found that it needed to use the results of any of it. That is really interesting, because when you think about transporting that to another clinical setting where the practice patterns and test ordering are different, but maybe the meaning of the test results are not that different, I think they missed the boat.

Harrington: Yes, I agree with you. Frank, I could keep talking with you all day about machine learning and new ways of thinking about data. The lesson I'm taking out of this is, remember some basic principles of statistics as we think about doing good clinical research. Thank you for joining me here on Medscape Cardiology today.

My guest today has been Dr Frank Harrell, a professor of biostatistics at Vanderbilt university School of Medicine. Frank, thanks for joining us.

Harrell: Thank you for having me. I really enjoyed it, Bob.


Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.