COMMENTARY

A Few Minutes on the Potential Harm of Predictive Models

F. Perry Wilson, MD, MSCE

Disclosures

January 08, 2020

Welcome to Impact Factor, your weekly dose of commentary on a new medical study. I'm Dr F. Perry Wilson.

It's a new year. After a little holiday break, I'm back and, frankly, a bit cranky as I peruse the recently published medical literature. I'm focusing today on a rather small study. It's one that hits a pet peeve of mine, so I'm going to channel my inner Andy Rooney here and gripe for a bit.

Appearing in JAMA Network Open, we have this article with the compelling title "Use of Machine Learning for Predicting Escitalopram Treatment Outcome From Electroencephalography Recordings in Adult Patients With Depression."

I like to know what I'm getting into when I read a title, and this title promises quite a bit. To me, it reads like researchers used electroencephalography (EEG) and some fancy machine-learning stuff to predict which patients with depression would benefit from escitalopram treatment.

That idea—using a machine-learning model to choose the best psychiatric treatment is holy grail–level personalized medicine stuff. See, when confronted with major depressive disorder, docs often try medication after medication to see what sticks; anything to lessen that trial-and-error approach would save tons of time, not to mention lives.

But that is not what this study is about. Walk with me through the methods and you'll see what I mean.

Researchers from British Columbia analyzed EEG data from 122 adult patients with major depression who were initiated on escitalopram therapy.

As you know, an EEG outputs a ton of data—multiple electrodes, thousands of measurements. This is actually an ideal place to use machine-learning tools to squeeze all of those data into a single number. The authors do an exemplary job of using a well-established machine-learning algorithm called a support vector machine to take those gobs of data and turn it into a prediction.

But what exactly are they predicting?

They are predicting whether the patient will have remission of depression in 8 weeks. They are not predicting whether escitalopram was good for the patient, and that difference is huge.

This study had no control group; all 122 patients were treated with escitalopram. We therefore have no way to know whether the machine-learning model identified individuals who are more likely to achieve remission regardless of therapy (let's remember that depression spontaneously remits in around 20% of cases) or those who truly benefit from escitalopram.

See, every patient with depression has four potential destinies in regard to escitalopram:


 

Some will have remission with or without the drug. Some will never have remission regardless of treatment. Some will only experience remission if they get the drug, and others, presumably, would only not experience remission if they get the drug.


 

It's really the last two categories we care about in terms of deciding on treatment, but ironically, the first two categories are the easiest to predict—because in the end, the biggest predictor of whether you get remission from depression is not whether you get a drug but how severe your depression is in the first place.

This is a huge difference in terms of a prediction problem and one that can actually lead to patient harm.

Let me give an example.

Imagine that we built a model predicting who is least likely to have a heart attack among a population receiving simvastatin.


 

Without a comparator group, we'd find that individuals with lower LDL, more physical activity, and no diabetes would have the best outcomes. If we then argue that these are the types of people who should receive statins, we'd be doing a huge disservice to the people with more severe disease at baseline. Our model doesn't tell us who should get the drug; it only tells us who was better off in the first place.

We need models that can target therapies to the right patients regardless of how sick they are at baseline, or else we'll always choose the least sick people to get treatment. Sure, that will make the success rate of therapies look awesome, but it's not how I want to practice medicine.

Okay, back to escitalopram. What this paper shows us is that the authors built a model based on EEG data that show who is likely to have remission of depression. You could argue that the model has nothing to do with escitalopram. The model may predict outcomes equally well among patients on any antidepressant or on no antidepressant at all. In other words, we're no closer to the dream of strapping an EEG on someone's head and knowing what drug to give them than we were before. But studies like this get reported inaccurately all the time, suggesting that we have some new tool in our personalized medicine toolbox.

My biggest fear is that these models get commercialized as some sort of "use this to decide who to treat" black box, which, as we now all understand, is biased against those who are sicker at baseline, even if they would respond well to therapy. The second sentence of the conclusion of this paper reads: "Developed into a proper clinical application, such a pipeline may provide a valuable treatment planning tool."

Not really—not unless you want to reserve treatment for the least sick individuals.

Could the researchers prove that their model is not simply identifying less severe depression as opposed to escitalopram response? Well, they could show how their model correlates with baseline depression scores or other baseline factors. My bet is that we'd mostly find that the model just identifies those with less severe depression at baseline, but those data are not presented.

And let's remember that although it's very cool to get data about how severe your depression is just from an EEG—I mean, that's Star Trek-y and I love it—we have plenty of tools already available to assess depression severity.

So the next time we see a study (using machine learning or otherwise) that claims to "predict response to therapy," the very next question we have to ask is, "How do we know the model isn't simply identifying less severe disease at baseline?"

F. Perry Wilson, MD, MSCE, is an associate professor of medicine and director of Yale's Program of Applied Translational Research. His science communication work can be found in the Huffington Post, on NPR, and here on Medscape. He tweets @methodsmanmd and hosts a repository of his communication work at www.methodsman.com.

Follow Medscape on Facebook, Twitter, Instagram, and YouTube

Comments

3090D553-9492-4563-8681-AD288FA52ACE
Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.
Post as:

processing....