Failing Grade for ProPublica's Surgeon Scorecard

John M Mandrola


July 23, 2015

Anyone who has seen a local 5K run knows runners vary in skill. Some run swiftly and with ease, and some do not. Mastery of doctoring is no different. By teaching the basics, medical training attempts to level the playing field, but it remains uncontroversial to say some doctors are better than others.

From my first day of private practice, I've despised medicine's lack of meritocracy. Although we can usually discover outliers and frauds (a rarity), neither patients nor referring doctors have a reliable way to judge the skills of a specialist or surgeon.

Two reporters from ProPublica, an investigative-journalism group, sought to remedy this injustice. They used exclusive access to a trove of Medicare billing data to form a user-friendly online SurgeonScorecard. They promoted the work with a sensational video and embraced their conclusion: that your surgeon matters, and we can tell them apart.

They are wrong. They cannot. The ProPublica SurgeonScorecard fails to deliver on its promise.

I believe ProPublica should admit they released the scorecard prematurely and consider taking it down until it is improved. It is not ready for prime time. Its risks are greater than its benefits. They should not feel bad. Mistakes are okay. Mistakes teach you a lot. Mistakes eventually make great doctors and surgeons.

Let's begin with two positives.

ProPublica's intent was good. We need more transparency in medicine. Patients and referring doctors deserve to know the skills of their specialists. I've never heard a good doctor stand against transparency.

Another positive is that ProPublica does good work. More than ever, we need investigative journalism. I support their mission, especially the Dollars for Docs conflict-of-interest project. ProPublica has quoted me in the past, and I am proud to stand on the side of transparency.

If only good intent and good people were all that was needed to get the right diagnosis and treatment.

The fatal flaw with the Surgeon Scorecard is that their methods and results do not support their conclusion. Now to the flaws and shortcomings of this project.

The Flaws

The first problem is that the data are incomplete. ProPublica used administrative Medicare billing data (age >65 years) to assess only two surgery outcomes—death or readmission to the hospital in the month after the operation. Coding data are often inaccurate; they do nothing to get to the nuance of the operation or person who had the operation. To understand a medical encounter, you need to look at more than what a secretary submits as the billing code.

ProPublica did not look at patient-level data; they did not look at specific postoperative complications, and therefore, they could not judge the most important skill of a surgeon—the judgment to operate in the first place. Dr Ben Davies, a practicing urologist, wrote that ProPublica graded prostate-cancer removals without looking at the three most patient-centric outcomes of the operation—cancer removal, continence, and erectile function.

The second problem with the data is that patients are not the same. Some doctors work in rich suburban hospitals where they see healthier patients than those who work in safety-net hospitals. ProPublica tried to account for this by adjusting their data with a "Health Score." Good on them, except the health score did not show that sicker patients had higher rates of death or readmission. As Dr Jay Schloss, a practicing electrophysiologist, wrote in his blog, this raises questions about the validity of their risk adjustment. Of course it does. Billing data aren't nuanced enough to tease out these crucial details.

The presentation of the data is misleading. Although ProPublica uses confidence intervals, a surgeon's rate of complications falls neatly into a green, yellow, or red color scheme. The rule of small numbers—in this analysis, both small numbers of cases and outcomes—dictates that confidence intervals will be wide. In many surgeons' scorecards, the interval spans multiple color codes. This means a surgeon with a complication rate in the yellow zone and confidence intervals that touch both red and green areas could be good, average, or poor. In medical journals, when confidence intervals cross boundaries, the data are declared not statistically significant. That doesn't happen in the ProPublica score card.

More confusing to the consumer is that a surgeon with a small number of cases and zero complications is classified as "medium" risk. How is that? It's assumed that if she did enough cases, she would have complications. Maybe, but maybe she is special.

Finally, there is the issue of data accuracy. At my hospital, the scorecard has a cardiologist doing knee replacements and an orthopedic surgeon doing gall-bladder removals. Inaccurate information creates credibility issues. Nine out of 10 things you say are accurate and legitimate, but it takes only one outlier to bust credibility.


Some experts are (rightly) frustrated by the fact that an average person cannot determine the skills of their doctor. They argue that the Surgeon Scorecard is a first step toward transparency and meritocracy. I say it is a step backward. Misleading people is a greater sin. Bad data are worse than no data.

It's much harder to undo bad information than it is to provide good information. In my opinion, uncertainty is better than a false sense of certainty. The scariest doctors are those who don't know what they don't know.

We should also be mindful of the urge to approach terrible problems with the do-something-anything mind-set. I see an analogy here to the patient with advanced cancer. Doctors desperately want to help. Fearful patients desire action; they have their guard down. That precarious scenario allows both parties to sanction the use of ineffective, costly, and toxic treatments. The patient still dies, only now he suffers a bad death. The intent was good, the people were good. But that's not enough for success in healthcare. Success in healthcare requires a clear-eyed assessment of evidence, the bravery to face uncertainty, and the will to resist bad solutions.

The Surgeon Scorecard was a huge undertaking that could have a major impact on health outcomes. If this were a new drug, new treatment, or new app, it would have to be peer-reviewed and independently evaluated. There would be prospective validation, say a pilot project, that proved it was effective and not harmful.

The history of medicine is filled with examples where we let eminence trump evidence. Thousands, maybe millions, of Americans may use an untested flawed database to evaluate the human being they trust to cut them open.

Are we okay with it being this imperfect? I'm not.



Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.