IBM Watson Oncology: Not Living Up to Expectations

Roxanne Nelson, BSN, RN

August 15, 2018

If it sounds too good to be true, then maybe it is.

Such seems to be the case with IBM Watson, which has been aggressively marketed as a tool to assist oncologists in selecting the optimal treatment regimen for their patients.

However, some of its recommendations for cancer treatment have been questionable, according to a new investigative report by STAT, which suggests it is not yet ready for prime time. Internal IBM documents, which STAT had access to, showed that Watson often gave erroneous cancer treatment advice and that company medical specialists and customers identified "multiple examples of unsafe and incorrect treatment recommendations" at the same time that IBM was promoting its supercomputer to hospitals and physicians across the globe.

The report certainly gives oncologists pause for thought, and maybe second thoughts before enlisting the help of Watson.

With an almost never ending series of new biomarkers and mutations being detected and a dazzling array of new therapies pouring out of the pipeline, choosing a cancer treatment regimen can take an arduous amount of time.

In a recent report, the supercomputer was shown to significantly speed up the process of analyzing whole-genome sequencing, which is a "key bottleneck in cancer genomics." It took IBM Watson 10 minutes to come up with conclusions that were similar to those reached by a team of experts after 160 hours of analysis.

However, when implemented in real-world oncology settings, Watson fares less well, according to the STAT report. Part of the problem appears to stem from how the system was trained. IBM's internal documents largely blame the problems on the training of Watson by IBM engineers and oncologists at Memorial Sloan Kettering Cancer Center in New York City, who had been tapped to train Watson in 2012. The resulting software was drilled with a small number of "synthetic" cancer cases, or hypothetical patients, instead of real patient data. Thus, Watson was trained on the basis of the expertise of a few specialists for each cancer type, rather than from guidelines or evidence.

STAT also noted that product information posted on the IBM website implies that Watson is continuing to be trained using real patient data and says that the supercomputer "analyzes patient data against thousands of historical cases and insights gleaned from thousands of Memorial Sloan Kettering MD and analyst hours." However, the number of cases for each of the eight cancers that are covered is a small fraction of the grand total, which ranges from 635 cases for lung cancer to 106 for ovarian cancer.

Gloom and Doom?

The Wall Street Journal recently issued its own report about IBM Watson, saying that despite the initial promise and hype, "six years and billions of dollars later, the diagnosis for Watson is gloomy."

Echoing some of the concerns brought up in the STAT articles, the Wall Street Journal notes that in many cases, the tools did not add much value, and in other cases, Watson was inaccurate. Some problems were that Watson could be "tripped up" by a lack of data for rare or recurring cancers and that treatments were evolving faster than Watson's human "trainers" were able to update the supercomputer. They also pointed out that thus far, there is no published research demonstrating that Watson has improved patient outcomes.

Watson for Genomics has been piloted at a number of US cancer centers, but according to the Wall Street Journal, physicians at several centers have reported that results were not always accurate, and when they were, it was often information that oncologists already knew.

"The discomfort that I have — and that others have had with using it — has been the sense that you never know what you're really going to get...and how much faith you can put in those results," Lukas Wartman, MD, of the McDonnell Genome Institute at the Washington University School of Medicine in St. Louis, Missouri, told the newspaper. He also said that even though he has complimentary access to Watson, he rarely uses it.

IBM Responds

IBM has taken issue with the negative media reports that were published recently. It singled out the Wall Street Journal article specifically.

In an article entitled "Watson Health: Setting the Record Straight" posted on the company website, John E. Kelly III, MD, IBM senior vice president, Cognitive Solutions and IBM Research, writes that the media reports distort and ignore facts when they suggest that IBM has not made "enough" progress in bringing the benefits of artificial intelligence (AI) to healthcare.

"It is true, as the article reports, that we at IBM have placed a big bet on healthcare," Kelly writes. He says that IBM has done this for two reasons: "AI can make a big difference in solving medical challenges and supporting the work of the healthcare industry," he writes, and also there is an "enormous business opportunity in this area as the adoption of AI increases." IBM has built three distinct cancer tools, Kelly notes:

"Together, they are now in use at 230 hospitals and health organizations globally and have nearly doubled the number of patients they've reached in the first six months of the year to 84,000," he writes.

Kelly also addresses the allegation that there has been no benefit to patients. "To suggest there has been no patient benefit is to ignore both what we know the Wall Street Journal was told by a number of physicians around the world and these institutions' own public comments — which we believe speak for themselves," he writes.

These were the examples that Kelly listed to support this statement:

  • Mayo Clinic physicians presented a poster at the annual meeting of the American Society of Clinical Oncology in which they reported that Watson for Clinical Trial Matching boosted enrollment in breast cancer trials by 80%, to 6.3 patients/month, up from 3.5 patients/month, during an 18-month period following its implementation.

  • Thaddeus Beck, MD, and colleagues at the Highland Oncology Group in Arkansas reported that Watson Clinical Trial Matching reduced the time for clinical trials matching by 78%.

  • Mark Kris, MD, and oncologists at the Memorial Sloan Kettering Cancer Center have helped to train Watson for Oncology regarding 13 cancers that represent 80% of global cancer incidence and prevalence.

  • S. P. Somashekhar, MBBS, and Manipal Hospital, Bangalore, India, reported a 93% concordance rate in breast cancer for their multidisciplinary tumor board in an article published in the Annals of Oncology earlier this year. They recently stated that their multidisciplinary tumor board uses Watson for Oncology for all of their complex cases and that it is changing their treatment recommendations in 9% to 11% of cases.

  • Michael Kelley, MD, and the Department of Veterans Affairs recently extended their contract for Watson for Genomics. Thus far, nearly 3000 veterans with stage IV cancer have been supported by this tool.

  • William Kim, MD, and the University of North Carolina Lineberger Cancer Center, Chapel Hill, reported that Watson for Genomics found new, actionable mutations in 32% of patients.

Oncologist Dilemma

So what should oncologists do with the extra information provided by the supercomputer?

Approached for comment, Nigam Shah, MBBS, PhD, associate professor of medicine and biomedical data science at Stanford University, California, said that he doesn't think there is much to debate about, given that the system is not being trained using real patient data.

"If I was an oncologist, I'd want to see their system validated in the same way we validate other interventions in medicine, such as with a prospective study, either with or without randomization, depending on the nature of the intervention," Shah told Medscape Medical News.

As an example, he suggested that a medical center could run IBM Watson in silent mode. The computer would be asked to assess and opine on every case the medical center sees. Its recommendations would be recorded, but not acted upon. The center would then assess how often Watson's recommendations corresponded with what the physicians did. "If that test establishes safety, then I'd start showing the recommendations and acting on them," Shah said, "then follow patients over time to see if those that got a Watson-generated recommendation did better than those who didn't."

But for now, he reiterated that if he were a practicing oncologist, "I'd just ignore the noise and chatter and wait for more proof points."

Shah added that IBM Watson should clearly disclose the data upon which the system is trained. "Any AI system is only as good as the data it gets to train on," he explained. "If the system is trained on idealized patient records generated by a small set of doctors, it is improper to say it is trained from real patient data at Sloan Kettering."

Michael Hogarth, MD, a professor in the Division of Biomedical Informatics, Department of Internal Medicine, the University of California, San Diego, said that although he is not a legal expert, he believes that there is a fair amount of case law that indicates that the physician using such a system is fully responsible for its use (ie, the system itself is not culpable).

"This is no different than a physician relying on a reference textbook that has misprinted or wrongly characterized something and using that information and having a bad patient outcome as a result," he explained. "Ultimately, the treating physician is 'always' responsible for the decision making — they can't blame a computer, journal article, or book for making them do the wrong thing."

Hogarth noted that he and many others in his discipline of health informatics were a bit skeptical of IBM when the company claimed Watson could improve oncologist decision making. Essentially, the reason is that specialists such as oncologists are experts in their fields, and some are "super experts" regarding specific conditions within their discipline and treat only patients with those conditions.

Such specialization has become very common in healthcare. It has been shown that clinical decision support systems tend to offer the most assistance to physicians who are not specialists of the condition at hand. "The more 'general' the physician, the more a clinical decision support system focused on specialized content — like IBM Watson Oncology — might surface up information that is not known by that physician," explained Hogarth. "When it comes to specialists, the amount of relevant knowledge is less overwhelming for the specialist, who has less to retain, since they don't have to know everything about every health condition and only need to know about the conditions they see in their specialty."

Hogarth added that the most concerning thing is that there is much talk about "machine learning" tools being used at the bedside for diagnosis and treatment decision making. "That's fine, but they still need to be validated," he emphasized. "Many of them are built using data from EHRs [electronic health records], and a lot of that data is incomplete in terms of knowing everything about a patient, since many patients get care in multiple health systems, so nobody has a complete record on them."

The notion of validating machine learning–based diagnostic aids is just starting to be discussed. "The question is whether it should be regulated as a medical device," he said.

Another physician who was approached for comment cautioned that oncologists need to be aware of where Watson's data are coming. "One thing that has to be emphasized is that Watson's output suffers a huge bias related to the fact that it's all Memorial Sloan Kettering doctors who are training Watson," David H. Gorski, MD, PhD, professor and chief, Breast Surgery Section, Wayne State University School of Medicine, Detroit, Michigan, told Medscape Medical News.

"That means Watson is basically the MSKCC way, which might or not be the right way in every case," he explained. "Remember, academic physicians approach the medical literature with their own biases and interpret it in light of them, and that's why we can see differing recommendations coming from different institutions."

He added that that is also why he wishes that there was a much broader base of physicians training Watson and feeding it studies. "In the end, Watson is a tool, nothing more," said Gorski. "It's a recommendation for treatment that suffers all the flaws of the medical literature and the specific doctors who taught it how to derive recommendations from that literature.

"If a doctor keeps that in mind and isn't afraid to overrule Watson when he or she thinks it's wrong," he added, "it certainly can be potentially useful."

Not Ready for Prime Time

STAT published an article in September 2017 about some of the problems that IBM was having with Watson. It reported that Watson was "still struggling with the basic step of learning about different forms of cancer. Only a few dozen hospitals have adopted the system, which is a long way from IBM's goal of establishing dominance in a multibillion-dollar market."

For that article, STAT conducted numerous interviews with various stakeholders, including physicians, IBM executives, and experts in AI. It also assessed its use, marketing, and performance in hospitals across the globe. The interviews suggested that "IBM, in its rush to bolster flagging revenue, unleashed a product without fully assessing the challenges of deploying it in hospitals globally," and as a result, "its flaws are getting exposed on the front lines of care by doctors and researchers who say that the system, while promising in some respects, remains undeveloped."

"Watson for Oncology is in their toddler stage, and we have to wait and actively engage, hopefully to help them grow healthy," said Taewoo Kang, MD, a South Korean cancer specialist who has used the product and was quoted in the STAT article.

Earlier this year, IBM confirmed rumors of company layoffs, but denied that the cuts affected 50% to 70% of the workforce at its Watson Health operation, as had been reported. IBM has not reported the actual number of layoffs.

"IBM is continuing to reposition our team to focus on the high-value segments of the IT market, and we continue to hire aggressively in critical new areas that deliver value for our clients and IBM. This activity affects a small percentage of our Watson Health workforce, as we move to more technology-intensive offerings, simplified processes, and automation to drive speed," IBM told Medscape Medical News in a statement earlier this year.

Hogarth commented in an interview with Medscape Medical News: "I understand that IBM Watson has laid off/moved many technical staff that were working on IBM Watson Health, so I think they discovered it was not as easy as they thought, plus actually improving decision making is not a trivial maneuver.... It involves many factors beyond just supplying summarized information."

Hogarth believes that had IBM put its sights on a more modest goal, it could have achieved it. The bar that IBM set Watson Health, as well as its expectations, were quite high and were perhaps misplaced or mistargeted. "Instead of creating a system for oncologists, they probably should have focused on a system that can assist general internists, family physicians, and other midlevel extenders and focused its help on patients with enigmatic diagnoses or patterns of symptoms and signs that are not frequently seen by generalists," he said. "That is where an engine like IBM Watson can potentially provide significant value."

IBM recently reported that it will be modifying its Watson software to better reflect geographic differences in cancer treatment, according to the most recent article from STAT about this topic. To date, Watson Oncology has made the most headway in Asia, and most of the hospitals currently using Watson are based outside of the United States. Some physicians have complained that Watson's "decisions" do not reflect treatment protocols in their country and have expressed dissatisfaction with its American bias, that article reports.

At an internal meeting for Watson Health employees worldwide, IBM announced that it would begin using data from real patients for the first time, and it recommended that treatments incorporate more localized treatment advice.


Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.
Post as: