How Genomic Sequencing Data Was Used to Track an Ongoing Salmonella Outbreak

Irena Hwang

November 05, 2021

ProPublica is a nonprofit newsroom that investigates abuses of power. Sign up to receive our biggest stories as soon as they're published.

Last week, ProPublica published an investigation documenting the failures in the U.S. food safety system that allowed the spread of a type of salmonella known as multidrug-resistant infantis. The bacteria has sickened tens of thousands of people, but outdated government policies and pushback from trade groups have left federal agencies with little power to stop infantis from spreading through the poultry industry.

Our reporting relied on public records requests and dozens of interviews, the bread and butter of journalism. But I also made use of a type of data ProPublica has never before tapped into: publicly available genomic sequencing data.

Before I became a data journalist, I was a doctoral student in electrical engineering. Most of my research was in bioinformatics — the analysis and interpretation of genetic data — and for seven years, culturing bacteria, purifying nucleic acids and writing code to analyze sequencing data were my bread and butter.

So when my ProPublica colleagues Michael Grabell and Bernice Yeung approached me with questions about genomic sequencing data, I was all ears. They explained that they were digging into a salmonella outbreak investigation that the Centers for Disease Control and Prevention had closed in 2019, albeit with a warning that "illnesses could continue because this salmonella strain appears to be widespread in the chicken industry."

The NCBI data alone wouldn't tell us. The database is stripped of key details, like the poultry plant where a salmonella sample was taken or when and where a patient got sick, a shortcoming that industry scientists and consumer advocates have complained about. This is where public records proved vital. Michael and Bernice, along with Mollie Simon on our research team, filed dozens of public records requests with the CDC, the U.S. Department of Agriculture and state public health agencies that had worked on the infantis outbreak. Through those requests, we obtained records from the USDA's microbiological sampling program that revealed how often different types of salmonella were being found at which poultry plants. We also obtained epidemiological information about patients who had been part of the outbreak, including the date they'd been tested for salmonella, details about their illness and recent food consumption. The records didn't include patients' names, of course, but we could match both of the datasets we'd obtained to the sequencing data available on the NCBI database.

The USDA sampling data also allowed ProPublica to create an online tool that consumers could use to check the salmonella records of the plants that process their chicken and turkey.

As we pored over the data and public records, we learned about how the CDC has analyzed DNA to connect food poisoning cases. From the 1990s to just around the time of the infantis outbreak, investigators used a technique called pulsed-field gel electrophoresis, or PFGE.

The difference between PFGE and sequencing data was crucial to this outbreak and our investigation.

PFGE Is Dead. Long Live WGS

When a patient shows up at the hospital with symptoms of foodborne illness, a stool or urine sample may be taken. Then, DNA from bacteria found in the sample can be extracted in a lab.

A DNA sequence can be thought of as a huge compound word spelled with only four possible molecular "letters," or nucleotide bases. PFGE uses a special protein to cut up DNA into smaller sections — imagine breaking up a giant compound word into chunks of words. Then, an electric field is applied, and the segments of cut-up DNA will rearrange based on their weight, resulting in a visible barcode-like pattern.

Scientists can compare PFGE patterns to make informed guesses about how closely related pathogens are to one another. The more similar their PFGE patterns, the more similar their underlying DNA must have been. For years, including the time covered by the infantis investigation, PFGE patterns were used to define outbreak strains.

But in the early 2000s, new technology called next-generation sequencing made it possible to relatively quickly get a readout of the full sequence of nucleotide bases in a DNA sample, a process called whole-genome sequencing, or WGS. Individuals are distinguished by the tiniest differences in the genes we share, but that is beyond the abilities of PFGE. Whole-genome sequencing, though, can reveal the unique "spellings" of our DNA that differentiate you from me — or one strain of a pathogen from another.

Sequencing data is the backbone of the NCBI Pathogen Detection project. NCBI groups genetically similar samples into clusters and then compares each sequence, nucleotide base by nucleotide base, to the other sequences in the cluster.

For each cluster of samples, NCBI also creates a phylogenetic tree, an evolutionary biologist's version of your Aunt Sue's hand-drawn family tree. This models how a group of organisms might be related to possible common ancestors and to one another.

But phylogenetic trees that are drawn based on hypothetical common ancestors, like NCBI's, are interpreted differently than known family trees. Genetic changes occur by evolution, but also by chance. In the case of humans, millions of unrelated strangers might have a particular gene that gives rise to a particular disease, but that's different from knowing that I inherited that gene directly from my parents. It's largely the same for bacteria like salmonella.

So I wondered: What could the tree for this infantis cluster tell us about how closely related the outbreak samples were to the thousands of more recent food and patient samples in the same cluster?

No Silver Bullet

To find out, I freed up 100 gigabytes on my work laptop and asked my editors for 50 euros. The hard-drive space was for comparing approximately 32 million pairs of samples from the NCBI data, and the euros were for phylogenetic visualization software created by researchers in Germany.

By comparing the bacteria samples found in USDA tests to the outbreak samples, I found that more than twice a day this year, on average, the agency has been finding drug-resistant infantis in chickens destined for supermarkets and restaurants that's genetically similar to the outbreak strain. We also confirmed that the CDC is still receiving reports of infantis infections — as recently as last week.

This finding highlights the power of WGS databases like NCBI's to help investigators draw connections between human illness and foods they may have eaten. Thanks to WGS, public health officials have discovered that certain foods, like raw flour and peaches, were vectors for outbreaks of foodborne illnesses, even though they had rarely been linked to a particular bug. Sequencing data has even helped solve cases that had long gone cold, like a sprinkling of food poisoning cases linked to ice cream that were finally connected after half a decade.

But WGS is no silver bullet. Even a seemingly "perfect" DNA match in NCBI cannot conclusively identify the specific culprit behind a foodborne illness. Bacteria accumulate changes in DNA relatively rapidly and have an annoying habit of swapping genes like Pokémon cards. Bacterial samples might share the same set of genes and mutations because they came from the same source, or they might have acquired them independently under completely different circumstances. So a genetic match between a food and human sample must be corroborated with epidemiological proof to make sure it fits in the outbreak timeline and matches a theorized source of the outbreak.

I'd hoped that visualizing the outbreak samples on a phylogenetic tree would reveal insights about the more recent infantis samples versus the ones collected during the outbreak. Perhaps there would be patterns in the tree showing that newer samples shared more genetic similarities than the outbreak samples did. Or that certain outbreak samples had spawned mini-outbreaks of their own. Instead, the visualization software showed that, evolutionarily speaking, the outbreak samples were all over the place: They couldn't be tracked back to one particular source.

The lack of obvious patterns in the tree that could be tied to geography, time or food product supported the CDC's theory that infantis contamination was likely originating not at particular slaughterhouses or processing plants, but rather upstream in the poultry supply chain, perhaps in feed or breeding flocks. (The two major breeding companies, Aviagen and Cobb-Vantress, the latter a subsidiary of Tyson Foods, declined to comment.) Comparing the DNA sequences yielded no further clues — the number of genetic differences between two samples from during the outbreak was, on average, about the same as that between an outbreak isolate and a more recent multidrug-resistant infantis isolate. To put it simply: The infantis samples before, during and after the outbreak were, in the end, all pretty similar.

We shared our findings with numerous experts, including former and current CDC researchers and food safety scientists. They agreed that our analysis indicated something very different from the traditional foodborne illness outbreak that can be traced back to a definitive single source. What we've been looking at, they said, is indicative of a bug that's so deeply entrenched in the poultry supply chain that it's hard to figure out where it came from.

A New Landscape of Bacterial Foodborne Illness

The closure of the infantis investigation without any conclusions about the outbreak's origins is, it appears, a harbinger. Similarities in genetic data are linking seemingly unrelated cases of people getting sick in different states and consuming different products. At a USDA meeting on salmonella last year, Robert Tauxe, director of the CDC's Division of Foodborne, Waterborne and Environmental Diseases, described a "new landscape" of foodborne illnesses revealed by WGS: strains that cause recurring outbreaks, that were newly emerging and that persist in a population from year to year.

The very definition of what constitutes an outbreak is in questio

Scientists are beginning to answer that question with sequencing data and are piecing together how bacteria are taking advantage of our interconnected food supply chains. A 2015 study on salmonella in fish products destined for sushi in restaurants and grocery stores identified certain countries in the global tuna supply chain where salmonella contamination is more likely to occur.

"Public health agencies," wrote the authors, "could use this information to determine most effective intervention points to minimize or eliminate outbreak risk."

Up to now, the USDA hasn't fully used the information at its disposal to prevent the most dangerous strains of salmonella from spreading in our food supply.

It's possible that will change, though. Last month, after years of public pressure (and weeks of inquiries from ProPublica), the USDA's top food safety official, Sandra Eskin, said the agency was rethinking its approach to salmonella. The agency will set up pilot projects and hold meetings to develop a new plan, but its announcement was short on specifics.

Michael Grabell and Bernice Yeung contributed reporting.