Use of Unstructured Event-based Reports for Global Infectious Disease Surveillance

Mikaela Keller; Michael Blench; Herman Tolentino; Clark C. Freifeld; Kenneth D. Mandl; Abla Mawudeku; Gunther Eysenbach; John S. Brownstein


Emerging Infectious Diseases. 2009;15(5) 

In This Article

The Healthmap Project


Operating since September 2006, HealthMap[22,23] is an Internet-based system designed to collect and display information about new outbreaks according to geographic location, time, and infectious agent.[24–26] HealthMap thus provides a structure to information flow that would otherwise be overwhelming to the user or obscure important elements of a disease outbreak. receives 1,000–10,000 visits/day from around the world. It is cited as a resource on sites of agencies such as the United Nations, National Institute of Allergy and Infectious Diseases, US Food and Drug Administration, and US Department of Agriculture. It has also been featured in mainstream media publications, such as Wired News and Scientific American, indicating the broad utility of such a system that extends beyond public health practice.[24,26] On the basis of usage tracking of HealthMap's Internet site, we can infer that its most avid users tend to come from government-related domains, including WHO, CDC, European Centre for Disease Prevention and Control, and other national, state, and local bodies worldwide. Although the question of whether this information has been used to initiate action will be part of an in-depth evaluation, we know from informal communications that organizations (ranging from local health departments to such national organizations as the US Department of Health and Human Services and the US Department of Defense) are leveraging the HealthMap data stream for day-to-day surveillance activities. For instance, CDC's BioPHusion Program incorporates information from multiple data sources, including media reports, surveillance data, and informal reports of disease events and disseminates it to public health leaders to enhance CDC's awareness of domestic and global health events.[27]

Data Acquisition

The system integrates outbreak data from multiple electronic sources, including online news wires (e.g., Google News), Really Simple Syndication (RSS) feeds, expert-curated accounts (e.g., ProMED-mail, a global electronic mailing list that receives and summarizes reports on disease outbreaks),[18] multinational surveillance reports (e.g., Eurosurveillance), and validated official alerts (e.g., from WHO). Through this multistream approach, HealthMap casts a unified and comprehensive view of global infectious disease outbreaks in space and time. Fully automated, the system acquires data every hour and uses text mining to characterize the data to determine the disease category and location of the outbreak. Alerts, defined as information on a previously unidentified outbreak, are geocoded to the country scale with province-, state-, or city-level resolution for select countries. Surveillance is conducted in several languages, including English, Spanish, Russian, Chinese, and French. The system is currently being ported to other languages, such as Portuguese and Arabic.

Data Dissemination

After being collected, the data are aggregated by source, disease, and geographic location and then overlaid on an interactive map for user-friendly access to the original report. HealthMap also addresses the computational challenges of integrating multiple sources of unstructured information by generating meta-alerts, color coded on the basis of the data source's reliability and report volume. Although information relating to infectious disease outbreaks is collected, not all information has relevance to every user. The system designers are especially concerned with limiting information overload and providing focused news of immediate interest. Thus, after a first categorization step into locations and diseases, a second round of category tags is applied to the articles to improve filtering. The primary tags include 1) breaking news (e.g., a newly discovered outbreak); 2) warning (initial concerns of disease emergence, e.g., in a natural disaster area; 3) follow-up (reference to a past outbreak); 4) background/context (information on disease context, e.g., preparedness planning); and 5) not disease-related (information not relating to any disease [2–5 are filtered from display]). Duplicate reports are also removed by calculating a similarity score based on text and category matching. Finally, in addition to providing mapped content, each alert is linked to a related information window with details on reports of similar content as well as recent reports concerning either the same disease or location and links for further research (e.g., WHO, CDC, and PubMED).

Project Results

HealthMap processes an average of 133.5 disease alerts/day (95% confidence interval [CI] 124.1–142.8); ≈50% are categorized as breaking news (65.3 reports/day). Looking 30 days back (default display), the system displays >800 breaking news alerts for any given day. From October 2006 through November 20, 2007, HealthMap had processed >35,749 alerts across 171 disease categories and 202 countries or semiautonomous or overseas territories. Most alerts come from news media (92.8%), followed by ProMED (6.5%) and multinational agencies (0.7%).