Use of Unstructured Event-based Reports for Global Infectious Disease Surveillance

Mikaela Keller; Michael Blench; Herman Tolentino; Clark C. Freifeld; Kenneth D. Mandl; Abla Mawudeku; Gunther Eysenbach; John S. Brownstein


Emerging Infectious Diseases. 2009;15(5) 

The EpiSPIDER Project


The EpiSPIDER project was designed in January 2006 to serve as a visualization supplement to the ProMED-mail reports. Through use of publicly available software, EpiSPIDER was able to display topic intensity of ProMED-mail reports on a map. Additonally, EpiSPIDER automatically converted the topic and location information of the reports into RSS feeds. Usage tracking showed, initially, that the RSS feeds were more popular than the maps. Transforming reports to a semantic online format (W3C Semantic Web) makes it possible to combine emerging infectious disease content with similarly transformed information from other Internet sites such as the Global Disaster Alert Coordinating System (GDACS) website ( The broad effects of disasters often increase illness and death from communicable diseases, particularly where resources for healthcare infrastructure have been lacking.[28,29] By merging these 2 online media sources (ProMED-mail and GDACS), EpiSPIDER demonstrates how distributed, event-based, unstructured media sources can be integrated to complement situational awareness for disease surveillance.

Data Acquisition and Dissemination

EpiSPIDER connects to news sites and uses natural language processing to transform free-text content into structured information that can be stored in a relational database. For ProMED reports, the following fields are extracted: date of publication; list of locations (country, province, or city) mentioned in the report; and topic. EpiSPIDER parses location names from these reports and georeferences them using the georeferencing services of Yahoo Maps (, Google Maps (, and Geonames (

Each news report that has location information can be linked to relevant demographic- and health-specific information (e.g., population, per capita gross domestic product, public health expenditure, and physicians/1,000 population). EpiSPIDER extracts this information from the Central Intelligence Agency (CIA) Factbook ( and the United Nations Development Human Development Report ( Internet sites. This feature provides different contexts for viewing emerging infectious disease information. By using askMEDLINE,[30] EpiSPIDER also provides context-sensitive links to recent and relevant scientific literature for each ProMED-mail report topic. After EpiSPIDER extracts the previously described information, it automatically transforms it to other formats, e.g., RSS, keyhole markup language (KML;, and JavaScript object notation (JSON, a human-readable format for representing simple data structures; Publishing content using those formats enables the semantic linking of ProMED-mail content to country information and facilitates EpiSPIDER's redistribution of structured data to services that can consume them. Continuing along this transformation chain, the SIMILE Exhibit API ( that consumes JSON-formatted data files enables faceted browsing of information by using scatter plots, Google Maps, and timelines.

Recently, EpiSPIDER began outsourcing some of its preprocessing and natural language processing tasks to external service providers such as OpenCalais ( and the Unified Medical Language System (UMLS) web service for concept annotation. This action has enabled the screening of noncurated news sources as well.

Project Results

Built on open-source software components, EpiSPIDER has been operational since January 2006. In response to feedback from users, additional custom data feeds have been incorporated, both topic oriented (by disease) and format specific (KML, RSS, GeoRSS), as has semantic annotation using UMLS concept codes. For example, the EpiSPIDER KML module was developed to enable the US Directorate for National Intelligence to distribute avian influenza event-based reports in Google Earth KML format to consumers worldwide and also to enable an integrated view of ProMED and World Animal Health Information Database reports.

EpiSPIDER is used by persons in North America, Europe, Australia, and Asia, and it receives 50–90 visits/hour, originating from 150–200 sites and representing 30–50 countries worldwide. EpiSPIDER has recorded daily visits from the US Department of Agriculture, US Department of Homeland Security, US Directorate for National Intelligence, US CDC, UK Health Protection Agency, and several universities and health research organizations. In the latter half of 2008, daily access to graphs and exhibits surpassed access to data feeds. EpiSPIDER's semantically linked data were also used for validating syndromic surveillance information in OpenRODS ( and populating disease detection portals, like and the Research Triangle Institute (Research Triangle Park, NC, USA).