Global infectious disease surveillance through multi lingual of … · Children’s Hospital...

Post on 31-May-2020

1 views 0 download

transcript

John S. Brownstein, PhD

Harvard Medical School  Children’s Hospital Informatics Program

Harvard‐MIT Division of Health Sciences and Technology

Global infectious disease surveillance 

through automated multi‐lingual 

georeferencing

of Internet media reports

Presenter
Presentation Notes
Surveillance sans frontières: �Internet-based emerging �infectious disease intelligence

Healthcare Surveillance

AEGIS INFLUENZA

AEGIS

Presenter
Presentation Notes
Leverage deep expertise in the area of surveillance Automated Mapping

Unstructured Web Surveillance

Abundant cheap/free resource

Detailed local information

Near real‐time reporting

Less susceptible to political pressure

Structured Clinical Surveillance

Lack of infrastructureLow level trainingGaps in coveragePoor information flow

Presenter
Presentation Notes
Zip through this: (1) don’t need to sell them on the approach; (2) all the things on the left apply on the right

Early reporting of SARS 

Nov 2002 Mar 2003

Progression of outbreak

Electronic Surveillance

Cases of atypical pneumonia FoshanNov 16th

Infected Chinese DoctorHong Kong hotelFeb 21st

305 Cases of acute respGuangdong ProvinceFeb 11th

Pharma report Guangdong ProvinceNovember 27

Media reportsGuangdong ProvinceFeb 10

Astute physician on ProMEDFeb 10

Initial WHO ReportFeb 25

Official WHO ReportMarch 10

Source of outbreak news verified by WHO

Adapted from Heymann 2001

Presenter
Presentation Notes
Initial source of reports of outbreaks 1998–2001. GPHIN picked up 56% of 578 outbreaks subsequently verified by WHO

Limitations of internet‐based surveillance

Abundance of disparate electronic resources but none comprehensive

Information is unstructured ‐‐

free text

Difficulty in georeferencing

of information sources

No synthesized view of the current state of global health

Brownstein et al. Institute of Medicine. 2007.

Presenter
Presentation Notes
Cut?

www.healthmap.org

Presenter
Presentation Notes
Builds on excellent talks of gphin and promed

Healthmap

Objectives

Enhance surveillance of infectious diseases through integrationMulti‐stream real‐time web‐based surveillance system Alert aggregator ‐ news wire, web sites, RRS feeds, mailing listsAutomated multi‐lingual searching, categorization, filtering, georeferencing

Achieve unified and comprehensive view of global health Space and time mappingEasy information accessLimit information overload  ‐‐ filter and duplicate removal

Free and open multi‐lingual mapping resource Open source technologies combined with Open API systemsLinux, Apache, MySQL,, PHP, Google Maps

Brownstein et al. Institute of Medicine. 2007.

Presenter
Presentation Notes
Linux, apache, MySQL, PHP

Public Health Resource

Tool for general population

Presenter
Presentation Notes
Very happy to be here System in infancy Surprised and humbled by the response

HealthMap Article Processing

ACQUIRING

>20,000 sites Every hour; 24/7

FILTERING

>3 million keywords 94% accuracy

CLUSTERING

Text Matching Similarity Score

CATEGORIZING

1500 disease patterns 5000 location patterns

Presenter
Presentation Notes
Information extraction, syntactic representation of sentences

UKUK unitedunited

kingdomkingdom statesstates arabarab

emiratesemirates

12

12 13

14As we process the input token by token, we traverse the tree accordingly.

Each node is a hashtable. Each key maps an input token either to an ID or another node.

UK 12united kingdom 12united states 13united arab emirates 14

Georeferencing

Early Stats

> 200 alerts per day

60,000 alerts so far

Alerts in 201 countries 

169 pathogens

4 four languages English (60%)Spanish (20%)French (11%)Russsian (9%)

Presenter
Presentation Notes
Resource for UN, FDA, AVMA, HHS, WHO, PBS Many users from government-related domains, including the CDC and WHO as well as other national, state and local bodies

Geographic Representation201 Countries with alarms

1‐USA: 4351

2‐UK: 1018

3‐Canada: 880

4‐China:737

Multi‐lingual Surveillance

Presenter
Presentation Notes
The initial set of data sources will be obtained strictly through English-language resources. However, we plan to adapt our data acquisition to reporting in multiple languages. We will attempt two different approaches to multi-lingual knowledge extraction. The first will be based on direct human translation. We have assembled collaborators from US Naval Medical Research Center Detachment in Peru to help us with Spanish and Portuguese language translation (See letter of support). Our research group also has in-house expertise in French and Chinese (traditional and simplified). In each case, we will directly translate our data dictionary (described in Aim II) of pathogens and geographic locations. Our second approach will be to use automated translation tools, such as BabelFish by AltaVista. We plan to validate these automated approaches against our own attempts at translation. If successful, we plan to expand language capabilities to include Russian, Japanese and Arabic.

Coverage Comparison: Argentina

English News

Bovine Anthrax

Citrus Canker

Coverage Comparison: Argentina

Spanish News

Trichinosis

Bronchiolitis

Rotavirus

Influenza

Georeferencing

errors can occur

No known pattern given in input

Location given, wrong matchLondon, Georgia

Non‐location pattern matchedAntarctica the horse

Correct location, but too generalRussia vs. Stavropol Krai

Presenter
Presentation Notes
Note path dependency here: how frequently I update dictionary. Possibility to compute a composite accuracy score?

47%

29%

24%

UpdatesInsertsDeletes

Georeferencing

accuracy

1774 alerts processed 

134 location edited (7.6%)

Presenter
Presentation Notes
Question is fundamentally qualitative: What is “primary”? Sensitivity vs. Information overload Different ways to get the classification wrong, with differing severity Bad input occurs: subscription only, headline doesn’t match article, site unavailable Problems with metrics

Dictionary Approach Expansion

The dictionary approach provides us with a labeled corpus:

Health authorities in New Caledonia are closely monitoring an upsurge of dengue fever cases

Dictionary Approach Expansion

The dictionary approach provides us with a labeled corpus:

Health authorities in New Caledonia

are closely monitoring an upsurge of dengue fever

cases

Dictionary Approach Expansion

The dictionary approach provides us with a labeled corpus:

Health authorities in New Caledonia

are closely monitoring an upsurge of dengue fever

casesNNP        NNS         IN       NNP               VBP    RB     

VBG       DT   NN       IN       NN              NNS

Expansion: Learn the syntactic/lexical context in which locations & diseases occur

Dictionary Approach Expansion

The dictionary approach provides us with a labeled corpus:

Health authorities in New Caledonia

are closely monitoring an upsurge of dengue fever

casesNNP        NNS         IN       NNP               VBP    RB     

VBG       DT   NN       IN       NN              NNS

Expansion: Learn the syntactic/lexical context in which locations & diseases occur

Predict new locations & diseases:

Health authorities in California

are closely monitoring an upsurge of salmonella

cases

Collaborative Georeferencing

Networks

ProMED

of 

the 

International 

Society 

for 

Infectious 

Diseases 

(specialty

moderators; full 40,000 members)

Emerging 

Infections 

Network

(EIN) 

of 

the 

Infectious 

Disease 

Society 

of 

America (982 ID consultants)

US Naval Medical Research Center Detachment

of DOD‐GEIS in Peru (Spanish 

and Portuguese moderation) 

Conclusions

Internet‐based disease mapping offers a promising multi‐use tool

Value in visualization of distributed electronic resources

Georeferencing

still presents formidable challenges Higher resolution mapping

Limiting misclassifications

Multi‐lingual location identification

Presenter
Presentation Notes
Data can also be leverage for epidemiological studies Internationalization Localization Useful because of a new context for public health issues Further supports English report bias greater population, numbers of media outlets, public health resources, and availability of electronic communication infrastructure.

AcknowledgmentsChildren’s Hospital Informatics Program

@ Harvard‐MIT HST

Clark Freifeld

Mikaela Keller, PhD

Ken Mandl, MD MPH

Ben Reis, PhD

Isaac Kohane, MD PhD

Larry Madoff (ProMED)

David Blazes (Peru NMRCD)

Aranka Anema (UBC)

Funding

Google Foundation

National Library of Medicine (NLM)

Centers for Disease Control and Prevention

Canadian Institutes of Health Research (CIHR)

Contact

john_brownstein@harvard.edu

www.healthmap.org

www.chip.org