Date post: | 18-Jul-2015 |
Category: |
Technology |
Upload: | robert-munro |
View: | 137 times |
Download: | 0 times |
‘Tracking Epidemics with Natural Language Processing and
Crowdsourcing’
Robert Munro, Lucky Gunasekara, Stephanie Nevins, Lalith Polepeddi and Evan Rosen
Stanford University and EpidemicIQ2012 AAAI Spring Symposium
March 2012
http://www.robertmunro.com/research/munro12epidemics.pdf
Global Viral Forecasting
weakly human adapted
human adapted
human exclusive
Influenza HIV-1Yellow FeverRabies SARS/Ebola
transmissible
not human adapted
90% o f t he
wor l d ’s
eco l og i ca l
d i ve r s i t y
90% o f t he
wor l d ’s
l i ngu i s t i c
d i ve r s i t y
Repor t ed
l oca l l y be fo r e
i den t i f i c a t i on
H1N1 (Swine Flu) –
months
( 10% o f wor l d i n fec t ed )HIV – decades
( 35 mi l l i on i n fec t ed )
H1N5 (Bird Flu) –
weeks( >50% fa t a l )
Di se a se s e r a d i c a t e d i n t he
l a s t 75 ye a r s :
I nc r e a se i n a i r t r a ve l i n t he l a s t
75 ye a r s :
s m a l l p o x
No one is tracking all the world’s outbreaks
• NASA is tracking thousands of potentially dangerous near-Earth objects (NASA 2011).
• National security agencies are tracking tens of thousands of suspected terrorists daily (Chertoff 2008).
• A deadly microbe is far more likely to sneak onto a plane undetected.
CDC vs
Google Flu
Trends?
"I'm Jacqui Jeras with
today's cold and flu
report ... across the
mid- Atlantic states, a
little bit of an increase
here” Jan 4th
"I'm Jacqui Jeras with
today's cold and flu
report ... across the
mid- Atlantic states, a
little bit of an increase
here” Jan 4th
CDC vs
Google Flu
Trends?
The f i r s t s i gna l i s p l a i n l a ngua ge
“ t oday ' s co l d and f l u r epo r t . . . a c r o s s
t h e mi d - At l an t i c s t a t e s , a l i t t l e b i t o f an
i n c r e a se ” CNN
J an 4 , 2008
Goog l e F l u Tr e nds
+ 3 weeks
CDC
+ 5 weeks
… bu t bu r i e d i n p l a i n v i e w
“ t oda y ' s c o l d a nd f l u r e po r t
. . . a c r o s s t he mi d - At l an t i c
s t a t e s , a l i t t l e b i t o f a n
i n c r e a se ”
“ We ' r e wor r i e d a bou t t he
mar ke t s . ”“We ' r e go i ng t o t ake you t o
Kenya whe r e t he U .S . ha s
d i spa t ched some d i p l omat i c
he l p t o t r y t o ge t t he
coun t r y back on po l i t i c a l
ba l ance . ”
“ I s i nd i v i dua l i sm an
endange r ed concep t
i n Saud i Ar ab i a? ”“ We l l , i n S t . J ohn ' s
Coun t y, one ma n l o s t
h i s home t r y i ng t o
ke e p h i s p i g wa r m.”“The p i g d i d no t
make i t . ”
“He had eve r y t h i ng bu t
t he cape . A good
s amar i t an i n Oh i o s aved
a f ami l y f r om t h i s
f e r oc i ous house f i r e . ”
“A spunky boy r ee l s
i n a 550 - pound
sha r k . ”
… i n 1000s o f l anguages
в пред с тоящий о с е нне - зим ний период в
Украине ожидают ся две эпидем ии гриппа
( 2 f l u ou t b r eaks p r ed i c t ed fo r t he Ukr a i ne )
مصر ي ف ور طي ل ا ا ز ن و ل ف ن ا ن م د ي ز م
( mor e f l u i n Egyp t )
香港现1例H5 N1禽流感病例曾游上海南京等地
( Hong Kong ha d a c a s e o f a v i a n i n f l ue nz a t ha t
t r a ve l e d t o Sha ngha i a nd Na n j i ng )
M a c h i n e -
l e a r n i n g :
R e l e v a n t ?
R e p o r t s
( m i l l i o n s )
в предстоящий о с енне -
зимний период в Украине
ожидают ся две эпидем ии
гриппа
ن م د ي ز زم ن و ل ف ن ي ا ف ور طي ل ا مصرا
香港现1例H5N1禽流感病例曾游上海南京等地
Targeted machine-processing
Broad machine-processing
Human-processing
Low-volumeprocessing
High-volumeprocessing
Data input
“there is a new flu-like illness here”
Discovered by crawler
Relevance evaluated by
machine learning
Relevance evaluated by microtasker
Information stored from the reports
Relevance evaluated by in-
house analyst
Sources monitor-frequency updated
Maximally relevant phrases used to
search more data
Direct report from field staff / partner
organization
Reports for each outbreak
aggregated
Data structuring
• Disease (if known)
• Case counts / demographics
• Location
• Responding organizations
• Transport used
• Quotes from officials
• Changing conditions (spreading / ending)
• Public reaction
Motivations
For 600 new seeds, please answer this question:
Does this sentence refer to a disease outbreak:
“E Coli spreads to Spain, sprouts suspected”
Yes/no: __
What disease: _______
What location: _______
Vi r t ua l p r o t e c t ing t he r e a l
Crowdsourcing applicability
• Success:
– Language coverage
– Outbreak relatedness
– Case-counts
– Location names
– Quotes from officials
• Falling short:
– Estimating citizen unrest
– Growth predictions
Native speaker expertise
Data structuring
Data analysis
Crowdsourcing and machine-learning
• The German problem
• Bias-free seeding
• Evaluation for needle-in-haystack scenarios
– Machine and human
• Language representation
Appendix: Abstract
The first indication of a new outbreak is often in unstructured data (natural language) and reported openly in traditional or social media as a new ‘flu-like’ or ‘malaria-like’ illness weeks or months before the new pathogen is eventually isolated. We present a system for tracking these early signals globally, using natural language processing and crowdsourcing. By comparison, search-log-based approaches, while innovative and inexpensive, are often a trailing signal that follow open reports in plain language. Concentrating on discovering outbreak-related reports in big open data, we show how crowdsourced workers can create near-real-time training data for adaptive active-learning models, addressing the lack of broad coverage training data for tracking epidemics. This is well-suited to an outbreak information- flow context, where sudden bursts of information about new diseases/locations need to be manually processed quickly at short notice.