+ All Categories
Home > Technology > Tracking Epidemics with Natural Language Processing and Crowdsourcing

Tracking Epidemics with Natural Language Processing and Crowdsourcing

Date post: 18-Jul-2015
Category:
Upload: robert-munro
View: 137 times
Download: 0 times
Share this document with a friend
Popular Tags:
24
‘Tracking Epidemics with Natural Language Processing and Crowdsourcing’ Robert Munro, Lucky Gunasekara, Stephanie Nevins, Lalith Polepeddi and Evan Rosen Stanford University and EpidemicIQ 2012 AAAI Spring Symposium March 2012 http://www.robertmunro.com/research/mu nro12epidemics.pdf
Transcript

‘Tracking Epidemics with Natural Language Processing and

Crowdsourcing’

Robert Munro, Lucky Gunasekara, Stephanie Nevins, Lalith Polepeddi and Evan Rosen

Stanford University and EpidemicIQ2012 AAAI Spring Symposium

March 2012

http://www.robertmunro.com/research/munro12epidemics.pdf

Global Viral Forecasting

weakly human adapted

human adapted

human exclusive

Influenza HIV-1Yellow FeverRabies SARS/Ebola

transmissible

not human adapted

90% o f t he

wor l d ’s

eco l og i ca l

d i ve r s i t y

90% o f t he

wor l d ’s

l i ngu i s t i c

d i ve r s i t y

Repor t ed

l oca l l y be fo r e

i den t i f i c a t i on

H1N1 (Swine Flu) –

months

( 10% o f wor l d i n fec t ed )HIV – decades

( 35 mi l l i on i n fec t ed )

H1N5 (Bird Flu) –

weeks( >50% fa t a l )

Di se a se s e r a d i c a t e d i n t he

l a s t 75 ye a r s :

I nc r e a se i n a i r t r a ve l i n t he l a s t

75 ye a r s :

s m a l l p o x

No one is tracking all the world’s outbreaks

• NASA is tracking thousands of potentially dangerous near-Earth objects (NASA 2011).

• National security agencies are tracking tens of thousands of suspected terrorists daily (Chertoff 2008).

• A deadly microbe is far more likely to sneak onto a plane undetected.

CDC vs Google Flu Trends?

CDC vs Google Flu Trends?

Source: http://www.google.org/flutrends/

CDC vs

Google Flu

Trends?

"I'm Jacqui Jeras with

today's cold and flu

report ... across the

mid- Atlantic states, a

little bit of an increase

here” Jan 4th

"I'm Jacqui Jeras with

today's cold and flu

report ... across the

mid- Atlantic states, a

little bit of an increase

here” Jan 4th

CDC vs

Google Flu

Trends?

The f i r s t s i gna l i s p l a i n l a ngua ge

“ t oday ' s co l d and f l u r epo r t . . . a c r o s s

t h e mi d - At l an t i c s t a t e s , a l i t t l e b i t o f an

i n c r e a se ” CNN

J an 4 , 2008

Goog l e F l u Tr e nds

+ 3 weeks

CDC

+ 5 weeks

… bu t bu r i e d i n p l a i n v i e w

“ t oda y ' s c o l d a nd f l u r e po r t

. . . a c r o s s t he mi d - At l an t i c

s t a t e s , a l i t t l e b i t o f a n

i n c r e a se ”

“ We ' r e wor r i e d a bou t t he

mar ke t s . ”“We ' r e go i ng t o t ake you t o

Kenya whe r e t he U .S . ha s

d i spa t ched some d i p l omat i c

he l p t o t r y t o ge t t he

coun t r y back on po l i t i c a l

ba l ance . ”

“ I s i nd i v i dua l i sm an

endange r ed concep t

i n Saud i Ar ab i a? ”“ We l l , i n S t . J ohn ' s

Coun t y, one ma n l o s t

h i s home t r y i ng t o

ke e p h i s p i g wa r m.”“The p i g d i d no t

make i t . ”

“He had eve r y t h i ng bu t

t he cape . A good

s amar i t an i n Oh i o s aved

a f ami l y f r om t h i s

f e r oc i ous house f i r e . ”

“A spunky boy r ee l s

i n a 550 - pound

sha r k . ”

… i n 1000s o f l anguages

в пред с тоящий о с е нне - зим ний период в

Украине ожидают ся две эпидем ии гриппа

( 2 f l u ou t b r eaks p r ed i c t ed fo r t he Ukr a i ne )

مصر ي ف ور طي ل ا ا ز ن و ل ف ن ا ن م د ي ز م

( mor e f l u i n Egyp t )

香港现1例H5 N1禽流感病例曾游上海南京等地

( Hong Kong ha d a c a s e o f a v i a n i n f l ue nz a t ha t

t r a ve l e d t o Sha ngha i a nd Na n j i ng )

M a c h i n e -

l e a r n i n g :

R e l e v a n t ?

R e p o r t s

( m i l l i o n s )

в предстоящий о с енне -

зимний период в Украине

ожидают ся две эпидем ии

гриппа

ن م د ي ز زم ن و ل ف ن ي ا ف ور طي ل ا مصرا

香港现1例H5N1禽流感病例曾游上海南京等地

Targeted machine-processing

Broad machine-processing

Human-processing

Low-volumeprocessing

High-volumeprocessing

Data input

“there is a new flu-like illness here”

Discovered by crawler

Relevance evaluated by

machine learning

Relevance evaluated by microtasker

Information stored from the reports

Relevance evaluated by in-

house analyst

Sources monitor-frequency updated

Maximally relevant phrases used to

search more data

Direct report from field staff / partner

organization

Reports for each outbreak

aggregated

Data structuring

• Disease (if known)

• Case counts / demographics

• Location

• Responding organizations

• Transport used

• Quotes from officials

• Changing conditions (spreading / ending)

• Public reaction

Motivations

For 600 new seeds, please answer this question:

Does this sentence refer to a disease outbreak:

“E Coli spreads to Spain, sprouts suspected”

Yes/no: __

What disease: _______

What location: _______

Vi r t ua l p r o t e c t ing t he r e a l

E Co l i , Ge r ma ny 2011

The AI head - s t a r t

Predicting epidemics, 100K training items

Crowdsourcing applicability

• Success:

– Language coverage

– Outbreak relatedness

– Case-counts

– Location names

– Quotes from officials

• Falling short:

– Estimating citizen unrest

– Growth predictions

Native speaker expertise

Data structuring

Data analysis

Crowdsourcing and machine-learning

• The German problem

• Bias-free seeding

• Evaluation for needle-in-haystack scenarios

– Machine and human

• Language representation

Acknowledgements

Questions?

Appendix: Abstract

The first indication of a new outbreak is often in unstructured data (natural language) and reported openly in traditional or social media as a new ‘flu-like’ or ‘malaria-like’ illness weeks or months before the new pathogen is eventually isolated. We present a system for tracking these early signals globally, using natural language processing and crowdsourcing. By comparison, search-log-based approaches, while innovative and inexpensive, are often a trailing signal that follow open reports in plain language. Concentrating on discovering outbreak-related reports in big open data, we show how crowdsourced workers can create near-real-time training data for adaptive active-learning models, addressing the lack of broad coverage training data for tracking epidemics. This is well-suited to an outbreak information- flow context, where sudden bursts of information about new diseases/locations need to be manually processed quickly at short notice.


Recommended