Detecting Influenza Outbreaks by Analyzing Twitter Messages
By Aron Culotta
Jedsada Chartree 02/28/11
Outline
• Introduction• Motivations• Data• Methodology• Results• Conclusion• Reference
Introduction• The growing in monitoring disease outbreaks using the
Internet• The growing of Twitter
Motivations• Developing methods that can reliably track ILI rates in real-
time.
Data• The U.S. Centers for Disease Control and Prevention (CDC)• Twitter data• 36 week period from August 29, 2009 to May 8, 2010.
Data
The ILI rates from the CDC’s weekly tracking statistics (09/05/09 to 05/08/10)
The number of Twitter messages collected per week
Methodology• Gathering the ILI rates and Twitter messages• Finding the correlation between the ILI rates and Twitter
messages
P = The proportion of the population exhibiting in ILI symptomsW = {w1…wk} = A set of k keywords, D = Document collection = The coefficients = The error termQ(W,D) = The fraction of documents in D the match W (|Dw|/|D|)Logit(P) = ln(P/(1-P))€
β1 ,β 2
€
ε
Methodology• Filtering spurious matches (noise)
The number of messages containing the keyword “flu” and a number of keywords that might lead to spurious correlations.
Methodology• Filtering spurious matches by supervised learning - Training a document classifier using logistic regression
Methodology• Filtering spurious matches by supervised learning - Combining filtering with regression 1. Soft classifier
Methodology• Filtering spurious matches by supervised learning - Combining filtering with regression 2. Hard classifier
• Applying both classifier to the simple linear model.
Methodology• Evaluating false alarms by simulation - Sample 1,000 messages deemed to be spurious. - Sample with replacement an increasing number of the
spurious messages and add them to the original message set. - Use the same trained regression models.
Results
Fitted and predicted ILI rates using regression over query fractions of Twitter messages
Results
Fitted and predicted ILI rates using regression over query fractions of Twitter messages
Results
Correlation results with refinements of the flu query
Results
Correlation results with refinements of the flu query
Results
Results
Number false messages added
Conclusion•The proposed method can be used to track influenza rates from Twitter messages.•The proposed evaluating false alarm can be used satisfying.
References• Aron Culotta. 2010. Detecting influenza outbreaks by analyzing Twitter messages.• Jeremy Ginsberg and others. 2009. Detecting influenza epidemics using search
engine query data.