Influenza-Like Illness Surveillance on Twitter
through Automated Learning of Naïve Language
Joey Ruberti Presented by
Francesco Gesualdo, Giovanni Stilo, Eleonora Agricola, Michaela V. Gonfiantini, Elisabetta Pandolfi , Paola Velardi, and Alberto E. Tozzi
Overview
1. Problem
2. Previous Approaches
3. Proposed Approach
4. System Details
5. System Evaluation Strategies
6. Results
7. Reliability
8. Effectiveness
9. Limitations
10. Conclusions
Problem: Twitter Mining Potential
• The general public shares personal information on social networks and microblogs like Twitter
• How can this data be utilized?
• This information is a potential source of real-time data directly from individuals that can be used for disease surveillance and public health
• Tweets often accompanied by location indicators
• Syndromic surveillance systems
• What is the best way to aggregate and analyze this data?
288 million monthly
active users on Twitter
500 million Tweets sent
per day
Previous Approaches: Measuring Specific Keywords
• Measure the occurrence of specific disease-related search
keywords vs disease trends
• Flu Trends - A Google service utilized this technique to estimate and predict influenza activity by aggregating search query volumes
• Suffers from a high level of noise because search peaks are often completely unrelated to the incidence of a disease
Previous Approaches: Measuring Specific Keywords
• These approaches usually look for the name of the clinical
condition or its synonyms (eg: H1N1 or Swine Flu)
• Sometimes these keywords are arbitrarily chosen by the authors but are related to the clinical syndrome (eg: Flu or vaccine)
Problems with this type of approach:
1. In blogs/forums people are motivated by a communication need, rather than information need so naïve language is often used over technical language
2. Most users will describe a combination of symptoms rather than a diagnosis. Looking at disease-related keywords can miss a large volume of messages that include signs/symptoms
New Approach: Goals
• Analyze Twitter messages as a source of data for syndromic surveillance but take into account the use of non-medical language by Twitter users
• Use a combination of symptoms rather than a suspected or final diagnosis keyword like previous approaches
• Use Twitter’s geolocation data to narrow down results to locations in the United States
New Approach: Design Overview
1. Develop a minimally supervised algorithm that learns technical term-naïve term pairs based on pattern generalization and complete-linkage clustering
2. Apply the algorithm to a group of technical terms extracted from the European Centre for Disease Prevention and Control (ECDC) case definition for influenza-like illness (ILI)
3. Construct a Boolean query based on the ECDC case definition for ILI, using both technical and related jargon terms identified by the algorithm from step 1
4. Collect 2 sets of Twitter messages matching the query
5. Compare the trends of these messages with traditional surveillance data for influenza in the US
Similarity of 2 clusters is
the similarity of their most dissimilar members
Results in clusters with minimum similarity
http://nlp.stanford.edu/IR-book/html/htmledition/single-link-and-complete-link-clustering-1.html
Algorithm Development: extraction of naïve-medical jargon
• In order to overcome the clinical term bias of the previous approaches, an algorithm was developed that automatically maps all naïve terms related to a specific medial term using www.freebase.com/view/medicine/disease
• The algorithm starts with an initial small learning set of medical conditions, composed by term pairs (1 technical term and 1 naïve term, eg. emesis-vomiting) to extract basic patterns from the web, and then generalize, cluster, and weight these patterns based on another small set of pairs
• Generalized patterns are learned for sentence fragments of naïve terms and for multi-word expressions describing medical conditions (eg. “inflammation of the nose” -> “inflammation of BODYPART”)
Query Development: Aggregation of Symptoms
• A Boolean query was developed to look for Tweets based on an aggregation of symptoms using the following ECDC case definition for an influenza-like illness:
Sudden onset of symptoms AND at least one of the following 4 systemic symptoms fever or feverishness, malaise, headache, myalgia AND at least one of the following 3 respiratory symptoms cough, sore throat, shortness of breath
Applying the Algorithm
The algorithm was applied to a set of 8 symptom-related medical conditions expressed as technical terms derived from the case definition
Set of naïve terms obtained by the algorithm:
Generating the Boolean Query
Using the naïve terms discovered by the algorithm and the original technical terms, the influenza-like illness case definition was transformed into a Boolean query
( (fever) OR (feverishness) OR (malaise) OR (headache) OR (myalgia) ) AND ( (cough) OR (pharyngitis) OR (dyspnea) )
Extracting Twitter Data: The Datasets
Twitter data was analyzed on two different datasets
Dataset 1
From November 11, 2012 to April 27, 2013, the first dataset was derived from a 1% sample of the worldwide Twitter traffic using the Twitter API
Dataset 2
From January 27, 2013 to May 2013, the second dataset was derived from all the Tweets including at least one of the singleton terms composing the influenza-like illness query and 3 additional queries based on other case definitions (Cold, Gastroenteritis, Allergy)
17 technical keywords and 65 jargon keywords
Geolocalization: How to identify Tweets from the US?
• 3 different geo-localization strategies were used to identify tweet trends localized in the US
1. US-GEO - tweets providing US GPS coordinates
2. US-WIDE - tweets responding to 1 of the following:
• US GPS coordinates • Explicit US place code • US related time zone • US place indicated in user’s profile
3. US-NARROW - same as US-WIDE excluding all tweets reporting a US time zone but a non-US place code
• This approach allows for a larger number of tweets to be identified rather than just using GPS coordinates alone
Query Evaluation: Will the query work?
• 100 tweets were extracted from the second dataset that matched the query on influenza like illness
• A random sample of 500 tweets not matching the query, but including at least one symptom were also extracted
• These Tweets were independently examined by the authors to test the consistency of extracted tweets for the case definition
• The Tweet examination yielded a 3% false positive rate with a precision of 0.97
Source of Influenza-like illness data
US Influenza-like illness trend data
• Obtained from reports by the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet)
• Their weekly reports were sent to the CDC and contain the number of patient visits for influenza-like illness by age group
• The CDC defines an influenza-like illness as fever (temperature
of 100°F or greater) and a cough and/or a sore throat without a known cause other than influenza
Control series
• Some models built on Twitter series can fit the data even using keywords not related to ILI
• In order to measure the correlation of unrelated data, a series of tweets containing ILI non-related keywords were used
• The ILI non-related keywords were:
• "zombie" OR "zed" OR "undead" OR "living dead“
• This data was used to compare non-ILI trend with the ILINet data
Statistical analysis
Results of Tweet trends are reported as number of ILI positive tweets (or number of ILI negative tweets for the control series) in the unit of time (week) Results of Tweet trends, ILINet data, and Google Flu trends data are expressed as z-scores Pearson correlation coefficients were used to compare US surveillance data with Twitter traffic consistent with the ILI case definition and with Twitter traffic not consistent with the ILI case definition (non-ILI tweets)
Statistical analysis
Twitter traffic expressed as:
• Total available Tweet traffic • US-GEO Tweets • US-WIDE Tweets • US-NARROW Tweets
Total available traffic series for the 1% sample dataset and US-NARROW series for the second dataset were smoothed by Loess function
Tweets from the 1% sample and US-NARROW tweets consistent with the ILI case definition were also compared with Google Trends data and with trends generated by tweets reporting the words “flu” OR “influenza”
Results: Dataset 1
447,597,718 Tweets were extracted between November 11, 2012 to April 27, 2013 (1% sample of the total worldwide Twitter traffic)
From the extracted tweets, 5,508 satisfied the conditions set by the query for influenza-like illness The sample of ILI tweets responding to the geo-localization criteria was too small, so the total ILI tweet series was used Twitter and traditional surveillance trends for US were compared, and the correlation coefficient was high (0.981, p<0.001).
Comparison between weekly ILI tweets, ILINet data, Google Flutrends and tweets containing the words “flu” or “influenza”
Z-scores of CDC’s reported ILI *from November 2012 to May 2013
Z-scores of tweets satisfying the ILI query
Z-scores of tweets including the words “flu” or “influenza”
Z-scores of Google Flu Trends data
Tweets satisfying the ILI query do not overestimate the actual flu peak Google Flu Trends series and the series of tweets containing “flu” or “influenza” do
Results: Dataset 1 (non-related keywords)
The ILINet data with the control series of tweets containing ILI non-related keywords was also compared and the correlation coefficient was very low (0.292, p=0.159) The ILI non-related keywords were:
"zombie" OR "zed" OR "undead" OR "living dead“
Results: Dataset 2
232,452,510 tweets were extracted from January 27, 2013 to May 5, 2013 containing at least one of the terms included in the ILI case definition and in the 3 additional Influenzanet case definitions (Cold, Allergy, Gastroenteritis)
3,252,013 (1.3%) Tweets responded to the US-GEO criteria *Tweets with GPS Coordinates
85,381,987 (36%) responded to the US-WIDE criteria *Tweets with GPS Coordinates, US place code, US related time zone, or US place indicated in profile
11,040,587 (4.7%) responded to the US-NARROW criteria *same as US-WIDE excluding all Tweets reporting a US time zone but a non-US place code
262,853 tweets (0.11%) satisfied the conditions set by the query for ILI
Weekly reported ILI (CDC) and Tweets satisfying ILI query
Z-scores of CDC’s reported ILI *from January 2013 to May 2013
Z-scores of tweets satisfying the ILI query
A. All tweets (regardless of location) B. US GEO (GPS localized tweets)
C. US-Wide Tweets D. US-Narrow Tweets
(r=0.769, p=0.001)
(r=0.974, p=0.001) (r=0.980, p=0.001)
(r=0.977, p=0.001)
highest correlation coefficient
Results: Dataset 2
When smoothed by Loess function, the comparison of ILINet data with US-NARROW yielded the highest correlation coefficient (r=0.997, p<0.001)
Comparing ILINet data with Tweets containing the word “flu” or the word “influenza”
Z-scores of CDC’s reported ILI *from January 2013 to May 2013
Z-scores of tweets including the words “flu” or “influenza” *geolocalized with the extended narrow localization pattern
low correlation coefficient compared to the tweet trend consistent with the ECDC case definition (r=0.944, p<0.001)
Reliability
• The results show a very high correlation between tweet trends and traditional US surveillance data (higher than Google Flu Trends for the same time period)
• This approach did not overestimate the actual flu peak in the 2012-2013 flu season like Google Flu Trends and the series of tweets containing “flu” or “influenza”
• The system has a very low rate of false positives (3%) yielded by the manual examination of the sample tweets
How has this approach proved useful?
Demonstrated the importance of
• Accounting for naïve language when performing syndromic surveillance
• Improves the detection of health-related concepts to produce a large body of evidence
• Eg. Pharyngitis cumulated 26 tweets while the corresponding naïve terms occurred 234,951 times
• Using a combination of symptoms to analyze words as they appear in specific contexts instead of relying on a final diagnosis keywords for query development
• Allows for a variety of natural language analyses and sense disambiguation techniques to be performed that could potentially reduce noise and more accurately detect disease indicators
How has this approach proved useful?
• The system can be applied to different country settings and languages
• By introducing other disease ontologies, the system can be applied to other kinds of syndromic surveillance (emerging diseases/allergies)
• Allows for the discovery of associations between symptoms and specific exposures
• System cost is low and the data can be acquired quickly compared to traditional surveillance systems
What were the limitations to this approach?
• Twitter surveillance, like search-related surveillance used in Google Flu Trends, may be influenced by news and media reports
• The second dataset only obtained Tweets from the second phase of the influenza season
• Twitter users are not representative of the entire US population
• This might show a trend towards a restricted population group
• Restricting the analysis to geo-localized tweets may introduce a selection bias
• Eg. users that allow GPS coordinates or include localization information in their profile may differ from other Twitter users
• System only tested on 1 influenza season
Conclusions
• Twitter mining techniques focused on disease surveillance can be improved by mining Tweets with Boolean queries derived from disease case definitions and by including naïve terms in the queries
• This technique proved less sensitive to media reports compared to other approaches like Google’s Flu Trends
• Using Twitter’s geolocation data allows for more precise information to be extracted for syndromic surveillance and disease mapping
References
http://www.plosone.org/article/fetchObject.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0082489&representation=PDF