RTI International
RTI International is a trade name of Research Triangle Institute. www.rti.org
Methodological Considerations in Analyzing Twitter Data
Annice Kim, Heather Hansen, Joe MurphyPresentation at AAPOR Annual Conference, May 2012, Orlando, FL.
RTI International
PurposeIn this session, we use examples from an ongoing study of Twitter data to illustrate methodological issues in analyzing Twitter data.
We will discuss insights on: 1) sampling2) data cleaning3) volume + data management3) metrics4) time frame and unit of analysis
We will conclude with areas for future research.
RTI International
Background: Twitter GrowthTwitter began in July 2006
Source: twitter (http://blog.twitter.com/2012/03/twitter-turns-six.html )
0
50
100
150
200
250
300
350
400
Jan-
08
Mar
-08
May
-08
Jul-0
8
Sep
-08
Nov
-08
Jan-
09
Mar
-09
May
-09
Jul-0
9
Sep
-09
Nov
-09
Jan-
10
Mar
-10
May
-10
Jul-1
0
Sep
-10
Nov
-10
Jan-
11
Mar
-11
May
-11
Jul-1
1
Sep
-11
Nov
-11
Jan-
12
Mar
-12
Tweets per dayRegistered Users
milli
ons
3 million users 300,000 tweets/day
340 million+ tweets/day
140 million users
RTI International
Background: Impact of TwitterRecent studies highlight the importance of twitter in helping researchers understand public discourse and public opinion about wide range of topics including health.
“Pandemics in the age of Twitter” – Chew, 2010
“Predicting the future with social media” - Asur & Huberman (2010)
“From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series” - O’Connor, Balasubramanyan, Routledge, & Smith (2010)
RTI International
Data Source
Topics: salvia, ketamine, cocaine, flu Relevant tweets from radian6 Google Insights for Search Prevalence rates of drug use - NSDUH Confirmed flu cases - CDC, MMWR More info: bit.ly/twitterNSDUH
Can tweets and google searches forecast trends in actual health behavior?
RTI International
1. Sample Frame
API FirehoseData Available 1-10%+
Sample Full Sample
Historical Data No Yes (availability varies by vendor)
Cost Free Varies by Vendor/Volume ($500+)
Twitter default search only goes back 1-week + cannot handle multiple keyword searches
Third party sources: Application Programming Interface (API) vs. firehoseaccess
RTI International
2. Noise/ Data CleaningOther non-related conversations may be driving your topic coverage.
For some topics, noise level is high (e.g.“cocaine”)
Salvia Salvia – “gardening”
RTI International
3) Volume + Data Management
o Limits on the amount of data that can be exported at one time e.g. radian6 allows only 5,000 cases
o Tweet files need to be merged for use with text analysis software, which also have limits on volume of data it can import and analyze.
17 months of healthcare reform Tweets
1.5 million Tweets
300 radian6 exports
26 CSV files
78 STAS files (~20k tweets per run)
RTI International
4) Metrics
# of salvia tweets (daily)
0
5000
10000
15000
Salvia Tweets, October 1 - December 31, 2010
Tweets (day)
% of tweeters mentioning salvia at least once (weekly)
0.0000000
0.0001000
0.0002000
0.0003000
% of Tweeters mentioning "salvia" at least once (week)
Salvia tweets as % of all tweets (daily)
0.00000000.00002000.00004000.00006000.00008000.0001000
% Salvia Tweets (day)
RTI International
4) Metrics (cont)Unadjusted: # of total tweets per day Adjusted: % of tweets per day
RTI International
5) Time Frame/ Unit of Analysis
0.000000
0.000010
0.000020
0.000030
0.000040
0.000050
0.000060
0.000070
0.000080
0.000090
0.000100
1-M
ay-0
8
1-Ju
n-08
1-Ju
l-08
1-A
ug-0
8
1-S
ep-0
8
1-O
ct-0
8
1-N
ov-0
8
1-D
ec-0
8
1-Ja
n-09
1-Fe
b-09
1-M
ar-0
9
1-A
pr-0
9
1-M
ay-0
9
1-Ju
n-09
1-Ju
l-09
1-A
ug-0
9
1-S
ep-0
9
1-O
ct-0
9
1-N
ov-0
9
1-D
ec-0
9
1-Ja
n-10
1-Fe
b-10
1-M
ar-1
0
1-A
pr-1
0
1-M
ay-1
0
1-Ju
n-10
1-Ju
l-10
1-A
ug-1
0
1-S
ep-1
0
1-O
ct-1
0
1-N
ov-1
0
1-D
ec-1
0
% Salvia Tweets (day)May 1, 2008 - December 31, 2010
% Salvia Tweets
RTI International
5) Time Frame/ Unit of Analysis (cont)
0.00000000.00000500.00001000.00001500.00002000.00002500.00003000.0000350
3-O
ct
10-O
ct
17-O
ct
24-O
ct
31-O
ct
7-N
ov
14-N
ov
21-N
ov
28-N
ov
5-D
ec
12-D
ec
19-D
ec
26-D
ec
% Salvia Tweets (week)
0.0000000
0.0000200
0.0000400
0.0000600
0.0000800
0.00010001-
Oct
8-O
ct
15-O
ct
22-O
ct
29-O
ct
5-N
ov
12-N
ov
19-N
ov
26-N
ov
3-D
ec
10-D
ec
17-D
ec
24-D
ec
31-D
ec
% Salvia Tweets (day)
RTI International
5) Time Frame/ Unit of Analysis (cont)
KetamineMay 1, 2008–December 31, 2010
5/1/08 7/1/08 10/1/08 1/1/09 4/1/09 7/1/09 10/1/09 1/1/10 4/1/10 7/1/10 10/1/10 1/1/11
RTI International
Summary: Key Considerations Topic suitable for twitter analysis?
– Enough conversation?– High noise potential?
Can you use a sample? Which metric is most useful?
– Raw volume? As a proportion of all tweets?
Are you trying to compare trends?– Timeframe of data sources– Unit of analysis
Do you have enough resources?– Potential cost of historical data– Data export, cleaning and analysis
RTI International
Future Studies
• Need for standards in sampling • Compare sample from API? Is it a random sample? Bias?
• Need for standards in metrics • More frequent data from twitter, e.g. daily Twitter volume for calculating
denominator, filter out spam
• Insights into general patterns of Twitter use and demographics of users
RTI International
More Information
Annice KimRTI International - [email protected]
Heather HansenRTI International – [email protected]
Joe MurphyRTI International - [email protected]