1
Twinder: A Search Engine for Twitter Streams
#ICWE2012 Berlin, Germany July 25th, 2012
Ke Tao, Fabian Abel, Claudia Hauff, Geert-Jan Houben Web Information Systems, TU Delft
2 Twinder: A search engine for Twitter streams
Get information from Twitter
• Twitter is more like a news media.
• How do people search Twitter? • Search on Twitter (Teevan et al.)
How do people use Twitter as a source of information?
Web Search Twitter Search
Query length (chars) 18.80 12.00
Query length (words) 3.08 1.64
Is a celebrity name 3.11% 15.22%
3 Twinder: A search engine for Twitter streams
Research Questions
1. Given a topic, can we identify the relevant tweets based on the characteristics of the tweets?
2. Are semantics meaningful for determining the tweets’ relevance for a topic?
3. How can we design an architecture that is scalable?
What are the challenges we are facing?
4 Twinder: A search engine for Twitter streams
Search on Twitter
• Twitter search interface
• Ordered by time
• Keyword-based match
What you can do on current Twitter?
5 Twinder: A search engine for Twitter streams
Twinder = TWeet +fINDER Our solution - Architecture
!"#$%&"'()$&#*+,-'
."#&*/'01"&'2-$"&3#*"'
4"5"6#-*"'(1+7#+,-'
.,*8#5'9":'.$&"#71'
!"#$%&"'()$&#*+,
-';#1<'=&,<"&'
."7#-+*1>:#1"?'4"5"6#-*"'
.@-$#*+*#5'!"#$%&"1'
."7#-+*'!"#$%&"1'
A,-$")$%#5'!"#$%&"1'
!"#$%&!#'($)*+&,*-./01.$21$.3&
B%"&@'
2-?")'
C"@D,&?>:#1"?'4"5"6#-*"'
1#(4254*03*04)63&& 1#(42503*04)63&&
&"1%5$1'3""?:#*<'
7"11#E"1'
784*%3.&&93/.2:&&;*+4*3&
3"#$%&"'")$&#*+,-'
$#1<1'
$03.0&
6 Twinder: A search engine for Twitter streams
Core Components of Twinder Feature Extraction
• Receive Twitter messages from Social Web Streams
• Features of two categories:
• (1) Topic-sensitive features
• (2) Topic-insensitive features
• Different extracting strategies designed for different features
7 Twinder: A search engine for Twitter streams
Core Components of Twinder Feature Extraction Task Broker
• Twinder makes use of MapReduce and cloud computing infrastructures to allow for high scalability and frequent updates of its multifaceted index.
• Feature Extration Task Broker dispatches features extration tasks and indexing tasks to cloud computing infrastructure.
8 Twinder: A search engine for Twitter streams
Core Components of Twinder Relevance Estimation
• Accepting search queries from front-end, passing them to Feature Extration Component.
• Tweets are classified into the relevant and the non-relevant by Relevance Estimation component, and are further delivered to front-end for rendering.
• Twinder can learn the classification model, initially from training dataset, then from usage data.
9 Twinder: A search engine for Twitter streams
Efficiency of Indexing How good does Twinder make use of cloud-computing infrastructure?
Corpus size Mainstream Server EMR(10 instances)
100k (13MBytes) 0.4 min 5 min
1m (122MBytes) 5 min 8 min
10m (1.3GBytes) 48 min 19 min
32m (3.9GBytes) 283 min 47 min
10 Twinder: A search engine for Twitter streams
Features of Microposts
Topic sensitive Topic insensitive Keyword-based
relevance ?
We already have keyword-based relevance, and…?
Hypothesis H1: The greater the keyword-based relevance score, the more relevant and interesting the tweet is to the topic.
11 Twinder: A search engine for Twitter streams
Semantic-based relevance Expand the queries to match more tweets
dbp:Hu_Jintao
dbp:United_States
Reformulated query is expected to get a more accurate retrieval score.
Hypothesis H2 : The greater the semantic-based relevance score, the more relevant and interesting the tweet is.
12 Twinder: A search engine for Twitter streams
Semantic-based relatedness Is there a semantic overlap between the query and the tweet?
dbp:Hu_Jintao
dbp:the_United_States
Hypothesis H3 : If a tweet is considered to be semantically related to the query then it is also relevant and interesting for the user.
13 Twinder: A search engine for Twitter streams
Overview of features
Topic sensitive Topic insensitive Keyword-based ? Semantic-based
What do we have now?
14 Twinder: A search engine for Twitter streams
Syntactical feature : Hashtag Is a tweet more relevant if it contains a #hashtag?
Hypothesis 4: tweets that contain hashtags are more likely to be relevant than tweets that do not contain hashtags.
15 Twinder: A search engine for Twitter streams
Syntactical feature : hasURL Is a tweet that contains a URL more relevant?
Hypothesis 5: tweets that contain a URL are more likely to be relevant than tweets that do not contain a URL.
16 Twinder: A search engine for Twitter streams
Syntactical feature : isReply Is a tweet which is a reply to @somebody more relevant?
Hypothesis 6: tweets that are formulated as a reply to another tweet are less likely to be relevant than other tweets.
17 Twinder: A search engine for Twitter streams
Syntactical feature : length Does the length of a tweet influence its relevance for a topic?
Hypothesis 7: the longer a tweet, the more likely it is to be relevant and interesting.
18 Twinder: A search engine for Twitter streams
Overview of features
Topic sensitive Topic insensitive Keyword-based Syntactical features Semantic-based ?
Short summary
Are there further features that allow for estimating the relevance?
19 Twinder: A search engine for Twitter streams
Semantic features Find semantics in a tweet to estimate the relevance
dbp:Tim_Berners-Lee dbp:World_Wide_Web
dbp:France
dbp:Lyon
dbp:International_World_Wide_Web_Conference
20 Twinder: A search engine for Twitter streams
Semantic features : #entity Is a tweet with more entities more interesting?
• 5 entities extracted.
Hypothesis 8: the more entities a tweet mentions, the more likely it is to be relevant and interesting.
21 Twinder: A search engine for Twitter streams
Semantic features : diversity How many types are there in the entities?
• 4 types of entities
Hypothesis 9: the greater the diversity of concepts mentioned in a tweet, the more likely it is to be interesting and relevant.
22 Twinder: A search engine for Twitter streams
Semantic features : sentiment Was the author of the tweet happy or not?
• Sentiment : Neutral
Hypothesis 10: the likelihood of a tweet’s relevance is influenced by its sentiment polarity.
23 Twinder: A search engine for Twitter streams
Overview of features
Topic sensitive Topic insensitive Keyword-based Syntactical Semantic-based Semantics
By now, we have 4 types of features.
Can we utilize the contextual information of tweets?
24 Twinder: A search engine for Twitter streams
Contextual features Does the number of followers influence the relatedness?
Hypothesis 11: The higher the number of followers a creator of a message has, the more likely it is that her tweets are relevant.
25 Twinder: A search engine for Twitter streams
Contextual features Or the number of followers that the author appears in?
Hypothesis 12: The higher the number of lists in which the creator of a message appears, the more likely it is that her tweets are relevant.
26 Twinder: A search engine for Twitter streams
Contextual features How long has been the author on Twitter?
Hypothesis 13: The older the Twitter account of a user, the more likely it is that her tweets are relevant.
Signed up
July 2008
Post
June 2012
27 Twinder: A search engine for Twitter streams
Summary of Features
Topic sensitive Topic insensitive Keyword-based Syntactical Semantic-based Semantics
Contextual
The features
28 Twinder: A search engine for Twitter streams
Analysis
• Research Questions: 1. Which features are more influential on predicting the
relatedness of a tweet to a certain topic? 2. Which types of features are more important? Are
semantics meaningful? 3. What’s the performance that we can achieve by utilizing
these features?
• Twinder Setup • Consider the search problem as a classification task • Classification algorithm = Logistic Regression
29 Twinder: A search engine for Twitter streams
Dataset
• Twitter corpus • 16 million tweets (Jan. 24th, 2011 – Feb. 8th) • 4,766,901 tweets classified as English • 6.2 million entity-extractions (140k distinct entities)
• Relevance judgments • 49 topics • 40,855 (topic, tweet) pairs • 60.31 relevant tweets per topic (on average)
From TREC 2011 Microblog Track
30 Twinder: A search engine for Twitter streams
Results Which type of features matters?
Features Precision Recall F-measure
keyword relevance 0.3036 0.2851 0.2940
semantic relevance 0.3050 0.3294 0.3167
topic-sensitive 0.3135 0.3252 0.3192
topic-insensitive 0.1956 0.0064 0.0123
without semantics 0.3363 0.4618 0.3965
without sentiment 0.3701 0.3923 0.4048
without context 0.3827 0.4714 0.4225
all features 0.3674 0.4736 0.4138
Overall, we can achieve the precision and recall of over 35% and 45% respectively by applying all the features.
31 Twinder: A search engine for Twitter streams
Weights of features Which feature matters?
-1
0
1
2
hasHashtag hasURL isReply length
Syntactical
-1
0
1
2
Keyword-based relevance
Keyword-based
-1
0
1
2
Relevance Relatedness
Semantic-based
-1
0
1
2
#entities diversity sentiment
Semantics
-1
0
1
2
#followers #lists Age
Contextual
32 Twinder: A search engine for Twitter streams
Topics of different categories The impact on the performance and models
• 49 topics categorized into 2 parts w.r.t. 3 dimensions: • Popularity • Gobal vs. Local • Temporal persistence
• Popularity • Higher recall for popular topics • Less impact from sentiment features on unpopular topics
• Temporal persistence • Higher performance on shorter-term topics • Less impact from sentiment features on persistent topic
33 Twinder: A search engine for Twitter streams
Conclusions What are our contributions?
1. Twinder search engine proposed: analyzing various features to determine the relevance and interestingness of Twitter messages for a given topic.
2. Scalability demonstrated for the Twinder search engine. 3. Extensive analysis on 13 features along two-dimensions:
topic-sensitive features and topic-insensitive features.
34 Twinder: A search engine for Twitter streams
Conclusions The lessons learned
1. The learned models which take advantage of semantics and topic-sensitive features outperform those which do not take the semantics and topic-sensitive features into account.
2. Contextual features that characterize the users who are posting the messages have little impact on the relevance estimation.
3. The importance of a feature differs depending on the topic characteristics; for example, the sentiment-based features are more important for popular than for unpopular topics.
35 Twinder: A search engine for Twitter streams
THANK YOU!
July 25th, 2012 [email protected] http://ktao.nl/
QUESTIONS?