TI: An Efficient Indexing Mechanism for Real-Time
Search on TweetsSIGMOD ‘11
C. Chen et al
Pete BohmanAdam Kunk
Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion
Outline
Requirements◦ Contents searchable immediately following creation◦ Scale to thousands of updates/sec
OBL Death 5,000 tweets/sec
◦ Results relevant to query via cost efficient ranking
Tradeoff: ◦ Scalability and Performance vs. Ranking
Real-Time Search
Applications◦ The ability to receive updates as they occur
Applicability◦ It may not be feasible to provide real-time search
results in a system with thousands of new entries per second
Real-Time Search
TI is an indexing and ranking mechanism for real-time search in microblogging systems, such as Twitter.
In order for TI to return real-time results, only some of the tweets are indexed immediately (distinguished tweets), and the others are handled periodically (those deemed not as important, noisy tweets).
TI: Tweet Index
Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion
Outline
The Case for Partial Indexes◦ Stonebreaker, 1989◦ Index only a portion of a column
User specified index predicates (where salary > 500) Build index as a side-effect of query processing
Partial Indexing
An application of materialized views is to use cost models to automatically select which views to materialize.◦ Materialized views can be thought of as snapshots
of a database, in which the results of a query are stored in an object.
The concept of only indexing essential tweets in real-time was borrowed from the idea of view materialization.
View Materialization
Google and Twitter have both released real-time search engines.◦ Google’s engine adaptively crawls the microblog◦ Twitter’s engine relies on Apache’s Lucene (high-
performance, full-featured text search engine library)
But, both the Google and Twitter engines only utilize time in their ranking algorithms.
TI’s ranking algorithm takes much more than just time into account.
Microblog Search
TI clusters similar tweets together and offloads noisy tweets in order to reduce computation costs of real-time search.
Tweets are grouped into topics by grouping them by relationship in a tree structure.◦ Tweets replying to the same tweet or belonging to
the same thread are organized as a tree.
TI also maintains popular topics in memory.
TI Cost Reduction
Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion
Outline
TI Architecture
Twitter users have links to other friends A User Graph is utilized to demonstrate this
relationship
Gu = (U, E) ◦ U is the set of users in the system◦ E is the friend links between them
User Graph
Nodes represent tweets
Directed edges indicate replies or retweets
Implemented by assigning tweets a tree encoding ID
Tweet Tree Structure
Search is handled via an inverted index for tweets
Given a keyword, the inverted index returns a tweet list, T◦ T contains set of tweets sorted by timestamp
TI Design
TID = Tweet ID U-PageRank = Used for ranking TF = Term Frequency tree = TID of root node of tweet tree time = timestamp
TI Inverted Index
In order to help ranking, TI keeps a table of metadata for each tweet◦ TID = tweet ID◦ RID = ID of replied tweet (to find parent)◦ tree = TID of root node of tweet tree◦ time = timestamp◦ count = number of tweets replying to this tweet
Ranking Support
Certain structures are kept in-memory to support indexing and ranking◦ Keyword threshold – records statistics of recent
popular queries
◦ Candidate topic list – information about recent topics
◦ Popular topic list – information about highly discussed topics
In-memory structures
Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion
Outline
TI categorizes tweets as either being distinguished or noisy◦ Distinguised: real-time indexing scheme◦ Noisy: background batch indexing scheme
As a new tweet is entered, its content is analyzed and in order to categorize the tweet as one of the above two types.
TI Indexing Overview
TI Inverted Index
New tweets categorized as being distinguished (index these immediately)1. If tweet belongs to existing tweet tree, retrieve
its parent tweet to get root ID and generate encoding. Update count number in parent.
2. Tweet is inserted into tweet data table.3. Tweet is inserted into inverted index.
Main cost is updating the inverted index (due to each keyword in the tweet).
Real-Time Indexing
New tweets categorized as being noisy (index these at a later time)
Instead of indexing in inverted index, append tweet to log file.
Batch indexing process periodically scans the log file and indexes the tweets there.
Batch Indexing
Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion
Outline
“The ranking function must consider both the timestamp of the data and the similarity between the data and the query.”
◦ “The ranking function is composed of two independent factors, time and similarity.”
“The ranking function should be cost-efficient.”
Ranking Desiderata
Ranking functions are completely separate from the indexing mechanism◦ New ranking functions could be used
TI’s proposed ranking function is based on:◦ User’s PageRank◦ Popularity of the topic◦ Timestamp (self-explanatory)◦ Similarity between tweet and the query
Ranking Overview
Twitter has two types of links between users◦ f(u): the set of users who follow user u◦ f-1(u): the set of users who user u follows
A matrix, Mf[i][j], is used to record the following links between users
A weight factor is given for each user◦ V = (w1, w2, ….. wn)
User’s PageRank
PageRank formula is given as:
Pu = VMfx
So, the user’s PageRank is a combination of their user weight and how many followers they have◦ The more popular the user, the higher the
PageRank
User’s PageRank Formula
Users can retweet or reply to tweets. Popularity can be determined by looking at
the largest tweet trees.
Popularity of tree is equal to the sum of the U-PageRank values of all tweets in the tree
Popularity of Topics
The similarity of a query and the tweet t can be computed as follows:
sim(q,t) = (q x t) / (|q||t|)
Similarity between query and tweet
q.timestamp = query submittal time tree.timstamp = timestamp of tree t
belongs to (timestamp of root node) w1, w2, w3 are weight factors for each
component (all set to 1)
Ranking Function
Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion
Outline
Evaluation
Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion
Outline
Conslusion