+ All Categories
Home > Documents > TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Date post: 24-Feb-2016
Category:
Upload: keagan
View: 63 times
Download: 0 times
Share this document with a friend
Description:
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al. Pete Bohman Adam Kunk. Outline. Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion. Real-Time Search. R equirements - PowerPoint PPT Presentation
35
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al Pete Bohman Adam Kunk
Transcript
Page 1: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI: An Efficient Indexing Mechanism for Real-Time

Search on TweetsSIGMOD ‘11

C. Chen et al

Pete BohmanAdam Kunk

Page 2: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion

Outline

Page 3: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Requirements◦ Contents searchable immediately following creation◦ Scale to thousands of updates/sec

OBL Death 5,000 tweets/sec

◦ Results relevant to query via cost efficient ranking

Tradeoff: ◦ Scalability and Performance vs. Ranking

Real-Time Search

Page 4: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Applications◦ The ability to receive updates as they occur

Applicability◦ It may not be feasible to provide real-time search

results in a system with thousands of new entries per second

Real-Time Search

Page 5: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI is an indexing and ranking mechanism for real-time search in microblogging systems, such as Twitter.

In order for TI to return real-time results, only some of the tweets are indexed immediately (distinguished tweets), and the others are handled periodically (those deemed not as important, noisy tweets).

TI: Tweet Index

Page 6: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion

Outline

Page 7: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

The Case for Partial Indexes◦ Stonebreaker, 1989◦ Index only a portion of a column

User specified index predicates (where salary > 500) Build index as a side-effect of query processing

Partial Indexing

Page 8: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

An application of materialized views is to use cost models to automatically select which views to materialize.◦ Materialized views can be thought of as snapshots

of a database, in which the results of a query are stored in an object.

The concept of only indexing essential tweets in real-time was borrowed from the idea of view materialization.

View Materialization

Page 9: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Google and Twitter have both released real-time search engines.◦ Google’s engine adaptively crawls the microblog◦ Twitter’s engine relies on Apache’s Lucene (high-

performance, full-featured text search engine library)

But, both the Google and Twitter engines only utilize time in their ranking algorithms.

TI’s ranking algorithm takes much more than just time into account.

Microblog Search

Page 10: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI clusters similar tweets together and offloads noisy tweets in order to reduce computation costs of real-time search.

Tweets are grouped into topics by grouping them by relationship in a tree structure.◦ Tweets replying to the same tweet or belonging to

the same thread are organized as a tree.

TI also maintains popular topics in memory.

TI Cost Reduction

Page 11: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion

Outline

Page 12: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI Architecture

Page 13: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Twitter users have links to other friends A User Graph is utilized to demonstrate this

relationship

Gu = (U, E) ◦ U is the set of users in the system◦ E is the friend links between them

User Graph

Page 14: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Nodes represent tweets

Directed edges indicate replies or retweets

Implemented by assigning tweets a tree encoding ID

Tweet Tree Structure

Page 15: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Search is handled via an inverted index for tweets

Given a keyword, the inverted index returns a tweet list, T◦ T contains set of tweets sorted by timestamp

TI Design

Page 16: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TID = Tweet ID U-PageRank = Used for ranking TF = Term Frequency tree = TID of root node of tweet tree time = timestamp

TI Inverted Index

Page 17: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

In order to help ranking, TI keeps a table of metadata for each tweet◦ TID = tweet ID◦ RID = ID of replied tweet (to find parent)◦ tree = TID of root node of tweet tree◦ time = timestamp◦ count = number of tweets replying to this tweet

Ranking Support

Page 18: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Certain structures are kept in-memory to support indexing and ranking◦ Keyword threshold – records statistics of recent

popular queries

◦ Candidate topic list – information about recent topics

◦ Popular topic list – information about highly discussed topics

In-memory structures

Page 19: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion

Outline

Page 20: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI categorizes tweets as either being distinguished or noisy◦ Distinguised: real-time indexing scheme◦ Noisy: background batch indexing scheme

As a new tweet is entered, its content is analyzed and in order to categorize the tweet as one of the above two types.

TI Indexing Overview

Page 21: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI Inverted Index

Page 22: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

New tweets categorized as being distinguished (index these immediately)1. If tweet belongs to existing tweet tree, retrieve

its parent tweet to get root ID and generate encoding. Update count number in parent.

2. Tweet is inserted into tweet data table.3. Tweet is inserted into inverted index.

Main cost is updating the inverted index (due to each keyword in the tweet).

Real-Time Indexing

Page 23: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

New tweets categorized as being noisy (index these at a later time)

Instead of indexing in inverted index, append tweet to log file.

Batch indexing process periodically scans the log file and indexes the tweets there.

Batch Indexing

Page 24: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion

Outline

Page 25: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

“The ranking function must consider both the timestamp of the data and the similarity between the data and the query.”

◦ “The ranking function is composed of two independent factors, time and similarity.”

“The ranking function should be cost-efficient.”

Ranking Desiderata

Page 26: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Ranking functions are completely separate from the indexing mechanism◦ New ranking functions could be used

TI’s proposed ranking function is based on:◦ User’s PageRank◦ Popularity of the topic◦ Timestamp (self-explanatory)◦ Similarity between tweet and the query

Ranking Overview

Page 27: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Twitter has two types of links between users◦ f(u): the set of users who follow user u◦ f-1(u): the set of users who user u follows

A matrix, Mf[i][j], is used to record the following links between users

A weight factor is given for each user◦ V = (w1, w2, ….. wn)

User’s PageRank

Page 28: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

PageRank formula is given as:

Pu = VMfx

So, the user’s PageRank is a combination of their user weight and how many followers they have◦ The more popular the user, the higher the

PageRank

User’s PageRank Formula

Page 29: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Users can retweet or reply to tweets. Popularity can be determined by looking at

the largest tweet trees.

Popularity of tree is equal to the sum of the U-PageRank values of all tweets in the tree

Popularity of Topics

Page 30: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

The similarity of a query and the tweet t can be computed as follows:

sim(q,t) = (q x t) / (|q||t|)

Similarity between query and tweet

Page 31: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

q.timestamp = query submittal time tree.timstamp = timestamp of tree t

belongs to (timestamp of root node) w1, w2, w3 are weight factors for each

component (all set to 1)

Ranking Function

Page 32: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion

Outline

Page 33: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Evaluation

Page 34: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion

Outline

Page 35: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Conslusion


Recommended