Post on 29-Dec-2015
transcript
Panagiotis AntonopoulosMicrosoft Corp
panant@microsoft.com
Ioannis KonstantinouNational Technical University of Athens
ikons@ece.ntua.gr
Dimitrios TsoumakosIonian University
dtsouma@ionio.gr
Nectarios KozirisNational Technical University of Athens
nkoziris@ece.ntua.gr
Efficient Index Updates over the Cloud
Requirements in the Web
• Huge volume of datasets> 1.8 zettabytes, growing by 80% each year
• Huge number of users> 2 billion users searching and updating web content
• Explosion of User Generated ContentFacebook: 90 updates/user/month, 30 billions/dayWikipedia: 30 updates/article/month, 8K new/day
• Users demand fresh results
2/18
Our contribution
A distributed system which allows fast and frequent updates on web-scale Inverted Indexes.
• Incremental processing of updates
• Distributed processing - MapReduce
• Distributed index storage and serving – NoSQL
3/ 26
Goals
• Update time independent of existing index size– Fast and frequent updates on large indexes
• Index consistency after an update– System stability and performance unaffected by
updates
• Scalability– Exploit large commodity clusters
4/ 26
Term List of documentsdistributed Doc2, Doc3, Doc7, Doc10
update Doc2, Doc5, Doc12
Hadoop Doc1, Doc2, Doc8
Inverted Index
• Maps each term included in a collection of documents to the documents that contain the term: (term, list(doc_ref))
• Popular for fast content search, search engines
• Index Record: (term, doc_ref)
• Example:
5/26
Related Work
• Google, distributed index creation– Google Caffeine, fast and continuous index updates
• Apache Solr, distributed search through index replication
• Katta, distributed index creation and serving
• CSLAB, distributed index creation and serving
• LucidWorks, distributed index creation and updates on top of Solr (not open-source)
6/ 26
Basic Update Procedure
• Input: Collection of new/modified documents
• For each new document:• Simply add each term to the corresponding list
• For each modified document:– Delete all index records that refer to the old version– Add each term of the new version to the corresponding list
7 /26
Basic Update Procedure
8/26
For modified documents we need to:
• Obtain the indexed terms of the old version
• Locate and delete the corresponding index records– Complexity depends on the schema of the index
Update time critically depends on these operations!
How can we do it efficiently?
Proposed Schema
• HBase:– Stores and indexes millions of columns per row– Stores varying number of columns for each row
• Proposed Schema:– One row for every indexed term– One column for each document contained in the list of
the corresponding term– Use the document ID as the column name
9 /26
Proposed Schema
Each cell (row, column) corresponds to an index record (term, docID)
• Advantages– Fast record discovery and deletion
Almost independent of the list size
• Disadvantages– Required storage space (overhead per column)
10 /26
Forward Index
• Forward Index: List of terms of each document
• Example:
• Advantages: – Immediate access to the terms of old version– Retrieving the Forward Index is faster (smaller size)
• Disadvantages: – Required storage space– Small overhead to the indexing process 11/ 26
Document ID WordsDoc1 data, management, in , the, cloudDoc2 Inverted, index, updates
Minimizing Index Changes
12/26
General Idea:• Modifications in the documents’ content are limited
• Update the index based only on the content modifications
Procedure:• Compare the two different versions of each document
• Delete the terms contained in the old version but not in the new
• Add the terms contained in the new version but not in the old
Minimizing Index Changes
13/26
No changes required for the common terms
Advantages:• Minimize the changes required to the index
‒ Minimize costly insertions and deletions in Hbase‒ Minimize volume of intermediate K/V pairs (distributed)
Disadvantages:• Increased complexity of indexing process
Distributed Index Updates
14/26
• Better but still centralized!
• Perfectly suited to the MapReduce logic:– Each document can be processed independently– The updates have to be merged before they are applied
to the index
• Utilizing MR model:– Easily distribute the processing– Exploit the resources of large commodity clusters
Distributed Index Updates
15/ 26
Mappers:• Scan modified document• Retrieve old FI• Compare two versions
Emit K/V pairs for additions(term, docID)
Emit K/V pairs for deletions(term, docID)
Emit K/V pairs for FI and Content
Combiners:• Merge the K/V pairs into
a list of values per key(only for additions and deletions)
Emit a Key/Value pair for additions:(term, list(docID))
Emit a Key/Value pair for deletions:(term, list(docID))
Reducers:• For additions:
Create an index record for each pair (term, docID)Write the records to HFiles
• For deletions:Delete the corresponding cells using theHBase Client API
Bulk Load the output HFiles to HBase
Content Table: The raw documents
Forward Index Table:The Forward Index
Inverted Index Table: The Inverted Index using the schema described in the previous slides
Even Load Distribution
16 /26
Two different types of keys:
• Document ID:– One K/V pair for the Content and one for the FI of each
document– Divide the keys into equally sized partitions using a hash
function
• Term:– Skewed-Zipfian distribution in natural languages– The number of values per key-term varies significantly
Even Load Distribution
17 /26
Solution: Sampling the input
Mappers:• Process a sample using the same algorithm• Emit a K/V per (term, 1) for each addition or deletion
Reducers: (1 for additions, 1 for deletions)• Count the occurrences to determine the splitting points
Indexer:• Loads the splitting points and chooses the reducer for
each key
Experimental Setup
18 /26
Cluster:• 2-12 worker nodes (default: 8)• 8 cores @2GHz, 8GB RAM• Hadoop v.0.20.2-CDH3 (Cloudera)• HBase v.0.90.3-CDH3 (Cloudera)• 6 mappers and 6 reducers per node
Datasets:• Wikipedia snapshots on April 5, 2011 and May 26, 2011• Default initial dataset: 64.2 GB, 23.7 million documents• Default update dataset: 15.4 GB, 2.2 million documents
Experimental Results
19/26
Evaluating our design choices
• Comparison: Depends on the number of indexed terms• Forward Index: Important in both cases• Bulk Loading: Depends on the number of indexed terms• Sampling: Not important, small number of intermediate K/V pairs
Full-Text Title-only0
10
20
30
40
50
60
70
80
90
No ComparisonNo Forw. IndexNo BulkNo SamplingBest
Inde
x U
pdat
e Co
mpl
etion
(m
in)
Experimental Results
20 /26
Update time vs. Update dataset size
Update time linear to update dataset size
For fixed size of initial dataset: 64.2 GB (≈24 mil. documents)
0 2 4 6 8 10 12 14 16 180
5
10
15
20
25
30
35
40
Full-TextTitle-only
Update Size (GB)
Inde
x U
pdat
e Co
mpl
etion
(min
)
Experimental Results
21 /26
4X larger initial dataset size increases update time by less
than 6%
Update time roughly independent of the initial
index size
For fixed new/modified documents dataset: 5,1 GB (≈400 thousand docs)
Update time vs. Initial Dataset Size
10 20 30 40 50 60 700
2
4
6
8
10
12
Full-TextTitle-only
Initial Indexed Document Size (GB)
Inde
x U
pdat
e Co
mpl
etion
(min
)
Experimental Results
22/ 26
• 5X faster indexing from 2 to 12 nodes• Bulk loading to HBase does NOT scale as expected• 3.3X better performance in total
Update time vs. Available resources (# of Mappers/Reducers)
For fixed size of initial/update datasets: 64.2 GB/15.4GB
0 12 24 36 48 60 720
102030405060708090
100110120
Full-Text
Total TimeIndexing
Total # of Mappers/Reducers used
Inde
x U
pdat
e Co
mpl
etion
(min
)
0 12 24 36 48 60 720
5
10
15
20
25
30
35
40
45
50
55
Title-only
Total Time
Indexing
Total # of Mappers/Reducers used
Inde
x U
pdat
e Co
mpl
etion
(min
)
Conclusion
Incremental Processing:• Process updates, minimize required changes• Update time :
– Almost independent of initial index size– Linear to the update dataset size
Distributed Processing• Reduced update time• Scalability
23/26
Conclusion
24/26
Fast and frequent updates on web-scale Indexes• Wikipedia: >6X faster than index rebuild
Disadvantages:• Slower index creation (done only once)• Increase in required storage space (low cost)
The End
Thank you!
25/26
Questions…