Panagiotis Antonopoulos Microsoft Corp panant@microsoft.com Ioannis Konstantinou National Technical...

transcript

Panagiotis AntonopoulosMicrosoft Corp

panant@microsoft.com

Ioannis KonstantinouNational Technical University of Athens

ikons@ece.ntua.gr

Dimitrios TsoumakosIonian University

dtsouma@ionio.gr

Nectarios KozirisNational Technical University of Athens

nkoziris@ece.ntua.gr

Efficient Index Updates over the Cloud

Requirements in the Web

• Huge volume of datasets> 1.8 zettabytes, growing by 80% each year

• Huge number of users> 2 billion users searching and updating web content

• Explosion of User Generated ContentFacebook: 90 updates/user/month, 30 billions/dayWikipedia: 30 updates/article/month, 8K new/day

• Users demand fresh results

Our contribution

A distributed system which allows fast and frequent updates on web-scale Inverted Indexes.

• Incremental processing of updates

• Distributed processing - MapReduce

• Distributed index storage and serving – NoSQL

• Update time independent of existing index size– Fast and frequent updates on large indexes

• Index consistency after an update– System stability and performance unaffected by

updates

• Scalability– Exploit large commodity clusters

Term List of documentsdistributed Doc2, Doc3, Doc7, Doc10

update Doc2, Doc5, Doc12

Hadoop Doc1, Doc2, Doc8

Inverted Index

• Maps each term included in a collection of documents to the documents that contain the term: (term, list(doc_ref))

• Popular for fast content search, search engines

• Index Record: (term, doc_ref)

• Example:

Related Work

• Google, distributed index creation– Google Caffeine, fast and continuous index updates

• Apache Solr, distributed search through index replication

• Katta, distributed index creation and serving

• CSLAB, distributed index creation and serving

• LucidWorks, distributed index creation and updates on top of Solr (not open-source)

Basic Update Procedure

• Input: Collection of new/modified documents

• For each new document:• Simply add each term to the corresponding list

• For each modified document:– Delete all index records that refer to the old version– Add each term of the new version to the corresponding list

Basic Update Procedure

For modified documents we need to:

• Obtain the indexed terms of the old version

• Locate and delete the corresponding index records– Complexity depends on the schema of the index

Update time critically depends on these operations!

How can we do it efficiently?

Proposed Schema

• HBase:– Stores and indexes millions of columns per row– Stores varying number of columns for each row

• Proposed Schema:– One row for every indexed term– One column for each document contained in the list of

the corresponding term– Use the document ID as the column name

Proposed Schema

Each cell (row, column) corresponds to an index record (term, docID)

• Advantages– Fast record discovery and deletion

Almost independent of the list size

• Disadvantages– Required storage space (overhead per column)

10 /26

Forward Index

• Forward Index: List of terms of each document

• Example:

• Advantages: – Immediate access to the terms of old version– Retrieving the Forward Index is faster (smaller size)

• Disadvantages: – Required storage space– Small overhead to the indexing process 11/ 26

Document ID WordsDoc1 data, management, in , the, cloudDoc2 Inverted, index, updates

Minimizing Index Changes

General Idea:• Modifications in the documents’ content are limited

• Update the index based only on the content modifications

Procedure:• Compare the two different versions of each document

• Delete the terms contained in the old version but not in the new

• Add the terms contained in the new version but not in the old

Minimizing Index Changes

No changes required for the common terms

Advantages:• Minimize the changes required to the index

‒ Minimize costly insertions and deletions in Hbase‒ Minimize volume of intermediate K/V pairs (distributed)

Disadvantages:• Increased complexity of indexing process

Distributed Index Updates

• Better but still centralized!

• Perfectly suited to the MapReduce logic:– Each document can be processed independently– The updates have to be merged before they are applied

to the index

• Utilizing MR model:– Easily distribute the processing– Exploit the resources of large commodity clusters

Distributed Index Updates

15/ 26

Mappers:• Scan modified document• Retrieve old FI• Compare two versions

Emit K/V pairs for additions(term, docID)

Emit K/V pairs for deletions(term, docID)

Emit K/V pairs for FI and Content

Combiners:• Merge the K/V pairs into

a list of values per key(only for additions and deletions)

Emit a Key/Value pair for additions:(term, list(docID))

Emit a Key/Value pair for deletions:(term, list(docID))

Reducers:• For additions:

Create an index record for each pair (term, docID)Write the records to HFiles

• For deletions:Delete the corresponding cells using theHBase Client API

Bulk Load the output HFiles to HBase

Content Table: The raw documents

Forward Index Table:The Forward Index

Inverted Index Table: The Inverted Index using the schema described in the previous slides

Even Load Distribution

16 /26

Two different types of keys:

• Document ID:– One K/V pair for the Content and one for the FI of each

document– Divide the keys into equally sized partitions using a hash

function

• Term:– Skewed-Zipfian distribution in natural languages– The number of values per key-term varies significantly

Even Load Distribution

17 /26

Solution: Sampling the input

Mappers:• Process a sample using the same algorithm• Emit a K/V per (term, 1) for each addition or deletion

Reducers: (1 for additions, 1 for deletions)• Count the occurrences to determine the splitting points

Indexer:• Loads the splitting points and chooses the reducer for

each key

Experimental Setup

18 /26

Cluster:• 2-12 worker nodes (default: 8)• 8 cores @2GHz, 8GB RAM• Hadoop v.0.20.2-CDH3 (Cloudera)• HBase v.0.90.3-CDH3 (Cloudera)• 6 mappers and 6 reducers per node

Datasets:• Wikipedia snapshots on April 5, 2011 and May 26, 2011• Default initial dataset: 64.2 GB, 23.7 million documents• Default update dataset: 15.4 GB, 2.2 million documents

Experimental Results

Evaluating our design choices

• Comparison: Depends on the number of indexed terms• Forward Index: Important in both cases• Bulk Loading: Depends on the number of indexed terms• Sampling: Not important, small number of intermediate K/V pairs

Full-Text Title-only0

No ComparisonNo Forw. IndexNo BulkNo SamplingBest

20 /26

Update time vs. Update dataset size

Update time linear to update dataset size

For fixed size of initial dataset: 64.2 GB (≈24 mil. documents)

0 2 4 6 8 10 12 14 16 180

Full-TextTitle-only

Update Size (GB)

21 /26

4X larger initial dataset size increases update time by less

than 6%

Update time roughly independent of the initial

index size

For fixed new/modified documents dataset: 5,1 GB (≈400 thousand docs)

Update time vs. Initial Dataset Size

10 20 30 40 50 60 700

Full-TextTitle-only

Initial Indexed Document Size (GB)

22/ 26

• 5X faster indexing from 2 to 12 nodes• Bulk loading to HBase does NOT scale as expected• 3.3X better performance in total

Update time vs. Available resources (# of Mappers/Reducers)

For fixed size of initial/update datasets: 64.2 GB/15.4GB

0 12 24 36 48 60 720

102030405060708090

100110120

Full-Text

Total TimeIndexing

Total # of Mappers/Reducers used

0 12 24 36 48 60 720

Title-only

Total Time

Indexing

Total # of Mappers/Reducers used

Conclusion

Incremental Processing:• Process updates, minimize required changes• Update time :

– Almost independent of initial index size– Linear to the update dataset size

Distributed Processing• Reduced update time• Scalability

Conclusion

Fast and frequent updates on web-scale Indexes• Wikipedia: >6X faster than index rebuild

Disadvantages:• Slower index creation (done only once)• Increase in required storage space (low cost)

The End

Thank you!

Questions…

Panagiotis Antonopoulos Microsoft Corp panant@microsoft.com Ioannis Konstantinou National Technical...

Documents