+ All Categories
Home > Technology > Tutorial 3 (b tree min heap)

Tutorial 3 (b tree min heap)

Date post: 17-Jun-2015
Category:
Upload: kira
View: 548 times
Download: 0 times
Share this document with a friend
Description:
Part of the Search Engine course given in the Technion (2011)
Popular Tags:
13
B-Tree Lexicon, Min-Heaps Kira Radinsky Min-Heap slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research Lab
Transcript
Page 1: Tutorial 3 (b tree min heap)

B-Tree Lexicon, Min-Heaps

Kira Radinsky

Min-Heap slides are courtesy of Aya Soffer and David Carmel,

IBM Haifa Research Lab

Page 2: Tutorial 3 (b tree min heap)

2 November 2010 236621 Search Engine Technology 2

The Lexicon as a B-Tree

• B-Tree: a balanced tree that is optimized for disk I/O, holding key/value pairs

• Branching is defined by a min-degree parameter t, t > 1– t is chosen according to the size of a disk block

• Any internal node other than the root has at least t and at most 2tchildren; the root has either no children, or at least two and at most 2tchildren

• Any internal node with k children also stores k-1 keys which serve as separator values: separator j is larger than the keys of subtree j and smaller than the keys of subtree j+1

• Leaf nodes, like all nodes, store at most 2t-1 key/value pairs– When not the root, store at least t-1 key/value pairs

• Lookup, insertion and deletion operations on a B-Tree are linear in its height (and t-logarithmic in the number of keys)

Page 3: Tutorial 3 (b tree min heap)

2 November 2010 236621 Search Engine Technology 3

B-Tree Lexicon - Example

• t=2

• Each key is associated with a value that contains a DF and a pointer to the postings list (dashed line)

gets more

1 2

and as bad

3 1 2

good is it

2 1 2

the ugly

1 2

Page 4: Tutorial 3 (b tree min heap)

2 November 2010 236620 Search Engine Technology 4

B-Tree Lookup

Looking up the value associated with key x:

1. current_node root

2. Let k1<k2<…<km be the keys of current_node

3. if x{k1,k2,…,km} – we’re done, return associated value

4. else, if current_node is a leaf node, return null

5. else, let j be the smallest index s.t. x<kj (j m+1 if x>km);

– current_node j’th subtree, and goto 2

Page 5: Tutorial 3 (b tree min heap)

2 November 2010 236621 Search Engine Technology 5

Top-r Document Selection

Problem definition: Given a set A of scored documents, select the r documents with the highest scores in A and return them in decreasing relevance order

• Naïve method: sort the set A by score– If |A|=M, time complexity is O(M logM)

• Better approach: since typically r<<M, selecting the r top scores can be done in O(M+r log M) time using a heap:

1. Heapify the set of M scores (about 2M comparisons) so that the top score is at the root

2. Repeatedly extract the heap’s root (r times), each time fixing the heap in O(logM)

Page 6: Tutorial 3 (b tree min heap)

2 November 2010 236621 Search Engine Technology 6

The Heap Data Structure - Reminder

• A binary heap is a (mostly full) binary tree with values stored at all leaves and internal nodes, and an ordering rule that requires values to be non-decreasing (alternatively, non-increasing) along each path from a leaf to the root– Largest/smallest value is at the root

• Heap implemented in an Array:– Root at index 1

– For node at index i, left child is at index 2i and right child at index 2i+1

– Thus the parent of the node at index i is at index i/2

Page 7: Tutorial 3 (b tree min heap)

2 November 2010 236621 Search Engine Technology 7

Binary Heap Stored in an Array

23

17

28

5

15

13

144

17

23 17 15 17 8 2 13 4 14 5

1 2 3 4 5 6 7 8 9 10

Page 8: Tutorial 3 (b tree min heap)

2 November 2010 236621 Search Engine Technology 8

Extracting the Top Element

• Remove the largest item r times• Each time:

– Remove the largest item – the root of the heap – Replace it with the last element of the heap– Sift the new root down until restoring order

• Example– Remove item 23 from the root – Last item in array 5 (at location 10) replaces it– Reinstate heap order - worst case 5 will be sifted

back down the tree - number of sifts is bounded by log(size of heap)

Page 9: Tutorial 3 (b tree min heap)

2 November 2010 236621 Search Engine Technology 9

Heap Example (cont.)

To restore order at the top level of tree, item 17, the larger of the 2 children of root must be swapped with 5.

This limits the order violation to the left sub-tree.

5

17

28

15

13

144

17

The process is repeated until heap order is restored

Page 10: Tutorial 3 (b tree min heap)

2 November 2010 236621 Search Engine Technology 10

5

17

28

15

13

144

17

17

17

28

15

13

54

14

17

5

28

15

13

144

17

17

17

28

15

13

144

5

Heap Example (cont.)

Page 11: Tutorial 3 (b tree min heap)

2 November 2010 236621 Search Engine Technology 11

Top-r Selection Using a Min-Heap

• The selection problem can be solved by a heap that stores the smallest item at the root: min-heap

• A min-heap of r items is held instead of a max-heap of M –lots of memory is saved, which is always good

• Process the M scores, storing in the min-heap the r largest values seen so far– First r values are heapified in O(r) comparisons

– Replace the smallest value in the min-heap (the rth largest) whenever a larger value is found

• Sort the r highest values in descending order and return the corresponding documents – O(r log r)

Page 12: Tutorial 3 (b tree min heap)

2 November 2010 236621 Search Engine Technology 12

Min-Heap Processing - Illustration

Processed Unprocessed

Min-heap of r

largest items

Discard smallest

value

Page 13: Tutorial 3 (b tree min heap)

2 November 2010 236621 Search Engine Technology 13

Top-r Selection Using a Min-Heap: Complexity Analysis

• Worst case: the scores are already in increasing order– Each of the M-r last values is inserted into the heap

– Furthermore, it percolates to the bottom of the heap

– Complexity is O( (M-r)*log(r) )

• Average case – the scores arrive in a permutation of size M chosen uniformly at random– The expected number of times one of the M-r last values is

inserted into the heap is ~ r*ln(M/r)

– Each insertion costs O(log(r))

– Complexity is O( r*log(r)*log(M/r) )

• Proof on the board


Recommended