CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

transcript

CSE 326: Data StructuresLecture #11

B-Trees

Alon Halevy

Spring Quarter 2001

B-Tree Properties• Properties

– maximum branching factor of M– the root has between 2 and M children– other internal nodes have between M/2 and M children– internal nodes contain only search keys (no data)– smallest datum between search keys x and y equals x– each (non-root) leaf contains between L/2 and L keys– all leaves are at the same depth

• Result– tree is (logM/2 n/(L/2)) +/- 1 deep (log n)

– all operations run in time proportional to depth– operations pull in at least M/2 or L/2 items at a time

When Big-O is Not Enough

B-Tree is about logM/2 n/(L/2) deep

= logM/2 n - logM/2 L/2

= O(logM/2 n)

= O(log n) steps per operation (same as BST!)

Where’s the beef?!

log2( 10,000,000 ) = 24 disk accesses

log200/2( 10,000,000 ) < 4 disk accesses

…__ __k1 k2 … ki

B-Tree Nodes

• Internal node– i search keys; i+1 subtrees; M - i - 1 inactive entries

• Leaf– j data keys; L - j inactive entries

k1 k2 … kj …__ __

1 2 M - 1

Example

B-Tree with M = 4

and L = 4

3 5 6 9

10 11 12

20 25 26

30 32 33 36

50 60 70

3 15 20 30 50

Making a B-Tree

The empty B-Tree

M = 3 L = 2

3 Insert(3)

3 14Insert(14)

Now, Insert(1)?

Splitting the Root

And createa new root

1 3 14

1 3 14 3 14

Insert(1)

Too many keys in a leaf!

So, split the leaf.

Insertions and Split Ends

Insert(59)

1 3 14 59

1 3 14

Insert(26)

1 3 14 26 59

14 26 59

1 3 14 26 59

And add a new child

Too many keys in a leaf!

So, split the leaf.

Propagating Splits

1 3 14 26 59

1 3 14 26 59 5

Insert(5)

14 26 59 1 3 5

5 59 5

1 3 5 14 26 59

Add newchild

Create anew root

Too many keys in an internal node!

So, split the node.

Insertion in Boring Text

• Insert the key in its leaf

• If the leaf ends up with L+1 items, overflow!– Split the leaf into two nodes:

• original with (L+1)/2 items

• new one with (L+1)/2 items

– Add the new child to the parent

– If the parent ends up with M+1 items, overflow!

• If an internal node ends up with M+1 items, overflow!– Split the node into two nodes:

• original with (M+1)/2 items• new one with (M+1)/2 items

– Add the new child to the parent– If the parent ends up with M+1

items, overflow!

• Split an overflowed root in two and hang the new nodes under a new root

This makes the tree deeper!

Deletion in B-trees

• Come to section tomorrow. Slides follow.

After More Routine Inserts

1 3 5 14 26 59

1 3 5 14 26 59 79

Insert(89)Insert(79)

Deletion

1 3 5 14 26 59 79

1 3 5 14 26 79

Delete(59)

Deletion and Adoption

1 3 5 14 26 79

Delete(5)?

1 3 14 26 79

1 3 3 14 26 79

A leaf has too few keys!

So, borrow from a neighbor

Deletion with Propagation

1 3 14 26 79

Delete(3)?

1 14 26 79

A leaf has too few keys!

And no neighbor with surplus!

So, deletethe leaf

But now a node has too few subtrees!

Adopt aneighbor

1 14 26 79

Finishing the Propagation (More Adoption)

Delete(1)(adopt a

neighbor)

1 14 26 79

A Bit More Adoption

14 26 79

Delete(26)26

14 26 79

Pulling out the Root

A leaf has too few keys!And no neighbor with surplus!

So, delete the leaf

A node has too few subtrees and no neighbor with surplus!

Delete the leaf

But now the root has just one subtree!

Pulling out the Root (continued)

The root has just one subtree!

But that’s silly!

Just makethe one childthe new root!

Deletion in Two Boring Slides of Text

• Remove the key from its leaf

• If the leaf ends up with fewer than L/2 items, underflow!– Adopt data from a neighbor;

update the parent

– If borrowing won’t work, delete node and divide keys between neighbors

– If the parent ends up with fewer than M/2 items, underflow!

Why will dumping keys always work if borrowing doesn’t?

Deletion Slide Two

• If a node ends up with fewer than M/2 items, underflow!– Adopt subtrees from a neighbor;

update the parent– If borrowing won’t work, delete

node and divide subtrees between neighbors

– If the parent ends up with fewer than M/2 items, underflow!

• If the root ends up with only one child, make the child the new root of the tree

This reduces the height of the tree!

Thinking about B-Trees

• B-Tree insertion can cause (expensive) splitting and propagation

• B-Tree deletion can cause (cheap) borrowing or (expensive) deletion and propagation

• Propagation is rare if M and L are large (Why?)

• Repeated insertions and deletion can cause thrashing

• If M = L = 128, then a B-Tree of height 4 will store at least 30,000,000 items

– height 5: 2,000,000,000!

Tree Summary

• BST: fast finds, inserts, and deletes O(log n) on average (if data is random!)

• AVL trees: guaranteed O(log n) operations• B-Trees: also guaranteed O(log n), but shallower

depth makes them better for disk-based databases

• What would be even better?– How about: O(1) finds and inserts?

Hash Table Approach

But… is there a problem in this pipe-dream?

Hash Table Dictionary Data Structure

• Hash function: maps keys to integers– result: can quickly find the

right spot for a given entry

• Unordered and sparse table– result: cannot efficiently list

all entries, – Cannot find min and max

efficiently,– Cannot find all items within

a specified range efficiently.

f(x)ZashaSteve

NicBrad

Hash Table Terminology

hash function

collision

load factor = # of entries in table

tableSize

Hash Table CodeFirst Pass

Value & find(Key & key) { int index = hash(key) % tableSize; return Table[index];}

What should the hash function be?

What should the table size be?

How should we resolve collisions?

A Good Hash Function…

…is easy (fast) to compute (O(1) and practically fast).…distributes the data evenly (hash(a) hash(b) ).…uses the whole hash table (for all 0 k < size, there’s an i

such that hash(i) % size = k).

Good Hash Function for Integers

• Choose – tableSize is prime– hash(n) = n % tableSize

• Example:– tableSize = 7

insert(4)insert(17)find(12)insert(9)delete(17)

Good Hash Function for Strings?

• Ideas?

Good Hash Function for Strings?

• Sum the ASCII values of the characters.• Consider only the first 3 characters.

– Uses only 2871 out of 17,576 entries in the table on English words.

• Let s = s1s2s3s4…s5: choose – hash(s) = s1 + s2128 + s31282 + s41283 + … + sn128n

• Problems:– hash(“really, really big”) = well… something really, really big

– hash(“one thing”) % 128 = hash(“other thing”) % 128

Think of the string as a base 128 number.

Making the String HashEasy to Compute

• Use Horner’s Rule

int hash(String s) { h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (si + 128*h) % tableSize; } return h; }

Universal Hashing• For any fixed hash function, there will be some

pathological sets of inputs– everything hashes to the same cell!

• Solution: Universal Hashing– Start with a large (parameterized) class of hash

functions• No sequence of inputs is bad for all of them!

– When your program starts up, pick one of the hash functions to use at random (for the entire time)

– Now: no bad inputs, only unlucky choices!• If universal class large, odds of making a bad choice very low• If you do find you are in trouble, just pick a different hash

function and re-hash the previous inputs

Universal Hash Function: “Random” Vector Approach

• Parameterized by prime size and vector:a = <a0 a1 … ar> where 0 <= ai < size

• Represent each key as r + 1 integers where ki < size– size = 11, key = 39752 ==> <3,9,7,5,2>

– size = 29, key = “hello world” ==> <8,5,12,12,15,23,15,18,12,4>

ha(k) = sizekar

iii mod

dot product with a “random” vector!

Universal Hash Function

• Strengths:– works on any type as long as you can form ki’s

– if we’re building a static table, we can try many a’s

– a random a has guaranteed good properties no matter what we’re hashing

• Weaknesses– must choose prime table size larger than any ki

Hash Function Summary

• Goals of a hash function– reproducible mapping from key to table entry

– evenly distribute keys across the table

– separate commonly occurring keys (neighboring keys?)

– complete quickly

• Hash functions– h(n) = n % size

– h(n) = string as base 128 number % size

– Universal hash function #1: dot product with random vector

How to Design a Hash Function

• Know what your keys are• Study how your keys are distributed• Try to include all important information in a key

in the construction of its hash• Try to make “neighboring” keys hash to very

different places• Prune the features used to create the hash until it

runs “fast enough” (very application dependent)

Collisions

• Pigeonhole principle says we can’t avoid all collisions– try to hash without collision m keys into n slots with m > n

– try to put 6 pigeons into 5 holes

• What do we do when two keys hash to the same entry?– open hashing: put little dictionaries in each entry

– closed hashing: pick a next entry to try

CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Documents