O logn Indexingdjmoon/chris/chris-notes/db... · 2017. 1. 22. · Indexing: Intro No matter what a...

Indexing: Intro

• No matter what a file’s organization, sequential search can always be used to locatean individual record (O(n), where n is number of blocks)

• Ordered files have the advantage that binary search can be used (O(logn))

• Indexing is a general term that refers to the use of a data structure to enhanceprocessing of individual records even more

• An index uses extra storage in order to make processing efficient

• It sometimes imposes a storage scheme on the file

• Indexing schemes include

– Indices (Indexed sequential organization, keyed indexing)

– B-trees

– B+-trees

– Hashing

1

Indexing: Indexed Sequential Organization (ISO)

• An index is a table of < key, pointer > pairs

– Each key represents a value associated with a record

– The corrsponding pointer points to the block in the data file containing thatrecord

∗ Note: An entry may also include the record’s offset in the block

• Pairs in an index block are ordered sequentially by key value

• The index is stored as a separate file

• Index files are ordered

• 2 types of ISO

1. Dense

– Suitable for ordered or unordered data files

– Index has one entry for each data entry

– Example

∗ Data block can hold 4 records (only keys are shown)

∗ Index block can hold 10 key-pointer pairs

Index Fileblock1 2000 11

2100 112200 112275 112847 123059 123125 123450 123495 133500 13

2 3570 133600 133700 143760 14... ...

Data Fileblock11 2000 2100 2200 227512 2847 3059 3125 345013 3495 3500 3570 360014 3700 3760 ... ...

2

Indexing: Indexed Sequential Organization (2)

– Savings result because blocking factor of index >> blocking factor of data file

∗ Blocking factor f =⌊bytesPerBlockbytesPerRecord

⌋∗ Thus, index will have fewer blocks to be searched

2. Operations:

(a) Search

i. Use binary search for key in index

ii. Use sequential search for record in associated block

(b) Deletion

i. Delete data record

ii. Delete corresponding index entry

– Since index is an ordered file, can handle deletion of index entry in any ofways discussed

iii. Further steps depend on way in which deletion of data file records handled

– If pack data file, will need to adjust pointers to records that move betweenblocks

– If use delete bit, nothing further required

(c) Insertion

– Analogous to deletion

3. Sparse

– Uses one index entry per block of data records

∗ Thus, each index entry corresponds to f records (blocking factor)

– Only applicable for ordered data files

– Key value can be either largest or smallest value in data block

3

Indexing: Indexed Sequential Organization (3)

– Example




2847 123495 133700 144040 154250 164350 174455 184587 194890 20

2 5112 215193 225197 235399 24

Data Fileblock11 2000 2100 2200 227512 2847 3059 3125 345013 3495 3500 3570 360014 3700 3760 3780 379015 4040 4060 4110 419016 4250 4270 4290 432517 4350 4375 4400 443518 4455 4478 4495 455019 4587 4613 4633 467120 4890 4944 4967 499121 5112 5134 5169 519222 5193 5194 5195 519623 5197 5330 5376 538024 5399 5511 5532 5533

4. Operations:

(a) Search

i. Use binary search for key in index

ii. Search will halt when Ij ≤ K < Ij+1

iii. Use sequential search for record in block associated with entry Ij(b) Deletion

– Similar to dense case

∗ If maintain packed data file, will need to adjust entries in index

∗ If use delete bit, no adjustments necessary

(c) Insertion

– Analogous to deletion for packed data files or ones using delete bits

– If use overflow blocks, no adjustments to index necessary

4

Indexing: ISO - Clustering

• If index based on non-key field, indexing value will generally refer to multiple records

• Such an index is called a clustering index

• Clustering index is sparse

• Frequently dedicate block to each value of index to facilitate insertions and deletions

– Link blocks if more than one required for records with the same indexing value

• Example:

Index Fileblock1 Arnold 11

Black 12Charles 13Doe 14Donovan 15Evans 16Jones 17Murphy 18Smith 19Stone 20

2 Stoner 21Swartz 22Thomas 23White 24

Data Fileblock link11 Fred Joe -112 Sue Dave Al Mary 3013 Carol Joe Mark Glen -114 Barry Bill Bart Bud 3515 Sharon Cicely Corey -116 Florence -117 Harry Zach Morgan -118 Riley Jenny Caine Derrick 5519 John MaryAnn Anne -120 Jay Margaret Sally -121 Angus Ozzie -122 Malcolm Bonn Oscar -123 Petunia Porky Minnie Mickey 6124 Tom -1

...30 Karen -1

...35 Bruce Becky -1

...55 Jan Dave Connor Kelly 58

...58 Colleen Dave -1

...61 Donald Daisy -1

5

Indexing: Dynamic ISO

• Because ISO requires sequential files, insertions and deletions are problemmatic

• Dynamic ISO is designed to alleviate these problems

• It allows data blocks to be partially (half) full

• Insertion Algorithm

1. Search index for block in which to insert record

2. If block is not full

(a) Insert record into block

3. If block is full

(a) Allocate new block

(b) Move records in upper half of original to new block

(c) Insert an entry into index for new block

(d) Insert record into original or new block as appropriate

• Deletion Algorithm

1. Search index for block in which to find record

2. Delete record from block

3. If block is empty

(a) Deallocate block

(b) Delete block index from index block

6

Indexing: Dynamic ISO (2)

• Example:

Index Fileblock1

Data Fileblock11

Insert 2000

1 2000 11 11 2000

Insert 2500

1 2000 11 11 2000 2500

Insert 3000

1 2000 11 11 2000 2500 3000

Insert 2200

1 2000 11 11 2000 2200 2500 3000

Insert 2700

1 2000 112500 12

11 2000 220012 2500 2700 3000

Insert 2400, 2300

1 2000 112500 12

11 2000 2200 2300 240012 2500 2700 3000

Insert 2100

1 2000 112300 132500 12

11 2000 2100 220012 2500 2700 300013 2300 2400

7

Indexing: Dynamic ISO (3)

Delete 3000


2300 132500 12

Data Fileblock11 2000 2100 220012 2500 270013 2300 2400

Delete 2300

1 2000 112400 132500 12

11 2000 2100 220012 2500 270013 2400

Delete 2400

1 2000 112500 12

11 2000 2100 220012 2500 2700

8

Indexing: Multilevel Indexing

• For a single-level index of b blocks, binary search requires dlog2be + 1 disk accessesin the worst case

• Can improve by using multilevel index

– I.e., index with its own index

• Index into the data file is first-level index

– Next level is second-level, etc.

• Each level will have one entry per block of the lower level index

• If f is the blocking factor of the index, binary search requires dlog2fe+1 disk accessesin the worst case

• Example:

– Data block can hold 4 records (only keys are shown)

– Index block can hold 4 key-pointer pairs

Level 2 Indexblock1 2000 11

4040 124587 135197 14


2847 223495 233700 24

12 4040 254250 264350 274455 28

13 4587 294890 305112 315193 32

14 5197 335399 34

Data Fileblock21 2000 2100 2200 227522 2847 3059 3125 345023 3495 3500 3570 360024 3700 3760 3780 379025 4040 4060 4110 419026 4250 4270 4290 432527 4350 4375 4400 443528 4455 4478 4495 455029 4587 4613 4633 467130 4890 4944 4967 499131 5112 5134 5169 519232 5193 5194 5195 519633 5197 5330 5376 538034 5399 5511 5532 5533

9

Indexing: ISAM

• ISAM stands for Indexed Sequential Access Method

• Created for IBM 360/370 mainframes

• Uses multilevel indexing based on levels of physical disk storage

– Master index

∗ One master index per file

∗ Index to cylinders

– Cylinder index

∗ One cylinder index per cylinder

∗ Index to surfaces

– Track index

∗ One per surface

∗ Index to blocks

• Each index can itself be hierarchical

• Example

Master Indexblock1 0000 cylinder 0

1000 cylinder 12000 cylinder 23000 cylinder 3...

Cylinder 5 Indexblockn 5000 surface 0

5100 surface 15200 surface 25300 surface 3...

Cylinder 5, Surface 3 Indexblockr 5300 block 0

5310 block 15320 block 25330 block 3...

10

Indexing: Secondary Indices

• Often want to be able to access records based on field other than primary key

• Can create indices for these also

• Note:

– Many secondary indices can exist

– Only one primary index can exist (for a given file)

• 2 general cases:

1. Secondary index based on unique field

– Index will have one entry per record (dense index)

2. Secondary index based on non-unique field

(a) Link records with same key value together

– Search index, then sequentially search linked list

– Requires an additional pointer field for data records

(b) Use variable-length records for index

– Variable part is repeating field that holds pointers to records

(c) Use hierarchical index

– Lowest level is dense index

– Each block in lowest level is dedicated to a specific value

– Second level is sparse index with one entry per value

– Secondary entries point to blocks in first level

11

Indexing: Secondary Indices (2)

– Example:



Level 2 Indexblock1 Black 11

Doe 12Smith 13Stone 14


4110 255399 34

12 2000 214550 284350 274455 28

13 ... ...

Data Fileblock21 2000 2100 2200 227522 2847 3059 3125 345023 3495 3500 3570 360024 3700 3760 3780 379025 4040 4060 4110 419026 4250 4270 4290 432527 4350 4375 4400 443528 4455 4478 4495 455029 4587 4613 4633 467130 4890 4944 4967 499131 5112 5134 5169 519232 5193 5194 5195 519633 5197 5330 5376 538034 5399 5511 5532 5533

12

Indexing: B-Trees - Intro

• Problem with ISO is that performance degrades as DB evolves

– Data file must remain ordered

– Delete bits, overflow blocks, linked lists all impose problems

• B-tree is alternative

– Dynamic data structure

– Always balanced

∗ All leaves at same depth

∗ Guarantees limit on search depth

∗ Uses minimal storage overhead

• A B-Tree of order n is a tree such that

1. Every node has at most n children

2. Every node, except for the root and the leaves, has at least dn/2e children

3. If the tree contains more than a single node, then the root has at least two children

4. All leaves appear on the same level

5. A non-leaf node with j children has j − 1 keys

13

Indexing: B-Trees - Intro (2)

• Node structure

– A node consists of triples consisting of a

∗ Node-pointer

∗ Key

∗ Record-pointer

< Pi, Ki, Ri >.

– Key value K will appear exactly once in the B-tree

– A node is represented as

< P1, K1, R1 >< P2, K2, R2 > . . . < Pn−1, Kn−1, Rn−1 >< Pn >

where

∗ K1 < K2 < . . . < Kn−1

∗ Ri points to the record whose value is Ki

∗ Kj in the subtree of Pi are such that Ki−1 < Kj < Ki

– Leaf nodes have the same structure as interior nodes except that all Pi are null

• All insertions and deletions occur at leaf nodes

• n chosen to be blocking factor of triples

• Example:

14

Indexing: B-Trees - Search

node <- root

found <- false

done <- false

repeat {

find smallest Ki such that x <= Ki

if (x = Ki) {

found <- true

done <- true

}

else if (node is leaf)

done <- true

else

node <- Pi

} until done

if (found)

return Ri

else

return NULL

15

Indexing: B-Trees - Insertion

1. Find the leaf node in which the key should be located

2. If the leaf node is not full, store the key and a pointer to the associated record atthe appropriate position in this leaf node

3. If the leaf node is already full:

(a) Let

< P1, K1, R1 >< P2, K2, R2 > . . . < Pn, Kn, Rn >< Pn+1 >

represent the ordered set of triples in the node, including the triple to be inserted

(b) Split the node into two parts:

i. Create a new node (to the ”left” of the original)

ii. Shift triples up to but not including the middle triple into the new node

iii. Pack the remaining pointer-key-record pointer triples to the right of the middletriple into the lower half of the node

(c) Promote the middle triple into the parent node

(d) Assign the pointer to the new node to node pointer field of the promoted node

4. Alternative to split step above:

(a) Create a new node (to the ”right” of the original)

(b) Shift triples to the right of the middle triple into the new node

(c) Replace pointer to original node in parent with pointer to new node

(d) Replace pointer in middle triple with pointer to original node

16

Indexing: B-Trees - Insertion (2)

5. If promotion causes the parent node to become overfull, apply this procedure recur-sively.

6. If the root divides

(a) Split as described above

(b) Create an additional new node

• This node becomes the new root

• Tree height increases by one

(c) Promote the middle triple into this new node

(d) Pointers in the new root point to the original and newly created nodes

17

Indexing: B-Trees - Insertion (3)

18

Indexing: B-Trees - Deletion at Leaf Node

1. Remove the corresponding triple from the node

2. Shift triples as necessary

3. If the leaf node contains dn/2e − 2 keys (node is underfull)

(a) Case 1: Rotate

• Sibling node to immediate left or right of underfull - the donator - containsmore than dn/2e − 1 keys

• Call the extreme < nodepointer, key, recordpointer > triple in donator <Pd, Kd, Rd >

– If donator node is right neighbor, < Pd, Kd, Rd >=< P1, K1, R1 >

– If donator node is left neighbor, < Pd, Kd, Rd >=< Pn, Kn, Rn >

• Let < Kp, Rp > be the key-record pointer pair that separates the pointers tounderfull and donator in the parent node

• Algorithm:

i. Move < Kp, Rp > into underfull

– < Kp, Rp > becomes part of the leftmost/rightmost triple in underfull

ii. Move < Kd, Rd > into the spot previously occupied by < Kp, Rp > in theparent node, shifting keys, etc., in donator as necessary

19

Indexing: B-Trees - Deletion at Leaf Node (2)

• Case 2: Merge/Coalesce

– Nodes to left and right (neighbors) contain exactly dn/2e − 1 keys

– Together, underfull and neighbor have 2dn/2e − 3 keys, which is at mostn-2 keys

– Can collapse underfull and neighbor into one node that also includes< Kp, Rp > (the < key, record− pointer > pair that separates the pointersto underfull and neighbor in the parent)

– Algorithm:

i. Arbitrarily select one of the neighbors

ii. Move < Kp, Rp > into the leftmost of neighbor and underfull, retainingany pointer in the triple being moved into

iii. Move the contents of leftmost node into rightmost

∗ Rightmost node now contains ordered set of triples from underfull, neigh-bor, and the one demoted from the parent node

iv. Free the emptied node

v. Pack the parent node (dropping the < Pp, Kp, Rp > triple from parent)

vi. If parent node becomes underfull (and is not the root),

∗ Apply algorithm recursivelyWhen borrow from left interior node, rightmost pointer from donator(Pdn2 e) becomes leftmost pointer in underfullWhen borrow from right interior node, leftmost pointer from donator(P1) becomes rightmost pointer in underfull

vii. If root becomes empty

∗ Single child of the root becomes the new root

∗ Release old root node

– Alternative to step iii. above:

∗ If keep triples in ”left” node instead of transfering to right node,

set pointer Pp+1 to left node

20


– Examples

i. Simple merging at leaf

21


22


ii. Meging with right internal rotation

23


24


iii. Meging with left internal rotation

25


26


• Examples:

i. Merging with height reduction

27


28

Indexing: B-Trees - Deletion at Interior Node

• Let < Pd, Kd, Rd > be triple associated with key to be deleted

• Algorithm

1. Overwrite < Kd, Rd > with values associated with largest key in the subtree ofPd (Km)

– Pd remains unchanged

2. Delete Km from the leaf node using leaf deletion algorithm

29

Indexing: B-Trees - Deletion at Interior Node (2)

• Examples:

1. Deletion from internal node

30


31


2. Deletion from root

32


33

Indexing: B-Trees - Analysis

• Consider a B-tree of order n with k keys

• With k keys, there are k + 1 faulure nodes at level l + 1

– Failure node is hypothetical node reached when searching for a key that is not inthe tree

– Failure nodes lie one level below leaf nodes (at level l)

– They correspond to null pointers in leaf nodes

– Let the key values contained in the tree be K1, K2, ..., Kk

– Ki ≤ ki+1 for 1 ≤ i ≤ k

– Failure nodes exist for every Xj where Ki < Xj < Ki+1 for 0 ≤ i ≤ k and K0

represents −∞ and Kk+1 represents ∞

• Root has at least 2 children, thus at least 2 nodes at level 2

• Each node must have at least dn/2e children, thus 2dn/2e nodes at level 3, 2dn/2e2nodes at level 4, etc.

• At level l have at least 2dn/2el−2 nodes, all of which are non-failure nodes (for treewith l > 1 levels)

• Since there are k − 1 failure nodes at level l + 1,–

k + 1 ≥ 2

⌈n

2

⌉l−1–

k + 1

2≥⌈n

2

⌉l−1–

logdn/2e

(k + 1

2

)≥ l − 1

–

logdn/2e

(k + 1

2

)+ 1 ≥ l

– This is the number of levels containing failure nodes, and thus the maximumnumber of disk accesses

34

Indexing: B+-Trees - Intro

• B+-tree is variation of B-tree that has greater blocking factor

• Record pointers appear only in the leaves

– All keys appear in the leaves

– Some keys will be duplicated in internal nodes

– Search always ends up at leaf node

• Since internal nodes do not include data pointers,

– Can have more keys per index block

– B+-tree will be shallower than equivalent B-tree

• A B+-Tree of order n is a tree such that

1. Every node has at most n children

2. Every node, except for the root and the leaves, has at least dn/2e children

3. If the tree contains more than a single node, then the root has at least two children

4. All leaves appear on the same level

5. A non-leaf node with j children has j − 1 keys

35

Indexing: B+-Trees - Intro (2)

• Node structure

– A node consists of pairs consisting of a

∗ Pointer

∗ Key

< Pi, Ki >

– An interior node is represented as

< P1, K1 >< P2, K2 > . . . < Pn−1, Kn−1 >< Pn >

where

∗ K1 < K2 < . . . < Kn−1

∗ Pi points to a child node

∗ Kj in the subtree of Pi are such that Ki−1 < Kj ≤ Ki

– Leaf nodes have the same structure as interior nodes except that

∗ Pi points to the block containing the record with key Ki, 1 <= i <= n− 1

∗ Pn points to the next leaf node.

• All keys appear in leaf nodes

• Some keys will be duplicated at interior nodes

• All insertions and deletions occur at leaf nodes

• Example:

36

Indexing: B+-Trees - Insertion into Leaf Nodes

1. Find the leaf node in which the key should be located

2. If the leaf node is not full, store the key and a pointer to the associated record atthe appropriate position in this leaf node

3. If the leaf node is already full

• Split the node into two parts:

(a) Create a new node (to left)

(b) Shift pairs up to and including the middle pair into the new node

(c) Pack the remaining pairs into the lower half of the original node

(d) The rightmost pointer of the new node must be set to point to the original

(e) The rightmost pointer of the original node’s left sibling must be set to pointto the new node

4. Insert a new pointer-key pair into the parent node, consisting of

(a) A copy of the middle key from the split node

(b) A pointer to the new node

5. If insertion into the parent causes the parent node to become overfull, proceed asdescribed below

6. Alternative to steps 3 and 4:

(a) New node created to right of original

(b) Shift pairs to the right of the middle pair into the new node

(c) The rightmost pointer of the new node must be set to the value of the original’srightmost pointer

(d) The rightmost pointer of the original node must be set to point to the new node

(e) Pointer to original in parent must be set to new node

(f) Copy of middle key-pointer pair promoted, with pointer set to original node

37

Indexing: B+-Trees - Insertion into Leaf Nodes (2)

38

Indexing: B+-Trees - Insertion into Interior Nodes

1. Proceed as for leaf nodes with the following differences:

(a) If an interior node must be split, only the pointer-key pairs to the left of themiddle pair are shifted into the new node

(b) The original node retains only the pointer-key pairs to the right of the centralpair

(c) The middle key is promoted into the parent

(d) The pointer paired with the promoted key points to the node just created

2. If the internal node to be split happens to be the root

• Create a new node

– This node becomes the new root

– Tree height increases by one

• Promote the middle key into this node

• The pointer associated with this key points to the left node created by the split

• The rightmost pointer points to the right node created by the split

39

Indexing: B+-Trees - Insertion into Interior Nodes (2)

3. Examples:

(a) Multilevel promotion

40


41


42


(b) With new root

43

Indexing: B+-Trees - Simplistic Deletion at Leaf Nodes

• Algorithm

1. Remove the corresponding pointer-key pair from the leaf node

2. Shift pairs as necessary

• Copies of deleted keys that occur at interior nodes remain

– This will not affect the search algorithm

– The key simply won’t be found at the leaf level

• This approach trades ease of implementation for loss of search efficiency, as it allowsunderfull and empty nodes in the tree

44

Indexing: B+-Trees - Full Deletion at Leaf Node

1. Remove the pointer-key pair from the associated leaf node

2. Shift pairs as necessary

3. If the deleted key was the rightmost key, copy the new rightmost key into the pairthat points to this node in the parent

4. If the leaf node contains dn/2e − 2 keys (node is underfull)

(a) Case 1

• Sibling node to immediate left or right of underfull - the donator - containsmore than dn/2e − 1 keys

• Algorithm:

i. Move the rightmost/leftmost pointer-key pair from donator into the left/rightend of underfull (disregarding Pn)

ii. Copy a key value into the parent (from either underfull or donator, asappropriate)

• Example:

45

Indexing: B+-Trees - Full Deletion at Leaf Node (2)

(b) Case 2

• Nodes to left and right (neighbors) contain exactly dn/2e − 1 keys

• Together, underfull and neighbor have 2dn/2e − 3 keys, which is at most n-2 keys

• Can collapse underfull and neighbor into one node

• Result contains n− 2 keys

• Algorithm:

i. Arbitrarily select one of the neighbors

ii. Move the contents of leftmost node into rightmost

iii. Free the emptied node

iv. In the parent, remove the pointer-key pair that pointed to the freed node

v. Pack the parent node

vi. if parent node becomes underfull, apply interior compaction algorithm

• Example:

46

Indexing: B+-Trees - Full Deletion at Leaf Node (3)

47

Indexing: B+-Trees - Compaction of Interior Nodes

1. Case 1

• Sibling node to immediate left or right of underfull - the donator - contains morethan dn/2e − 1 keys

• Extreme < pointer, key > (< Pd, Kd >) pair from donator will be rotated intounderfull

• If donator node is right neighbor, pair is < P1, K1 >

• If donator node is left neighbor, pair is < Pn+1, Kn >

• Let Kp be the key that separates the pointers to underfull and donator in theparent node

• Algorithm:

(a) Move Kp into underfull, shifting keys, etc., in donator as necessary

– Kp becomes part of the leftmost/rightmost pair in underfull

(b) Move Pd into underfull

– It becomes the leftmost/rightmost pointer of underfull

(c) Move Kd into the spot previously occupied by Kp in the parent node

48

Indexing: B+-Trees - Compaction of Interior Nodes (2)

2. Case 2

• Nodes to left and right (neighbors) contain exactly dn/2e − 1 keys

• Together, underfull and neighbor have 2dn/2e− 3 keys, which is at most n-2 keys

• Can collapse underfull and neighbor into one node

• Result includes the key (Kp) from the parent that separates the pointers to un-derfull and neighbor

• NOTE: This is now following the B-tree algorithm

• Algorithm:

(a) Arbitrarily select one of the neighbors

(b) Move Kp into the leftmost of neighbor and underfull

(c) Associate rightmost node pointer of node with Kp

(d) Move the contents of leftmost node into rightmost

– Rightmost node now contains ordered set of pairs from underfull, neighbor,and the one containing Kp

(e) Free the emptied node

(f) Pack the parent node

(g) if parent node becomes underfull (and is not the root),

– Apply algorithm recursively

(h) If root becomes empty

– Single child of the root becomes the new root

– Release old root node

49


• Example:

50


51

Indexing: B+-Trees - Comparison With B-trees

• B-tree has better best-case search, since some data pointers are in interior nodes

– B+-tree has better worst case since tree is shallower

• B+-tree wastes space in that key values duplicated in interior nodes

– B-tree wastes space because every node includes data pointers

• B-tree has nodes of unequal size (if leaf nodes do not include node pointers)

– B+-tree nodes uniform

• B-tree deletion more complex due to interior nodes

– All deletion occurs at leaves of B+-tree

• B+-tree allows easier sequential processing of records

– Requires in-order traversal in B-tree

52

Indexing: Hashing - Intro

• Fastest access to record is if associate every key with a unique address

– Problem is that usually range of key values >> actual number contained in thefile

– This would result in much wasted storage space

• Hashing consists of

– Address space

∗ Called the hash table

∗ Set of storage locations that contain pointers to blocks of records

∗ Each block called a bucket

– Hash (mapping) function (h(K))

∗ Produces a many:1 mapping from keys to address space

∗ Characteristics of good hash function:

· Uniform: Same number of keys in kay range map to each address

· Random: A given set of keys should should be evenly distributed across theaddress space

· Worst case scenario: all keys map to same address

• Main issue is dealing with collisions

– Collision is situation where more than one key maps to the same address

53

Indexing: Hashing - Static

• In static hashing, address space is fixed size

• 2 approaches:

– Closed addressing (chaining)

– Open addressing (closed hashing)

• Closed addressing

– Insertion algorithm

1. Apply h(K) to key

2. Access block returned by h(K)

3. If block not full, insert record in block

4. If block full:

(a) Allocate a new block (overflow block)

∗ May dedicate overflow block to given bucket, OR

∗ May use a communal overflow block

· If use this approach, records with same hash value should be linkedtogether

· Reasonable approach if expect little overflow

(b) Link new block to overflow block

(c) Insert record into overflow block

– Search algorithm



3. Linear search for key in block

∗ May need to follow links to overflow blocks

– Note that can accomodate an unlimited number of keys, at cost of approachinglinear search

54

Indexing: Hashing - Static (2)

– Example:

∗ h(K) = K value mod 5

∗ Bucket can hold 2 records

key values

Bob 14Jane 5

Mary 10Bill 8Joe 6Sue 13

Ann 15Mike 12Ellen 3

Hash Table

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

Buckets10 -111 -112 -113 -114 -1

Insert Bob

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

10 -111 -112 -113 -114 Bob -1

55


Insert Jane

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

10 Jane -111 -112 -113 -114 Bob -1

Insert Mary

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

10 Jane Mary -111 -112 -113 -114 Bob -1

Insert Bill

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

10 Jane Mary -111 -112 -113 Bill -114 Bob -1

Insert Joe

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

10 Jane Mary -111 Joe -112 -113 Bill -114 Bob -1

56


Insert Sue

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

10 Jane Mary -111 Joe -112 -113 Bill Sue -114 Bob -1

Insert Ann

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

10 Jane Mary 1511 Joe -112 -113 Bill Sue -114 Bob -115 Ann -1

Insert Mike

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

10 Jane Mary 1511 Joe -112 Mike -113 Bill Sue -114 Bob -115 Ann -1

Insert Ellen

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

10 Jane Mary 1511 Joe -112 Mike -113 Bill Sue 1614 Bob -115 Ann -116 Ellen -1

57


• Open addressing

– Insertion algorithm



3. If block not full, insert key-pointer pair

4. If block full:

(a) Perform linear search of table, looking for first bucket that is not full

(b) Insert key-pointer pair into this bucket

– Search algorithm



3. Linear search for key in block

∗ If not found and bucket is full:

(a) Go to next bucket of table

(b) Repeat this procedure recursively, ending with a bucket that is not full orif have traversed the entire table

4. Note that accomodates a fixed number of keys

5. Deletions particularly problemmatic

58


– Example:

∗ h(K) = K value mod 5

∗ Bucket can hold 2 records

key values

Bob 14Jane 5

Mary 10Bill 8Joe 6Sue 13

Ann 15Mike 12Ellen 3

Hash Table

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

Buckets1011121314

Insert Bob

Hash Table

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

Buckets1011121314 Bob

59


Insert Jane

Hash Table

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

Buckets10 Jane11121314 Bob

Insert Mary

Hash Table

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

Buckets10 Jane Mary11121314 Bob

Insert Bill

Hash Table

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

Buckets10 Jane Mary111213 Bill14 Bob

Insert Joe

Hash Table

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

Buckets10 Jane Mary11 Joe1213 Bill14 Bob

60


Insert Sue

Hash Table

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

Buckets10 Jane Mary11 Joe1213 Bill Sue14 Bob

Insert Ann

Hash Table

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

Buckets10 Jane Mary11 Joe Ann1213 Bill Sue14 Bob

Insert Mike

Hash Table

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

Buckets10 Jane Mary11 Joe Ann12 Mike13 Bill Sue14 Bob

Insert Ellen

Hash Table

h(K) Ptr

0 10

1 11

2 12

3 13

4 14

Buckets10 Jane Mary11 Joe Ann12 Mike13 Bill Sue14 Bob Ellen

61


• Evaluation

– Table size a problem

∗ Size must be chosen a priori

∗ If table size chosen too small, will have lots of collisions

∗ If chosen too large, have lots of wasted space

∗ Size determination especially difficult problem for dynamic situations

∗ Selection of table size most critical for open addressing

– Fast access of individual records (1 disk access per record ideally)

– Not good for sequential access of records

62

Hashing - Extendible

• Designed to accomodate dynamic databases

• Basic concepts

1. The hash table contains a maximum MAX number of entries

– Normally, not all table entries are in use

– Each table entry that is in use points to the bucket that contains the recordsthat “hashed” to that entry in the table

2. h(K) generates a value for key K, but only a portion of this hash value is used todiscriminate among records

– This portion is called the hash prefix

3. LargestHashPrefixLength is the number of bits in the longest hash prefix currentlyassociated with an entry in the table

– The current table size = 2LargestHashPrefixLength

– Note that MAX ≥ 2LargestHashPrefixLength

4. Each table entry has a HashPrefixLength

– It indicates the number of bits in the current hash prefix for that table entry

– HashPrefixLength ≤ LargestHashPrefixLength

5. Initially,

– LargestHashPrefixLength = 0

– Only one table entry is in use

– HashPrefixLength = 0 for this entry

– All records are initially stored in the block associated with this entry

• Retrieval

1. Compute h(K)

2. H = LargestHashPrefixLength bits of h(K)

3. Search bucket H for K

63

Hashing - Extendible (2)

• Insertion

1. Compute h(K)


3. If bucket H not full, add record

4. Otherwise:

(a) If HashPrefixLength < LargestHashPrefixLength

– H discriminates among records using fewer than LargestHashPrefixLengthbits

∗ Therefore, several hash table entries will point to this bucket

∗ Want to split the bucket in 2 two (call them BLOCK0 and BLOCK1)and redistribute the records based on a longer prefix

– Let H∗ be the value of the rightmost HashPrefixLength bits of H

– Let Diff = LargestHashPrefixLength−HashPrefixLength

– 2Diff rows in the hash table currently point to bucket H

∗ The row addresses of these table entries are H∗ + a ∗ 2HashPrefixLength

∗ 0 ≤ a ≤ 2Diff − 1

– Table entries with addresses H∗+2b∗2HashPrefixLength, 0 ≤ b ≤ 2Diff−1−1,are redistributed to BLOCK0

– Table entries with addresses H∗ + (2b + 1) ∗ 2HashPrefixLength,0 ≤ b ≤ 2Diff−1 − 1, are redistributed to BLOCK1

– For each of these entries, HashPrefixLength = HashPrefixLength + 1

– To redistribute records between BLOCK0 and BLOCK1, use the rightmostbit that was NOT used in forming H∗

∗ If bit = 0, record→ BLOCK0

∗ If bit = 1, record→ BLOCK1

(b) If HashPrefixLength = LargestHashPrefixLength andLargestHashPrefixLength ≤MAX

– LargestHashPrefixLength = LargestHashPrefixLength + 1

– Double the table size

∗ Copy the first 2LargestHashPrefixLength−1 rows of the hash table into thelatter 2LargestHashPrefixLength−1 rows of the table

∗ Insert record as above

64


5. Example:

– Bucket can hold 2 records

– LPL stands for largest prefix length

– PL stands for prefix length

hash values

Bob 1110Jane 0101Mary 1010Bill 1000Joe 0110Sue 1101Ann 1111Mike 1100Ellen 0011

Hash Table Blocks

LPL R PL B

0 0 0 0 0 insert Bob

0 0 0 0 0 Bob insert Jane

0 0 0 0 0 Bob Jane insert Mary

1 0 0 0 0 Bob Jane1 0 0

1 0 1 0 0 Jane1 1 1 1 Bob

1 0 1 0 0 Jane1 1 1 1 Bob Mary

2 00 1 0 0 Jane insert Bill01 1 0 1 Bob Mary10 1 111 1 1

65


2 00 1 0 0 Jane insert Joe, Sue01 1 0 1 Mary Bill10 2 1 2 Bob11 2 2

2 00 1 0 0 Jane Joe insert Ann01 1 0 1 Mary Bill10 2 1 2 Bob Sue11 2 2

3 000 1 0 0 Jane Joe001 1 0 1 Mary Bill010 1 0 2 Bob Sue011 1 0100 2 1101 2 1110 2 2111 2 2

3 000 1 0 0 Jane Joe insert Mike001 1 0 1 Mary Bill010 1 0 2 Sue011 1 0 3 Bob Ann100 2 1101 2 1110 3 2111 3 3

66


3 000 1 0 0 Jane Joe insert Ellen001 1 0 1 Mary Bill010 1 0 2 Sue Mike011 1 0 3 Bob Ann100 2 1101 2 1110 3 2111 3 3

3 000 2 0 0 Ellen001 2 0 1 Mary Bill010 2 4 2 Sue Mike011 2 4 3 Bob Ann100 2 1 4 Jane Joe101 2 1110 3 2111 3 3

67


• Deletion

1. Compute h(K)


3. Delete record from bucket H

4. Let H∗ be the value of the rightmost HashPrefixLength bits of h(K)

5. The buddy row for row H∗ is computed by

(a) H∗+2HashPrefixLength−1 when the leftmost HashPrefixLength bits of H∗ = 0

(b) H∗−2HashPrefixLength−1 when the leftmost HashPrefixLength bits of H∗ = 1

(c) Note: The leftmost bit of H∗ will be 0 if H∗ < 2HashPrefixLength−1

6. Merge row H∗ with its buddy row if

(a) The buddy row has the same value of HashPrefixLength as row H∗

– This indicates that the buddy row is not currently split even further

(b) The total number of records in bucket H∗ and its buddy can fit in a singlebucket

7. Determine whether other table entries should point to the bucket resulting fromthe merger

– These entries are those whose row address uses the same rightmostHashPrefixLength bits as H∗

– Let Diff = LargestHashPrefixLength−HashPrefixLength

– There are 2Diff+1 such table entries

– Let H′

be the value of the rightmost HashPrefixLength− 1 bits of H

– The addresses of these entries are H′+ a ∗ 2HashPrefixLength−1,

where 0 ≤ a ≤ 2Diff+1 − 1

– Set each of these entries point to the result of the above merge

– Decrement the value of HashPrefixLength for each of these entries

8. Since H∗’s buddy depends on the value of the HashPrefixLength for row H∗,

– Row H∗ now has a new buddy

– Repeat the above to determine whether H∗ can be merged with its new buddy

9. When merging is completed, set LargestHashPrefixLength to the maximumHashPrefixLength in the table

– Reduce the table size if warranted by this new value

68


10. Example:

Hash Table Blocks

LPL R PL B

3 000 2 0 0 Bill Mike delete Jane001 2 1 1 Jane Sue010 3 2 2 Mary011 2 4 3 Bob Joe100 2 0 4 Ann Ellen101 2 1110 3 3111 2 4

3 000 2 0 0 Bill Mike delete Mike001 2 1 1 Sue010 3 2 2 Mary011 2 4 3 Bob Joe100 2 0 4 Ann Ellen101 2 1110 3 3111 2 4

69


3 000 2 0 0 Bill delete Ellen001 2 1 1 Sue010 3 2 2 Mary011 2 4 3 Bob Joe100 2 0 4 Ann Ellen101 2 1110 3 3111 2 4

3 000 2 0 0 Bill001 1 1 1 Sue010 3 2 2 Mary011 1 4 3 Bob Joe100 2 0 4 Ann101 1 1110 3 3111 1 4

70


3 000 2 0 0 Bill delete Mary001 1 1 1 Sue Ann010 3 2 2 Mary011 1 1 3 Bob Joe100 2 0101 1 1110 3 3111 1 1

3 000 2 0 0 Bill001 1 1 1 Sue Ann010 2 2 2011 1 1 3 Bob Joe100 2 0101 1 1110 2 3111 1 1

2 00 2 0 0 Bill01 1 1 1 Sue Ann10 2 2 2 Bob Joe11 1 100 2 001 1 110 2 211 1 1

2 00 2 0 0 Bill01 1 1 1 Sue Ann10 2 2 2 Bob Joe11 1 1

71

Hashing - Linear

• Linear hashing involves the following:

1. A hash table whose size is dynamic

– Collisions trigger table growth via splitting buckets into 2

– M : the base size of the table

∗ This will always be a power of 2

– N : An integer that indicates

∗ The next bucket to be split

∗ The number of buckets that have already been split in the base table of sizeM

· N < M

– At any time, the number of buckets = M + N

– For a given base table of size M

∗ N is initialized to 0

2. A hash function h(K)

– h(K) = K mod M

– If K mod M < N , rehash using K mod 2 ∗M∗ This signifies that the bucket h(K) had been split at some point in the past

∗ Its contents were redistributed using the modified hash function (see below)

• Insertion

– Collision is interpreted as any time a record is inserted into a full bucket

– Bucket consists of a block of records

– Algorithm:

1. Let B = h(K)

2. If bucket B not full

(a) Insert record into bucket B

3. If full,

(a) Allocate an overflow bucket

(b) Link overflow bucket to bucket B

(c) Insert record into overflow bucket

(d) Redistribute records in bucket N into bucket N and bucket M + N

i. Apply Kmod 2M to every record in bucket N

ii. N ← N + 1

72

iii. If N = M

A. N ← 0

B. M = 2 ∗M

• Search

– Apply h(K) as described above

– Linearly search bucket for record

hash values

Bob 14Jane 5Mary 10Bill 8Joe 6Sue 13Ann 15Mike 12Ellen 3

Table blocks

M N HV

1 0 0 →

Insert Bob

1 0 0 → Bob

Insert Jane

1 0 0 → Bob Jane

Insert Mary

1 0 0 → Bob Jane → Mary

1 1 0 → Bob Mary →1 → Jane

2 0 0 → Bob Mary →1 → Jane

Insert Bill

73

2 0 0 → Bob Mary → Bill1 → Jane

2 1 0 → Bill →1 → Jane2 → Bob Mary

Insert Joe

2 1 0 → Bill →1 → Jane2 → Bob Mary → Joe

2 2 0 → Bill →1 → Jane2 → Bob Mary → Joe3 →

4 0 0 → Bill →1 → Jane2 → Bob Mary → Joe3 →

Insert Sue

4 0 0 → Bill →1 → Jane Sue2 → Bob Mary → Joe3 →

Insert Ann

4 0 0 → Bill →1 → Jane Sue2 → Bob Mary → Joe3 → Ann

Insert Mike

4 0 0 → Bill Mike →1 → Jane Sue2 → Bob Mary → Joe3 → Ann

74

Insert Ellen

4 0 0 → Bill Mike →1 → Jane Sue2 → Bob Mary → Joe3 → Ann Ellen

75

Date post:	02-Oct-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

O logn Indexingdjmoon/chris/chris-notes/db... · 2017. 1. 22. · Indexing: Intro No matter what a...

Documents