+ All Categories
Home > Documents > Indexing

Indexing

Date post: 25-Nov-2014
Category:
Upload: sm137
View: 575 times
Download: 2 times
Share this document with a friend
Popular Tags:
87
Indexing Structures Professor Navneet Goyal Department of Computer Science & Information Systems BITS, Pilani
Transcript
Page 1: Indexing

Indexing Structures

Professor Navneet GoyalDepartment of Computer Science & Information SystemsBITS, Pilani

Page 2: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Topics Basic Concepts Classification of Indices Tree-based Indexing Hash-based Indexing Comparison

Page 3: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Basic Concepts Indexing mechanisms used to speed up access to

desired data. E.g., index at the end of a book E.g., author catalog in library

Search Key – attribute(s) used to look up records in a file

Multiple indexes for a single file An index file consists of records (called index

entries) of the form

search-key pointer

Page 4: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Basic Concepts Index files are typically much smaller than

the original file Kinds of indices:

Ordered indices: search keys are stored in sorted order (single-level)

Tree indices: search keys are arranged in a tree (multi-level)

Hash indices: search keys are distributed uniformly across “buckets” using a “hash function”

Page 5: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Classification Single-level vs. Multi-level Dense vs. Sparse Static vs. Dynamic

Page 6: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Choosing an Index No single indexing structure suitable

for all database applications Can be chosen based on the following

factors: Access types supported efficiently. E.g.,

• records with a specified value in the attribute• or records with an attribute value falling in a specified range

of values. Access time Insertion time Deletion time Space overhead

Page 7: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Primary Index Example of an ordered index In an ordered index, index entries are

stored sorted on the search key value. E.g., topics in book index.

Requires relation to be sorted on the search key

Search key should be a ‘KEY’ of the relation

If not, then it is called a Clustering Index

Page 8: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Primary Index

1020

5060

3040

100

90

8070

10

30

50

70

90

Page 9: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Primary Index Primary index requires that the ordering field of

the data file have a distinct value for each record. Primary index is sparse Contains as many records as there are blocks* in

the data file (there are 5 blocks in this example and each block can hold only 2 records).

The first record in each block of the data file is called anchor record of the block, or simply block anchor.

There can be only one primary index on a table

Page 10: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Clustering Index

1

2

3

4

5

2

3

3

3

1

1

1

2

3

3

4

5

OPTION 1

Page 11: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Clustering Index

Figure taken from Elmasiri, 4e

Page 12: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Clustering Index

Figure taken from Elmasiri, 4e

Page 13: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Clustering Index Data file is sorted on a non-key

field Retrieves cluster of records for a

given search key Clustering index is always sparse

Page 14: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Secondary Index (key)

1

2

3

4

5

6

78

4

3

8

6

5

2

7

1

Page 15: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Secondary Index (key)

Figure taken from Elmasiri, 4e

Page 16: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Secondary Index (Non-key)

Option 1 is to include several index entries with the same index field value- one for each record

This would be a dense index Option 2 is to have variable length records

for the index entries, with a repeating field for the pointer-one pointer to each block that contains a record with matching indexing field value.

This would be a non-dense index.

Page 17: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Secondary Index (Non-key)

Emp# SSN Name Dept # DOB SALARY

3

5

1

3

2

3

4

5

3

1

2

3

3

3

3

4

5

5

1 B1(1)

2 B2(1)

3 B3(1), B3(2), B3(3), B3(4)

4 B4(1)

5 B5(1)

OPTION 1

OPTION 2

Page 18: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Secondary Index (Non-key)

Option 3 is most commonly used Record Pointers Implemented using one level of indirection so that

index entries are of fixed length and have unique field values

Figure taken from Elmasiri, 4e

Page 19: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Types of Single-level Indexes

Ordering Field

Nonordering Field

Key Field Primary Index

Secondary Index (key)

Nonkey Field

Clustering Index

Secondary Index (nonkey)

Page 20: Indexing

© Prof. Navneet Goyal, BITS, Pilani

EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )Suppose that:record size R=100 bytesblock size B=1024 bytesr=30000 recordsThen, we get:blocking factor Bfr= B div R= 1024 div 100= 10 records/blocknumber of file blocks b= (r/Bfr)= (30000/10)= 3000 blocksFor an index on the SSN field, assume the field size VSSN=9 bytes,assume the block pointer size PR=6 bytes. Then: index entry size Ri=(VSSN+ PR)=(9+6)=15 bytesindex blocking factor Bfri= B div Ri= 1024 div 15 = 68 entries/blocknumber of index blocks bi= (ri/ Bfri)= (3000/68)= 45 blocksbinary search needs log2bi= log245= 6 block accesses (+ 1 for the data block)This is compared to binary search cost of:

log 2 b = log 2 3000 = 12 block accesses

Example 1: Primary Index

Page 21: Indexing

© Prof. Navneet Goyal, BITS, Pilani

EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )Suppose that:record size R=100 bytesblock size B=1024 bytesr=30000 recordsThen, we get:blocking factor Bfr= B div R= 1024 div 100= 10 records/blocknumber of file blocks b= (r/Bfr)= (30000/10)= 3000 blocksFor an index on the Job field, assume the field size VJOB=9 bytes,assume the block pointer size PR=6 bytes. Then: index entry size Ri=(VJOB+ PR)=(9+6)=15 bytesindex blocking factor Bfri= B div Ri= 1024 div 15 = 68 entries/blocknumber of index blocks bi= (ri/ Bfri)= (30000/68)= 442 blocksbinary search needs log2bi= log2442= 9 block accesses (+ 1 for the data block)This is compared to the linear search cost of:

b/2 = 3000/2 = 1500 block accesses

Example 2: Secondary Index -Non Key Field

Page 22: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Properties of Single-level Indexes

Type of Index

Number of Index Entries Dense or Sparse

Block Anchoring

Primary No. of blocks in data file Sparse YesClustering No. of distinct index field values Sparse Yes/no*Secondary (key)

Number of records in data file Dense No

Secondary (nonkey)

No. of records**No. of distinct index field values***

DenseSparse

No

* Yes if every distinct value of the ordering field starts from a new block; no otherwise** For Option 1*** For Options 2 & 3

Page 23: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Multilevel Indexes In all single level indexes, the index file is

always sorted on the search key For an index with bi blocks, a binary

search requires approximately (log2 bi) block accesses

The idea behind multilevel indexes is to reduce the part of the index file that we continue to search by a factor of bfri (blocking factor)

Page 24: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Multilevel Indexes Blocking Factor=block size in bytes/record size

in bytes bfri, the blocking factor for the index, is always

greater than 2 Search space is reduced much faster bfri is called the fan-out (fo) for the multilevel

index Searching a multilevel index requires (logfo bi)

block accesses, which is a smaller number that for binary search if fo>2.

Page 25: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Multilevel Indexes MLI considers the index file (first or base

level of MLI) as an ordered file with a distinct value

We can create a PI for the first level Index to the first level is called the 2nd level

of the MLI 2nd level is a PI, so block anchors can be used 2nd level has one record for each block of the

1st level

Page 26: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Multilevel Indexes Blocking factor for the 2nd level & all

subsequent levels is the same as that of the 1st level index

If the 1st level has r1 entries, & the blocking factor is bfri =fo, then the 1st level needs r1/fo blocks

r2=r1/fo The same process can be repeated for the

second level & we wet r3 = r2/fo

Page 27: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Multilevel Indexes Note that we require the 2nd level only if

the 1st level needs more than 1 block of disk space

Similarly, we require the 3rd level only if the 2nd level needs more than 1 block of disk space

Repeat the preceding process until all the entries of some index level t fit in a single block

Page 28: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Multilevel Indexes 1 <= r1/(fo)t

An MLI with r1 1st level entries, will have approx. t levels, where

t = log fo r1 MLI can be used for any type of index,

primary, clustering, or secondary, as long as the 1st level index has distinct search key values and fixed-length entries

Page 29: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Multilevel Indexes

Figure taken from Elmasiri, 4e

Page 30: Indexing

© Prof. Navneet Goyal, BITS, Pilani

EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )Suppose that:record size R=100 bytesblock size B=1024 bytesr=30000 recordsDense secondary index of Ex. 2 is converted into an MLIindex blocking factor Bfri= B div Ri= 1024/15=68 entries/blocknumber of 1st level index blocks bi= (ri/ Bfri)= (30000/68)= 442 number of 2nd level index blocks = 442/68 = 7 &number of 3rd level index blocks = 7/68 = 1number of block accesses = t+1=3+1 = 4 block accessesThis is compared to 10 block accesses using dense secondary index

Example: MLI

Page 31: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Multi-Level Indexes Such a multi-level index is a form

of search tree ; however, insertion and deletion of new index entries is a severe problem because every level of the index is an ordered file.

Page 32: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Multiple-key Access Implicit assumption that the index is

created on only one attribute In many retrieval & update requests,

multiple attributes are involved Option 1: Multiple such indexes on a

relation can be used to answer queries Option 2: Have a composite search key

Page 33: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Multiple-key Access Example: List all employees of DNO=4 with

AGE=59 Assume DNO has an index, but age does not Assume AGE has an index, but DNO does not If both DNO and AGE have indexes. Both would

give a set of records or a set of pointers (to blocks or records) as result. Intersection of these records or pointers yields those records that satisfy both conditions, or the blocks in which records satisfying both conditions are located

Page 34: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Multiple-key Access All the above alternatives give the correct

result IF the set of records that satisfy each

condition ( DNO=4 or AGE=59) individually are large, yet only a few records satisfy the combined condition, then none of the above technique is efficient.

Try having a composite search key<DNO, AGE> or <AGE, DNO>

Page 35: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Index Update Insert Delete Update (first delete & then insert) Compare single-level & ML indexes DO IT YOURSELF!!!

Page 36: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Indexed Sequential File Common file organization used in data

processing Ordered file with a ML primary index on its

ordering key field Indexed sequential file Used in large no. of early IBM systems Insertions handles by some form of overflow

file that is merged periodically with the data file

Index is recreated during file reorganization

Page 37: Indexing

© Prof. Navneet Goyal, BITS, Pilani

IBM’s ISAM Indexed Sequential Access Method 2-level index Closely related to the organization of the

disk

Page 38: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Tree-based Indexing ISAM & B,B+-trees Based on tree data structures Provide:

Efficient support for range queries Efficient support for insertion & deletion Support for equality queries (not as efficient as

hash-based indexes) ISAM is static, whereas B,B+-tree are

dynamic, adjusts gracefully under inserts and deletes

Page 39: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Search Tree Search tree is a special type of tree

that is used to guide the search for a record, given the search key

MLI is a variation of the search treeA node in a search tree with pointers to subtrees below it

Page 40: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Search Tree

A search tree of order p = 3

Page 41: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Search Tree Each key value in the tree is

associated with a pointer to the record in the data file having that value.

Pointer could be to the disk block containing the record

Search tree itself can be stored on the disk by assigning each tree node to a disk block

Page 42: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Search TreeConstraints: Search keys within a node is ordered

(increasing from L to R) For all values X in the subtree pointed

to by Pi, we have

1<i<qi=1 i=q

Page 43: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Search Tree Algorithms for inserts and deletes do not

guarantee that a search tree is balanced Keeping a search tree balanced HELPS!! Keeping search tree balanced yields a

uniform search speed regardless of the value of the search key

Deletions may lead to nearly empty nodes, thus wasting space and increasing no. of levels

Page 44: Indexing

© Prof. Navneet Goyal, BITS, Pilani

B-Tree B-tree has additional constraints that ensure

that tree is always balanced and that the space wasted by deletion is never excessive

Algorithms for inserts and deletes are more complex in order to maintain these additional constraints

They are mostly simple Become complicated only when inserts and

deletes lead to splitting and merging of nodes respectively

Page 45: Indexing

© Prof. Navneet Goyal, BITS, Pilani

B-Tree One or two levels of index are often

very helpful in speeding up queries More general structure that is used in

commercial systems This family of data structures is called

B-trees, & the particular variant that is often used in known as B+-tree

Page 46: Indexing

© Prof. Navneet Goyal, BITS, Pilani

B-Tree: Characteristics

Automatically maintains as many levels of index as is appropriate for the size of the file being indexed

Manages space on the blocks they use so that every block is between half full & completely full

Each node corresponds to a disk block

Page 47: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Structure of B-Trees Balanced tree All paths from the root to a leaf have the same

length Three layers in a B-tree

Root Intermediate layer Leaves

Parameter n is associates with each B-tree Each node will have n search keys & n+1 pointers Pick n to be as large as will allow n+1 pointers &

n keys to fit in one block

Page 48: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Example Block size = 4096 bytes Search key – 4 byte integer Pointer - 8 bytes Assume no header information kept in block We choose n such that

4n + 8(n+1) <= 4096 n=340 Block can hold 340 keys & 341 pointers

Page 49: Indexing

© Prof. Navneet Goyal, BITS, Pilani

B-Trees & B+-Trees An insertion into a node that is not full

is quite efficient; if a node is full the insertion causes a split into two nodes

Splitting may propagate to other tree levels

A deletion is quite efficient if a node does not become less than half full

If a deletion causes a node to become less than half full, it must be merged with neighboring nodes

Page 50: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Difference between B-tree and B+-tree In a B-tree, pointers to data records

exist at all levels of the tree In a B+-tree, all pointers to data

records exists at the leaf-level nodes A B+-tree can have less levels (or

higher capacity of search values) than the corresponding B-tree

Page 51: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Rules for B-Trees At the root, there are at least two used

pointers. All pointers point to the B-tree blocks at the lower level

At a leaf, the last pointer points to the next leaf block to the right, i.e., to the block with next higher keys

Among the other n pointers in a leaf, at least (n+1)/2 are used to point to data records and unused pointers can be thought of as null and do not point anywhere

The ith pointer, if it is used, points to a record with the ith key

Page 52: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Rules for B-Trees At any interior node, all the n+1 pointers can be used

to point to B-tree blocks at the next lower level At least (n+1)/2 of them are actually used If j pointers are used, then there will be j-1 keys, k1,

k2,…., kj-1. The 1st pointer points to a part of the B-tree where

some of the records with keys less than k1 will be found.

The 2nd pointer goes to that part of the tree where all the records with keys that are at least k1, but less than k2 will be found, and so on

Finally, the jth pointer gets us to that part of the B-tree where some of the records with keys greater than or equal to kj-1 are found.

Page 53: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Rules for B-Trees Note that some of the records with

keys far below k1 or far above kj-1 may not be reachable from this block at all, but will be reached via another block at the same level.

The nodes at any level, left to right, contain keys in non-decreasing order.

Page 54: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Hash-based Indexing Intuition behind hash-based indexes Good for equality searches Useless for range searches Static hashing Dynamic hashing

Extendible hashing Linear hashing

Page 55: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Static Hashing A bucket is a unit of storage containing one or more records

(a bucket is typically a disk block). In a hash file organization we obtain the bucket of a record

directly from its search-key value using a hash function. Hash function h is a function from the set of all search-key

values K to the set of all bucket addresses B. Hash function is used to locate records for access, insertion

as well as deletion. Records with different search-key values may be mapped to

the same bucket; thus entire bucket has to be searched sequentially to locate a record.

Page 56: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Static Hashing

There are 10 buckets, The binary representation of the ith

character is assumed to be the integer i. The hash function returns the sum of the

binary representations of the characters modulo 10 E.g. h(Perryridge) = 5 h(Round Hill) = 3

h(Brighton) = 3

Hash file organization of account file, using branch_name as key

Page 57: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Static Hashing

Page 58: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Hash Functions Worst hash function maps all search-key values to

the same bucket; this makes access time proportional to the number of search-key values in the file.

An ideal hash function is uniform, i.e., each bucket is assigned the same number of search-key values from the set of all possible values.

Ideal hash function is random, so each bucket will have the same number of records assigned to it irrespective of the actual distribution of search-key values in the file.

Typical hash functions perform computation on the internal binary representation of the search-key.

Page 59: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Bucket Overflow Bucket overflow can occur because of

Insufficient buckets Skew in distribution of records. This can

occur due to two reasons:• multiple records have same search-key value• chosen hash function produces non-uniform

distribution of key values Although the probability of bucket

overflow can be reduced, it cannot be eliminated; it is handled by using overflow buckets.

Page 60: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Bucket Overflows

Overflow chaining – the overflow buckets of a given bucket are chained together in a linked list.

Above scheme is called closed hashing.

Page 61: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Bucket Overflows

Page 62: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Hash Indexes Hashing can be used not only for file

organization, but also for index-structure creation.

A hash index organizes the search keys, with their associated record pointers, into a hash file structure.

Strictly speaking, hash indices are always secondary indices if the file itself is organized using hashing, a separate

primary hash index on it using the same search-key is unnecessary.

However, we use the term hash index to refer to both secondary index structures and hash organized files.

Page 63: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Example of Hash Index

Page 64: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Deficiencies of Static Hashing

Databases grow with time. If initial number of buckets is too small, performance will degrade due to too much overflows.

If file size at some point in the future is anticipated and number of buckets allocated accordingly, significant amount of space will be wasted initially.

If database shrinks, again space will be wasted. One option is periodic re-organization of the file

with a new hash function, but it is very expensive.These problems can be avoided by using techniques that allow the number of buckets to be modified dynamically.

Page 65: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Dynamic Hashing Long overflow chains can develop

and degrade performance. Extendible and Linear Hashing:

Dynamic techniques to fix this problem.

Page 66: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Extendible Hashing Insert new data entry to a full bucket Add overflow page OR Reorganize file using double the no.

of buckets & redistributing the entries

Drawback – entire file has to be read & twice as many pages have to be written

Page 67: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Extendible Hashing Idea: Use directory of pointers to

buckets, double # of buckets by doubling the directory, splitting just the bucket that overflowed!

Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. No overflow page!

Trick lies in how hash function is adjusted!

Page 68: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Extendible Hashing

13*00

01

10

11

2

2

2

2

2

LOCAL DEPTH

GLOBAL DEPTH

DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

10*

1* 21*

4* 12* 32* 16*

15* 7* 19*

5*

Directory is array of size 4. To find bucket for r, take

last `global depth’ # bits of h(r); we denote r by h(r). If h(r) = 5 = binary 101, it

is in bucket pointed to by 01.

Page 69: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Extendible Hashing Insert: If bucket is full, split it (allocate new page, re-distribute).

If necessary, double the directory. (As we will see, splitting a bucket does not always require doubling; we can tell by comparing global depth with local depth for the split bucket.)

Page 70: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Insert h(r)=20 (Causes Doubling)

20*

00011011

2 2

2

2

LOCAL DEPTH 2

2

DIRECTORY

GLOBAL DEPTHBucket A

Bucket B

Bucket C

Bucket D

Bucket A2(`split image'of Bucket A)

1* 5* 21*13*

32*16*

10*

15* 7* 19*

4* 12*

19*

2

2

2

000001010

011100101

110111

3

3

3DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

Bucket A2(`split image'of Bucket A)

32*

1* 5* 21*13*

16*

10*

15* 7*

4* 20*12*

LOCAL DEPTH

GLOBAL DEPTH

Page 71: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Points to Note 20 = binary 10100. Last 2 bits (00) tell us r

belongs in A or A2. Last 3 bits needed to tell which. Global depth of directory: Max # of bits needed to tell

which bucket an entry belongs to. Local depth of a bucket: # of bits used to determine if

an entry belongs to this bucket. When does bucket split cause directory

doubling? Before insert, local depth of bucket = global depth.

Insert causes local depth to become > global depth; directory is doubled by copying it over and `fixing’ pointer to split image page. (Use of least significant bits enables efficient doubling via copying of directory!)

Page 72: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Points to Note Does splitting a bucket always

necessitates a directory doubling? Try inserting 9* Belongs to bucket B, which is already

full Split the bucket B and using directory

elements 001 & 101 to point to the bucket & its split image

Page 73: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Points to Ponder Why use LSB, why not MSB? What if a bucket becomes empty?

Page 74: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Directory Doubling

00

01

10

11

2

Why use least significant bits in directory? Allows for doubling via copying!

000

001

010

011

3

100

101

110

111

vs.

0

1

1

6*6*

6*

6 = 110

00

10

01

11

2

3

0

1

1

6*6* 6*

6 = 110000

100

010

110

001

101

011

111

Least Significant Most Significant

Page 75: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Comments on Extendible Hashing

If directory fits in memory, equality search answered with one disk access; else two. 100MB file, 100 bytes/rec, 4K pages contains 1,000,000

records (as data entries) and 25,000 directory elements; chances are high that directory will fit in memory.

Directory grows in spurts, and, if the distribution of hash values is skewed, directory can grow large.

Multiple entries with same hash value cause problems! Delete: If removal of data entry makes bucket

empty, can be merged with `split image’. If each directory element points to same bucket as its split image, can halve directory.

Page 76: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Extendable Hashing Benefits of extendable hashing:

Hash performance does not degrade with growth of file

Minimal space overhead Disadvantages of extendable hashing

Extra level of indirection to find desired record Bucket address table may itself become very big

(larger than memory)• Need a tree structure to locate desired record in the

structure! Changing size of bucket address table is an

expensive operation

Page 77: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Linear Hashing Linear hashing is an alternative

mechanism which avoids these disadvantages at the possible cost of more bucket overflows

This is another dynamic hashing scheme, alternative to Extensible Hashing.

Motivation: Ext. Hashing uses a directory that grows by doubling… Can we do better? (smoother growth)

LH: split buckets from left to right, regardless of which one overflowed (simple, but it works!!)

Page 78: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Linear Hashing Does not require a directory LH provides a way to control

chains from growing too large on average

It accomplishes this by expanding address space gracefully, one chain at a time

Achieved using chain splitting

Page 79: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Linear Hashing: ExampleSuppose M=3, three buckets[0], [1], and [2][1] = {106, 217, 151, 418, 379}Three issues with chain splitting: How can a chain be split? Which chain should be split? When should a chain be split?

Page 80: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Linear Hashing: Example How can a chain be split?

Split a chain [m] evenly into two chains using a mod function

Since we want to expand the address space, the argument for a hash fn. need not be M

Use 2M to rehash the records in [m] On average, mod 2M will hash half of the

records to chain [m], and the other half to chain [1] = {106, 217, 151, 418, 379}Rehash using mod 2M(=6)[1] = {217, 151, 379}[4] = {106, 418}

[M+m]

Page 81: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Linear Hashing: Example Which chain to be split?

Following possibilities:• Split chain [0]: this will create chain [3]• Split chain [1]: this will create chain [4]• Split chain [2]: this will create chain [5]

Linear hashing gets its name from the fact that chains are designated linearly for splitting

In the example, we will first split the chain [0], then [1], and then [2]

Note that this is independent of where the insertions are taking place

Page 82: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Linear Hashing: Example Which chain to be split?

mod 3

M=3

0 1 2

mod 3

mod 6 mod 6

30 1 2

Page 83: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Linear Hashing: ExampleInitially: h(x) = x mod N (N=4 here)Assume 3 records/bucketInsert 17 = 17 mod 4 1Bucket id 0 1 2 3

4 8 5 9 6 7 1113

Page 84: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Linear Hashing: ExampleInitially: h(x) = x mod N (N=4 here)Assume 3 records/bucketInsert 17 = 17 mod 4 1Bucket id 0 1 2 3

4 8 5 9 6 7 1113

Overflow for Bucket 1

Split bucket 0, anyway!!

Page 85: Indexing

© Prof. Navneet Goyal, BITS, Pilani

Linear Hashing: ExampleTo split bucket 0, use another function

h1(x): h0(x) = x mod N , h1(x) = x mod (2*N)

17 0 1 2 3

4 85 9 6 7 1113

Split pointer

Page 86: Indexing

Q & A

Page 87: Indexing

Thank You


Recommended