+ All Categories

lec5

Date post: 28-Nov-2014
Category:
Upload: kapil-gautam
View: 2 times
Download: 0 times
Share this document with a friend
Popular Tags:
44
Hashing Lecture #5 of Algorithms, Data structures and Complexity Joost-Pieter Katoen, Ed Brinksma Formal Methods and Tools Group E-mail: [email protected] September 24, 2002 c JPK
Transcript
Page 1: lec5

HashingLecture #5 of Algorithms, Data structures and Complexity

Joost-Pieter Katoen, Ed Brinksma

Formal Methods and Tools Group

E-mail: [email protected]

September 24, 2002

c� JPK

Page 2: lec5

#5: Hashing ADC (214020)

Overview� Introduction

� Direct addressing

� Hashing

– Collision resolution using chaining– Complexity analysis of chaining

� Open addressing

– Probing strategies– Complexity analysis of open addressing

� Hash functions

c� JPK 1

Page 3: lec5

#5: Hashing ADC (214020)

Introduction

� A dictionary ADT stores information that can be retrieved at any time

– the set of items stored is dynamic– items have a key and information associated with that key– example: symbol table for a compiler where keys are strings (i.e., identifiers)

� A dictionary � supports the following operations:

– search( � ) looks up the information stored under key � in �

– insert( � ) stores information object � into �– delete( � ) deletes information object � from � ; requires � to be in �

� Which data structure is appropriate to implement a dictionary?

– a heap: insertion and deletion are efficient, but how about search?– ordered array/list: insertion is linear in worst case– red-black tree: all operations are logarithmic in worst case

under reasonable assumptions a hash table takes � ��� � on average for all operations

c� JPK 2

Page 4: lec5

#5: Hashing ADC (214020)

Overview� Introduction

� Direct addressing

� Hashing

– Collision resolution using chaining– Complexity analysis of chaining

� Open addressing

– Probing strategies– Complexity analysis of open addressing

� Hash functions

c� JPK 3

Page 5: lec5

#5: Hashing ADC (214020)

Direct addressing

� Allocate an array that has a position for each possible key

� Each array element contains a pointer to the stored information

– for simplicity we omit the information associated to keys in this lecture

� the techniques and analysis results remain valid

� For universe � � ��� ��� �� �� � � of keys we have:

– a direct-address table� ���� � � ��� � � with� � � � corresponding to key �

– search( � ): return� � � �

– insert( � ): boils down to� � � � � � � � ��� �

– delete( � ): simply means� � � � � � � � � � nil

� Runtime for each of the operations is � �� � in worst case

c� JPK 4

Page 6: lec5

#5: Hashing ADC (214020)

Direct addressing

��

5

4

3

2

1

0

6

7

8

9

0

1

2

3

4

5

6

7

89

universe of keys

actual keys

key direct address table

c� JPK 5

Page 7: lec5

#5: Hashing ADC (214020)

Check for duplicates in linear time

assume all elements are positive integers of at most �

bool checkDuplicates � int ��� � � ��� ��

int ��� � � � Count � // direct-address table for� ��� �

for � � � � � � � � � � � � � Count � � � � � � // initialize Count

for � � � � � � � � � � � � ��

if � Count �� ��� � �� � � return true � // duplicate found

else Count �� ��� � � � � � // count occurrence of� ��� �

return false � // no duplicate found

c� JPK 6

Page 8: lec5

#5: Hashing ADC (214020)

Counting sort

assume all elements are positive integers of at most �

void countSort � int ��� � � ��� ��

int ��� � � � Count � int � � � ��� � � �

for � � � � � � � � � � � � � Count ��� � � � �

for � � � � � � � � � � � � � Count �� � � � � � � �

for � � � � � � � � � � � � ��

for � � � Count � � � � � � � � � � �� � � � � � � � � �

� � Count ��� � � � �

c� JPK 7

Page 9: lec5

#5: Hashing ADC (214020)

Counting sort: example

start

after 2iterations

after 5iterations

after 1iteration

after 3iterations

after 4iterations

0

input array �Count�

02 0 1 2 1 1 0 0 1 107 1 4 6 5 1 5� �

02 0 1 2 1 1 0 0 1 11 1 4 6 5 1 52

� �

1 5 1 54

02 0 1 2 1 1 0 0 1 15

55

02 0 1 2 1 1 0 0 1 11 1 4 6 5 1 5�

2

02 0 1 2 1 1 0 0 1 11 1 4 6 5 1 52

��

1 4 6 5 1 54

02 0 1 2 1 1 0 0 1 13

c� JPK 8

Page 10: lec5

#5: Hashing ADC (214020)

Counting sort

� Note that we now sort with worst-case complexity � � �

– compare this to the lower-bound of � � ��� � � � � � that we obtained earlier– but this algorithm is incomparable to quicksort, heapsort and the like

� it is not based on element-wise comparisons, but counts occurrences

� Why does this trick work: exploit direct addressing

� Insertion, deletion and searching takes � �� � in worst case

� Main complication: excessive space consumption (size of array = � � � )

– e.g., if keys are strings of 20 symbols, we need about �� array entries

can we avoid this huge memory consumption while remaining efficient?

yes! by using hashing

c� JPK 9

Page 11: lec5

#5: Hashing ADC (214020)

Overview� Introduction

� Direct addressing

� Hashing

– Collision resolution using chaining– Complexity analysis of chaining

� Open addressing

– Probing strategies– Complexity analysis of open addressing

� Hash functions

c� JPK 10

Page 12: lec5

#5: Hashing ADC (214020)

Hashing

� In practice only a small fraction of keys is used, i.e., � � ��� � � � �

� with direct addressing most of the direct address table� is wasted

� The aim of hashing is:

– map an extremely large key space onto a reasonable small range (of integers)– such that it is unlikely that two keys are mapped onto the same integer

� A hash function maps a key onto an index in the hash table � :

� � � � � � � � � �� � � ��� � � where� is the table-size and � � �

� Hash collisions, i.e., ��� � � ��� � for� �� � , raise the issues:

– how to obtain a hash function that is cheap to evaluate and minimizes collisions?– how to treat hash collisions when they occur?

c� JPK 11

Page 13: lec5

#5: Hashing ADC (214020)

Hashing

��

0

universe of keys

actual keys

� �

� �

� �� �

� � �

� �

hash functionhash table

� � � � �

� � � � �� � � � � �

� � � � �

� � � � �

hash collision

c� JPK 12

Page 14: lec5

#5: Hashing ADC (214020)

Hash collisions: the birthday paradoxNo matter how good our hash function is, we better be prepared for collisions

� This is due to the birthday paradox:

– the probability that your neighbor has the same birthday is � � � � � �� � � �

– if you ask 23 people, this probability raises to � �� � � � �� �� �

– but, if there are 23 people in a room, two of them have the same birthday

with probability: � ��

�� ��

� �� ��

� �� ���

� � � � � � ��� �

� ��

� Applying this to hashing yields:

– the probability of no collisions after � insertions into an� -element table:

�� � � � �� � � � � � � � � � �

� �� � �

�� � � �

– for� � �� and � � � this probability goes to 0

c� JPK 13

Page 15: lec5

#5: Hashing ADC (214020)

Hash collisions: the birthday paradox

0.2

0.4

0.6

0.8

1.0

20 40 60 10080

Number of insertions �

Pro

babi

lity

ofno

colli

sion

c� JPK 14

Page 16: lec5

#5: Hashing ADC (214020)

Collision resolution by chaining

concept: put all keys that hash to the same integer in a linked list [Luhn 1953]

� � � �

��

� �

� �

� �

0

� �

� �

� �� �

� � �

� �

� �

��

� �

c� JPK 15

Page 17: lec5

#5: Hashing ADC (214020)

Collision resolution by chaining

� Dictionary operations when using chaining:

– search( � ): search for an element with key � in the list� � � � � � �

– insert( � ): put element � at the front of list� � � � � � � � � � � �

– delete( � : delete element � from list� � � � � � � � � � � �

� Worst-case complexity of these operations:

– assuming computing � � � � is rather efficient, say � ��� �

– searching: proportional to the length of the list� � � � � � �

– insertion: in constant time (note: no check whether element � is already present)– deletion: proportional to the length of the list� � � � � � �

� In worst case all keys are hashed onto the same slot

– searching and deletion have same complexity as for lists! � � � �The average case complexity of hashing with chaining is efficient, though

c� JPK 16

Page 18: lec5

#5: Hashing ADC (214020)

Average case analysis of chaining (I)

� Assumptions:

– we have � possible keys and� hash-table entries � � � �

– uniform hashing: each key is equally likely hashed to any integer– the hash value � � � � can be computed in constant time

� The filling degree of hash table � is � � ��� � � ��

– note that the average length of list� � � � is also �

� What is the expected # elts examined in � � �� � � to search key� ?

– distinguish between unsuccessful and successful search (like in lecture #1)

� Technical point:

– extend definition of � , � and � for functions with two parameters (like � )– e.g., � �� � if � � � � � ��� such that

� � � � �� � � � � � � � ��� � � � � � � � � �c� JPK 17

Page 19: lec5

#5: Hashing ADC (214020)

Average case analysis of chaining (II)

� An unsuccessful search takes � �� � � � time on average

– expected time to search for key � = expected time to search list� � � � � � �

– this list has expected length �– the computation of � � � � takes a single time unit

� together this yields� � � time units on average

� A successful search also takes � �� � � � time on average

– let � � be the � -th inserted key and � � � � � be the expected time to search � � :

� � � � � � � � average # of keys inserted in� � � � � � � � after � � was inserted

– using the uniform hashing assumption this reduces to: � � � � � � � �

�� � � ��

��

– take the average over all � insertions into the hash-table�

��

�� �� � � � �

c� JPK 18

Page 20: lec5

#5: Hashing ADC (214020)

Average case analysis of chaining (III)The expected number of elements examined in a successful search is

��

��� �

��

� �

�� � � ��

�� �

�� (* calculus *)�

��

�� �� �

�� �

��� �

�� � � ��

� (* calculus *)

� ��

� ��

�� �� � � � �

� (* calculus *)

� ��

� � �� �� � � � � � �

� �� (* calculus *)

� �� � �

� � � � ��

�� �

� � and thus in � ��� � � �c� JPK 19

Page 21: lec5

#5: Hashing ADC (214020)

Complexity of dictionary operations using chaining

� Assume the number� of entries is (at least) proportional to

� Then filling degree � � ��� � � �� �

� � � �� � � �� �

� Then all dictionary operations take � �� � time on average

� This includes searching, so we can sort in � � � on average!

c� JPK 20

Page 22: lec5

#5: Hashing ADC (214020)

Overview� Introduction

� Direct addressing

� Hashing

– Collision resolution using chaining– Complexity analysis of chaining

� Open addressing

– Probing strategies– Complexity analysis of open addressing

� Hash functions

c� JPK 21

Page 23: lec5

#5: Hashing ADC (214020)

Collision resolution by open addressing

� Unlike chaining all elements are stored in the hash table itself

� at most � keys can be stored, i.e., � � � � � � � �� � � [Amdahl 1954]

� Since no memory is used for pointers, more data can be stored

� this helps to reduce the number of hash collisions

� Insertion of a key� :

– probe the entries of the hash table until an empty slot is found– sequence of slots probed depends on key � to be inserted– the hash function depends on the key � and the probe number:

� � � � � � � � �� � � � � � � � � � � � �� � � � � �

– hash function � should eventually consider every entry in the hash table

c� JPK 22

Page 24: lec5

#5: Hashing ADC (214020)

Insertion using open addressing

void hashInsert � int� � key � ��

int � � � � � � // � is probe number

repeat

� � � � � � � � � // compute � � � � -st probe

if� � � � � � nil� // free entry found

� � � � � � � return � // store key � and stop

else � � � � � �

until � � � � � � length � � // check entire table

return hash table overflow � // no free entry left

c� JPK 23

Page 25: lec5

#5: Hashing ADC (214020)

Searching using open addressing

int hashSearch � int� � key � � �

int � � � � � � // � is probe number

repeat

� � � � � � � � � // compute � � � � � -st probe

if� � � � � � � return � � // key � found

else � � � � � �

until � � � � � � length � � � � � � nil � �// check entire table or find an empty slot

return nil � // key � has not been found

c� JPK 24

Page 26: lec5

#5: Hashing ADC (214020)

Deletion using open addressing

� Deleting key� from slot � by � � � � � nil is inappropriate

� if at insertion of � slot � was occupied we cannot retrieve � anymore

� Solution: mark � � � � as special value DELETED (or “obsolete”)

� hashInsert needs to be adapted to treat such slots as empty

� hashSearch remains unchanged as DELETED slots are ignored

� Search times now no longer depend on filling degree � only

� If keys are to be deleted, chaining is more commonly used

c� JPK 25

Page 27: lec5

#5: Hashing ADC (214020)

How to select the next probe?

� How to generate the probing sequence for a given key� :�

�� �� � � ��� � � � � � �� � � � � � �

– which is a permutation of

�� �� � � � � �

for each key �

� this guarantees that all slots are eventually considered

� Ideally we have uniform hashing

– i.e. each of the� � permutations is equally likely as probing sequence– only used for analysis, in practice too expensive and approximated

� Different policies exist to select the next probe

– we consider linear probing, quadratic probing and double hashing– quality is indicated by the number of distinct probing sequences generated

c� JPK 26

Page 28: lec5

#5: Hashing ADC (214020)

Linear probing

� Uses the hash function �� � � � � � �� � � � � mod� (for � � � )

– where � � is an auxiliary hash function

� Subsequent probed slots are offset by a linear dependence on �

� Initial probe determines the entire probe sequence

� � distinct probe sequences can be generated

� Suffers from clustering, i.e., long sequences of occupied slots

– an empty slot preceded by � full slots gets filled next with probability � ���� long sequences of occupied slots tend to get longer

c� JPK 27

Page 29: lec5

#5: Hashing ADC (214020)

Linear probing: example

31 9

10

8

7

6

5

4

3

2

1

022

10

28

15

4

31 9

10

8

7

6

5

4

3

2

1

022

10

28

15

4

17

31 9

10

8

7

6

5

4

3

2

1

022

10

28

15

4

31 9

10

8

7

6

5

4

3

2

1

022

10

28

15

4

17

31 9

10

8

7

6

5

4

3

2

1

022

10

28

15

4

17

ins(17) ins(17)

1st probe 2nd probe

ins(59)

1st probe

ins(59)

2nd probe

ins(59)

3rd probe

��� � � � mod� � �� � � � � � �� � � � � mod� �c� JPK 28

Page 30: lec5

#5: Hashing ADC (214020)

Quadratic probing

� Uses the hash function �� � � � � � ��� � � � ��� � � � � � �

� � mod� (for

� � � )

– where � � is an auxiliary hash function and non-zero constants � � �

� Subsequent probed slots are offset by a quadratic dependence on �

� Initial probe determines the entire probe sequence

� � distinct probe sequences can be generated (like for linear probing)–� � � � � � provided the values of� and constants � and � are appropriately

chosen

� Suffers from secondary clustering

– � � � � � � � � � � �

� � � implies � � � � � � � � � � �

� � � for all �

– but avoids the clustering appearing with linear probing

c� JPK 29

Page 31: lec5

#5: Hashing ADC (214020)

Quadratic probing: example

31 9

10

8

7

6

5

4

3

2

1

022

10

28

4

15

31 9

10

8

7

6

5

4

3

2

1

022

10

28

4

15

31 9

10

8

7

6

5

4

3

2

1

022

10

28

4

15

31 9

10

8

7

6

5

4

3

2

1

022

10

28

4

15

31 9

10

8

7

6

5

4

3

2

1

022

10

28

4

15

17ins(17) ins(17) ins(17)ins(17)

4th probe2nd probe1st probe 3rd probe �� � � � � � �� � � � � � ��� � mod� � �� � � � mod� �

c� JPK 30

Page 32: lec5

#5: Hashing ADC (214020)

Double hashing

� Uses the hash function �� � � � � �� � � �� � ��� � � mod� (for � � � )

– where � � and � � are auxiliary hash functions

� Subsequent probed slots are offset by the amount � �� �

� the initial probe does not determine the probe sequence

� this yields a better distribution of keys in the hash table

� approximates the uniform hashing strategy

� If � �� � and� are relatively prime, the entire hash table is searched

– e.g., choose� � � � and � � such that it produces an odd number

� Each possible pair � ��� � and � �� � yields a distinct probe sequence

� double hashing generates� � distinct permutations

c� JPK 31

Page 33: lec5

#5: Hashing ADC (214020)

Double hashing: example

31 9

10

8

7

6

5

4

3

2

1

022

10

28

4

15

31 9

10

8

7

6

5

4

3

2

1

022

10

28

4

15

31 9

10

8

7

6

5

4

3

2

1

022

10

28

4

15

31 9

10

8

7

6

5

4

3

2

1

022

10

28

4

15

31 9

10

8

7

6

5

4

3

2

1

022

10

28

4

15

17

ins(17) ins(17)

1st probe 2nd probe

ins(17)

4th probe

ins(59)

1st probe

� �� � � � � � mod� �

� ��� � � � mod� � �� � � � � � � �� � � �� � �� � � mod� �

ins(17)

3rd probe

c� JPK 32

Page 34: lec5

#5: Hashing ADC (214020)

Practical efficiency of double hashing

� Hash table with � �� � � � entries (final filling 99.95%)

� Mean number of collisions per insertion into hash table:

0

1

2

3

4

5

6

0 10 20 30 40 50 60 70 80 90 100

usage of hashtable (in %)

c� JPK 33

Page 35: lec5

#5: Hashing ADC (214020)

Efficiency of open addressing

Under the assumption of uniform hashing we have:

� An unsuccessful search takes �

���� � �

time on average

– if hash table is half full, 2 probes are necessary on average– if hash table is 90% full, 10 probes are necessary on average

� A successful search takes �

��

�� �� �

�� � �

time on average

– if hash table is half full, about 1.39 probes are necessary on average– if hash table is 90% full, about 2.56 probes are necessary on average

� Recall that for chaining this was � �� � � � for both cases

c� JPK 34

Page 36: lec5

#5: Hashing ADC (214020)

Analyzing unsuccessful search (I)

�� � # probes � �

� (* � � is the event that there is an � -th probe and it is to an occupied slot *)�� � � � � � � �� � � � � �� �

� (* probability theory *)

�� � � � � �� � � � � � � �� � � � � � � � � � � � �� � � � � � �� � � � � �� �

� (* there are � elements and� slots *)

�� � ��� �

� � �� � � � � ��� � � �

� � � � �

� (* bound to above *)

��

� ��� �

� (* definition of � *)

� �� �

c� JPK 35

Page 37: lec5

#5: Hashing ADC (214020)

Analyzing unsuccessful search (II)

the expected number of probes

� (* property of� *)

��� �

�� � # probes � �

� (* use previous derivation on �� � # probes � � *)

��� �

� �� �

� (* rewrite slightly *)

�� �

� �

� (* geometric series *)

�� � �

c� JPK 36

Page 38: lec5

#5: Hashing ADC (214020)

Analyzing successful search (I)

average number of probes in a successful search� (* definition of average *)

�� �

� � ���

average number of probes for � � � � � -st inserted key

� (* average number of probes for � � � � � -st inserted key is at most �� � �

*)�

� �� � �

� � �

� � �

� (* calculus *)

�� �

� � ���

�� � �

c� JPK 37

Page 39: lec5

#5: Hashing ADC (214020)

Analyzing successful search (II)

�� �

� � ���

�� � �

� (* calculus *)

��

� ��

� � � � � ���

��

� (* approximate summation by integral (cf. Example 1.7) *)

��

�� � �

��

� �

� (* integral calculus *)

�� ��

��

� � � �� (* definition of � *)

�� ��

��

� � � �

c� JPK 38

Page 40: lec5

#5: Hashing ADC (214020)

Overview� Introduction

� Direct addressing

� Hashing

– Collision resolution using chaining– Complexity analysis of chaining

� Open addressing

– Probing strategies– Complexity analysis of open addressing

� Hash functions

c� JPK 39

Page 41: lec5

#5: Hashing ADC (214020)

Hash functions

� A hash function maps a key onto an integer (i.e., an index)

– the hash function � � � � should be cheap to evaluate– it should be surjective on the range �� � � � � �

– it should tend to use all indexes with uniform frequency– it should tend to put similar keys in different parts of the hash table

� Three major techniques to obtain a “good” hash function:

– the division method– the multiplication method– universal hashing

c� JPK 40

Page 42: lec5

#5: Hashing ADC (214020)

Division method

� Uses the hash scheme �� � � � mod� (for � � � )

� Using this method, the value of� should be chosen with care

– if� � ��� , then � mod� amounts to select the � least significant bits of �

� Practical good choice:� is prime and not too close to power of 2

– example: consider 2,000 character strings– allow on average about 3 probes for an unsuccessful search– choose� � � � � � � � � � � � �

c� JPK 41

Page 43: lec5

#5: Hashing ADC (214020)

Multiplication method

� Uses the hash scheme �� � � �� � �� � � mod� � � (for � � � )

– with constant � � � � (Knuth suggests � �� � � � � � � �� � � )– note that �� mod� is the fractional part of ��

� the value of� is not critical here

� Usual scheme take� � ��� and � � ��� where� � � ��� and then:

– first compute � � (= � � � � � )– divide by � � , use only the fractional part– multiply by � � and use only the integer part

X

key �

� bits��� � �

� � � �

extract� bits

c� JPK 42

Page 44: lec5

#5: Hashing ADC (214020)

Universal hashing

� Greatest problem with hashing:

– there is always an adversarial sequence of keys all mapped onto the same slot

� Choose randomly a hash function from a given small set �

– that is independent of the keys which are going to be used

� For� � � the fraction of functions in � such that� and� collide is ��� ��

– probability that � � � � collide is � ��� �� �� ��� �

� Example: define the elements of the class of hash functions by:

� � �� � � � ��� � � � � mod � � mod�– where � is a prime number such that � � � and � � largest key– integers � (� � � � � ) and � (� � � � � ) are chosen at execution time

c� JPK 43


Recommended