Download - 1 Binary Search Tree vs. Hash Table Binary Search Tree vs. Hash Table Hash Function (quick intro) Hash Function (quick intro) Collision Collision Coping.

1

Binary Search Tree vs. Hash Table Binary Search Tree vs. Hash Table Hash Function (quick intro)Hash Function (quick intro) CollisionCollision Coping with CollisionsCoping with Collisions

Open addressing & linear probingOpen addressing & linear probing Chaining with separate listsChaining with separate lists

Hash FunctionsHash Functions What worksWhat works C++ Function ObjectsC++ Function Objects

Hash IteratorsHash Iterators Efficiency of Hash MethodsEfficiency of Hash Methods

CSE 30331Lecture 16 – Hashing & Tables

2

BST vs Hash Table

Both used to implement Sets & Maps

Binary Search Tree – ordered associative container

Order (log N) access (average & worst)

Hash Table – unordered associative container

Order(1) access (average case)

3

Hash Function A hash function converts a key into a numeric

(unsigned int) table index

Ideal hash functions uniformly distribute keys to all available indices

When two keys hash to the same index a collision occurs

Keys are not in any particular order (numeric, alphabetical, ...) within the table

4

Example Hash Function

h f(2 2 ) = 2 2 2 2 % 7 = 1

h f(4 ) = 4 4 % 7 = 4

0

1

4

6

23

5

t ab leE n t ry [1 ]

tab leE n t ry [4 ]

hf(n) = n, the identity function index = hf(n)%m, where m is table size

5

Collision

h f(2 2 ) = 2 2 2 2 % 7 = 1

h f(4 ) = 4 4 % 7 = 4

0

1

4

6

23

5

t ab leE n t ry [1 ]

tab leE n t ry [4 ]

hf(36)=36 ---- 36%7 = 1hf(36)=36 ---- 36%7 = 1

Given keys p and q, and table size mhf(p)%m and hf(q)%m produce the same index

6

Coping with Collisions

Three primary methods exist for coping with collisions Rehashing: use same key but different hash function Linear Probing: examine successive locations (index,

index+1, index+2, ...) Chaining: implement table with separate list at each

table[index] location

Note: Except for the last case, the table is a fixed size.

7

Hash Table Using Linear Probing – Open Addressing

7 7

8 9

1 4

9 4

0

1

2

3

4

5

6

7

8

9

1 0

(a)

1

1

1

1

1

In s ert5 4 , 7 7 , 9 4 , 8 9 , 1 4

2

7 7

8 9

4 5

1 4

9 4

0

1

2

3

4

5

6

7

8

9

1 0

(b )

1

1

1

1

1

In s ert4 5

2

7 7

8 9

4 5

1 4

3 5

9 4

0

1

2

3

4

5

6

7

8

9

1 0

(c)

1

1

1

1

1

In s ert3 5

3

2

7 7

8 9

4 5

1 4

3 5

7 6

9 4

0

1

2

3

4

5

6

7

8

9

1 0

(d )

1

1

1

1

1

In s ert7 6

3

7

5 4 5 4 5 45 4

8

Linear Probing PseudoCode// insert item into table of size n using hashFunc() to // calculate index. this assumes no duplicate keys, and some// method of indicating that a hash table location is emptyint index = hashFunc(item) % n;int origIndex = index;

do{ if table[index] is empty insert item as table[index] and return else if table[index] matches item return index = (index+1) % n; // this is next location to probe

} while (index != origIndex);

throw overflowError; // if we get here, table is full & does // not contain item

9

Problems with Linear Probing

Clustering of items occurs as number of items approaches size of table Colliding items fill in gaps between other entries This forms runs or clusters within the table Items in the cluster are a mix of items that hash to

different indices Degraded performance results

Long sequences of repeated probes are required to find what is sought

10

Chaining – Uses Lists or Buckets

Implement the hash table as a vector of lists Each list (bucket, chain, ...) contains all items that

hash to the associated table location Buckets are not mixed like clusters in linear

probing Table size can grow easily by expanding

individual buckets as necessary The number of buckets stays constant Within a bucket, items are unordered and must be

searched linearly

11

Chaining with Separate Lists Example

 8 9 ( 1 ) 4 5 ( 2 )



 1 4 ( 1 )

 3 5 ( 1 )

 5 4 ( 1 ) 7 6 ( 2 )

 9 4 ( 1 )











7 7 ( 1 )

12

C++ Function Objects

Function object is an instance of a class that contains only a single function – operator()

Function objects are easily passed as parameters to other functions

Commonly used to implement hash functions and comparison operations

template <typename T>class greaterThan { public: bool operator() (const T& x, const T& y) const { return x > y; }};

13

Using a function object Here is a template function that swaps two parameters only IF the

comparison is true

template<typename T, typename Compare>void swap(T& a, T& b, Compare comp){ if (comp(a,b)) { T temp = a; a = b; b = temp; }}

Here is a sample call

swap(x, y, greaterThan);

14

Reasonable Hash Functions

Integer key: Identity function

Good distribution if key or a portion of it is random

class hfIntKey { public: bool operator() (int key) const { return key; } };

15

Reasonable Hash Functions Integer key: Midsquare technique

Extracts middle two bytes of 4 byte square of key Works well with random and non-random keys

class hfMidSq { public: bool operator() (int key) const { unsigned int n = key; return ((n*n)/256) % 65536; // 0 .. 2^16-1 } };

16

Reasonable Hash Functions String key: string-to-number

Simple function uses ASCII codes for the string characters to build n-digit unsigned integers out of n-digit strings

class hfString { public: bool operator() (string key) const { unsigned int prime = 2049982463; int n(0); for (int i=0; i < key.length(); i++) n = n*8 + key[i];

return (n > 0 ? (n % prime) : (-n % prime) ); } };

17

Reasonable Hash Functions String key: folding

Uses substrings as numbers and combines them by addition or multiplication or …

Example: Sum of the 3 character substrings of a SSN Assuming no dashes in SSN … “987654321” 987+654+321 = 1962

class hfSSN { public: bool operator() (string ssn) const { return ( atoi(ssn.substr(0,3).c_str()) + atoi(ssn.substr(3,3).c_str()) + atoi(ssn.substr(6,3).c_str()) ); } };

18

Hash Class – not in STL

See headers in Ford & Topp include folder

d_hash.h – for the hash table using buckets d_hashf.h – for hash function object d_uset.h – for unordered set based on hash class d_hiter.h – for hash class iterator and

const_iterator

19

Hash Classtemplate <typename T, typename HashFunc>class hash { public : hash (int nbuckets, const HashFunc& hfunc = HashFunc()); hash (T *first, T *last, int nbuckets, const HashFunc& hfunc = HashFunc()); bool empty() const; int size() const; iterator find(const T& item); pair<iterator,bool> insert(const T& item); int erase(const T& item); void erase(iterator pos); void erase(iterator first, iterator last); iterator begin(); const_iterator begin() const; iterator end(); const_iterator end() const;

private:int numBuckets; // number of bucketsvector<list<T> > bucket; // table is vector of listsHashFunc hf; // hash functionint hashtableSize; // number of elements

};

20

Hash::find(item)template <typename T, typename HashFunc>hash<T,HashFunc>::iterator hash<T,HashFunc>::find(const T& item){ int hashIndex = int(hf(item) % numBuckets); list<T>& myBucket = bucket[hashIndex]; list<T>::iterator bucketIter;

// traverse list and look for a match with item bucketIter = myBucket.begin(); while(bucketIter != myBucket.end()) { if (*bucketIter == item) // return iterator to found item return iterator(this, hashIndex, bucketIter);

bucketIter++; }

// did not find item, so return iterator to table end return end();}

21

Hash::insert(item)template <typename T, typename HashFunc>pair<hash<T, HashFunc>::iterator,bool> hash<T, HashFunc>::insert(const T& item){ int hashIndex = int(hf(item) % numBuckets); list<T>& myBucket = bucket[hashIndex]; list<T>::iterator bucketIter; bool success; bucketIter = myBucket.begin(); while (bucketIter != myBucket.end()) if (*bucketIter == item) break; // found the item already in bucket else bucketIter++; if (bucketIter == myBucket.end()) { bucketIter = myBucket.insert(bucketIter, item); success = true; hashtableSize++; } else success = false; // item already in table return pair<iterator,bool> (iterator(this,hashIndex,bucketIter), success);}

22

Hash Iterator hIter Referencing Element 22 in Table ht

h as h T ab le = & h t

cu rren t B u ck et = 2

em p t y

em p t y

b u ck et s [0 ]

b u ck et s [4 ]

b u ck et s [3 ]

b u ck et s [2 ]

b u ck et s [1 ]

h f(x) = x

1 0

2 2 2

2 9

* h It er = 2 2 .

h as h < in t , h F in t ID > h t ;h as h < in t , h F in t ID > ::it erat o r h It er;

cu rren t L o c

h t

h It er

23

Determining Performance

The Load Factor (λ) measures the table density Where (m = size of table, n = items in table)

Linear addressing (m = size of vector, maxitems) Chaining (m = number of buckets)

Worst case (all items hash to same table location or bucket) Linear search is O(n)

Making table size prime helps prevent nonuniform distribution causing this worst case

mn /

24

Average Case - Chaining

Finding bucket is O(1) – using hash function Uniform hashing implies each bucket has n/m items Assuming uniform hash distribution

The ith item was inserted at the end of its bucket when the previous (i-1) items were spread evenly over the m buckets

To find this item takes 1+(i-1)/m comparisons since there are (on average) (i-1)/m items ahead of it in its bucket

Average performance of search for an arbitrary item is the average of the number of comparisons required to find each item in the list

mm

i

n

n

i 2

1

21)

11(

1

1

25

Hash table size = m, Number of elements in hash table = n, Load factor = n/m

Average Probesfor Successful Search

Average Probesfor Unsuccessful

Search

Open Probe

Chaining

Efficiency of Hash Methods

2)1(2

1

2

1

m2

1

21

)1(2

1

2

1

26

Final Variations Universal Hashing

Choose hf(n) randomly before execution from set of hash functions Prevents same clustering of collisions each time given set of data

is used in hash table Efficiency is more likely to be to be Θ(1), even worst case

Perfect Hashing Two tier approach (requires static set of keys) Uses two hash functions from universal hf(n) set Like chaining, but with secondary hash tables instead of chains Size of secondary hash tables is square of number of items

hashing to that table using first hash function Second hash function is chosen so no collisions occur in each

secondary table Efficiency is guaranteed to be Θ(1)

27

Summary Hash Table

Simulates the fastest searching technique, knowing the index of the required value in a vector and array and apply the index to access the value, by applying a hash function that converts the data to an integer

After obtaining an index by dividing the value from the hash function by the table size and taking the remainder, access the table. Normally, the number of elements in the table is much smaller than the number of distinct data values, so collisions occur.

To handle collisions, we must place a value that collides with an existing table element into the table in such a way that we can efficiently access it later.

Average running time for a search of a hash table is Θ(1) Worst case is Θ(n)

28

Summary

Collision Resolution Linear open probe addressing

the table is a vector or array of static size After using the hash function to compute a table

index, look up the entry in the table. If the values match, perform an update if

necessary. If the table entry is empty, insert the value in the

table.

29

Summary Collision Resolution (Cont…) Chaining with separate lists.

The hash table is a vector of list objects Each list is a sequence of colliding items. After applying the hash function to compute the table index,

search the list for the data value. If it is found, update its value; otherwise, insert the value at

the back of the list. You search only items that collided at the same table

location There is no limitation on the number of values in the table,

and deleting an item from the table involves only erasing it from its corresponding list