1
Binary Search Tree vs. Hash Table Binary Search Tree vs. Hash Table Hash Function (quick intro)Hash Function (quick intro) CollisionCollision Coping with CollisionsCoping with Collisions
Open addressing & linear probingOpen addressing & linear probing Chaining with separate listsChaining with separate lists
Hash FunctionsHash Functions What worksWhat works C++ Function ObjectsC++ Function Objects
Hash IteratorsHash Iterators Efficiency of Hash MethodsEfficiency of Hash Methods
CSE 30331Lecture 16 – Hashing & Tables
2
BST vs Hash Table
Both used to implement Sets & Maps
Binary Search Tree – ordered associative container
Order (log N) access (average & worst)
Hash Table – unordered associative container
Order(1) access (average case)
3
Hash Function A hash function converts a key into a numeric
(unsigned int) table index
Ideal hash functions uniformly distribute keys to all available indices
When two keys hash to the same index a collision occurs
Keys are not in any particular order (numeric, alphabetical, ...) within the table
4
Example Hash Function
h f(2 2 ) = 2 2 2 2 % 7 = 1
h f(4 ) = 4 4 % 7 = 4
0
1
4
6
23
5
t ab leE n t ry [1 ]
tab leE n t ry [4 ]
hf(n) = n, the identity function index = hf(n)%m, where m is table size
5
Collision
h f(2 2 ) = 2 2 2 2 % 7 = 1
h f(4 ) = 4 4 % 7 = 4
0
1
4
6
23
5
t ab leE n t ry [1 ]
tab leE n t ry [4 ]
hf(36)=36 ---- 36%7 = 1hf(36)=36 ---- 36%7 = 1
Given keys p and q, and table size mhf(p)%m and hf(q)%m produce the same index
6
Coping with Collisions
Three primary methods exist for coping with collisions Rehashing: use same key but different hash function Linear Probing: examine successive locations (index,
index+1, index+2, ...) Chaining: implement table with separate list at each
table[index] location
Note: Except for the last case, the table is a fixed size.
7
Hash Table Using Linear Probing – Open Addressing
7 7
8 9
1 4
9 4
0
1
2
3
4
5
6
7
8
9
1 0
(a)
1
1
1
1
1
In s ert5 4 , 7 7 , 9 4 , 8 9 , 1 4
2
7 7
8 9
4 5
1 4
9 4
0
1
2
3
4
5
6
7
8
9
1 0
(b )
1
1
1
1
1
In s ert4 5
2
7 7
8 9
4 5
1 4
3 5
9 4
0
1
2
3
4
5
6
7
8
9
1 0
(c)
1
1
1
1
1
In s ert3 5
3
2
7 7
8 9
4 5
1 4
3 5
7 6
9 4
0
1
2
3
4
5
6
7
8
9
1 0
(d )
1
1
1
1
1
In s ert7 6
3
7
5 4 5 4 5 45 4
8
Linear Probing PseudoCode// insert item into table of size n using hashFunc() to // calculate index. this assumes no duplicate keys, and some// method of indicating that a hash table location is emptyint index = hashFunc(item) % n;int origIndex = index;
do{ if table[index] is empty insert item as table[index] and return else if table[index] matches item return index = (index+1) % n; // this is next location to probe
} while (index != origIndex);
throw overflowError; // if we get here, table is full & does // not contain item
9
Problems with Linear Probing
Clustering of items occurs as number of items approaches size of table Colliding items fill in gaps between other entries This forms runs or clusters within the table Items in the cluster are a mix of items that hash to
different indices Degraded performance results
Long sequences of repeated probes are required to find what is sought
10
Chaining – Uses Lists or Buckets
Implement the hash table as a vector of lists Each list (bucket, chain, ...) contains all items that
hash to the associated table location Buckets are not mixed like clusters in linear
probing Table size can grow easily by expanding
individual buckets as necessary The number of buckets stays constant Within a bucket, items are unordered and must be
searched linearly
11
Chaining with Separate Lists Example
< B uc k e t 1 > 8 9 ( 1 ) 4 5 ( 2 )
< B uc k e t 0 >
< B uc k e t 3 > 1 4 ( 1 )
< B uc k e t 2 > 3 5 ( 1 )
< B uc k e t 1 0 > 5 4 ( 1 ) 7 6 ( 2 )
< B uc k e t 6 > 9 4 ( 1 )
< B uc k e t 9 >
< B uc k e t 8 >
< B uc k e t 7 >
< B uc k e t 5 >
< B uc k e t 4 >
7 7 ( 1 )
12
C++ Function Objects
Function object is an instance of a class that contains only a single function – operator()
Function objects are easily passed as parameters to other functions
Commonly used to implement hash functions and comparison operations
template <typename T>class greaterThan { public: bool operator() (const T& x, const T& y) const { return x > y; }};
13
Using a function object Here is a template function that swaps two parameters only IF the
comparison is true
template<typename T, typename Compare>void swap(T& a, T& b, Compare comp){ if (comp(a,b)) { T temp = a; a = b; b = temp; }}
Here is a sample call
swap(x, y, greaterThan);
14
Reasonable Hash Functions
Integer key: Identity function
Good distribution if key or a portion of it is random
class hfIntKey { public: bool operator() (int key) const { return key; } };
15
Reasonable Hash Functions Integer key: Midsquare technique
Extracts middle two bytes of 4 byte square of key Works well with random and non-random keys
class hfMidSq { public: bool operator() (int key) const { unsigned int n = key; return ((n*n)/256) % 65536; // 0 .. 2^16-1 } };
16
Reasonable Hash Functions String key: string-to-number
Simple function uses ASCII codes for the string characters to build n-digit unsigned integers out of n-digit strings
class hfString { public: bool operator() (string key) const { unsigned int prime = 2049982463; int n(0); for (int i=0; i < key.length(); i++) n = n*8 + key[i];
return (n > 0 ? (n % prime) : (-n % prime) ); } };
17
Reasonable Hash Functions String key: folding
Uses substrings as numbers and combines them by addition or multiplication or …
Example: Sum of the 3 character substrings of a SSN Assuming no dashes in SSN … “987654321” 987+654+321 = 1962
class hfSSN { public: bool operator() (string ssn) const { return ( atoi(ssn.substr(0,3).c_str()) + atoi(ssn.substr(3,3).c_str()) + atoi(ssn.substr(6,3).c_str()) ); } };
18
Hash Class – not in STL
See headers in Ford & Topp include folder
d_hash.h – for the hash table using buckets d_hashf.h – for hash function object d_uset.h – for unordered set based on hash class d_hiter.h – for hash class iterator and
const_iterator
19
Hash Classtemplate <typename T, typename HashFunc>class hash { public : hash (int nbuckets, const HashFunc& hfunc = HashFunc()); hash (T *first, T *last, int nbuckets, const HashFunc& hfunc = HashFunc()); bool empty() const; int size() const; iterator find(const T& item); pair<iterator,bool> insert(const T& item); int erase(const T& item); void erase(iterator pos); void erase(iterator first, iterator last); iterator begin(); const_iterator begin() const; iterator end(); const_iterator end() const;
private:int numBuckets; // number of bucketsvector<list<T> > bucket; // table is vector of listsHashFunc hf; // hash functionint hashtableSize; // number of elements
};
20
Hash::find(item)template <typename T, typename HashFunc>hash<T,HashFunc>::iterator hash<T,HashFunc>::find(const T& item){ int hashIndex = int(hf(item) % numBuckets); list<T>& myBucket = bucket[hashIndex]; list<T>::iterator bucketIter;
// traverse list and look for a match with item bucketIter = myBucket.begin(); while(bucketIter != myBucket.end()) { if (*bucketIter == item) // return iterator to found item return iterator(this, hashIndex, bucketIter);
bucketIter++; }
// did not find item, so return iterator to table end return end();}
21
Hash::insert(item)template <typename T, typename HashFunc>pair<hash<T, HashFunc>::iterator,bool> hash<T, HashFunc>::insert(const T& item){ int hashIndex = int(hf(item) % numBuckets); list<T>& myBucket = bucket[hashIndex]; list<T>::iterator bucketIter; bool success; bucketIter = myBucket.begin(); while (bucketIter != myBucket.end()) if (*bucketIter == item) break; // found the item already in bucket else bucketIter++; if (bucketIter == myBucket.end()) { bucketIter = myBucket.insert(bucketIter, item); success = true; hashtableSize++; } else success = false; // item already in table return pair<iterator,bool> (iterator(this,hashIndex,bucketIter), success);}
22
Hash Iterator hIter Referencing Element 22 in Table ht
h as h T ab le = & h t
cu rren t B u ck et = 2
em p t y
em p t y
b u ck et s [0 ]
b u ck et s [4 ]
b u ck et s [3 ]
b u ck et s [2 ]
b u ck et s [1 ]
h f(x) = x
1 0
2 2 2
2 9
* h It er = 2 2 .
h as h < in t , h F in t ID > h t ;h as h < in t , h F in t ID > ::it erat o r h It er;
cu rren t L o c
h t
h It er
23
Determining Performance
The Load Factor (λ) measures the table density Where (m = size of table, n = items in table)
Linear addressing (m = size of vector, maxitems) Chaining (m = number of buckets)
Worst case (all items hash to same table location or bucket) Linear search is O(n)
Making table size prime helps prevent nonuniform distribution causing this worst case
mn /
24
Average Case - Chaining
Finding bucket is O(1) – using hash function Uniform hashing implies each bucket has n/m items Assuming uniform hash distribution
The ith item was inserted at the end of its bucket when the previous (i-1) items were spread evenly over the m buckets
To find this item takes 1+(i-1)/m comparisons since there are (on average) (i-1)/m items ahead of it in its bucket
Average performance of search for an arbitrary item is the average of the number of comparisons required to find each item in the list
mm
i
n
n
i 2
1
21)
11(
1
1
25
Hash table size = m, Number of elements in hash table = n, Load factor = n/m
Average Probesfor Successful Search
Average Probesfor Unsuccessful
Search
Open Probe
Chaining
Efficiency of Hash Methods
2)1(2
1
2
1
m2
1
21
)1(2
1
2
1
26
Final Variations Universal Hashing
Choose hf(n) randomly before execution from set of hash functions Prevents same clustering of collisions each time given set of data
is used in hash table Efficiency is more likely to be to be Θ(1), even worst case
Perfect Hashing Two tier approach (requires static set of keys) Uses two hash functions from universal hf(n) set Like chaining, but with secondary hash tables instead of chains Size of secondary hash tables is square of number of items
hashing to that table using first hash function Second hash function is chosen so no collisions occur in each
secondary table Efficiency is guaranteed to be Θ(1)
27
Summary Hash Table
Simulates the fastest searching technique, knowing the index of the required value in a vector and array and apply the index to access the value, by applying a hash function that converts the data to an integer
After obtaining an index by dividing the value from the hash function by the table size and taking the remainder, access the table. Normally, the number of elements in the table is much smaller than the number of distinct data values, so collisions occur.
To handle collisions, we must place a value that collides with an existing table element into the table in such a way that we can efficiently access it later.
Average running time for a search of a hash table is Θ(1) Worst case is Θ(n)
28
Summary
Collision Resolution Linear open probe addressing
the table is a vector or array of static size After using the hash function to compute a table
index, look up the entry in the table. If the values match, perform an update if
necessary. If the table entry is empty, insert the value in the
table.
29
Summary Collision Resolution (Cont…) Chaining with separate lists.
The hash table is a vector of list objects Each list is a sequence of colliding items. After applying the hash function to compute the table index,
search the list for the data value. If it is found, update its value; otherwise, insert the value at
the back of the list. You search only items that collided at the same table
location There is no limitation on the number of values in the table,
and deleting an item from the table involves only erasing it from its corresponding list