Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | louisa-anderson |
View: | 218 times |
Download: | 0 times |
1
CS 163Data Structures
Chapter 4Searching
Herbert G. Mayer, PSUHerbert G. Mayer, PSUStatus 5/28/2015Status 5/28/2015
2
Syllabus
Goal of SearchGoal of Search Cost of SearchCost of Search Linear SearchLinear Search Binary Search in ListBinary Search in List Search in TreeSearch in Tree Search in GraphSearch in Graph Search via HashingSearch via Hashing Test Primary vs. SecondaryTest Primary vs. Secondary Brent Heuristic OutlineBrent Heuristic Outline
3
Goal of Search
Searching is the process of inspecting a Searching is the process of inspecting a data data structurestructure for a specified for a specified datumdatum
Goal of a search may be to determine, whether the Goal of a search may be to determine, whether the datum is included at all the data being searched datum is included at all the data being searched Upon finding the datum, the search may end successfullyUpon finding the datum, the search may end successfully Frequently the goal of a search is to know the specific Frequently the goal of a search is to know the specific
location location or or indexindex, where the datum was found, where the datum was found If not found, logically the complete data structure had to be If not found, logically the complete data structure had to be
searchedsearched Yet in case of sorted data structures this may not be as Yet in case of sorted data structures this may not be as
costly as it sounds!costly as it sounds!
Another goal of a search may be to determine the Another goal of a search may be to determine the number of occurrences of a datum in the data number of occurrences of a datum in the data structure being inspectedstructure being inspected
4
Cost of Search In linear search, the cost of find a datum is In linear search, the cost of find a datum is nn, with , with n n being being
the size of the data structurethe size of the data structure A common sense way of stating this: you may have to look at A common sense way of stating this: you may have to look at
all all nn elements in linear search to locate an item elements in linear search to locate an item In Big-O notation we say, the cost function for 1 lookup in a In Big-O notation we say, the cost function for 1 lookup in a
linear search is linear search is O(n)O(n) It doesn’t matter that sometimes the first try may be It doesn’t matter that sometimes the first try may be
successful, the cost function is successful, the cost function is O(n)O(n)
For the binary search, the cost is For the binary search, the cost is O(logO(log22(n)),(n)), since the worst since the worst case, to find an item, or prove its absence, is to inspect case, to find an item, or prove its absence, is to inspect loglog22(n)(n) elements elements
For a balanced tree the cost is For a balanced tree the cost is O(logO(log22(n))(n)), but for an , but for an unbalanced it can be as bad asunbalanced it can be as bad as O(n) O(n)
It is redundant to say It is redundant to say loglog22 since in Big-O notation the ratio of since in Big-O notation the ratio of loglog22 by by loglog1010 is but a constant, i.e. “noise” in Big-O terms is but a constant, i.e. “noise” in Big-O terms
5
Linear Search
Linear searchLinear search makes no assumption on the order, in makes no assumption on the order, in which data are stored in a data structure, the data which data are stored in a data structure, the data don’t need to be sorted!don’t need to be sorted!
The data structure is often an array; but could be a The data structure is often an array; but could be a linked list or any equivalent data structurelinked list or any equivalent data structure
If it is known that data stored are unique (occur at If it is known that data stored are unique (occur at most once) then search may terminate at first most once) then search may terminate at first successful findsuccessful find
To prove that a datum is not included, the complete To prove that a datum is not included, the complete data structure must be inspected in data structure must be inspected in linear searchlinear search
Hence the cost function for a linear search is O(n), Hence the cost function for a linear search is O(n), with with nn being size of data structure being size of data structure
If datum may be included more than once, then even If datum may be included more than once, then even after successful find the complete data structure must after successful find the complete data structure must be inspected; result is be inspected; result is count ofcount of occurrences occurrences
6
Linear Search
#include <stdio.h>#include <stdio.h>
#include <iostream.h>#include <iostream.h>
#define MAX 20#define MAX 20
// slot at index 0 is NOT used to store data// slot at index 0 is NOT used to store data
// instead, index 0 used to indicate: "not found"// instead, index 0 used to indicate: "not found"
int unsorted[ MAX ] =int unsorted[ MAX ] =
{{
0, 9, -99, 99, 999, -999, 7, -7, 77, 88,0, 9, -99, 99, 999, -999, 7, -7, 77, 88,
1, 2, 9, 8, 3, 4, 7, 16, 77, 221, 2, 9, 8, 3, 4, 7, 16, 77, 22
};};
7
Linear Search
// assumes global array unsorted[] with MAX elements// assumes global array unsorted[] with MAX elements
// Note that element unsorted[ 0 ] is NOT used// Note that element unsorted[ 0 ] is NOT used
// return 0 if element n is NOT found in unsorted[]// return 0 if element n is NOT found in unsorted[]
// searching n, return index of first occurrence// searching n, return index of first occurrence
int linear( int n )int linear( int n )
{ // linear{ // linear
// start linear search at index 1; index 0 unused// start linear search at index 1; index 0 unused
for ( int i = 1; i < MAX; i++ ) {for ( int i = 1; i < MAX; i++ ) {
if ( unsorted[ i ] == n ) {if ( unsorted[ i ] == n ) {
// found it// found it
return i;return i;
} //end if} //end if
} //end for} //end for
return 0;return 0; // 0 used for “not found!”// 0 used for “not found!”
} //end linear} //end linear
8
Binary Search in List, Play Game
Practice the Practice the Guessing GameGuessing Game in class: in class:
Students think of an integer number Students think of an integer number between 1 and 1,000; keep secretbetween 1 and 1,000; keep secret
I bet I can guess your secret number in I bet I can guess your secret number in 10 or less tries10 or less tries
Exactly that secret number!Exactly that secret number!
No cheating No cheating and no changing of mind! and no changing of mind!
Why does that work?Why does that work?
Binary search cost is Binary search cost is O( logO( log22( n ) )( n ) )
9
Binary Search in List
Binary search assumes that the data Binary search assumes that the data structure is structure is sortedsorted, generally in some in , generally in some in array[] or equivalent data structurearray[] or equivalent data structure
If data are unique –occurring at most once– If data are unique –occurring at most once– then we say the sorted array holds data in then we say the sorted array holds data in ascendingascending –or descending – order –or descending – order
Else we say: “non-ascending” and “non-Else we say: “non-ascending” and “non-descending”; is a bit awkwarddescending”; is a bit awkward
The method is:The method is:
10
Binary Search in List: Method
Use: Use: left left and and rightright indices; indices; array[]array[] sorted in sorted in ascendingascending order, wanted element is: order, wanted element is: nn
Initially Initially left = 0left = 0 in C++, in C++, rightright is index of last is index of last array element: array element: right = size - 1right = size - 1
Loop: Repeatedly do the following “guess” Loop: Repeatedly do the following “guess” and verify:and verify:
Guess, that Guess, that nn be in the be in the midmid, right between , right between leftleft and and rightright, i.e. , i.e. mid = ( left + right ) / 2mid = ( left + right ) / 2
If If array[ mid ] = narray[ mid ] = n the algorithm stops the algorithm stops
11
Binary Search in List: Method
Else adjust the indices Else adjust the indices leftleft or or rightright: :
If wanted element is greater than the found If wanted element is greater than the found one, search continues, with one, search continues, with left = mid + 1left = mid + 1
Else continue with Else continue with right = mid - 1right = mid - 1
When When leftleft is beyond is beyond rightright, , nn is not found is not found and algorithm stopsand algorithm stops
Else continue at loopsElse continue at loops
12
Binary Search 1
#include <stdio.h>#include <stdio.h>
#include <iostream.h>#include <iostream.h>
#define MAX 20#define MAX 20
// also slot at // also slot at index 0 not usedindex 0 not used in this binary search! in this binary search!
// simply a convention used here!// simply a convention used here!
int sorted[ MAX ] =int sorted[ MAX ] =
{{
00, -88, -77, -12, -4, -2, -1, 0, 1, 4,, -88, -77, -12, -4, -2, -1, 0, 1, 4,
5, 14, 17, 66, 77, 99, 100, 1001, 2015, 99995, 14, 17, 66, 77, 99, 100, 1001, 2015, 9999
};};
13
Binary Search 1
// binary search for element n in array sorted[]// binary search for element n in array sorted[]
// sorted[] is global, sized MAX is known globally// sorted[] is global, sized MAX is known globally
// element sorted[ // element sorted[ 00 ] is NOT used ] is NOT used
// if n not found, return // if n not found, return 00; else return index of n; else return index of n
int binary( int n )int binary( int n )
{ // binary{ // binary
int left = 1; // don't use index 0int left = 1; // don't use index 0
int right = MAX-1; // left, right approachint right = MAX-1; // left, right approach
int mid = ( left + right ) / 2;int mid = ( left + right ) / 2;
14
Binary Search 1
int binary( int n ) // search for n, return 0 if not found{ // binary
int left = 1; // don't use index 0int right = MAX-1; // left, right approachint mid = ( left + right ) / 2;while( ( sorted[ mid ] != n ) && ( left < right ) ) {while( ( sorted[ mid ] != n ) && ( left < right ) ) {
if ( sorted[ mid ] > n ) {if ( sorted[ mid ] > n ) { right = mid – 1;right = mid – 1; // search in "lower" part// search in "lower" part }else if ( sorted[ mid ] < n ) {}else if ( sorted[ mid ] < n ) { left = mid + 1;left = mid + 1; // search in "upper" part// search in "upper" part }else{}else{ break;break; // found n// found n } //end if} //end if mid = ( left + right ) / 2;mid = ( left + right ) / 2;
} //end while} //end whilereturn sorted[ mid ] == n ? mid : 0;return sorted[ mid ] == n ? mid : 0;
} //end binary} //end binary
15
Binary Search 1
// assume array sorted[ MAX ]// assume array sorted[ MAX ]// index range 0 .. MAX-1// index range 0 .. MAX-1// binary() return 0 means: n not in sorted[ 1..MAX-1 ]// binary() return 0 means: n not in sorted[ 1..MAX-1 ]if ( i = binary( num ) ) {if ( i = binary( num ) ) {
// yes, did find n at sorted[i]// yes, did find n at sorted[i]cout << ”Found ” << num << ”in slot ” << i << endl;cout << ”Found ” << num << ”in slot ” << i << endl;
}else{}else{// i is // i is 00, meaning: element num was not found, meaning: element num was not found
cout << ”Element ” << num << ”Not in sorted[]” << cout << ”Element ” << num << ”Not in sorted[]” << endl;endl;
} //end if} //end if
16
Better Binary Search 2
int sorted[ MAX ] = { -12, -6, -2, -1, 0, 1, 2, 16, 44, 120 };int sorted[ MAX ] = { -12, -6, -2, -1, 0, 1, 2, 16, 44, 120 };// here element at index // here element at index 00 is used is usedint binary_search( int first, int last, int key )int binary_search( int first, int last, int key ){ // binary_search{ // binary_search int mid;int mid; if ( first > last ) {if ( first > last ) { return NIL;return NIL; // not in range 0..MAX-1// not in range 0..MAX-1 } //end if} //end if mid = ( first + last ) / 2;mid = ( first + last ) / 2;
if ( key == sorted[ mid ] ) {if ( key == sorted[ mid ] ) { return mid;return mid; }else if( key < sorted[ mid ] ) {}else if( key < sorted[ mid ] ) { return binary_search( first, mid-1, key );return binary_search( first, mid-1, key ); }else{}else{
// key < sorted[ mid ], so: first = mid+1// key < sorted[ mid ], so: first = mid+1 return binary_search( mid+1, last, key );return binary_search( mid+1, last, key ); } // end if} // end if} //end binary_searchs} //end binary_searchs
17
Binary Search in Trees
Searching for a datum Searching for a datum nn in a binary tree is in a binary tree is similar:similar:
Initially a node pointer Initially a node pointer pp is set to the root is set to the root If wanted element If wanted element nn is at p->data, the algorithm is at p->data, the algorithm
ends successfully, returning ends successfully, returning pp Else, if n > p->data, search the right subtree, i.e. Else, if n > p->data, search the right subtree, i.e.
p = p->rightp = p->right Else search the left subtree, Else search the left subtree, p = p->leftp = p->left If If pp is ever null, is ever null, nn is not found; not in the tree! is not found; not in the tree! If tree is not If tree is not balancedbalanced, search in tree can be as , search in tree can be as
slow as linear search slow as linear search O(n)O(n)
18
Search in Graph G
A graph G is a data structure of nodes and A graph G is a data structure of nodes and connecting edgesconnecting edges
Edges may be undirected or directedEdges may be undirected or directed A simplification is: to view undirected edges A simplification is: to view undirected edges
a bi-directionala bi-directional Any node in G may have any number Any node in G may have any number
incident edgesincident edges Unlike a tree, whose nodes have exactly 1 Unlike a tree, whose nodes have exactly 1
incident edge; always!incident edge; always! But a graph G may be unconnected!But a graph G may be unconnected!
19
Search in Graph G From “G has any number incident edges” follows:From “G has any number incident edges” follows:
. . . that G may be unconnected, when there are no incident . . . that G may be unconnected, when there are no incident edges to some nodesedges to some nodes
Hence a Hence a binary searchbinary search for a general G is not possible for a general G is not possible
Unless an ancillary data structure is provided, even the Unless an ancillary data structure is provided, even the linear linear searchsearch would not be possible would not be possible
Work-around: add a field to each node of G, this field is NOT Work-around: add a field to each node of G, this field is NOT logically a part of Glogically a part of G
But this field, named But this field, named fingerfinger, allows an algorithm to weave , allows an algorithm to weave through all nodes of G, in linear fashion, through all nodes of G, in linear fashion, even if unconnectedeven if unconnected
Don’t forget to add a boolean Don’t forget to add a boolean visitedvisited field to avoid repeat visits! field to avoid repeat visits!
21
Hashing For time critical applications that involve searches, or for For time critical applications that involve searches, or for
programs that manipulate extremely large data sets, programs that manipulate extremely large data sets, conventional conventional searching methodssearching methods are inadequate are inadequate
Even binary search on balanced trees or sorted lists may not be Even binary search on balanced trees or sorted lists may not be good enough, i.e. the cost function good enough, i.e. the cost function log(n)log(n) in a table of in a table of nn elements (AKA keys) may be too highelements (AKA keys) may be too high
Wouldn't it be much nicer to determine in a Wouldn't it be much nicer to determine in a single stepsingle step with with certainty most of the times, whether a key is in a table?certainty most of the times, whether a key is in a table?
For large tables we would even be content to complete a search For large tables we would even be content to complete a search after a after a constant numberconstant number of steps > 1, if such a constant is small of steps > 1, if such a constant is small
Hashing Hashing almost accomplishes this ambitious goalalmost accomplishes this ambitious goal
We say We say almostalmost, because the worst possible case for hashing is , because the worst possible case for hashing is unthinkably bad, but occurs rarely while the average statistical unthinkably bad, but occurs rarely while the average statistical performance is excellentperformance is excellent
22
Hashing Intuitively speaking, hashing is an Intuitively speaking, hashing is an oracleoracle. When . When
asked about a key: the oracle provides a swift answerasked about a key: the oracle provides a swift answer
Astonishing about the oracle is that it relinquishes Astonishing about the oracle is that it relinquishes the answer the answer without actually lookingwithout actually looking in the table in the table
It doesn't need to know what is in the table. All the It doesn't need to know what is in the table. All the oracle knows is the size of the table; after all, it must oracle knows is the size of the table; after all, it must return a legal table indexreturn a legal table index
Just like all oracles, it is correct only sometimesJust like all oracles, it is correct only sometimes
Other times it just has no relation to reality. So we Other times it just has no relation to reality. So we must always check, whether the key really resides at must always check, whether the key really resides at the reported index, or perhaps close by?the reported index, or perhaps close by?
23
Hashing Formally Hashing is a method of Hashing is a method of searching for keyssearching for keys in a data structure, in a data structure,
AKA the hash table, by returningAKA the hash table, by returning hash( key ) hash( key )
If the key is found at the predicted index, its location is If the key is found at the predicted index, its location is confirmed and that confirmed and that index = hash( key ) index = hash( key ) is returnedis returned
Else, if the key is not found, and an empty slot is located at Else, if the key is not found, and an empty slot is located at same index, then the same index, then the key is enteredkey is entered, and that index is returned, and that index is returned
Data structure is typically an array of keys; the table is known Data structure is typically an array of keys; the table is known as the hash tableas the hash table
The table size is a The table size is a primeprime number!! Very important number!! Very important
Unique to hashingUnique to hashing: It is a many-to-one mapping, i.e. many keys : It is a many-to-one mapping, i.e. many keys can generate the same index, yet only 1 single key can reside at can generate the same index, yet only 1 single key can reside at that index; we call this a that index; we call this a bucket sizebucket size of 1 of 1
Other bucket sizes are also possibleOther bucket sizes are also possible
24
Hashing Formally The index predicted by, and returned by, the hash The index predicted by, and returned by, the hash
function, is within legal subscript range function, is within legal subscript range 0 .. SIZE-10 .. SIZE-1, , SIZESIZE being that being that primeprime number number
If the location at the predicted index is busy with If the location at the predicted index is busy with another key, we have a another key, we have a collisioncollision
In case of a collision the hashing mechanism must In case of a collision the hashing mechanism must find an alternate entry; known as find an alternate entry; known as collision handlingcollision handling
Collisions can be handled by just adding a fixed delta Collisions can be handled by just adding a fixed delta to the predicted value, but to the predicted value, but modulo table sizemodulo table size
Or to handle a collision, we invoke a Or to handle a collision, we invoke a secondary secondary hash hash function, different from the function, different from the primary primary hash functionhash function
25
Hashing Formally Important to design a clever method for collision Important to design a clever method for collision
handling, to avoid handling, to avoid clustering clustering of entriesof entries
The secondary hash function must return a The secondary hash function must return a delta != 0delta != 0, , as adding a 0 would keep the old index, causing as adding a 0 would keep the old index, causing infinite loopinginfinite looping
The primary hash function must The primary hash function must include 0include 0, as that is a , as that is a legal hash table indexlegal hash table index
Values provided by the secondary hash function Values provided by the secondary hash function must be distinct from the original busy index, but if must be distinct from the original busy index, but if that slot is ever reached again, this means the table is that slot is ever reached again, this means the table is full, resulting in full, resulting in abort abort of the searchof the search
26
Hashing Fill Factor Hashing works well for tables that are only partly Hashing works well for tables that are only partly
filled; performance is unthinkably bad for filled; performance is unthinkably bad for 100% full100% full hash tableshash tables
Depending on the quality of the secondary hash Depending on the quality of the secondary hash function, function, fill factorsfill factors of >75% can be critical of >75% can be critical
For very good secondary functions, a fill factor of > For very good secondary functions, a fill factor of > 90% can still be acceptable90% can still be acceptable
Acceptable for a hash search means, the cost is Acceptable for a hash search means, the cost is O(1)O(1) in Big-O notation, for the average lookup; above all: in Big-O notation, for the average lookup; above all: NOT NOT O(n)O(n)
27
Primary Hash Function --- The The primaryprimary hash function needs to be fast, to keep hash function needs to be fast, to keep
the overall search fastthe overall search fast
Should consider Should consider all parts of its given keyall parts of its given key, for even , for even distribution; ok to make shortcuts, e.g. limit length distribution; ok to make shortcuts, e.g. limit length of an identifier being hashedof an identifier being hashed
E.g. when hashing identifiers of a source program, E.g. when hashing identifiers of a source program, it is good to consider all -or at least many- of the it is good to consider all -or at least many- of the individual characters individual characters
Must randomly distribute index range 0 .. SIZE-1, to Must randomly distribute index range 0 .. SIZE-1, to reduced clusteringreduced clustering
Must cover all indices 0 .. SIZE-1, not skip any Must cover all indices 0 .. SIZE-1, not skip any indexindex
28
Secondary Hash Function The The secondary secondary hash function must also be fast; hash function must also be fast;
may be as simple as a constant valuemay be as simple as a constant value
If use a constant, a prime number is a good If use a constant, a prime number is a good candidatecandidate
Must be different from he primary hash functionMust be different from he primary hash function
May May never be 0never be 0, since it will be added as a delta to , since it will be added as a delta to the current index; adding 0 will keep the previous the current index; adding 0 will keep the previous index!!index!!
After adding a delta to the primary, this new index After adding a delta to the primary, this new index must cover all indices must cover all indices 0 .. SIZE-10 .. SIZE-1, not skip any , not skip any index; same requirement as primary hash functionindex; same requirement as primary hash function
29
Primary Hash Function// hash on string “// hash on string “namename”, null-terminated”, null-terminated// use global constant “// use global constant “PrimePrime”, i.e a small prime #”, i.e a small prime #// use global constant “// use global constant “MaxSymTabMaxSymTab”, a large prime #”, a large prime #unsigned PrimaryHash( char * name )unsigned PrimaryHash( char * name ){ // PrimaryHash{ // PrimaryHash
int index = 0;int index = 0; // name[ index ]// name[ index ]int hash = 0;int hash = 0; // result// result// compute hash, based on key: name// compute hash, based on key: namewhile ( name[ index ] ) {while ( name[ index ] ) {
hash = ( hash+name[ index++ ] * hash = ( hash+name[ index++ ] * PrimePrime ) ) % % MaxSymTabMaxSymTab;;
} //end while} //end whilereturn hash;return hash;
} // PrimaryHash} // PrimaryHash
30
Secondary Hash Function
// hash on string “// hash on string “namename”, null-terminated”, null-terminated// use global constant “// use global constant “PrimePrime”, i.e a small prime #”, i.e a small prime #// use global constant “// use global constant “MaxSymTabMaxSymTab”, a large prime #”, a large prime #unsigned SecondaryHash( char * name )unsigned SecondaryHash( char * name ){// SecondaryHash{// SecondaryHash int nameindex = 0;int nameindex = 0; int hash = 0;int hash = 0; // based on key: “name”// based on key: “name” while ( name[ nameindex ] ) {while ( name[ nameindex ] ) { hash = ( hash+name[ nameindex++ ] * hash = ( hash+name[ nameindex++ ] * Prime Prime ) %) %
( ( MaxSymTabMaxSymTab - 2- 2 ) ) + 1+ 1;;// could use // could use twin-primetwin-prime for MaxSymTab-2 for MaxSymTab-2
} //end while} //end while return hash;return hash;}// SecondaryHash}// SecondaryHash
31
Enter Name in Symbol Table void search( char * name )void search( char * name ) // single hash// single hash { // search { // search unsigned HashIndex = unsigned HashIndex = PrimaryHash( name );PrimaryHash( name ); // 0 <= hash < MaxSymTab // 0 <= hash < MaxSymTab unsigned OldIndex = Hashindex; unsigned OldIndex = Hashindex; // WrapAround check// WrapAround check bool Found = false; bool Found = false; // Found = name in symbol table // Found = name in symbol table bool WrapAround;bool WrapAround; // back to original slot? // back to original slot? unsigned chain = 0;unsigned chain = 0; // Maximum chaining of probes // Maximum chaining of probes
// so far no Collision for this search // so far no Collision for this search while ( (!Found) && !WrapAround ) {while ( (!Found) && !WrapAround ) {
Found = True; Found = True; // // GuessGuess, correct later , correct later if ( !(SymTab[ HashIndex ][ 0 ]) ) {// empty slot; use it if ( !(SymTab[ HashIndex ][ 0 ]) ) {// empty slot; use it strcpy( SymTab[ HashIndex ], name );strcpy( SymTab[ HashIndex ], name ); Available--;Available--; // Found stays true// Found stays true }else if ( strcmp( SymTab[ HashIndex ], name ) ) { // slot busy}else if ( strcmp( SymTab[ HashIndex ], name ) ) { // slot busy Found = False;Found = False; Collisions++;Collisions++; // global counter// global counter chain++;chain++; // just for statistics// just for statistics Maxchain = chain > Maxchain : chain ? Maxchain;Maxchain = chain > Maxchain : chain ? Maxchain;
HashIndex = ( HashIndex + HashIndex = ( HashIndex + PrimePrime ) % MaxSymTab; // Secondary Hash? ) % MaxSymTab; // Secondary Hash? WrapAround = HashIndex == OldIndex;WrapAround = HashIndex == OldIndex; } //end if; else same, Found == True } //end if; else same, Found == True } //end while} //end while CHECK( WrapAround, " <><> Hash table full <><>\n” );CHECK( WrapAround, " <><> Hash table full <><>\n” ); } // search unsigned} // search unsigned
32
Enter Name in Symbol Table void search( char * name )void search( char * name ) // double hash// double hash { // search { // search unsigned HashIndex = unsigned HashIndex = PrimaryHash( namePrimaryHash( name );// 0 <= hash < MaxSymTab );// 0 <= hash < MaxSymTab unsigned OldIndex = Hashindex; unsigned OldIndex = Hashindex; // WrapAround check// WrapAround check bool Found = false; bool Found = false; // Found = name in symbol table // Found = name in symbol table bool WrapAround;bool WrapAround; // back to original slot? // back to original slot? unsigned chain = 0;unsigned chain = 0; // Maximum chaining of probes // Maximum chaining of probes
// so far no Collision for this search // so far no Collision for this search while ( (!Found) && !WrapAround ) {while ( (!Found) && !WrapAround ) {
Found = True; Found = True; // Guess, correct later // Guess, correct later if ( !(SymTab[ HashIndex ][ 0 ]) ) {// empty slot; use it if ( !(SymTab[ HashIndex ][ 0 ]) ) {// empty slot; use it strcpy( SymTab[ HashIndex ], name );strcpy( SymTab[ HashIndex ], name ); Available--; }Available--; } }else if ( strcmp( SymTab[ HashIndex ], name ) ) { // slot busy}else if ( strcmp( SymTab[ HashIndex ], name ) ) { // slot busy Found = False;Found = False; Collisions++;Collisions++; // global counter// global counter chain++;chain++; // just for statistics// just for statistics Maxchain = chain > Maxchain : chain ? Maxchain;Maxchain = chain > Maxchain : chain ? Maxchain;
HashIndex = ( HashIndex + HashIndex = ( HashIndex + SecondaryHash( name )SecondaryHash( name ) ) % MaxSymTab; ) % MaxSymTab; WrapAround = HashIndex == OldIndex;WrapAround = HashIndex == OldIndex; } //end if; else same, Found == True } //end if; else same, Found == True } //end while} //end while CHECK( WrapAround, " <><> Hash table full <><>\n” );CHECK( WrapAround, " <><> Hash table full <><>\n” ); } // search unsigned} // search unsigned
33
Single or Double Hash?
For demonstration purposes: make single- versus For demonstration purposes: make single- versus double-hashing programmabledouble-hashing programmable
If command line argument “d” is provided, then use If command line argument “d” is provided, then use double-hashing, e.g. command:double-hashing, e.g. command:
a.out d < infile > resulta.out d < infile > result
Indicates that double hashing is desiredIndicates that double hashing is desired
See actual source code:See actual source code:
34
Single or Double Hash? if ( 1 == argc ) {if ( 1 == argc ) {
// no added arg given// no added arg given
single = true;single = true;
} else if ( argc >= 2 ) {} else if ( argc >= 2 ) {
if ( !strcmp( argv[ 1 ], "d" ) ) {if ( !strcmp( argv[ 1 ], "d" ) ) {
single = false;single = false;
}else{}else{
single = true;single = true;
} //end if} //end if
} //end if} //end if
// now we know: single hashing or double hashing // now we know: single hashing or double hashing function?function?
35
Single or Double Hash in Search()?void search( char * name )void search( char * name ){ // search{ // search
. . . . . . while ( ( !found ) && ( !WrapAround ) ) {while ( ( !found ) && ( !WrapAround ) ) {
found = true; // educated guessfound = true; // educated guess if ( ! sym_tab[ HashIndex ][ 0 ] ) { // found empty slotif ( ! sym_tab[ HashIndex ][ 0 ] ) { // found empty slot strcpy( sym_tab[ HashIndex ], name );strcpy( sym_tab[ HashIndex ], name ); Available--; // one less symbmol table slot availableAvailable--; // one less symbmol table slot available used++; // one more symbol table slot used upused++; // one more symbol table slot used up }else if ( strcmp( sym_tab[ HashIndex ], name ) ) {}else if ( strcmp( sym_tab[ HashIndex ], name ) ) { // another ID sits there// another ID sits there found = false;found = false; Collisions++; // global counterCollisions++; // global counter chain++;chain++; if ( single ) {if ( single ) { HashIndex = ( HashIndex + HashIndex = ( HashIndex + PRIMEPRIME ) % Max_Sym_Tab; ) % Max_Sym_Tab; }else{}else{ HashIndex = ( HashIndex + HashIndex = ( HashIndex + SecondaryHash( name )SecondaryHash( name ) ) % Max_Sym_Tab; ) % Max_Sym_Tab; } //end if} //end if WrapAround = HashIndex == Old_Inx;WrapAround = HashIndex == Old_Inx; MaxChain = chain > MaxChain ? chain : MaxChain;MaxChain = chain > MaxChain ? chain : MaxChain; } //end if} //end if
} //end while} //end while. . .. . .
} //end search} //end search
36
Scan Complete Inputvoid enter_names( void )void enter_names( void ){ // enter_names{ // enter_names char c = ' ';char c = ' '; // anything except EOF// anything except EOF char name[ MAX_L + 2 ];char name[ MAX_L + 2 ]; // next string, add 2 slack// next string, add 2 slack unsigned length;unsigned length; // track chars in ident// track chars in ident while ( c != EOF ) { // the whole filewhile ( c != EOF ) { // the whole file while ( ( c != EOF ) && ( ! isalpha( c ) ) ) {while ( ( c != EOF ) && ( ! isalpha( c ) ) ) { c = getchar();c = getchar(); // skip non-alpha// skip non-alpha } //end while} //end while length = 0;length = 0; // reset for each ID// reset for each ID while ( ( c != EOF ) && ( isalpha( c ) ) ) {while ( ( c != EOF ) && ( isalpha( c ) ) ) { if ( length < MAX_L ) {if ( length < MAX_L ) { name[ length++ ] = c;name[ length++ ] = c; } //end if // skip others} //end if // skip others c = getchar();c = getchar(); } //end while} //end while name[ length ] = '\0';name[ length ] = '\0'; TRACK( name );TRACK( name ); // also for statistics// also for statistics search( name );search( name ); // enter, only if not found!// enter, only if not found! } //end while} //end while} //end enter_names} //end enter_names
37
Initialize Hash
void init( int argc, char * argv[] )void init( int argc, char * argv[] ){ //init{ //init if ( 1 == argc ) {if ( 1 == argc ) { // no added arg given// no added arg given single = true;single = true; } else if ( argc >= 2 ) {} else if ( argc >= 2 ) { if ( !strcmp( argv[ 1 ], "d" ) ) {if ( !strcmp( argv[ 1 ], "d" ) ) { single = false;single = false; }else{}else{ single = true;single = true; } //end if} //end if } //end if} //end if printf( "want %s\n", single ? "single hash" : "double hash" );printf( "want %s\n", single ? "single hash" : "double hash" ); for ( int i = 0; i < Max_Sym_Tab; i++ ) {for ( int i = 0; i < Max_Sym_Tab; i++ ) { sym_tab[ i ][ 0 ] = '\0';sym_tab[ i ][ 0 ] = '\0'; } //end for} //end for MaxChain = 0;MaxChain = 0; Collisions = 0;Collisions = 0; printf( "Reading input:\n" );printf( "Reading input:\n" );} //end init} //end init
38
Terminate Hash
void terminate( void )void terminate( void ){ // terminate{ // terminate printf( " Maximum chaining = %d\n", MaxChain );printf( " Maximum chaining = %d\n", MaxChain ); printf( " Number of Collis. = %d\n", Collisions );printf( " Number of Collis. = %d\n", Collisions ); printf( " Number slots Avail = %d\0", Available );printf( " Number slots Avail = %d\0", Available );## ifdef DEBUG ifdef DEBUG for ( int i = 0; i < Max_Sym_Tab; i++ ) {for ( int i = 0; i < Max_Sym_Tab; i++ ) { if ( sym_tab[ i ][ 0 ] ) {if ( sym_tab[ i ][ 0 ] ) { printf( " %4d %s \n", i, sym_tab[ i ] );printf( " %4d %s \n", i, sym_tab[ i ] ); } //end if} //end if } //end for} //end for# endif // DEBUG# endif // DEBUG} //end terminate} //end terminate
39
Test Primary vs. SecondaryTest Primary vs. Secondary Running 2 hashing searchesRunning 2 hashing searches
The first one using a The first one using a primary hash functionprimary hash function only: only: collisions are handled by finding another entry with collisions are handled by finding another entry with fixed (fixed (PrimePrime) offset; see p. 31) offset; see p. 31
The second one using also a secondary hash The second one using also a secondary hash function: function: SecondaryHash( name );SecondaryHash( name ); collisions are collisions are handled by adding the new hash value, guaranteed handled by adding the new hash value, guaranteed to be != 0; yielding another entry; see p. 32to be != 0; yielding another entry; see p. 32
Both exercise the identical, small input, a file of Both exercise the identical, small input, a file of 21807 characters, 446 distinct identifiers21807 characters, 446 distinct identifiers
Results are shown next. Despite small sampling Results are shown next. Despite small sampling space, there is a noticeable difference in space, there is a noticeable difference in performance quality!performance quality!
40
Test Secondary HashTest Secondary Hash Entries for Entries for single hashsingle hash= 883= 883 Twin prime, 2 smaller = 881Twin prime, 2 smaller = 881 Symbols used = 436Symbols used = 436 Fill factor = 49 percentFill factor = 49 percent Maximum chaining = 73Maximum chaining = 73 <- pretty bad chaining<- pretty bad chaining Number of Collisions = Number of Collisions = 68606860 Number slots Available = 447Number slots Available = 447 number of chars read = 21808number of chars read = 21808 number of ids scanned = 2320number of ids scanned = 2320 number of noise chars = 13087number of noise chars = 13087
Entries for Entries for double hashdouble hash= 883= 883 Twin prime, 2 smaller = 881Twin prime, 2 smaller = 881 Symbols used = 436Symbols used = 436 Fill factor = 49 percentFill factor = 49 percent Maximum chaining = 10Maximum chaining = 10 Number of Collisions = Number of Collisions = 12511251 <- ~20% of single hash<- ~20% of single hash Number slots Available = 447Number slots Available = 447 number of chars read = 21808number of chars read = 21808 number of ids scanned = 2320number of ids scanned = 2320 number of noise chars = 13087number of noise chars = 13087
41
Brent Heuristic Outline Brent's idea requires Brent's idea requires more work per insertionmore work per insertion than than
single and double hashing. But if a key is used single and double hashing. But if a key is used frequently as is the case in typical programs, the frequently as is the case in typical programs, the complex insertion effort may turn out to be a good complex insertion effort may turn out to be a good investment after allinvestment after all
To assess its relocation cost, we need to fetch the To assess its relocation cost, we need to fetch the old key and use the secondary hash function to old key and use the secondary hash function to find a relocation slot for the old keyfind a relocation slot for the old key
This would increase the cost for retrieving the old This would increase the cost for retrieving the old key, but, if this addition costs less than finding an key, but, if this addition costs less than finding an alternate entry for the new key, the net result is a alternate entry for the new key, the net result is a savingsaving
42
Hash Conclusion Symbol tables in a compiler are generally large, to Symbol tables in a compiler are generally large, to
handle large source programhandle large source program
Regular searches, such as Regular searches, such as binary binary or or insertioninsertion sorts sorts are not sufficient, when fast compilation speeds are are not sufficient, when fast compilation speeds are neededneeded
Hashing will be neededHashing will be needed to look up symbols to look up symbols
The fill factor of the hash table generally is way The fill factor of the hash table generally is way under 90%under 90%
Some techniques exist even to dynamically Some techniques exist even to dynamically increase table sizes, typically requiring relocation increase table sizes, typically requiring relocation of keys entered so farof keys entered so far
Dynamic array resizing is not discussed in CS 163Dynamic array resizing is not discussed in CS 163