1 CS 163 Data Structures Chapter 4 Searching Herbert G. Mayer, PSU Status 5/28/2015.

1

CS 163Data Structures

Chapter 4Searching

Herbert G. Mayer, PSUHerbert G. Mayer, PSUStatus 5/28/2015Status 5/28/2015

2

Syllabus

Goal of SearchGoal of Search Cost of SearchCost of Search Linear SearchLinear Search Binary Search in ListBinary Search in List Search in TreeSearch in Tree Search in GraphSearch in Graph Search via HashingSearch via Hashing Test Primary vs. SecondaryTest Primary vs. Secondary Brent Heuristic OutlineBrent Heuristic Outline

3

Goal of Search

Searching is the process of inspecting a Searching is the process of inspecting a data data structurestructure for a specified for a specified datumdatum

Goal of a search may be to determine, whether the Goal of a search may be to determine, whether the datum is included at all the data being searched datum is included at all the data being searched Upon finding the datum, the search may end successfullyUpon finding the datum, the search may end successfully Frequently the goal of a search is to know the specific Frequently the goal of a search is to know the specific

location location or or indexindex, where the datum was found, where the datum was found If not found, logically the complete data structure had to be If not found, logically the complete data structure had to be

searchedsearched Yet in case of sorted data structures this may not be as Yet in case of sorted data structures this may not be as

costly as it sounds!costly as it sounds!

Another goal of a search may be to determine the Another goal of a search may be to determine the number of occurrences of a datum in the data number of occurrences of a datum in the data structure being inspectedstructure being inspected

4

Cost of Search In linear search, the cost of find a datum is In linear search, the cost of find a datum is nn, with , with n n being being

the size of the data structurethe size of the data structure A common sense way of stating this: you may have to look at A common sense way of stating this: you may have to look at

all all nn elements in linear search to locate an item elements in linear search to locate an item In Big-O notation we say, the cost function for 1 lookup in a In Big-O notation we say, the cost function for 1 lookup in a

linear search is linear search is O(n)O(n) It doesn’t matter that sometimes the first try may be It doesn’t matter that sometimes the first try may be

successful, the cost function is successful, the cost function is O(n)O(n)

For the binary search, the cost is For the binary search, the cost is O(logO(log22(n)),(n)), since the worst since the worst case, to find an item, or prove its absence, is to inspect case, to find an item, or prove its absence, is to inspect loglog22(n)(n) elements elements

For a balanced tree the cost is For a balanced tree the cost is O(logO(log22(n))(n)), but for an , but for an unbalanced it can be as bad asunbalanced it can be as bad as O(n) O(n)

It is redundant to say It is redundant to say loglog22 since in Big-O notation the ratio of since in Big-O notation the ratio of loglog22 by by loglog1010 is but a constant, i.e. “noise” in Big-O terms is but a constant, i.e. “noise” in Big-O terms

5

Linear Search

Linear searchLinear search makes no assumption on the order, in makes no assumption on the order, in which data are stored in a data structure, the data which data are stored in a data structure, the data don’t need to be sorted!don’t need to be sorted!

The data structure is often an array; but could be a The data structure is often an array; but could be a linked list or any equivalent data structurelinked list or any equivalent data structure

If it is known that data stored are unique (occur at If it is known that data stored are unique (occur at most once) then search may terminate at first most once) then search may terminate at first successful findsuccessful find

To prove that a datum is not included, the complete To prove that a datum is not included, the complete data structure must be inspected in data structure must be inspected in linear searchlinear search

Hence the cost function for a linear search is O(n), Hence the cost function for a linear search is O(n), with with nn being size of data structure being size of data structure

If datum may be included more than once, then even If datum may be included more than once, then even after successful find the complete data structure must after successful find the complete data structure must be inspected; result is be inspected; result is count ofcount of occurrences occurrences

6

Linear Search

#include <stdio.h>#include <stdio.h>

#include <iostream.h>#include <iostream.h>

#define MAX 20#define MAX 20

// slot at index 0 is NOT used to store data// slot at index 0 is NOT used to store data

// instead, index 0 used to indicate: "not found"// instead, index 0 used to indicate: "not found"

int unsorted[ MAX ] =int unsorted[ MAX ] =

{{

0, 9, -99, 99, 999, -999, 7, -7, 77, 88,0, 9, -99, 99, 999, -999, 7, -7, 77, 88,

1, 2, 9, 8, 3, 4, 7, 16, 77, 221, 2, 9, 8, 3, 4, 7, 16, 77, 22

};};

7

Linear Search

// assumes global array unsorted[] with MAX elements// assumes global array unsorted[] with MAX elements

// Note that element unsorted[ 0 ] is NOT used// Note that element unsorted[ 0 ] is NOT used

// return 0 if element n is NOT found in unsorted[]// return 0 if element n is NOT found in unsorted[]

// searching n, return index of first occurrence// searching n, return index of first occurrence

int linear( int n )int linear( int n )

{ // linear{ // linear

// start linear search at index 1; index 0 unused// start linear search at index 1; index 0 unused

for ( int i = 1; i < MAX; i++ ) {for ( int i = 1; i < MAX; i++ ) {

if ( unsorted[ i ] == n ) {if ( unsorted[ i ] == n ) {

// found it// found it

return i;return i;

} //end if} //end if

} //end for} //end for

return 0;return 0; // 0 used for “not found!”// 0 used for “not found!”

} //end linear} //end linear

8

Binary Search in List, Play Game

Practice the Practice the Guessing GameGuessing Game in class: in class:

Students think of an integer number Students think of an integer number between 1 and 1,000; keep secretbetween 1 and 1,000; keep secret

I bet I can guess your secret number in I bet I can guess your secret number in 10 or less tries10 or less tries

Exactly that secret number!Exactly that secret number!

No cheating No cheating and no changing of mind! and no changing of mind!

Why does that work?Why does that work?

Binary search cost is Binary search cost is O( logO( log22( n ) )( n ) )

9

Binary Search in List

Binary search assumes that the data Binary search assumes that the data structure is structure is sortedsorted, generally in some in , generally in some in array[] or equivalent data structurearray[] or equivalent data structure

If data are unique –occurring at most once– If data are unique –occurring at most once– then we say the sorted array holds data in then we say the sorted array holds data in ascendingascending –or descending – order –or descending – order

Else we say: “non-ascending” and “non-Else we say: “non-ascending” and “non-descending”; is a bit awkwarddescending”; is a bit awkward

The method is:The method is:

10

Binary Search in List: Method

Use: Use: left left and and rightright indices; indices; array[]array[] sorted in sorted in ascendingascending order, wanted element is: order, wanted element is: nn

Initially Initially left = 0left = 0 in C++, in C++, rightright is index of last is index of last array element: array element: right = size - 1right = size - 1

Loop: Repeatedly do the following “guess” Loop: Repeatedly do the following “guess” and verify:and verify:

Guess, that Guess, that nn be in the be in the midmid, right between , right between leftleft and and rightright, i.e. , i.e. mid = ( left + right ) / 2mid = ( left + right ) / 2

If If array[ mid ] = narray[ mid ] = n the algorithm stops the algorithm stops

11

Binary Search in List: Method

Else adjust the indices Else adjust the indices leftleft or or rightright: :

If wanted element is greater than the found If wanted element is greater than the found one, search continues, with one, search continues, with left = mid + 1left = mid + 1

Else continue with Else continue with right = mid - 1right = mid - 1

When When leftleft is beyond is beyond rightright, , nn is not found is not found and algorithm stopsand algorithm stops

Else continue at loopsElse continue at loops

12

Binary Search 1

#include <stdio.h>#include <stdio.h>

#include <iostream.h>#include <iostream.h>

#define MAX 20#define MAX 20

// also slot at // also slot at index 0 not usedindex 0 not used in this binary search! in this binary search!

// simply a convention used here!// simply a convention used here!

int sorted[ MAX ] =int sorted[ MAX ] =

{{

00, -88, -77, -12, -4, -2, -1, 0, 1, 4,, -88, -77, -12, -4, -2, -1, 0, 1, 4,

5, 14, 17, 66, 77, 99, 100, 1001, 2015, 99995, 14, 17, 66, 77, 99, 100, 1001, 2015, 9999

};};

13

Binary Search 1

// binary search for element n in array sorted[]// binary search for element n in array sorted[]

// sorted[] is global, sized MAX is known globally// sorted[] is global, sized MAX is known globally

// element sorted[ // element sorted[ 00 ] is NOT used ] is NOT used

// if n not found, return // if n not found, return 00; else return index of n; else return index of n

int binary( int n )int binary( int n )

{ // binary{ // binary

int left = 1; // don't use index 0int left = 1; // don't use index 0

int right = MAX-1; // left, right approachint right = MAX-1; // left, right approach

int mid = ( left + right ) / 2;int mid = ( left + right ) / 2;

14

Binary Search 1

int binary( int n ) // search for n, return 0 if not found{ // binary

int left = 1; // don't use index 0int right = MAX-1; // left, right approachint mid = ( left + right ) / 2;while( ( sorted[ mid ] != n ) && ( left < right ) ) {while( ( sorted[ mid ] != n ) && ( left < right ) ) {

if ( sorted[ mid ] > n ) {if ( sorted[ mid ] > n ) { right = mid – 1;right = mid – 1; // search in "lower" part// search in "lower" part }else if ( sorted[ mid ] < n ) {}else if ( sorted[ mid ] < n ) { left = mid + 1;left = mid + 1; // search in "upper" part// search in "upper" part }else{}else{ break;break; // found n// found n } //end if} //end if mid = ( left + right ) / 2;mid = ( left + right ) / 2;

} //end while} //end whilereturn sorted[ mid ] == n ? mid : 0;return sorted[ mid ] == n ? mid : 0;

} //end binary} //end binary

15

Binary Search 1

// assume array sorted[ MAX ]// assume array sorted[ MAX ]// index range 0 .. MAX-1// index range 0 .. MAX-1// binary() return 0 means: n not in sorted[ 1..MAX-1 ]// binary() return 0 means: n not in sorted[ 1..MAX-1 ]if ( i = binary( num ) ) {if ( i = binary( num ) ) {

// yes, did find n at sorted[i]// yes, did find n at sorted[i]cout << ”Found ” << num << ”in slot ” << i << endl;cout << ”Found ” << num << ”in slot ” << i << endl;

}else{}else{// i is // i is 00, meaning: element num was not found, meaning: element num was not found

cout << ”Element ” << num << ”Not in sorted[]” << cout << ”Element ” << num << ”Not in sorted[]” << endl;endl;


16

Better Binary Search 2

int sorted[ MAX ] = { -12, -6, -2, -1, 0, 1, 2, 16, 44, 120 };int sorted[ MAX ] = { -12, -6, -2, -1, 0, 1, 2, 16, 44, 120 };// here element at index // here element at index 00 is used is usedint binary_search( int first, int last, int key )int binary_search( int first, int last, int key ){ // binary_search{ // binary_search int mid;int mid; if ( first > last ) {if ( first > last ) { return NIL;return NIL; // not in range 0..MAX-1// not in range 0..MAX-1 } //end if} //end if mid = ( first + last ) / 2;mid = ( first + last ) / 2;

if ( key == sorted[ mid ] ) {if ( key == sorted[ mid ] ) { return mid;return mid; }else if( key < sorted[ mid ] ) {}else if( key < sorted[ mid ] ) { return binary_search( first, mid-1, key );return binary_search( first, mid-1, key ); }else{}else{

// key < sorted[ mid ], so: first = mid+1// key < sorted[ mid ], so: first = mid+1 return binary_search( mid+1, last, key );return binary_search( mid+1, last, key ); } // end if} // end if} //end binary_searchs} //end binary_searchs

17

Binary Search in Trees

Searching for a datum Searching for a datum nn in a binary tree is in a binary tree is similar:similar:

Initially a node pointer Initially a node pointer pp is set to the root is set to the root If wanted element If wanted element nn is at p->data, the algorithm is at p->data, the algorithm

ends successfully, returning ends successfully, returning pp Else, if n > p->data, search the right subtree, i.e. Else, if n > p->data, search the right subtree, i.e.

p = p->rightp = p->right Else search the left subtree, Else search the left subtree, p = p->leftp = p->left If If pp is ever null, is ever null, nn is not found; not in the tree! is not found; not in the tree! If tree is not If tree is not balancedbalanced, search in tree can be as , search in tree can be as

slow as linear search slow as linear search O(n)O(n)

18

Search in Graph G

A graph G is a data structure of nodes and A graph G is a data structure of nodes and connecting edgesconnecting edges

Edges may be undirected or directedEdges may be undirected or directed A simplification is: to view undirected edges A simplification is: to view undirected edges

a bi-directionala bi-directional Any node in G may have any number Any node in G may have any number

incident edgesincident edges Unlike a tree, whose nodes have exactly 1 Unlike a tree, whose nodes have exactly 1

incident edge; always!incident edge; always! But a graph G may be unconnected!But a graph G may be unconnected!

19

Search in Graph G From “G has any number incident edges” follows:From “G has any number incident edges” follows:

. . . that G may be unconnected, when there are no incident . . . that G may be unconnected, when there are no incident edges to some nodesedges to some nodes

Hence a Hence a binary searchbinary search for a general G is not possible for a general G is not possible

Unless an ancillary data structure is provided, even the Unless an ancillary data structure is provided, even the linear linear searchsearch would not be possible would not be possible

Work-around: add a field to each node of G, this field is NOT Work-around: add a field to each node of G, this field is NOT logically a part of Glogically a part of G

But this field, named But this field, named fingerfinger, allows an algorithm to weave , allows an algorithm to weave through all nodes of G, in linear fashion, through all nodes of G, in linear fashion, even if unconnectedeven if unconnected

Don’t forget to add a boolean Don’t forget to add a boolean visitedvisited field to avoid repeat visits! field to avoid repeat visits!

20

Hashing,Hashing,

Single and Double HashingSingle and Double Hashing

21

Hashing For time critical applications that involve searches, or for For time critical applications that involve searches, or for

programs that manipulate extremely large data sets, programs that manipulate extremely large data sets, conventional conventional searching methodssearching methods are inadequate are inadequate

Even binary search on balanced trees or sorted lists may not be Even binary search on balanced trees or sorted lists may not be good enough, i.e. the cost function good enough, i.e. the cost function log(n)log(n) in a table of in a table of nn elements (AKA keys) may be too highelements (AKA keys) may be too high

Wouldn't it be much nicer to determine in a Wouldn't it be much nicer to determine in a single stepsingle step with with certainty most of the times, whether a key is in a table?certainty most of the times, whether a key is in a table?

For large tables we would even be content to complete a search For large tables we would even be content to complete a search after a after a constant numberconstant number of steps > 1, if such a constant is small of steps > 1, if such a constant is small

Hashing Hashing almost accomplishes this ambitious goalalmost accomplishes this ambitious goal

We say We say almostalmost, because the worst possible case for hashing is , because the worst possible case for hashing is unthinkably bad, but occurs rarely while the average statistical unthinkably bad, but occurs rarely while the average statistical performance is excellentperformance is excellent

22

Hashing Intuitively speaking, hashing is an Intuitively speaking, hashing is an oracleoracle. When . When

asked about a key: the oracle provides a swift answerasked about a key: the oracle provides a swift answer

Astonishing about the oracle is that it relinquishes Astonishing about the oracle is that it relinquishes the answer the answer without actually lookingwithout actually looking in the table in the table

It doesn't need to know what is in the table. All the It doesn't need to know what is in the table. All the oracle knows is the size of the table; after all, it must oracle knows is the size of the table; after all, it must return a legal table indexreturn a legal table index

Just like all oracles, it is correct only sometimesJust like all oracles, it is correct only sometimes

Other times it just has no relation to reality. So we Other times it just has no relation to reality. So we must always check, whether the key really resides at must always check, whether the key really resides at the reported index, or perhaps close by?the reported index, or perhaps close by?

23

Hashing Formally Hashing is a method of Hashing is a method of searching for keyssearching for keys in a data structure, in a data structure,

AKA the hash table, by returningAKA the hash table, by returning hash( key ) hash( key )

If the key is found at the predicted index, its location is If the key is found at the predicted index, its location is confirmed and that confirmed and that index = hash( key ) index = hash( key ) is returnedis returned

Else, if the key is not found, and an empty slot is located at Else, if the key is not found, and an empty slot is located at same index, then the same index, then the key is enteredkey is entered, and that index is returned, and that index is returned

Data structure is typically an array of keys; the table is known Data structure is typically an array of keys; the table is known as the hash tableas the hash table

The table size is a The table size is a primeprime number!! Very important number!! Very important

Unique to hashingUnique to hashing: It is a many-to-one mapping, i.e. many keys : It is a many-to-one mapping, i.e. many keys can generate the same index, yet only 1 single key can reside at can generate the same index, yet only 1 single key can reside at that index; we call this a that index; we call this a bucket sizebucket size of 1 of 1

Other bucket sizes are also possibleOther bucket sizes are also possible

24

Hashing Formally The index predicted by, and returned by, the hash The index predicted by, and returned by, the hash

function, is within legal subscript range function, is within legal subscript range 0 .. SIZE-10 .. SIZE-1, , SIZESIZE being that being that primeprime number number

If the location at the predicted index is busy with If the location at the predicted index is busy with another key, we have a another key, we have a collisioncollision

In case of a collision the hashing mechanism must In case of a collision the hashing mechanism must find an alternate entry; known as find an alternate entry; known as collision handlingcollision handling

Collisions can be handled by just adding a fixed delta Collisions can be handled by just adding a fixed delta to the predicted value, but to the predicted value, but modulo table sizemodulo table size

Or to handle a collision, we invoke a Or to handle a collision, we invoke a secondary secondary hash hash function, different from the function, different from the primary primary hash functionhash function

25

Hashing Formally Important to design a clever method for collision Important to design a clever method for collision

handling, to avoid handling, to avoid clustering clustering of entriesof entries

The secondary hash function must return a The secondary hash function must return a delta != 0delta != 0, , as adding a 0 would keep the old index, causing as adding a 0 would keep the old index, causing infinite loopinginfinite looping

The primary hash function must The primary hash function must include 0include 0, as that is a , as that is a legal hash table indexlegal hash table index

Values provided by the secondary hash function Values provided by the secondary hash function must be distinct from the original busy index, but if must be distinct from the original busy index, but if that slot is ever reached again, this means the table is that slot is ever reached again, this means the table is full, resulting in full, resulting in abort abort of the searchof the search

26

Hashing Fill Factor Hashing works well for tables that are only partly Hashing works well for tables that are only partly

filled; performance is unthinkably bad for filled; performance is unthinkably bad for 100% full100% full hash tableshash tables

Depending on the quality of the secondary hash Depending on the quality of the secondary hash function, function, fill factorsfill factors of >75% can be critical of >75% can be critical

For very good secondary functions, a fill factor of > For very good secondary functions, a fill factor of > 90% can still be acceptable90% can still be acceptable

Acceptable for a hash search means, the cost is Acceptable for a hash search means, the cost is O(1)O(1) in Big-O notation, for the average lookup; above all: in Big-O notation, for the average lookup; above all: NOT NOT O(n)O(n)

27

Primary Hash Function --- The The primaryprimary hash function needs to be fast, to keep hash function needs to be fast, to keep

the overall search fastthe overall search fast

Should consider Should consider all parts of its given keyall parts of its given key, for even , for even distribution; ok to make shortcuts, e.g. limit length distribution; ok to make shortcuts, e.g. limit length of an identifier being hashedof an identifier being hashed

E.g. when hashing identifiers of a source program, E.g. when hashing identifiers of a source program, it is good to consider all -or at least many- of the it is good to consider all -or at least many- of the individual characters individual characters

Must randomly distribute index range 0 .. SIZE-1, to Must randomly distribute index range 0 .. SIZE-1, to reduced clusteringreduced clustering

Must cover all indices 0 .. SIZE-1, not skip any Must cover all indices 0 .. SIZE-1, not skip any indexindex

28

Secondary Hash Function The The secondary secondary hash function must also be fast; hash function must also be fast;

may be as simple as a constant valuemay be as simple as a constant value

If use a constant, a prime number is a good If use a constant, a prime number is a good candidatecandidate

Must be different from he primary hash functionMust be different from he primary hash function

May May never be 0never be 0, since it will be added as a delta to , since it will be added as a delta to the current index; adding 0 will keep the previous the current index; adding 0 will keep the previous index!!index!!

After adding a delta to the primary, this new index After adding a delta to the primary, this new index must cover all indices must cover all indices 0 .. SIZE-10 .. SIZE-1, not skip any , not skip any index; same requirement as primary hash functionindex; same requirement as primary hash function

29

Primary Hash Function// hash on string “// hash on string “namename”, null-terminated”, null-terminated// use global constant “// use global constant “PrimePrime”, i.e a small prime #”, i.e a small prime #// use global constant “// use global constant “MaxSymTabMaxSymTab”, a large prime #”, a large prime #unsigned PrimaryHash( char * name )unsigned PrimaryHash( char * name ){ // PrimaryHash{ // PrimaryHash

int index = 0;int index = 0; // name[ index ]// name[ index ]int hash = 0;int hash = 0; // result// result// compute hash, based on key: name// compute hash, based on key: namewhile ( name[ index ] ) {while ( name[ index ] ) {

hash = ( hash+name[ index++ ] * hash = ( hash+name[ index++ ] * PrimePrime ) ) % % MaxSymTabMaxSymTab;;

} //end while} //end whilereturn hash;return hash;

} // PrimaryHash} // PrimaryHash

30

Secondary Hash Function

// hash on string “// hash on string “namename”, null-terminated”, null-terminated// use global constant “// use global constant “PrimePrime”, i.e a small prime #”, i.e a small prime #// use global constant “// use global constant “MaxSymTabMaxSymTab”, a large prime #”, a large prime #unsigned SecondaryHash( char * name )unsigned SecondaryHash( char * name ){// SecondaryHash{// SecondaryHash int nameindex = 0;int nameindex = 0; int hash = 0;int hash = 0; // based on key: “name”// based on key: “name” while ( name[ nameindex ] ) {while ( name[ nameindex ] ) { hash = ( hash+name[ nameindex++ ] * hash = ( hash+name[ nameindex++ ] * Prime Prime ) %) %

( ( MaxSymTabMaxSymTab - 2- 2 ) ) + 1+ 1;;// could use // could use twin-primetwin-prime for MaxSymTab-2 for MaxSymTab-2

} //end while} //end while return hash;return hash;}// SecondaryHash}// SecondaryHash

31

Enter Name in Symbol Table void search( char * name )void search( char * name ) // single hash// single hash { // search { // search unsigned HashIndex = unsigned HashIndex = PrimaryHash( name );PrimaryHash( name ); // 0 <= hash < MaxSymTab // 0 <= hash < MaxSymTab unsigned OldIndex = Hashindex; unsigned OldIndex = Hashindex; // WrapAround check// WrapAround check bool Found = false; bool Found = false; // Found = name in symbol table // Found = name in symbol table bool WrapAround;bool WrapAround; // back to original slot? // back to original slot? unsigned chain = 0;unsigned chain = 0; // Maximum chaining of probes // Maximum chaining of probes

// so far no Collision for this search // so far no Collision for this search while ( (!Found) && !WrapAround ) {while ( (!Found) && !WrapAround ) {

Found = True; Found = True; // // GuessGuess, correct later , correct later if ( !(SymTab[ HashIndex ][ 0 ]) ) {// empty slot; use it if ( !(SymTab[ HashIndex ][ 0 ]) ) {// empty slot; use it strcpy( SymTab[ HashIndex ], name );strcpy( SymTab[ HashIndex ], name ); Available--;Available--; // Found stays true// Found stays true }else if ( strcmp( SymTab[ HashIndex ], name ) ) { // slot busy}else if ( strcmp( SymTab[ HashIndex ], name ) ) { // slot busy Found = False;Found = False; Collisions++;Collisions++; // global counter// global counter chain++;chain++; // just for statistics// just for statistics Maxchain = chain > Maxchain : chain ? Maxchain;Maxchain = chain > Maxchain : chain ? Maxchain;

HashIndex = ( HashIndex + HashIndex = ( HashIndex + PrimePrime ) % MaxSymTab; // Secondary Hash? ) % MaxSymTab; // Secondary Hash? WrapAround = HashIndex == OldIndex;WrapAround = HashIndex == OldIndex; } //end if; else same, Found == True } //end if; else same, Found == True } //end while} //end while CHECK( WrapAround, " <><> Hash table full <><>\n” );CHECK( WrapAround, " <><> Hash table full <><>\n” ); } // search unsigned} // search unsigned

32

Enter Name in Symbol Table void search( char * name )void search( char * name ) // double hash// double hash { // search { // search unsigned HashIndex = unsigned HashIndex = PrimaryHash( namePrimaryHash( name );// 0 <= hash < MaxSymTab );// 0 <= hash < MaxSymTab unsigned OldIndex = Hashindex; unsigned OldIndex = Hashindex; // WrapAround check// WrapAround check bool Found = false; bool Found = false; // Found = name in symbol table // Found = name in symbol table bool WrapAround;bool WrapAround; // back to original slot? // back to original slot? unsigned chain = 0;unsigned chain = 0; // Maximum chaining of probes // Maximum chaining of probes

// so far no Collision for this search // so far no Collision for this search while ( (!Found) && !WrapAround ) {while ( (!Found) && !WrapAround ) {

Found = True; Found = True; // Guess, correct later // Guess, correct later if ( !(SymTab[ HashIndex ][ 0 ]) ) {// empty slot; use it if ( !(SymTab[ HashIndex ][ 0 ]) ) {// empty slot; use it strcpy( SymTab[ HashIndex ], name );strcpy( SymTab[ HashIndex ], name ); Available--; }Available--; } }else if ( strcmp( SymTab[ HashIndex ], name ) ) { // slot busy}else if ( strcmp( SymTab[ HashIndex ], name ) ) { // slot busy Found = False;Found = False; Collisions++;Collisions++; // global counter// global counter chain++;chain++; // just for statistics// just for statistics Maxchain = chain > Maxchain : chain ? Maxchain;Maxchain = chain > Maxchain : chain ? Maxchain;

HashIndex = ( HashIndex + HashIndex = ( HashIndex + SecondaryHash( name )SecondaryHash( name ) ) % MaxSymTab; ) % MaxSymTab; WrapAround = HashIndex == OldIndex;WrapAround = HashIndex == OldIndex; } //end if; else same, Found == True } //end if; else same, Found == True } //end while} //end while CHECK( WrapAround, " <><> Hash table full <><>\n” );CHECK( WrapAround, " <><> Hash table full <><>\n” ); } // search unsigned} // search unsigned

33

Single or Double Hash?

For demonstration purposes: make single- versus For demonstration purposes: make single- versus double-hashing programmabledouble-hashing programmable

If command line argument “d” is provided, then use If command line argument “d” is provided, then use double-hashing, e.g. command:double-hashing, e.g. command:

a.out d < infile > resulta.out d < infile > result

Indicates that double hashing is desiredIndicates that double hashing is desired

See actual source code:See actual source code:

34

Single or Double Hash? if ( 1 == argc ) {if ( 1 == argc ) {

// no added arg given// no added arg given

single = true;single = true;

} else if ( argc >= 2 ) {} else if ( argc >= 2 ) {

if ( !strcmp( argv[ 1 ], "d" ) ) {if ( !strcmp( argv[ 1 ], "d" ) ) {

single = false;single = false;

}else{}else{

single = true;single = true;



// now we know: single hashing or double hashing // now we know: single hashing or double hashing function?function?

35

Single or Double Hash in Search()?void search( char * name )void search( char * name ){ // search{ // search

. . . . . . while ( ( !found ) && ( !WrapAround ) ) {while ( ( !found ) && ( !WrapAround ) ) {

found = true; // educated guessfound = true; // educated guess if ( ! sym_tab[ HashIndex ][ 0 ] ) { // found empty slotif ( ! sym_tab[ HashIndex ][ 0 ] ) { // found empty slot strcpy( sym_tab[ HashIndex ], name );strcpy( sym_tab[ HashIndex ], name ); Available--; // one less symbmol table slot availableAvailable--; // one less symbmol table slot available used++; // one more symbol table slot used upused++; // one more symbol table slot used up }else if ( strcmp( sym_tab[ HashIndex ], name ) ) {}else if ( strcmp( sym_tab[ HashIndex ], name ) ) { // another ID sits there// another ID sits there found = false;found = false; Collisions++; // global counterCollisions++; // global counter chain++;chain++; if ( single ) {if ( single ) { HashIndex = ( HashIndex + HashIndex = ( HashIndex + PRIMEPRIME ) % Max_Sym_Tab; ) % Max_Sym_Tab; }else{}else{ HashIndex = ( HashIndex + HashIndex = ( HashIndex + SecondaryHash( name )SecondaryHash( name ) ) % Max_Sym_Tab; ) % Max_Sym_Tab; } //end if} //end if WrapAround = HashIndex == Old_Inx;WrapAround = HashIndex == Old_Inx; MaxChain = chain > MaxChain ? chain : MaxChain;MaxChain = chain > MaxChain ? chain : MaxChain; } //end if} //end if

} //end while} //end while. . .. . .

} //end search} //end search

36

Scan Complete Inputvoid enter_names( void )void enter_names( void ){ // enter_names{ // enter_names char c = ' ';char c = ' '; // anything except EOF// anything except EOF char name[ MAX_L + 2 ];char name[ MAX_L + 2 ]; // next string, add 2 slack// next string, add 2 slack unsigned length;unsigned length; // track chars in ident// track chars in ident while ( c != EOF ) { // the whole filewhile ( c != EOF ) { // the whole file while ( ( c != EOF ) && ( ! isalpha( c ) ) ) {while ( ( c != EOF ) && ( ! isalpha( c ) ) ) { c = getchar();c = getchar(); // skip non-alpha// skip non-alpha } //end while} //end while length = 0;length = 0; // reset for each ID// reset for each ID while ( ( c != EOF ) && ( isalpha( c ) ) ) {while ( ( c != EOF ) && ( isalpha( c ) ) ) { if ( length < MAX_L ) {if ( length < MAX_L ) { name[ length++ ] = c;name[ length++ ] = c; } //end if // skip others} //end if // skip others c = getchar();c = getchar(); } //end while} //end while name[ length ] = '\0';name[ length ] = '\0'; TRACK( name );TRACK( name ); // also for statistics// also for statistics search( name );search( name ); // enter, only if not found!// enter, only if not found! } //end while} //end while} //end enter_names} //end enter_names

37

Initialize Hash

void init( int argc, char * argv[] )void init( int argc, char * argv[] ){ //init{ //init if ( 1 == argc ) {if ( 1 == argc ) { // no added arg given// no added arg given single = true;single = true; } else if ( argc >= 2 ) {} else if ( argc >= 2 ) { if ( !strcmp( argv[ 1 ], "d" ) ) {if ( !strcmp( argv[ 1 ], "d" ) ) { single = false;single = false; }else{}else{ single = true;single = true; } //end if} //end if } //end if} //end if printf( "want %s\n", single ? "single hash" : "double hash" );printf( "want %s\n", single ? "single hash" : "double hash" ); for ( int i = 0; i < Max_Sym_Tab; i++ ) {for ( int i = 0; i < Max_Sym_Tab; i++ ) { sym_tab[ i ][ 0 ] = '\0';sym_tab[ i ][ 0 ] = '\0'; } //end for} //end for MaxChain = 0;MaxChain = 0; Collisions = 0;Collisions = 0; printf( "Reading input:\n" );printf( "Reading input:\n" );} //end init} //end init

38

Terminate Hash

void terminate( void )void terminate( void ){ // terminate{ // terminate printf( " Maximum chaining = %d\n", MaxChain );printf( " Maximum chaining = %d\n", MaxChain ); printf( " Number of Collis. = %d\n", Collisions );printf( " Number of Collis. = %d\n", Collisions ); printf( " Number slots Avail = %d\0", Available );printf( " Number slots Avail = %d\0", Available );## ifdef DEBUG ifdef DEBUG for ( int i = 0; i < Max_Sym_Tab; i++ ) {for ( int i = 0; i < Max_Sym_Tab; i++ ) { if ( sym_tab[ i ][ 0 ] ) {if ( sym_tab[ i ][ 0 ] ) { printf( " %4d %s \n", i, sym_tab[ i ] );printf( " %4d %s \n", i, sym_tab[ i ] ); } //end if} //end if } //end for} //end for# endif // DEBUG# endif // DEBUG} //end terminate} //end terminate

39

Test Primary vs. SecondaryTest Primary vs. Secondary Running 2 hashing searchesRunning 2 hashing searches

The first one using a The first one using a primary hash functionprimary hash function only: only: collisions are handled by finding another entry with collisions are handled by finding another entry with fixed (fixed (PrimePrime) offset; see p. 31) offset; see p. 31

The second one using also a secondary hash The second one using also a secondary hash function: function: SecondaryHash( name );SecondaryHash( name ); collisions are collisions are handled by adding the new hash value, guaranteed handled by adding the new hash value, guaranteed to be != 0; yielding another entry; see p. 32to be != 0; yielding another entry; see p. 32

Both exercise the identical, small input, a file of Both exercise the identical, small input, a file of 21807 characters, 446 distinct identifiers21807 characters, 446 distinct identifiers

Results are shown next. Despite small sampling Results are shown next. Despite small sampling space, there is a noticeable difference in space, there is a noticeable difference in performance quality!performance quality!

40

Test Secondary HashTest Secondary Hash Entries for Entries for single hashsingle hash= 883= 883 Twin prime, 2 smaller = 881Twin prime, 2 smaller = 881 Symbols used = 436Symbols used = 436 Fill factor = 49 percentFill factor = 49 percent Maximum chaining = 73Maximum chaining = 73 <- pretty bad chaining<- pretty bad chaining Number of Collisions = Number of Collisions = 68606860 Number slots Available = 447Number slots Available = 447 number of chars read = 21808number of chars read = 21808 number of ids scanned = 2320number of ids scanned = 2320 number of noise chars = 13087number of noise chars = 13087

Entries for Entries for double hashdouble hash= 883= 883 Twin prime, 2 smaller = 881Twin prime, 2 smaller = 881 Symbols used = 436Symbols used = 436 Fill factor = 49 percentFill factor = 49 percent Maximum chaining = 10Maximum chaining = 10 Number of Collisions = Number of Collisions = 12511251 <- ~20% of single hash<- ~20% of single hash Number slots Available = 447Number slots Available = 447 number of chars read = 21808number of chars read = 21808 number of ids scanned = 2320number of ids scanned = 2320 number of noise chars = 13087number of noise chars = 13087

41

Brent Heuristic Outline Brent's idea requires Brent's idea requires more work per insertionmore work per insertion than than

single and double hashing. But if a key is used single and double hashing. But if a key is used frequently as is the case in typical programs, the frequently as is the case in typical programs, the complex insertion effort may turn out to be a good complex insertion effort may turn out to be a good investment after allinvestment after all

To assess its relocation cost, we need to fetch the To assess its relocation cost, we need to fetch the old key and use the secondary hash function to old key and use the secondary hash function to find a relocation slot for the old keyfind a relocation slot for the old key

This would increase the cost for retrieving the old This would increase the cost for retrieving the old key, but, if this addition costs less than finding an key, but, if this addition costs less than finding an alternate entry for the new key, the net result is a alternate entry for the new key, the net result is a savingsaving

42

Hash Conclusion Symbol tables in a compiler are generally large, to Symbol tables in a compiler are generally large, to

handle large source programhandle large source program

Regular searches, such as Regular searches, such as binary binary or or insertioninsertion sorts sorts are not sufficient, when fast compilation speeds are are not sufficient, when fast compilation speeds are neededneeded

Hashing will be neededHashing will be needed to look up symbols to look up symbols

The fill factor of the hash table generally is way The fill factor of the hash table generally is way under 90%under 90%

Some techniques exist even to dynamically Some techniques exist even to dynamically increase table sizes, typically requiring relocation increase table sizes, typically requiring relocation of keys entered so farof keys entered so far

Dynamic array resizing is not discussed in CS 163Dynamic array resizing is not discussed in CS 163

Date post:	03-Jan-2016
Category:	Documents
Upload:	louisa-anderson
View:	218 times
Download:	0 times

1 CS 163 Data Structures Chapter 4 Searching Herbert G. Mayer, PSU Status 5/28/2015.

Documents