+ All Categories
Home > Documents > Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Date post: 18-Jan-2016
Category:
Upload: jonah-blair
View: 214 times
Download: 1 times
Share this document with a friend
45
Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching
Transcript
Page 1: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Copyright © 2009-2011 by Curt Hill

Searching and Sorting

A Summary on Searching

Page 2: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

The lesson from Neural Networks

• Neural networks are only used when there are no algorithms that always work

• We only use on hard problems• In NN there are never any absolute

answers• Instead each project is different and

we experiment with our options until we are happy

• We never are sure that this is the best answer we only hope that it is acceptable

Copyright © 2009-2011 by Curt Hill

Page 3: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Apply the lesson• This is the same problem with

constructing data structures for programs

• It is extremely rare for us to know in advance all the things that would make the decision easy:– The frequency or number of insertions,

deletions, lookups in an average run– The frequency distribution of the key– The density of the key– What will be the optimal container class– How the next revision will change all of

this

Copyright © 2009-2011 by Curt Hill

Page 4: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Therefore

• We make fuzzy choices based on incomplete information

• We then become good at spotting trends that favor one structure over another

• With that in mind let us come back and re-examine searching and sorting

Copyright © 2009-2011 by Curt Hill

Page 5: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Why consider both at once?

• Our containers will fall into one of three categories:– Unordered– Ordered by key– Ordered or partially ordered by something

other than key• In first case there is no notion of sorting• In rest there is

– In some cases we must sort before we get started searching

– In both cases an insert must do some type of partial sort to clean it up

– A delete may also affect the sorted order but is often easier to correct

Copyright © 2009-2011 by Curt Hill

Page 6: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Searching Arrays or Vectors

• Three areas to review here• Linear• Binary• Self-organizing lists

Copyright © 2009-2011 by Curt Hill

Page 7: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Linear searching• This is the best and worst• Advantages

– It is the easiest to code• Not much more than a for loop

– Does the best for small tables, typically less than 10

– Applicable to Lists as well as tables• Disadvantages

– Finding an item that is present in uniformly distributed array needs ½N probes

– Finding that an item is not present requires looking at each N items

– Clearly an O(N) algorithm which is the worst for a search

Copyright © 2009-2011 by Curt Hill

Page 8: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Why use?• When the advantages outweigh

the disadvantages• For small tables it is the preferred

choice• Often chosen early in a project

– If and when performance becomes a problem then upgrade the search based on what you now know about the project

– It may be that the vector is large but searched infrequently so it is not a problem

Copyright © 2009-2011 by Curt Hill

Page 9: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Sequential Search in C• There is a sequential search function

in stdlib.h• It will search an array of items using

a user defined comparison• The header is:void* lfind ( const void * key, const void * base, size_t * num, size_t * width, int (_USERENTRY * fcmp) (const void *, const void *));

Copyright © 2009-2011 by Curt Hill

Page 10: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Notes

• It uses void * to represent any pointer

• The key is a pointer to what is being searched for

• The base is an array, not necessarily of the same type as the key– It may contain the key and other stuff

• The array has num entries and each entry is width bytes long

Copyright © 2009-2011 by Curt Hill

Page 11: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

The passed function

• fcmp is a user defined routine to compare the key with a base item

• Key is the first parameter• An array entry is the second• Returns zero for equal and

anything else for not equal• If the item is found then it returns

the pointer to it and NULL otherwise

Copyright © 2009-2011 by Curt Hill

Page 12: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Commentary

• Actually figuring out how to use this thing is probably harder than coding it from scratch

• However, it will generally use machine language statements

• Thus it should do better than any C style loop

• There is also in the C libraries:– A binary search we will see later– A quick sort routine

Copyright © 2009-2011 by Curt Hill

Page 13: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Example

Copyright © 2009-2011 by Curt Hill

int fcmpe(const void * a, const void *b){ if(*(int *)a == *(int *)b){ return 0; } return +1;}...size_t s = tablesize;int key;int * unsorted; // dynamic array...lfind(key, unsorted, &s, 4, fcmpe);

Page 14: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Commentary• The lfind is classic C• It is not a template function, but it

can be used much like a template function

• Must use:– void * pointers– Makes user specify the length– Requires a user-defined function for

comparison

• Then it will work on any array

Copyright © 2009-2011 by Curt Hill

Page 15: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

STL considerations• The STL has a search which is

customarily interesting• It may search for an item or a

range of items– In any container

• The header looks like this:FI search(First1, Last1, First2, Last2)

Copyright © 2009-2011 by Curt Hill

Page 16: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

STL Notes• The result and all parameters are

Forward iterators of the same container class type

• First1 through Last1 are n the container class to be searched

• First2 through Last2 may be in another container

• If First2=Last2 then just one item

Copyright © 2009-2011 by Curt Hill

Page 17: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

STL Results

• If search finds it the result is the beginning of the sequence

• Otherwise it returns Last1• In order to use the stored types

must be suitable for the equality operator

• You may also provide your own predicate

Copyright © 2009-2011 by Curt Hill

Page 18: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Binary search

• The binary search requires a sorted table

• The sort order may be either ascending or descending– For this presentation assumed

ascending

Copyright © 2009-2011 by Curt Hill

Page 19: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Basic algorithm• Set low to 0, high to the last used• While low < high

– Set mid to be halfway between low and high

– Compare the mid item with key– If the mid item is equal you are done– If the mid item is less than the key

• Remove the lower half of the table• Set low to mid

– If the mid item is greater than the key• Remove the upper half of the table• Set high to mid

Copyright © 2009-2011 by Curt Hill

Page 20: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Commentary• The loop terminates when we find

item or the high and low bounds collapse

• We determine which after the loop• The advantages

– The search is O(log2N) because at each iteration we eliminate half of what is left

• The disadvantages– The loop is much more complicated

• Most people do not get it right the first time

– The array must be sorted before we get started

Copyright © 2009-2011 by Curt Hill

Page 21: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Sorting• Since sorting is either a O(N2) or

O(N log2N), this is a very serious ramification– You have to do quite a few searches

to pay for that sort

• If the table will allow insertions it complicates that as well– The search to find the item is log2N,

however the insertion may only be linear in an array, since we have to slide all the following items down one

Copyright © 2009-2011 by Curt Hill

Page 22: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

C Function• There is a binary search function in

stdlib.h• It will search a sorted array of items

using a user defined comparisonvoid* bsearch ( const void * key, const void * base, size_t * num, size_t * width, int (_USERENTRY * fcmp) (const void *, const void *) );

Copyright © 2009-2011 by Curt Hill

Page 23: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Commentary

• It uses void * to represent any pointer

• The key is a pointer to what is being searched for

• The base is an array, not necessarily of the same type as the key– It may contain the key and other stuff– The array has num entries and each

entry is width bytes longCopyright © 2009-2011 by Curt Hill

Page 24: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

User Defined Function• fcmp is a user defined routine to

compare the key with a base item• It returns a negative if the first

parameter is less than second• It returns a zero if the first

parameter is equal to second• It returns a positive if the first

parameter is greater than second• If the item is found then it returns

the pointer to it and NULL otherwise

Copyright © 2009-2011 by Curt Hill

Page 25: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Example

Copyright © 2009-2011 by Curt Hill

int fcmp(const void * a, const void *b){ if(*(int *)a<*(int *)b)

return -1; // less if(*(int *)a==*(int *)b)

return 0; // equal return +1; // greater}...int key;int * table; ...bsearch(key, table, tablesize, 4, fcmp);

Page 26: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

STL considerations• There is also a binary search in the

STL• The header is:• bool binary_search(first, last, const

T& value)• first and last are ForwardIterators in

the container• value is the item looked for• comp is comparison object to allow

you to specify the comparison• Of course, the container is ordered

Copyright © 2009-2011 by Curt Hill

Page 27: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Segmented Search

• Intermediate between binary and linear search– Easier to code than binary search– Faster than linear

• Requires sorted array• Depending on size of table may

come in two to four stages

Copyright © 2009-2011 by Curt Hill

Page 28: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Two Stage

• Divide the array into segments– Segment size is close to square root

of size

• First find the segment that contains desired item– Use a linear search but with segment

size increment

• Once segment is found find the desired item– Again with linear search

Copyright © 2009-2011 by Curt Hill

Page 29: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Example code• Assume that size of table is 64 and

it is sorted:int first = 0; last = 0;for(int i = 1;i<64;i+=8){ last = i; if(key > arr[i]) break; first = last; }for(int j = first;j<last;j++) if(key!=arr[j]) break;

Copyright © 2009-2011 by Curt Hill

Page 30: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Commentary

• The search should be O(2N½)• On the above table of 64 a linear

search that finds would take average 32 searches

• The segmented search will take no more than 16– Average is 8

• A binary search would average and 5.? and maximum 6

Copyright © 2009-2011 by Curt Hill

Page 31: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Other searches

• If the frequency of lookup is not uniform you may do some other things

• Storing the most commonly accessed items at the beginning of the list

• The most developed of which becomes a self organizing list

Copyright © 2009-2011 by Curt Hill

Page 32: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Hashing

• Often the best vector search technique

• Should be O(C) if done well• No restrictions on the key• The problems are well known and

discourage many from using

Copyright © 2009-2011 by Curt Hill

Page 33: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Problems with hashing• Insertions and deletions• Hash function does not generalize

well• No such thing as a general hash function• A good hash function is most often

constructed with knowledge of the data

• Performance degrades when full• Processing the data in a sorted

order requires an extra sort• Making the hash as robust as the

tree is quite difficult

Copyright © 2009-2011 by Curt Hill

Page 34: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Sermon

• Programmers usually avoid the hash because of these problems

• Very often this is the best of the search techniques

• The only question is: Is the work needed to make the hash the search technique of choice worth the work?– Depends on the application

Copyright © 2009-2011 by Curt Hill

Page 35: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Other containers

• Pointer based• Lists• Trees

Copyright © 2009-2011 by Curt Hill

Page 36: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Lists• Most of our techniques translate into lists

rather easily• Insertions and deletions are much easier• The lack of needing to know the size in

advance is also helpful– Dynamic arrays, including the STL vector,

are as convenient– There is a substantial run-time penalty

when the array has to be recopied to another larger array

• The main exception is the binary search– The binary search cannot be done since a

list is not a random access container– Most sorts do not work on a list either– Quick sort should work on a doubly linked

list Copyright © 2009-2011 by Curt Hill

Page 37: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Self organizing lists• Only list that is recommended for

searching– Only with very narrow criteria

• Types– Move to top

• Delete the item and push onto front

– Transpose• Remember the prior pointer and

exchange the two contents

– Sort by frequency is the hardest of the SOLs because you can move up a variable amount

Copyright © 2009-2011 by Curt Hill

Page 38: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Lists

• A self organizing list will provide good results only if:– Few items dominate the sought items– The list is relatively short

• Other than this lists are not a good search container unless– Search, insertion and deletion are

very infrequent– None are coded by the programmer

Copyright © 2009-2011 by Curt Hill

Page 39: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Trees

• Trees are inherently sorted– There is nothing like an unsorted list

• Flavors to consider– Unbalanced– Balanced– Optimal Search– Btree– Trie

Copyright © 2009-2011 by Curt Hill

Page 40: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Unbalanced tree• Normal searches perform slightly

worse than binary searches– Rarely balanced

• Advantage of log2N insertion time

• When the search failed, you are at the location that you want to insert at with no additional work

• The worst case tree deletion is better than the average table deletion and the average case is log2N

Copyright © 2009-2011 by Curt Hill

Page 41: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Balanced trees

• Search comparable to binary search• Insertions and deletions are

generally less painful• A rebalance can be quite extensive

and expensive– Generally a rebalance is less painful

than an insertion or deletion in a table because the sliding affects all the table to the end

– Recopying table is hidden cost

Copyright © 2009-2011 by Curt Hill

Page 42: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Tries• Most of the advantages of the tree but it

has two requirements to be useful:• Dense key• The key should have a small alphabet

and short length– This is not much of a consideration if the

key is truly dense

• A binary tree has a O(log2N) search time– While a trie has search time linear on the

length of the key rather than the number of entries

Copyright © 2009-2011 by Curt Hill

Page 43: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Optimal search trees• Somewhat similar to a list with

optimal static order but faster– Requires knowledge of the frequencies

• Like a binary search it generally cuts the items to be cut in half in each pass– The items are based on frequencies not

on keys

• Like a self organizing list it tends to find high frequency items quite quickly

Copyright © 2009-2011 by Curt Hill

Page 44: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

Optimal search tree• A standard unbalanced tree may be

used– Need a prior program that orders the

keys based on frequency

• Generally not used if insertions and deletions are possible

• May be used for a set of keys that changes from day to day

• Keep the counts in every node and then write out tomorrows based on the frequencies

• Could be quite effective but complicated to implement

Copyright © 2009-2011 by Curt Hill

Page 45: Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching.

B-Trees

• Offer no advantages in memory– Searching the node offsets the

shallowness of the tree

• Preferred for disks• No DBMS should be without

Copyright © 2009-2011 by Curt Hill


Recommended