Sets and Maps (and Hashing) Chapter 9 Chapter 9: Sets and Maps2 Chapter Objectives To understand the...

Sets and Maps (and Hashing)

Chapter 9

Chapter 9: Sets and Maps 2

Chapter Objectives

• To understand the Java Map and Set interfaces and how to use them

• To learn about hash codes and how they are used to facilitate efficient search and retrieval

• To study two forms of hash tables—open addressing and chaining—and to understand their relative benefits and performance tradeoffs


Chapter Objectives

• To learn how to implement both hash table forms• To be introduced to the implementation of Maps and

Sets• To see how two earlier applications can be more easily

implemented using Map objects for data storage


Review of Sets

• Set is unordered, and has no duplicate elements• Suppose A = {1,3,5,7,9,11}, B = {2,3,5,7,11,13}• Then

• A B = {1,2,3,5,7,9,11,13}• A B = {3,5,7,11}• A B = {1,9}• B A = {2,13}• If C = {3,5,9}, then C A


Sets and the Set Interface

• The part of the Collection hierarchy that relates to sets • Includes three interfaces, two abstract classes, and

two actual classes


The Set Abstraction

• A set is a collection that contains no duplicate elements• And at most, one null element

• In a set, index of an element is meaningless• If s is a set,

s.contains(“apple”) returns true or false

s.indexOf(“apple”) makes no sense

s.get(i) is also nonsensical


The Set Abstraction

• Operations on sets include:• Testing for membership• Adding (inserting) elements• Removing elements• Union• Intersection• Difference• Subset


The Set Interface and Methods

• Has required methods for …• Testing set membership• Testing for an empty set• Determining set size• Creating an iterator over the set

• Two optional methods for …• To add an element• To remove an element

• Constructors enforce no duplicate members, and…• …add method does not allow duplicate item


The Set Interface and Methods


Comparison of Lists and Sets

• Duplicate elements• OK in a list• Not allowed in sets: Set.add returns false if you try to

insert a duplicate element• Get method

• List has a get method• A set has no get method (index is meaningless)

• Iterators• Lists have iterators• Can also iterate thru elements in a set


Maps

• A map relates one set to another set• Map is a set of ordered pairs (x,y)

• Where x == key and y == value (element)• For example

• This map is: {(J,Jane), (B,Bill), (B2,Bill), (S,Sam), (B1,Bob)}


Maps

• Map is a set of ordered pairs (x,y)• Where x == key and y == value (element)

• Keys must be unique• But values need not be unique (onto, not 1-to-1)• Each key “maps” to a particular value (element)

• Or, you might say it “corresponds” to• Maps used for very efficient storage and retrieval of

information in tables• Key is used like index into a list

• But key does not need to be integer


Maps

• Suppose we have the map:{(J,Jane), (B,Bill), (B2,Bill), (S,Sam), (B1,Bob)}

• And it is stored in “aMap”• Then

• What does aMap.get(“B2”) return?• “Bill”• What does aMap.get(“Bill”) return? • Null, since nothing in aMap has key == “Bill”


Map Interface


Hash Tables

• For maps, want to access entry by its key, not its value• A hash table is used for such access• For efficiency, want to access element directly by its key

• As opposed to searching for key value in an array• Using a hash table we can retrieve an item in constant

time, on average, and linear time in worst case• That is, O(1) is expected, but O(n) is worst case


Hash Codes and Index Calculation

• Hashing idea• Transform an item’s key value into an integer • Then use this integer as a numeric index


Hash Code Index Example

• Suppose we want to store number of occurrences of each Unicode characters in a file• There are 65,536 Unicode characters

• What to do?• Could create an array of size 65,536 and store count of

character i in array element i • This will work, but…• …very inefficient for a small file• Suppose file only has 100 characters!• Is there a better way?


Hash Code Index Calculation

• Suppose we want to store number of occurrences of each Unicode characters in a file• There are 65,536 Unicode characters

• File of 100 characters• Use a hash code for each character

• But how to compute hash code?• Could do the following:

• Create an array of size 200 and compute index as index = uniChar % 200

• Good since it uses less space• Bad if there are collisions

• 2 or more characters in file “hash” to same value


Methods for Generating Hash Codes

• Usually, keys consist of strings of letters and/or digits• The number of possible key values is much larger than

the table size• Generating a good hash code is something of an art

• Some experimentation, trial-and-error may be required• Desirable properties of a “hash function”?

• A “random” (uniform) distribution of values• Relatively simple function• Efficient to compute

• Collisions can always occur---what to do?


Java HashCode Method

• For strings, could simply sum int values of all characters • Will return the same hash code for sign and sing

• The Java API algorithm accounts for position of the characters as follows…• The String.hashCode() returns the integer calculated

by the formula: s0 x 31(n-1) + s1

x 31(n-2) + … + sn-1 where si is the ith character of the string, and n is the length of the string

• “Cat” will have a hash code of: ‘C’ x 312 + ‘a’ x 31 + ‘t’• Since 31 is a prime number, fewer collisions


Open Addressing

• We consider two ways to organize hash tables• Open addressing• Chaining

• For open addressing, linear probing can be used to deal with collisions• If that element contains an item with a different key,

increment the index by one• Keep incrementing until you find the key or null entry• Null indicates element is not in the table


Open Addressing Algorithm


Table Wraparound and Search Termination

• As index increases, must wrap around (circular array)• Leads to the potential of an infinite loop• How do you know when to stop searching if the table is

full and you have not found the correct value?• Stop when the index value for the next probe is the

same as the hash code value for the object, or…• Ensure that the table is never full by increasing its size

after an insertion if its occupancy rate exceeds a specified threshold (sparser table has fewer collisions)


Open Addressing Example

• Suppose we have the following values and hash codes

Name hashCode hashCode % 5 hashCode %11

“Tom” 84274 4 3

“Dick” 2129869 4 5

“Harry” 69496448 3 10

“Sam” 82879 4 5

“Pete” 2484038 3 7



• Suppose we use hashCode % 5 to create hash table• Using open addressing

Name hashCode % 5

“Tom” 4

“Dick” 4

“Harry” 3

“Sam” 4

“Pete” 3

index data

0 null

1 null

2 null

3 null

4 null




Name hashCode % 5

“Tom” 4

“Dick” 4

“Harry” 3

“Sam” 4

“Pete” 3

index data

0 null

1 null

2 null

3 null

4 “Tom”




Name hashCode % 5

“Tom” 4

“Dick” 4

“Harry” 3

“Sam” 4

“Pete” 3

index data

0 “Dick”

1 null

2 null

3 null

4 “Tom”




Name hashCode % 5

“Tom” 4

“Dick” 4

“Harry” 3

“Sam” 4

“Pete” 3

index data

0 “Dick”

1 null

2 null

3 “Harry”

4 “Tom”




Name hashCode % 5

“Tom” 4

“Dick” 4

“Harry” 3

“Sam” 4

“Pete” 3

index data

0 “Dick”

1 “Sam”

2 null

3 “Harry”

4 “Tom”




Name hashCode % 5

“Tom” 4

“Dick” 4

“Harry” 3

“Sam” 4

“Pete” 3

index data

0 “Dick”

1 “Sam”

2 “Pete”

3 “Harry”

4 “Tom”




Name hashCode % 5

“Tom” 3

“Dick” 5

“Harry” 10

“Sam” 5

“Pete” 7

Index data

0 null

1 null

2 null

3 null

4 null

5 null

6 null

7 null

8 null

9 null

10 null




Name hashCode % 5

“Tom” 3

“Dick” 5

“Harry” 10

“Sam” 5

“Pete” 7

Index data

0 null

1 null

2 null

3 “Tom”

4 null

5 null

6 null

7 null

8 null

9 null

10 null




Name hashCode % 5

“Tom” 3

“Dick” 5

“Harry” 10

“Sam” 5

“Pete” 7

Index data

0 null

1 null

2 null

3 “Tom”

4 null

5 “Dick”

6 null

7 null

8 null

9 null

10 null




Name hashCode % 5

“Tom” 3

“Dick” 5

“Harry” 10

“Sam” 5

“Pete” 7

Index data

0 null

1 null

2 null

3 “Tom”

4 null

5 “Dick”

6 null

7 null

8 null

9 null

10 “Harry”




Name hashCode % 5

“Tom” 3

“Dick” 5

“Harry” 10

“Sam” 5

“Pete” 7

Index data

0 null

1 null

2 null

3 “Tom”

4 null

5 “Dick”

6 “Sam”

7 null

8 null

9 null

10 “Harry”




Name hashCode % 5

“Tom” 3

“Dick” 5

“Harry” 10

“Sam” 5

“Pete” 7

Index data

0 null

1 null

2 null

3 “Tom”

4 null

5 “Dick”

6 “Sam”

7 “Pete”

8 null

9 null

10 “Harry”


Hash Table Operations

• Iterating thru hash table gives entries in “arbitrary” order• Deleting from hash table

• Cannot just insert a null --- why not?• Null used for stopping/not found condition• Can insert a “dummy value”• So, removing does not improve search time

• Reducing collisions• Expand size of hash table, and rehash elements• Tradeoff between table size and search efficiency


Reducing Collisions by Quadratic Probing

• Linear probing tends to form clusters of keys in the table, causing longer search chains

• Quadratic probing can reduce the effect of clustering• Increments form a quadratic series

• Disadvantages?• More work to calculate next index (multiplication,

addition, and modular division)• Not all table elements are examined when looking for

an insertion index


Chaining

• Chaining is an alternative to open addressing• Each table element references a linked list that contains

all of the items that hash to the same table index• The linked list is often called a bucket• The approach sometimes called bucket hashing

• Only items that have the same value for their hash codes will be examined when looking for an object


Chaining

• Recall hashCode % 5 • Chaining creates linked

list for each collision• In this example

• Linked list for Tom, Dick, Sam

• Another linked list for Harry and Pete

Name hashCode % 5

“Tom” 4

“Dick” 4

“Harry” 3

“Sam” 4

“Pete” 3


Chaining


Chaining

• Plusses?• Conceptually simple• Minimizes table size• Good search efficiency

• Minuses?• Overhead of linked lists (more storage)• More complex (perhaps)


Performance of Hash Tables

• Load factor is number of filled cells divided by table size• Load factor has greatest effect on performance

• The lower the load factor, the better the performance • Why?• Less chance of collision in a sparsely populated table• But, smaller the load factor, more wasted space…


Performance of Hash Tables


Maps and Hashing

• Maps use hash tables!• Hashing converts the key into an index

• Index is place where corresponding value stored• Makes it possible to search efficiently

• Recall, O(1), on average• Without having an (explicit) index• Of course, there is some additional overhead


Implementing a Hash Table


Implementing a Hash Table


Implementation of Maps and Sets

• Class Object implements methods hashCode and equals, so every class can access these methods unless it overrides them

• Object.equals compares two objects based on their addresses, not their contents

• Object.hashCode calculates an object’s hash code based on its address, not its contents

• Java recommends that if you override the equals method, then you should also override the hashCode method


Implementing HashSetOpen


Implementing Java Map and Set Interfaces

• The Java API uses a hash table to implement both the Map and Set interfaces

• The task of implementing the two interfaces is simplified by the inclusion of abstract classes AbstractMap and AbstractSet in the Collection hierarchy


Nested Interface Map.Entry

• One requirement on the key-value pairs for a Map object is that they implement the interface Map.Entry<K, V>, which is an inner interface of interface Map• An implementer of the Map interface must contain an

inner class that provides code for the methods in the table below


Additional Applications of Maps

• Can implement the phone directory using a map


Additional Applications of Maps

• Huffman Coding Problem• Use a map for creating an array of elements and

replacing each input character by its bit string code in the output file

• Frequency table• The key will be the input character• The value is the character code string


Chapter Review

• The Set interface describes an abstract data type that supports the same operations as a mathematical set

• The Map interface describes an abstract data type that enables a user to access information corresponding to a specified key

• A hash table uses hashing to transform an item’s key into a table index so that insertions, retrievals, and deletions can be performed in expected O(1) time

• A collision occurs when two keys map to the same table index

• In open addressing, linear probing is often used to resolve collisions


Chapter Review

• The best way to avoid collisions is to keep the table load factor relatively low by rehashing when the load factor reaches a value such as 0.75

• In open addressing, you can’t remove an element from the table when you delete it, but you must mark it as deleted

• A set view of a hash table can be obtained through method entrySet

• Two Java API implementations of the Map (Set) interface are HashMap (HashSet) and TreeMap (TreeSet)

Date post:	21-Dec-2015
Category:	Documents
View:	228 times
Download:	3 times

Sets and Maps (and Hashing) Chapter 9 Chapter 9: Sets and Maps2 Chapter Objectives To understand the...

Documents