+ All Categories
Home > Documents > Hashing Indirect Address Translation

Hashing Indirect Address Translation

Date post: 14-Jan-2016
Category:
Upload: akamu
View: 74 times
Download: 0 times
Share this document with a friend
Description:
Hashing Indirect Address Translation. Chapter 11. Indirect Address Translation. Direct translation Primary Key (PK) and the relative record position (RRP) are the same, we say there is a direct translation. Simple direct access file systems use this technique. Indirect Address Translation. - PowerPoint PPT Presentation
35
Processing - Indirect Address Translation MVNC 1 Hashing Indirect Address Translation Chapter 11
Transcript
Page 1: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 1

HashingIndirect Address

TranslationChapter 11

Page 2: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 2

Indirect Address Translation

Direct translation» Primary Key (PK) and the relative record position

(RRP) are the same, we say there is a direct translation.

» Simple direct access file systems use this technique.

Page 3: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 3

Indirect Address Translation

Direct translation - problems» The PKs may not be numeric.

– Names– Alpha numeric IDs

Page 4: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 4

Indirect Address Translation

Direct translation - problems» Only a small percent of the possible range of PK's

may actual have records assigned to them:– Consider a keyfield for an employee file is a 9 digit ID

number. (E.g. Social Security Number) – The company has 200 employees. – Since the ID's may have any of the 109 values, The file

will have to be huge (109 records!). Thus the file will have a packing density of:

200  records used109  records  allocated

= 2

107 = .000002%

Page 5: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 5

Indirect Address Translation

Hashing» A common technique of indirect translation is

hashing.» A solution in which the broad range of PK values

are transformed into the smaller range of RRP values.

» Hashing uses a hashing function to map translate thne key values into the smaller range of the RRP values.

000000000

9999999999

000

250

Broad range of

PK values

Restricted range of

RRP values

Coercion or indirect translation

Page 6: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 6

Indirect Address Translation

Hashing Algorithms» Development of a hashing function requires careful

attention– The algorithm should distribute the keys as evenly as

possible across the range of address.– Some different key MUST necessarily map to the same

addresses

Page 7: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 7

Key Transformation Algorithms

3 general steps to convert a key to a RRP address:1) If key is not numeric, convert it into a numeric form,

without losing information.

2) Operate on the numeric key using an algorithm which converts the keys into a spread of numbers of the order of magnitude of the address numbers required.

3) The resulting numbers are multiplied by a constant which compresses the address into the precise range of addresses.

Page 8: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 8

Key Transformation Algorithms

Example:» Key is a 9 Digit Number. » Destination file has 7000 records» Step 1 - Not needed (already a number)» Step 2 - Divide Key by 10000 to get remainder

between 0 - 9999» Step 3 - we multiply the value from 2 by .7 to put

number within the range 0000 to 6999.

Page 9: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 9

Key Transformation Algorithms

Example:» What would happen if we simply skip step 2 , and

simply compress the number from step 1? » What about clustered insertions? (Keys with

contiguous values.)

Page 10: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 10

Key Transformation Algorithms - Division

The key is divided by a number approximately equal to the number of available addresses, and the remainder is taken as the RRP.

A prime number or number with no small factors is used.

Page 11: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 11

Key Transformation Algorithms - Division

Example:» records have 6-digit key, 5000 RRPs desired.» divide by 4997 and use remainder» consider key: 142536

» = 28 remainder 2620.

» Use 2620 as RRP. How do you suppose this method would work

with clustered insertions?

142536

4997

Page 12: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 12

Key Transformation Algorithms - Extraction

Select digits from different parts of key. Example:

» Records with 10-digit key, 5000 RRPs desired.» Choose 3rd, 5th, 8th and 9th digits:» Consider key = 3865324567

» Compress into RRP range:

INT(8625 * .5) = 4312. Use 4312 as RRP.

3865324567

8625

Page 13: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 13

Key Transformation Algorithms - Folding

Digits in the key are folded inward like folding paper. Then the digits are added.

Folding tends to be more appropriate for large keys.

142537 142537

Page 14: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 14

Key Transformation Algorithms - Folding

Example» Let key be 142537.» Fold left at 4th digit, right at 3rd digit:

» Results in 4137 and 735» Add the two resulting values:

4137 + 735 = 4872» Compress into RRP range:» 4872 x .5 = 2436. Use 2436 as RRP.

142537 142537

Page 15: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 15

Key Transformation Algorithms - Mid-square

method

Square the key, and use the central digits of the result.

Example:» Let records have 6-digit key, and 5000 RRP's desired.» Key value of 142536.» 1425362 --> 020316511296» 1651 - central digits» Compress into RRP range: » 1651 x.5 = 825. Use 825 as RRP.

Page 16: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 16

Key Transformation Algorithms - Selection

The best way to choose a transform is to take the key set for the file and simulate using different transforms.

Choose the one which distributes the records most evenly.

The division method seems to be the best general transform.

Page 17: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 17

Important hashing considerations

When designing a practical hashing scheme, several important issues must be addressed:

record distribution » A hashing function needs to be picked which will

evenly distribute the records throughout the RRP range.

» Different key sets will have different distribution patterns.

» Thus the hashing function chosen will depend on the patterns of keys in the data set.

Page 18: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 18

Important hashing considerations

synonyms » two or more PKs which transform to the same RRP

address. » The the goal is to devise a hashing function for a

given key set of keys which will minimize synonyms. » It is, however, statistically beyond reason to totally

avoid synonyms. » Not only would all keys need to be known in advance,

but only one algorithm in 1012000 will work!

Page 19: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 19

Important hashing considerations

collisions» When a new record hashes to a record already in

use by another record. » The new record and the existing record are called

synonyms. » The result is called an overflow. » A scheme must be devised to handle overflows

efficiently.

Page 20: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 20

Important hashing considerations

packing density» ratio of records stored in a file to addresses

available in the file. » Typically the best packing density is 80-90%. » The larger the file, the less the probability of an

overflow. » There is thus a trade-off between space and

efficiency.

spaceefficiency

Page 21: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 21

Techniques for handling collisions

Strategies for collision resolution:1. Create the file so that each address (physical

record) can hold several logical records (usually synonyms). Called Composite Records or buckets.

2. Develop algorithms for relocating records which collide.

Page 22: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 22

Composite Records or buckets

Reduce number of RRP’s, but increase the size of each to hold several records.

Each RRP (called a bucket) now holds several logical records.

123456789

101112

1234

Page 23: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 23

Composite Records or buckets

buckets are arrays of logical records. bucket size - number of records/bucket Now room for several synonyms in each

bucket. Probability of overflow is reduced. Overflow now only occurs when bucket is full. Overall file size need not increase, if bucket

size 5, then reduce number of physical records by 5.

Page 24: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 24

Composite Records or buckets

May be implemented by having file record be arrays of logical records

Example: Consider two half full files

123456789

101112

rec

rec

rec

rec

rec

rec

1234

rec

rec

rec

rec

rec

rec

Probabity ofOverflow?

Page 25: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 25

Composite Records or buckets

Trade-offs» as bucket size increases, probability of a overflow

is greatly reduced. » as bucket size increases, time to read in and scan

bucket increases» Typical bucket sizes range from 5 to 30. » Ideal bucket size often a multiple of the disk sector

or track size.» What is the extreme case of having the longest

possible bucket?

Page 26: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 26

Handling overflows

Increasing bucket size will reduce, but not eliminate overflows. They must be dealt with.

Many algorithms exist for handling overflows , including:1. Progressive overflow

2. Separate overflow area

3. Chained Progressive overflow

Page 27: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 27

Progressive overflow

Adding new record» If home address is full, try the next record.» If next address full, try next, and so one.» If at end of file, wrap around to record 0» If search continues until home address again

reached, file full.

Page 28: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 28

Progressive overflow

Finding a record» If in home bucket, success!» Else if home bucket not full, search fails.» Else if home bucket full, go search next bucket.» Keep searching successive buckets until either

found, or a non-full bucket is searched.

Page 29: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 29

Progressive overflow

Finding a record» Note that as file fills, search length will increase.» What are some enhancements?

– Each bucket has flag indicating if bucket has really overflowed

Page 30: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 30

Progressive overflow

Delete record» Can't simply remove, or find may not work correctly» Must mark each record as used, unused, or

deleted.

Page 31: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 31

Progressive overflow

Evaluation» simple» robust » searches may get very long» clustering

Page 32: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 32

Progressive overflow

Alternate version - skip x records each time, where x is prime relative to the number of records.

Reduces the problem of record clustering

Page 33: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 33

Separate overflow area

Buckets contain pointers which may point to a record in a special overflow area.

Records (or buckets) are linked together in the overflow area as a linked list.

What happens if there are a lot of synonyms for a few home addresses?

Page 34: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 34

Separate overflow areaMain

BucketsOverflow Pointers

Overflow Area

Page 35: Hashing Indirect Address Translation

File Processing - Indirect Address Translation MVNC 35

Chained Progressive overflow

similar to progressive, but pointers link synonyms together for quicker searches.


Recommended