Date post: | 15-Dec-2015 |
Category: |
Documents |
Upload: | hayden-elam |
View: | 215 times |
Download: | 0 times |
CS 245 Notes 5 1
CS 245: Database System Principles
Hector Garcia-Molina
Notes 5: Hashing and More
CS 245 Notes 5 2
key h(key)
Hashing
<key>
.
.
Buckets(typically 1disk block)
CS 245 Notes 5 3
.
.
.
Two alternatives
records
.
.
.
(1) key h(key)
CS 245 Notes 5 4
(2) key h(key)
Index
recordkey 1
Two alternatives
• Alt (2) for “secondary” search key
CS 245 Notes 5 5
Example hash function
• Key = ‘x1 x2 … xn’ n byte character string
• Have b buckets• h: add x1 + x2 + ….. xn
– compute sum modulo b
CS 245 Notes 5 6
This may not be best function … Read Knuth Vol. 3 if you really
need to select a good function.
Good hash Expected number of function: keys/bucket is the
same for all buckets
CS 245 Notes 5 7
Within a bucket:
• Do we keep keys sorted?
• Yes, if CPU time critical & Inserts/Deletes not too frequent
CS 245 Notes 5 8
Next: example to illustrateinserts, overflows,
deletes
h(K)
CS 245 Notes 5 9
EXAMPLE 2 records/bucket
INSERT:h(a) = 1h(b) = 2h(c) = 1h(d) = 0
0
1
2
3
CS 245 Notes 5 10
EXAMPLE 2 records/bucket
INSERT:h(a) = 1h(b) = 2h(c) = 1h(d) = 0
0
1
2
3
d
ac
b
h(e) = 1
CS 245 Notes 5 11
EXAMPLE 2 records/bucket
INSERT:h(a) = 1h(b) = 2h(c) = 1h(d) = 0
0
1
2
3
d
ac
b
h(e) = 1
e
CS 245 Notes 5 12
0
1
2
3
a
bc
e
d
EXAMPLE: deletion
Delete:ef
fg
CS 245 Notes 5 13
0
1
2
3
a
bc
e
d
EXAMPLE: deletion
Delete:ef
fg
maybe move“g” up
c
CS 245 Notes 5 14
0
1
2
3
a
bc
e
d
EXAMPLE: deletion
Delete:ef
fg
maybe move“g” up
cd
CS 245 Notes 5 15
Rule of thumb:• Try to keep space utilization
between 50% and 80% Utilization = # keys used
total # keys that fit
CS 245 Notes 5 16
Rule of thumb:• Try to keep space utilization
between 50% and 80% Utilization = # keys used
total # keys that fit
• If < 50%, wasting space• If > 80%, overflows significant
depends on how good hashfunction is & on # keys/bucket
CS 245 Notes 5 17
How do we cope with growth?
• Overflows and reorganizations• Dynamic hashing
CS 245 Notes 5 18
How do we cope with growth?
• Overflows and reorganizations• Dynamic hashing
• Extensible• Linear
CS 245 Notes 5 19
Extensible hashing: two ideas
(a) Use i of b bits output by hash function
b h(K)
use i grows over time….
00110101
CS 245 Notes 5 20
(b) Use directory
h(K)[i ] to bucket
.
.
.
.
CS 245 Notes 5 21
Example: h(k) is 4 bits; 2 keys/bucket
i =01
1
1
1
0001
1001
1100
Insert 1010
CS 245 Notes 5 22
Example: h(k) is 4 bits; 2 keys/bucket
i =01
1
1
1
0001
1001
1100
Insert 101011100
1010
CS 245 Notes 5 23
Example: h(k) is 4 bits; 2 keys/bucket
i = 1
1
1
0001
1001
1100
Insert 101011100
1010
New directory
200
01
10
11
i =
2
2
CS 245 Notes 5 24
10001
21001
1010
21100
Insert:
0111
0000
00
01
10
11
2i =
Example continued
CS 245 Notes 5 25
10001
21001
1010
21100
Insert:
0111
0000
00
01
10
11
2i =
Example continued
0111
0000
0111
0001
CS 245 Notes 5 26
10001
21001
1010
21100
Insert:
0111
0000
00
01
10
11
2i =
Example continued
0111
0000
0111
0001
2
2
CS 245 Notes 5 27
00
01
10
11
2i =
21001
1010
21100
20111
20000
0001
Insert:
1001
Example continued
CS 245 Notes 5 28
00
01
10
11
2i =
21001
1010
21100
20111
20000
0001
Insert:
1001
Example continued
1001
1001
1010
CS 245 Notes 5 29
00
01
10
11
2i =
21001
1010
21100
20111
20000
0001
Insert:
1001
Example continued
1001
1001
1010
000
001
010
011
100
101
110
111
3i =
3
3
CS 245 Notes 5 30
Extensible hashing: deletion
• No merging of blocks• Merge blocks
and cut directory if possible(Reverse insert procedure)
CS 245 Notes 5 31
Deletion example:
• Run thru insert example in reverse!
CS 245 Notes 5 32
Note: Still need overflow chains
• Example: many records with duplicate keys
11101
1100
2
21100
insert 1100
1100
if we split:
CS 245 Notes 5 33
Solution: overflow chains
11101
1100
11100
insert 1100 add overflow block:
1101
1101
CS 245 Notes 5 34
Extensible hashing
Can handle growing files- with less wasted space- with no full reorganizations
Summary
+
Indirection(Not bad if directory in
memory)
Directory doubles in size(Now it fits, now it does not)
-
-
CS 245 Notes 5 35
Linear hashing
• Another dynamic hashing scheme
Two ideas:(a) Use i low order bits of hash
01110101grows
b
i
(b) File grows linearly
CS 245 Notes 5 36
Example b=4 bits, i =2, 2 keys/bucket
00 01 10 11
0101
1111
0000
1010
m = 01 (max used block)
Futuregrowthbuckets
CS 245 Notes 5 37
Example b=4 bits, i =2, 2 keys/bucket
00 01 10 11
0101
1111
0000
1010
m = 01 (max used block)
Futuregrowthbuckets
If h(k)[i ] m, then look at bucket h(k)[i ]
else, look at bucket h(k)[i ] - 2i -1
Rule
CS 245 Notes 5 38
Example b=4 bits, i =2, 2 keys/bucket
00 01 10 11
0101
1111
0000
1010
m = 01 (max used block)
Futuregrowthbuckets
If h(k)[i ] m, then look at bucket h(k)[i ]
else, look at bucket h(k)[i ] - 2i -1
Rule
• insert 0101
CS 245 Notes 5 39
Example b=4 bits, i =2, 2 keys/bucket
00 01 10 11
0101
1111
0000
1010
m = 01 (max used block)
Futuregrowthbuckets
If h(k)[i ] m, then look at bucket h(k)[i ]
else, look at bucket h(k)[i ] - 2i -1
Rule
0101• can have overflow chains!
• insert 0101
CS 245 Notes 5 40
Note• In textbook, n is used instead of m• n=m+1
00 01 10 11
0101
1111
0000
1010
m = 01 (max used block)
Futuregrowthbuckets
n=10
CS 245 Notes 5 41
Example b=4 bits, i =2, 2 keys/bucket
00 01 10 11
0101
1111
0000
1010
m = 01 (max used block)
Futuregrowthbuckets
10
1010
0101 • insert 0101
CS 245 Notes 5 42
Example b=4 bits, i =2, 2 keys/bucket
00 01 10 11
0101
1111
0000
1010
m = 01 (max used block)
Futuregrowthbuckets
10
1010
0101 • insert 0101
11
CS 245 Notes 5 43
Example b=4 bits, i =2, 2 keys/bucket
00 01 10 11
0101
1111
0000
1010
m = 01 (max used block)
Futuregrowthbuckets
10
1010
0101 • insert 0101
11
11110101
CS 245 Notes 5 44
Example Continued: How to grow beyond this?
00 01 10 11
111110100101
0101
0000
m = 11 (max used block)
i = 2
. . .
CS 245 Notes 5 45
Example Continued: How to grow beyond this?
00 01 10 11
111110100101
0101
0000
m = 11 (max used block)
i = 2
0 0 0 0100 101 110 111
3
. . .
CS 245 Notes 5 46
Example Continued: How to grow beyond this?
00 01 10 11
111110100101
0101
0000
m = 11 (max used block)
i = 2
0 0 0 0100 101 110 111
3
. . .
100
100
CS 245 Notes 5 47
Example Continued: How to grow beyond this?
00 01 10 11
111110100101
0101
0000
m = 11 (max used block)
i = 2
0 0 0 0100 101 110 111
3
. . .
100
100
101
101
0101
0101
CS 245 Notes 5 48
• If U > threshold then increase m(and maybe i )
When do we expand file?
• Keep track of: # used slots total # of slots = U
CS 245 Notes 5 49
Linear Hashing
Can handle growing files- with less wasted space- with no full reorganizations
No indirection like extensible hashing
Summary
+
+
Can still have overflow chains-
CS 245 Notes 5 50
Example: BAD CASE
Very full
Very empty Need to move
m here…Would wastespace...
CS 245 Notes 5 51
Hashing- How it works- Dynamic hashing
- Extensible- Linear
Summary
CS 245 Notes 5 52
Next:
• Indexing vs Hashing• Index definition in SQL• Multiple key access
CS 245 Notes 5 53
• Hashing good for probes given keye.g., SELECT …
FROM RWHERE R.A = 5
Indexing vs Hashing
CS 245 Notes 5 54
• INDEXING (Including B Trees) good for
Range Searches:e.g., SELECT
FROM RWHERE R.A > 5
Indexing vs Hashing
CS 245 Notes 5 55
Index definition in SQL
• Create index name on rel (attr)• Create unique index name on rel
(attr)defines candidate key
• Drop INDEX name
CS 245 Notes 5 56
CANNOT SPECIFY TYPE OF INDEX
(e.g. B-tree, Hashing, …)
OR PARAMETERS(e.g. Load Factor, Size of
Hash,...)
... at least in SQL...
Note
CS 245 Notes 5 57
ATTRIBUTE LIST MULTIKEY INDEX
(next) e.g., CREATE INDEX foo ON
R(A,B,C)
Note
CS 245 Notes 5 58
Motivation: Find records where DEPT = “Toy” AND SAL >
50k
Multi-key Index
CS 245 Notes 5 59
Strategy I:
• Use one index, say Dept.• Get all Dept = “Toy” records
and check their salary
I1
CS 245 Notes 5 60
• Use 2 Indexes; Manipulate Pointers
Toy Sal>
50k
Strategy II:
CS 245 Notes 5 61
• Multiple Key Index
One idea:
Strategy III:
I1
I2
I3
CS 245 Notes 5 62
Example
ExampleRecord
DeptIndex
SalaryIndex
Name=JoeDEPT=SalesSAL=15k
ArtSalesToy
10k15k17k21k
12k15k15k19k
CS 245 Notes 5 63
For which queries is this index good?
Find RECs Dept = “Sales” SAL=20kFind RECs Dept = “Sales” SAL > 20kFind RECs Dept = “Sales”Find RECs SAL = 20k