1
CSCE 520 Test 2 InfoIndexing
Modified from slides of Hector Garcia-Molina and Jeff Ullman
2
Physical Storage Media
Speed of data access
Cost per unit of data
Reliability
•Data loss (power failure or system crash)
•Physical failure (storage device)
•Storage types
•Volatile storage
•Non-volatile storage
3
Memory Hierarchy
DBMSPrograms,Main MemoryDBMS
Tertiary Storage
VirtualMemory
Disk FileSystem
Main Memory
Cache
4
Disk Access Characteristics
•Move data to main memory: •Position head on cylinder•Find and access sector
•Steps of reading a block:•Processor and disk controller processes the request •Seek time: position the head•Rotation latency: rotate the sector under the head•Transfer time: sector/block read by the head
5
Disk Access Characteristics
•Steps of writing a block:•Read the block into the main memory•Change main memory copy of block•Write new content back on disk•Verify correctness of write
6
How to find records efficiently?
• Primary key – sequential organization
• Search key?• High I/O cost
INDEXING
Cost of Indexing
• Where the time spent on answering a query
• Fast: processing in memory• Slow: fetching from secondary storage• Cost of indexing:
– Index on several attributes: fast retrieval but slow writes (maintain index structure)
7
8
Topics
• Conventional indexes• B-trees• Hashing schemes (read only)
9
Sequential File
2010
4030
6050
8070
10090
10
Sequential File
2010
4030
6050
8070
10090
Dense Index
10203040
50607080
90100110120
11
Sequential File
2010
4030
6050
8070
10090
Sparse Index
10305070
90110130150
170190210230
12
Sequential File
2010
4030
6050
8070
10090
Sparse 2nd level
10305070
90110130150
170190210230
1090
170250
330410490570
13
Sparse vs. Dense Tradeoff
• Sparse: Less index space per record can keep more of
index in memory• Dense: Can tell if any record exists
without accessing file
14
Terms
• Index sequential file• Search key ( primary key)• Primary index (on Sequencing field)• Secondary index• Dense index (all Search Key values in)• Sparse index• Multi-level index
15
Next:
• Duplicate keys
• Deletion/Insertion
• Secondary indexes
16
Duplicate keys
1010
2010
3020
3030
4540
17
1010
2010
3020
3030
4540
10101020
20303030
1010
2010
3020
3030
4540
10101020
20303030
Dense index, one way to implement?
Duplicate keys
18
1010
2010
3020
3030
4540
10203040
Dense index, better way?
Duplicate keys
19
1010
2010
3020
3030
4540
10102030
Sparse index, one way?
Duplicate keys
care
ful if lookin
gfo
r 2
0 o
r 3
0!
20
1010
2010
3020
3030
4540
10203030
Sparse index, another way?
Duplicate keys
– place first new key from block
shouldthis be40?
21
Duplicate values, primary index
• Index may point to first instance ofeach value only
File Index
Summary
aaa
b
22
Deletion from sparse index
2010
4030
6050
8070
10305070
90110130150
23
Deletion from sparse index
2010
4030
6050
8070
10305070
90110130150
– delete record 40
24
Deletion from sparse index
2010
4030
6050
8070
10305070
90110130150
– delete record 30
4040
25
Deletion from sparse index
2010
4030
6050
8070
10305070
90110130150
– delete records 30 & 40
5070
26
Deletion from dense index
2010
4030
6050
8070
10203040
50607080
27
Deletion from dense index
2010
4030
6050
8070
10203040
50607080
– delete record 30
4040
28
Insertion, sparse index case
2010
30
5040
60
10304060
29
Insertion, sparse index case
2010
30
5040
60
10304060
– insert record 34
34
• our lucky day! we have free space where we need it!
30
Insertion, sparse index case
2010
30
5040
60
10304060
– insert record 15
15
2030
20
• Illustrated: Immediate reorganization• Variation:
– insert new block (chained file)– update index
31
Insertion, sparse index case
2010
30
5040
60
10304060
– insert record 25
25
overflow blocks(reorganize later...)
32
Insertion, dense index case
• Similar
• Often more expensive . . .
33
Summary so far
• Conventional index– Basic Ideas: sparse, dense, multi-
level…– Duplicate Keys– Deletion/Insertion– Secondary indexes
34
Conventional indexes
Advantage:- Simple- Index is sequential file
good for scans
Disadvantage:- Inserts expensive,
and/or- Lose sequentiality &
balance
35
• NEXT: Another type of index– Give up on sequentiality of index– Try to get “balance”
36
Root
B+Tree Example n=3
100
120
150
180
30
3 5 11
30
35
100
101
110
120
130
150
156
179
180
200
37
Sample non-leaf
to keys to keys to keys to keys
< 57 57 k<81 81k<95 95
57
81
95
38
Sample leaf node:
From non-leaf node
to next leafin
sequence5
7
81
95
To r
eco
rd
wit
h k
ey 5
7
To r
eco
rd
wit
h k
ey 8
1
To r
eco
rd
wit
h k
ey 8
5
39
Size of nodes: n+1 pointersn keys
(fixed)
40
Don’t want nodes to be too empty
• Use at least
Non-leaf: (n+1)/2pointers
Leaf: (n+1)/2 pointers to data
41
Full nodemin. node
Non-leaf
Leaf
n=3
12
01
50
18
0
30
3 5 11
30
35
counts
even if
null
42
B+tree rules tree of order n
(1) All leaves at same lowest level(balanced tree)
(2) Pointers in leaves point to records except for “sequence pointer”
43
(3) Number of pointers/keys for B+tree
Non-leaf(non-root) n+1 n (n+1)/2 (n+1)/2- 1
Leaf(non-root) n+1 n
Root n+1 n 1 1
Max Max Min Min ptrs keys ptrsdata keys
(n+1)/2 (n+1)/2
44
Insert into B+tree (read only)
(a) simple case– space available in leaf
(b) leaf overflow(c) non-leaf overflow(d) new root
45
(a) Insert key = 32 n=33 5 11
30
31
30
100
32
46
(a) Insert key = 7 n=3
3 5 11
30
31
30
100
3 5
7
7
47
(a) Simple case - no example
(b) Coalesce with neighbor (sibling)
(c) Re-distribute keys(d) Cases (b) or (c) at non-leaf
Deletion from B+tree
48
(b) Coalesce with sibling– Delete 50
10
40
100
10
20
30
40
50
n=4
40
49
(c) Redistribute keys– Delete 50
10
40
100
10
20
30
35
40
50
n=4
35
35
50
B+tree deletions in practice
– Often, coalescing is not implemented– Too hard and not worth it!