+ All Categories
Home > Technology > Index management in depth

Index management in depth

Date post: 15-Jul-2015
Category:
Upload: andrea-giuliano
View: 64 times
Download: 1 times
Share this document with a friend
Popular Tags:
58
INDEX MANAGEMENT IN DEPTH Andrea Giuliano @bit_shark
Transcript

I N D E X M A N A G E M E N TI N D E P T H

A n d r e a G i u l i a n o@ b i t _ s h a r k

D ATA B A S E M A N A G E M E N T S Y S T E M S

T H E A R C H I T E C T U R ED A TA B A S E M A N A G E M E N T S Y S T E M S

web forms Application Front Ends SQL Interface

SQL COMMANDS

Plan Executor Parser

OptimizerOperator Evaluator

Access File Manager

SQL Engine

Buffer Manager

Disk Manager

Transaction Manager

Lock Manager

Recovery Manager

Concurrency Control

DBMS

S Q L E N G I N EA R C H I T E C T U R E O F A D B M S

web forms Application Front Ends SQL Interface

SQL COMMANDS

Plan Executor Parser

OptimizerOperator Evaluator

Access File Manager

SQL Engine

Buffer Manager

Disk Manager

Transaction Manager

Lock Manager

Recovery Manager

Concurrency Control

DBMS

A C C E S S F I L E M A N A G E RA R C H I T E C T U R E O F A D B M S

web forms Application Front Ends SQL Interface

SQL COMMANDS

Plan Executor Parser

OptimizerOperator Evaluator

Access File Manager

SQL Engine

Buffer Manager

Disk Manager

Transaction Manager

Lock Manager

Recovery Manager

Concurrency Control

DBMS

B U F F E R M A N A G E RA R C H I T E C T U R E O F A D B M S

web forms Application Front Ends SQL Interface

SQL COMMANDS

Plan Executor Parser

OptimizerOperator Evaluator

Access File Manager

SQL Engine

Buffer Manager

Disk Manager

Transaction Manager

Lock Manager

Recovery Manager

Concurrency Control

DBMS

D I S K M A N A G E RA R C H I T E C T U R E O F A D B M S

web forms Application Front Ends SQL Interface

SQL COMMANDS

Plan Executor Parser

OptimizerOperator Evaluator

Access File Manager

SQL Engine

Buffer Manager

Disk Manager

Transaction Manager

Lock Manager

Recovery Manager

Concurrency Control

DBMS

U N I T O F I N F O R M AT I O N

Page size ranges from 2Kb to 64Kb I/O of pages dominates the cost of the operations

F I L E O R G A N I Z AT I O N

method of arranging the records in a file

–Sars, Roebuck, and Co, Consumers’ Guide, 1897

“If you don’t find it in the index, look very carefully through the entire catalog.”

I N D E X E S

organize data records on disk to optimize retrieval operations

I N D E X E S

data entry record stored in an index file

data record record stored in a database file

P R O P E R T I E S O F A N I N D E X

1. Structure of data entries

2. Clustered/unclustered

3. Primary/secondary

4. Dense/sparse

5. Organization of the index

S T R U C T U R E O F D ATA E N T R I E S

k*: data entry whose search key value is k

1. k* is a data record - extreme case

2. k* is a pair (k, rid) - the index file is independent from the data file

3. k* is a pair <k, rid-list> - the index file is independent from the data file - better use of space but variable-length data entries

P R O P E R T I E S O F A N I N D E X

C L U S T E R E D / U N C L U S T E R E D I N D E Xunclustered clustered

data record file

index file

data records

data entries

data entries

index entries

data records

P R O P E R T I E S O F A N I N D E X

P E R F O R M A N C E

data record file

index file

data records

data entries

Clustered only few pages have to be retrieved

Unclustered as many data page I/Os as the number of data entries

C L U S T E R E D / U N C L U S T E R E D I N D E X

P R I M A R Y A N D S E C O N D A R Y I N D E XP R O P E R T I E S O F A N I N D E X

10

20

30

40

10

20

30

40

page pointers

page in index file

page in data file

P R O P E R T I E S O F A N I N D E X

D E N S E I N D E X

10

30

50

70

10

20

30

40

50

60

70

80

page pointers

page in index file

page in data file

P R O P E R T I E S O F A N I N D E X

S PA R S E I N D E X

P R O P E R T I E S O F A N I N D E X

O R G A N I Z AT I O N O F T H E I N D E X

sorted index the index is a sorted file

tree-based index the index is a tree

hash-based index the index is a function from search key values to record addresses

I S A M A N D B +- T R E ET R E E - B A S E D I N D E X

ISAM used when the relation is static: no insertion and deletion on the tree

b+-tree effective in dynamic situations with insertion and deletion

Data Pages

Index Pages

Overflow Pages

Page Allocation in ISAM

Non-leaf pages

Leaf pages

Primary pages

Overflow page

I S A MT R E E - B A S E D I N D E X

I S A M : O P E R AT I O N ST R E E - B A S E D I N D E X

search identical to b+-tree (more on this soon).

insertion find the right position on the tree and write the key (possible overflow pages).

deletion remove the entry and the empty overflow page if needed. Leave the empty primary page as it is.

B +- T R E ET R E E - B A S E D I N D E X

Index entries

Data entries

Index file

12 78

3 9 19 56 86 94

33 44

Daniel, 22, 6003

Ashby, 25, 3000

Basu, 33, 4003 Rossi, 44, 3000

Bianchi, 50, 5004

B +- T R E ET R E E - B A S E D I N D E X

… … … … … … ……

12 78

3 9 19 56 86 94

33 44

Daniel, 22, 6003

Ashby, 25, 3000

Basu, 33, 4003 Rossi, 44, 3000

Bianchi, 50, 5004

… … … … … … ……

B +- T R E E : S E A R C HT R E E - B A S E D I N D E X

Start search

B +- T R E E : S E A R C HT R E E - B A S E D I N D E X

12 78

3 9 19 56 86 94

33 44

Daniel, 22, 6003

Ashby, 25, 3000

Basu, 33, 4003 Rossi, 44, 3000

Bianchi, 50, 5004

… … … … … … ……

Start search

A

B

L1 L2 L3

find all data entries with 24 < key < 50

12 78

3 9 19 56

33 44

Daniel, 22, 6003

Ashby, 25, 3000

Basu, 33, 4003 Rossi, 44, 3000

Bianchi, 50, 5004

… … … …

Start search

A

B

L1 L2 L3

S E A R C H : C O S TT R E E - B A S E D I N D E X

f: fanout h: height m: leaves (f h)

Cost of a search [logF m]

T R E E - B A S E D I N D E X

12 78

3 9 19 56

33 44

Daniel, 22, 6003

Ashby, 25, 3000

Basu, 33, 4003 Rossi, 44, 3000

Bianchi, 50, 5004

… … … …

Start search

A

B

L1 L2 L3

S E A R C H : C O S T

f = 3 h = 3 m = 27

I/Os [log3 27] = 3

B +- T R E E : I N S E R TT R E E - B A S E D I N D E X

13 17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Insertion of a data record with search key value 8

2* 3* 5* 7* 8*

B +- T R E E : I N S E R TT R E E - B A S E D I N D E X

5

data entry is copied up

B +- T R E E : I N S E R TT R E E - B A S E D I N D E X

17

data entry is pushed up

5 13 24 30

B +- T R E E : I N S E R TT R E E - B A S E D I N D E X

17

5* 7* 8* 14*

16*

19*

20*

22*

24*

27*

29*

33*

34*

38*

39*

5 13 24 30

2* 3*

The resulting tree after the insertion of a data record with search key value 8.

B +- T R E E : D E L E T ET R E E - B A S E D I N D E X

17

5* 7* 8* 14*

16*

22 24 27 29 33*

34*

38*

39*

5 13 27 30

2* 3*

The resulting tree after deleting entries 19* and 20*

- 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost

(on average) - dense unclustered b+-tree index with search key cost - alternative 2 (k*, rid)

Bookcode

authorcost

publisher

E X A M P L E

D E N S E U N C L U S T E R E D I N D E X

SELECT code, author, publisher FROM Book WHERE cost = %cost%

- 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost

(on average) - dense unclustered b+-tree index with search key cost - alternative 2 (k*, rid)

Bookcode

authorcost

publisher

E X A M P L E

D E N S E U N C L U S T E R E D I N D E X

SELECT code, author, publisher FROM Book WHERE cost = %cost%

- 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost

(on average) - dense unclustered b+-tree index with search key cost - alternative 2 (k*, rid)

Bookcode

authorcost

publisher

how many page accesses do we need to answer to the query?

E X A M P L E

D E N S E U N C L U S T E R E D I N D E X

E X A M P L E

D E N S E U N C L U S T E R E D I N D E X

- 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost

(on average) - dense unclustered b+-tree index with search key cost - alternative 2 (k*, rid)

Bookcode

authorcost

publisher

Let’s build the index structure each record has 4 field so in each page there are 40 fields 20 data entries fit in one leaf page of the index the tree has a fan-out of 20

E X A M P L E

D E N S E U N C L U S T E R E D I N D E X

fanout: 20

occupancy factor of 67% leads to 13 data entries in the leaves

How many leaves are there in the tree?

2.000.000/13 = 153.846

log20 (153.846) = 4 I/O page accesses

E X A M P L E

D E N S E U N C L U S T E R E D I N D E X

How many I/Os are needed to go to the leaves?

leaves: 153.846 fanout: 20

E X A M P L E

D E N S E U N C L U S T E R E D I N D E X

- 200 records with the same value of the attribute cost (on average)

- 13 data entries in the leaves - dense unclustered b+-tree index with search key cost

Bookcode

authorcost

publisherSELECT code, author, publisher FROM Book WHERE cost = %cost%

- 15 pages (200/13) to visit for reaching data records with the same cost value (on average)

~ 3 sec

E X A M P L E

D E N S E U N C L U S T E R E D I N D E X

Costs path to the leaves: 4 I/Os leaves access: 15 I/Os data records: 200 I/Os

Total cost 4 + 15 + 200 = 219 I/Os

- 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost

(on average) - sparse clustered b+-tree index with search key cost - alternative 2 (k*, rid)

Bookcode

authorcost

publisher

E X A M P L E

S PA R S E C L U S T E R E D I N D E X

- 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost

(on average) - sparse clustered b+-tree index with search key cost - alternative 2 (k*, rid)

Bookcode

authorcost

publisher

E X A M P L E

S PA R S E C L U S T E R E D I N D E X

SELECT code, author, publisher FROM Book WHERE cost = %cost%

- 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost

(on average) - sparse clustered b+-tree index with search key cost - alternative 2 (k*, rid)

Bookcode

authorcost

publisher

E X A M P L E

S PA R S E C L U S T E R E D I N D E X

SELECT code, author, publisher FROM Book WHERE cost = %cost%

how many page accesses do we need to answer to the query?

- 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost

(on average) - sparse clustered b+-tree index with search key cost - alternative 2 (k*, rid)

Bookcode

authorcost

publisher

E X A M P L E

S PA R S E C L U S T E R E D I N D E X

Let’s build the index structure each record has 4 field so in each page there are 40 fields 20 data entries fit in one leaf page of the index the tree has a fan-out of 20

E X A M P L E

S PA R S E C L U S T E R E D I N D E X

fanout: 20

How many pages store 2.000.000 data records?

2.000.000/10 = 200.000

each data entry points to a data record page

E X A M P L E

S PA R S E C L U S T E R E D I N D E X

fanout: 20

occupancy factor of 67% leads to 13 data entries in the leaves

How many leaves are there in the tree?

200.000/13 = 15.384

log20 (15.384) = 3 I/O page accesses

E X A M P L E

S PA R S E C L U S T E R E D I N D E X

How many I/Os are needed to go to the leaves?

leaves: 15.384 fanout: 20

fanout: 20

E X A M P L E

S PA R S E C L U S T E R E D I N D E X- 200 records with the same value of the attribute cost

(on average) - 10 data records in a page - sparse clustered b+-tree index with search key cost

Bookcode

authorcost

publisherSELECT code, author, publisher FROM Book WHERE cost = %cost%

- 20 pages (200/10) of data records to visit (on average)

~ 0.3 sec

E X A M P L E

S PA R S E C L U S T E R E D I N D E X

Costs path to the leaves: 3 I/Os data records: 20 I/Os

Total cost 3 + 20 = 23 I/Os

E X A M P L E

A VA R I A N T

What if the attributes were part of the search key?

SELECT code, author, publisher FROM Book WHERE cost = %cost%

E X A M P L E

W I T H O U T I N D E X

In the worst case we have to visit 2.000.000 records

SELECT code, author, publisher FROM Book WHERE cost = %cost%

~ 50 min

$ W H O A M I

Andrea Giuliano@bit_sharkwww.andreagiuliano.it

joind.in/13333Please rate the talk!

Ramakrishan, Gehrke “Database Management Systems” Assets:

https://farm4.staticflickr.com/3577/3492185538_a39dbb4511_b_d.jpg https://farm3.staticflickr.com/2852/10740309163_12f6a671cc_k_d.jpg https://farm3.staticflickr.com/2456/3835365695_5e515a3492_b_d.jpg https://farm5.staticflickr.com/4048/4332381194_3cfbed7f8e_b_d.jpg https://farm5.staticflickr.com/4117/4800819674_3cf963deaa_b_d.jpg https://farm5.staticflickr.com/4141/4772464179_0672159bbd_b.jpg https://farm9.staticflickr.com/8530/8574154090_bd14f9ccbf_o_d.jpg https://farm8.staticflickr.com/7370/10847923014_2b3fc30cea_k_d.jpg

R E F E R E N C E S


Recommended