+ All Categories
Home > Documents > DATABASE PHYSICAL DESIGN Chandra S. Amaravadi 1. INTRODUCTION 2.

DATABASE PHYSICAL DESIGN Chandra S. Amaravadi 1. INTRODUCTION 2.

Date post: 28-Dec-2015
Category:
Upload: ambrose-paul
View: 217 times
Download: 0 times
Share this document with a friend
56
DATABASE PHYSICAL DESIGN Chandra S. Amaravadi 1
Transcript

DATABASE

PHYSICAL DESIGN

Chandra S. Amaravadi

1

INTRODUCTION

2

PHYSICAL DATABASE DESIGN

Physical database design is concerned with issues revolving

around data base implementation:

Implementation design

Database storage, access & location

File organization & constraints

3

Conceptual/

Base table

THE THREE FORMS OF DATA

External

100 ...100 ... 200 ...200 ...

300 ...300 ...

Internal/

Hardware level

These three levels provide logical and physical data independence

4

Cust# Name Address Balance

100 Gordon 110 Oak Street $400

200 Prasad 22 Birch place $2500

300 ………. …………… ….......

Create table

Alter table

Create index

drop index

Facilities

ConceptualConceptualConceptualConceptual

InternalInternalInternalInternal

ExternalExternalExternalExternal

Models

Schemas

File

Organizations

Views

THE THREE TYPES OF MODELS

Create view

Drop view

5

DATABASE

PHYSICAL DESIGN

Inputs?

6

COMPONENTS OF PHYSICAL DESIGN

1. Implementation design

2. Storage, access & distribution strategies

3. File organizations

4. Specifications for integrity constraints (later)

7

IMPLEMENTATION DESIGN

Decide on tables (de-normalization)

Decide on primary and cross reference keys (not discussed further)

Decide on attribute data types (not discussed further)

E.g. fixed vs variable length fields

integer vs double integer

Design reports and forms (not discussed further)

Concerned with taking the results of normalization and designing tables, attributes, data types for implementation.

8

Field Name Data type Description Length Decimals

Prod# Numeric Unique prod code 6 0

Descr Text Short prod

description

25 0

Price Currency Product price 6 2

Denormalization Example (for 1:1)

Parts(Part#, PartName, )

Container (ContainerID, #fin, #needed, Part#)

Parts(Part#, PartName, ContainerID, #fin, #needed)

DECIDING ON TABLES

9

Denormalization is going back in the normal forms to reduce schemaoverhead

DECIDING ON TABLES..

Denormalization Example (for M:N)

ORDERS PRODUCTSAre for

Ord# Ord_dt

Qty

Prod# Descr.

What tables does normalization result in?

10

Orders(ord#, ord_dt, ..)

Product(prod.#, descr, ..)

Orders for prod (prod.#, ord#, qty)

DENORMALIZATION

Orders(ord#, ord_dt, ..)

Product(prod.#, ord#, descr., qty..)

11

COMPONENTS OF PHYSICAL DESIGN..

1. Implementation design

2. Storage and access strategies

3. Distribution strategies

4. File organizations

5. Specifications for integrity constraints (later)

12

STORAGE & ACCESS STRATEGIES

Estimate storage requirements (Volume analysis)

Determine media to be used (not discussed)

Study how data is being acccessed (Usage analysis)

Use these to develop file organization (later)

OBJECTIVES

13

ALSO CALLED VOLUME & USAGE ANALYSIS

Volume and Usage analysis is carried out with a composite usage map.

COMPOSITE USAGE MAP

Used for volume & usage analysis file org.

Superimposed on ER Chart

Attributes are not shown

Shows estimated number of records (volume)

Shows type of access (dotted lines )

A composite usage map is simply an ER chart (without attr),that shows the number of records, and the frequency/pattern with which they are accessed.

14

VOLUME & USAGE ANALYSIS

15

Equipment, Parts and PE tables Equipment: 100;

Parts:12,000; PE: 10,000

20 inquiries per hour to Equipment

300 inquiries per hour on Parts table

70% of these inquiries also need to know Equipment info.

Draw a composite usage map, estimate storage requirements and develop a suitable file organization

COMPOSITE USAGE MAP

EQUIPMENT

PARTS

ARE FOR

(100)

(12,000)

PE

(10,000)

20

????

???

16

FOR DISCUSSION

How can one estimate the size of a database?

17

ESTIMATING STORAGE REQMTS. FOR PARTS AND EQUIPMENT

7 10 12 2 1 1

EQUIPMENT (Model#, Descr, Mfr., Price, HP, WT) 1 10 12 2

PARTS(Part#, Descr, Mfr, Price) 7 1 1

PE (Model#, Part#, Qty)

18

Equipment table: 7+10+12+2+1+1 = 33 bytes/recordParts table: ??PE table: ??

Total storage requirements = ??

A MORE ELABORATE EXAMPLE

Parts are manufactured parts and purchased partsParts: 1,000; Suppliers:50; Quotations: 2,500

Total of 200 parts inquiries

60 direct inquiries to purchased parts

Of the purchased parts inquiries, 80 are also to

quotation

Of these 80, 70 are to supplier as well.

75 direct queries to supplier

Of these 40 are for quotation

All of these are also for parts

40% 70%

19

ANOTHER EXAMPLE..

PART

MANU-

FACTURED

PURCH-

ASED

SUPPLIER

QUOTA-

TION

Is-a

(1000)

(400) (700)

40% 70%

(2500)

(50)

200

140

60

A COMPOSITE USAGE MAP

75

40 80

70

40

20

80

Note: # of records are in red;the # of accesses are in blue

STORAGE REQUIREMENTS

PART_NO (5)

DESCRIPTION (15)

LOCATION (10)

QUANTITY (1)

RECORD SIZE: 31

FILE SIZE: 31 * 1100 = 34,300 Bytes

PART TABLE:

Estimated record size 150

Estimated file size 150*2500

= 375,000 Bytes

Note: This is done similarly for other tables.

QUOTATION TABLE:

21

COMPONENTS OF PHYSICAL

DESIGN..1. Implementation design

2. Storage & access strategies

3. Distribution strategies

4. File organizations

5. Specifications for integrity constraints (later)

22

1. Centralized

2. DistributedReplicated (not discussed)

Partitioned

DISTRIBUTION STRATEGIESDistribution strategies are concerned with where the files

are physically located.

23

DISTRIBUTION STRATEGIES

Centralized -- All the data is stored in one physical location.Distributed -- The data is stored in multiple physical locations.Replicated -- The database is duplicated in multiple locations.Partitioned -- The database is divided into “fragments” and each fragment is stored in a different location.

24

CENTRALIZED VS DISTRIBUTED

Which is bottleneck?

Which causes security problems?

Which method may be required for business reasons?

In which setup is data more accessible?

Which provides better performance?

25

CENTRALIZED STRATEGY

Maximize local access, minimize remote access

General Principle:

S1S1 S2S2

S3S3

100100

500500

600600

WHERE SHOULD WE

LOCATE THE DATABASE?

S1, S2 or S3

26

This slide is blank

DISTRIBUTED DATABASE

EID Name City

2356 Armstrong LA

3286 Nickerson SF

3356 Forrester MPLS

LA SF MPLS

partitioning

COMPONENTS OF PHYSICAL

DESIGN..1. Implementation design

2. Storage & access strategies

3. Distribution strategies

4. File organizations

5. Specifications for integrity constraints (later)

29

FILE ORGANIZATION

Tracks

Sectors

File 1

Rec. 1,2..

How records are arrangedon secondary storage ormapping between ____ and ______?

30

DATA ACCESS (FYI)

Hard driveIOP

FAT/NTFS

O/SDBMSRequests

Consults

Directory tables

Generates instructions to IOP

Partition

RAM

31

Database storage

FILE ORGANIZATION

Retrieval time (disk access)

Access type (direct, sequential)

Storage space

Maintenance effort

Selection Criteria

32

OVERVIEW OF FILEORGANIZATIONS Sequential

Hashed

IndexedISAM

VSAM

33

OVERVIEW OF FILEORGANIZATIONS..

Sequential -- Records are stored one after anotherin pkey sequence.

Hashed -- Record address is determined bysubjecting pkey to hashing algorithm.

Indexed -- Same as sequential except that there is anindex file which places keys into a separate file for ease of searching.

34

THE SEQUENTIAL ORGANIZATION

Records in Pkey sequence

Access only sequential

Insertions/Deletions in sequential order

Simple organization

good for batch updates

Part# Descr.     100 Aux. motors   120 Scrapers   124 Rotors   ..... ............    

35

THE HASHING ORGANIZATION

A type of file organization where record addressesare generated by subjecting primary keys to a hashingroutine, usually by dividing by a prime#

HashingAlgorithm

Pkey Hash Address

= REM [(Pkey)/(Prime#)]

+Address of StartingBlock

363432

HASHING CONCEPTS

Hashing algorithm

Hash address

Buckets & Bucket size

Slots

Collisions/overflows

Load factor

Search length

1

2

3

4

5

6

7

..

n

Record address = hash address + physical addr

37

Following are important conceptsin hashing:

3432

Pkey = 43Hash address = (43 remainder 7) = 1Record address = 3432 + 1 = 3433

43

Filespace

HASHING CONCEPTS..

Hashing algorithm – the formula used to calculate a record address

Hash address – an address (within block) where a hashed record is stored

Buckets – storage area for a group of records; bucket size refers to # of slots.

Slots – storage area for an individual record

Collision – when two records hash to the same address

Load factor – is the ratio of # of records to the total space allocated

Average search length – is the time it takes to retrieve a record on the avg.

(usually expressed in terms of disk accesses)

Disk access – every time a disk is accessed for getting a record (if the

record is stored in its hardware address, one access otherwise it depends

on record location)

38

HASHING ALGORITHMChoose load factor

Identify # of buckets to be allocated

Select a prime# close to this number

Divide each pkey by prime#

Remainder = record address

Sequentially number the buckets

Place each record to its address

If there are overflows, use Open

39

HASHING CONCEPTS..

11

22

33

44

55

66

77

....

nn

Collision: When two keyshash to the same address

Open overflow(store in unallocated slots)

Chained overflow(a separate area)

OVERFLOWS

40

HASHING EXAMPLE

Given Part#s:

100 Gears

120 Scrapers

130 Aux motors

140 Crankshafts

145 Cylinder heads

150 Pistons

100 Mod 7 = 2

120 Mod 7 = 1

130 Mod 7 = 4

140 Mod 7 = 0

145 Mod 7 = 5

150 Mod 7 = 3

assume 8 buckets (0..7)

assume 1 slot per bucket

assume disk access time of 20 ms

41

HASHING EXAMPLE..

0

12

3

4

100 Gears

120 Scrapers

130 Aux. motor

5

140 Crankshaft

145 Cylinders

FILE LOADINGS

150 Pistons

6

Insert: 135 Shovel?

135 Mod 7 = 2

Average search length?

6 records -> 1 access

1 record -> 2 accesses7

Load factor: ?

Bucket size = ?

42

THE HASHING ORGANIZATION

H(pkey) --> record address

Records in hash sequence

Need to allocate extra space

Load factor between 60-80%

Good for low activity (FAR) files

Real-time and OO applns.

EVALUATION

43

DISCUSSION

A parts file with Part# as the pkey includes records with the

following part# values:

23,37,46,48, 56,18, 10, 71, 16, 24, 39, 47 and 69.

The file uses 8 buckets numbered 0 to 7. Each bucket holds

two records.

Load these records into the file in the given order using the

hash function h(K) = K mod 8. Calculate the average search

length in terms of # of disk accesses.

44

INDEXED ORGANIZATION

Primary key

Secondary key

Clustered

A method of file organization where a subset of key values are stored in an index. Types are:

45

Records are in pkey sequence (master file)

But are organized into groups

Grouping information is stored in

index file

Records can be inserted at random

Records can be accessed in sequence or at random

THE INDEXED ORGANIZATION(ISAM)

46

Index file (index set) Master file (sequence set)

Emp ID

THE INDEXED ORGANIZATION

47

THE INDEXED ORGANIZATION

TRACKSCYLINDER1

48

CYLINDER2

CYLINDER1 CYLINDER2

THE ISAM ORGANIZATION

87 189 300 Cylinder index

43 69 87 136 150

250 300

24 32 43

45 62 69

Track index

Overflow tracks

Sequence Set

122 136

141 150 172

CYLINDER1 CYLINDER N..

Index Set

74 77 87 175 181 189 278 281 300

… …. …

… …. …

… …. …

Note: Assume that the corresponding HW addresses are stored along with the pkeys49

INSERTIONS IN ISAM

Identify track where record needs

to be inserted

If the track is full, insert in overflow area

If the track has room insert pkey in sequence

Update track index and cylinder index if necessary

50

ISAM: ADVANTAGES AND DISADVANTAGES

Access is direct or sequential?

Access time dependent on?

Rewrite sequentially

Retrieval time uniform

Suitable for volatile files?

Workhorse organization used in

most apps.

51

SECONDARY KEY INDEX

REC# E_SSN E_NAME E_TITLE E_SALARY

1. 456-34-8895 Smith Programmer $35,000

2. 459-66-6785 Johnson Analyst $27,000

3. 467-89-8898 Weintraub Programmer $60,000

4. 478-64-8005 Dickson Manager $64,000

5. 489-12-5575 Holland Analyst $47,000

6. 492-93-4438 Rao Analyst $71,000

7. 537-89-8898 McDonald Manager $85,000

EMPLOYEEEMPLOYEE

E_TITLE REC#

Analyst 2,5,6

Manager 4,7

Programmer 1,3

52

CLUSTERED INDEX

REC# E_SSN E_NAME E_TITLE E_SALARY

1. 459-66-6785 Johnson Analyst $27,000

2. 489-12-5575 Holland Analyst $47,000

3. 492-93-4438 Rao Analyst $71,000

4. 478-64-8005 Dickson Manager $64,000

5. 467-89-8898 McDonald Manager $85,000

6. 467-89-8898 Weintraub Programmer $60,000

7. 456-34-8895 Smith Programmer $35,000

EMPLOYEE

E_TITLE REC#

Analyst 1

Manager 4

Programmer 6

Also known asInverted fileorganization

53

INDEXING STRATEGIES

Index if you must

Index on pkey

Index on foreign keys

Index on secondary key

(depending on query frequency)

54

DISCUSSIONWhat activities are part of identifying storage strategies?

How is denormalization carried out for M:N relationships?

How many indexes can you have per table?

How many clustered indexes?

Can we sequentially update all records in

a) hashing organization? b) in indexing?

Is indexing suitable for volatile files?

If an index consists of 3 levels of indexes with the

main index in RAM, and a disk access time of 20 MS,

how long on the average does it take to retrieve a record?

What problems do overflow records cause in hashing?

55

THE END!

56


Recommended