+ All Categories
Home > Documents > normalization

normalization

Date post: 31-Oct-2014
Category:
Upload: prince-harsha
View: 51 times
Download: 18 times
Share this document with a friend
Description:
normalization
Popular Tags:
37
Chapter 1 Functional Dependencies The topic of Functional Dependencies is very important for databases because it makes it possible to analyse how the attributes in a given set of relations relate to each other. Analysing these relationships is essential if one wants to avoid problems when the information is actually stored in the database. Examples of these problems will be seen in more detail in the next Chapter, when we speak about Normalisation. 1.1 A simpli_ed view of the topic In its simplest form, a functional dependency (FD) in databases, as the name says, is a dependency of the value of one attribute of a relation on the value of another attribute (not necessarily distinct). To understand what a functional dependency is, it is useful to recall the concept of a function in mathematics. I will make a quick recap here stressing only the properties which are important for us. A function is simply an association of elements of two sets X and Y . The set X is called the domain of the function and the set Y is called its image. This association is such that \every element in X is associated with exactly one element in Y ". Note that di_erent elements in X can be associated to the same element in Y . What is not allowed is that the same element in X is mapped to di_erent elements in Y . Figure 1.1 shows examples of associations and whether they qualify as functions or not. The association on the far right end is not a function because x 2 is associated with both y 2 and y 3. Let us now bring the concept back to _eld of databases. The elements in X and Y are
Transcript
Page 1: normalization

Chapter 1 Functional Dependencies The topic of Functional Dependencies is very important for databases because it makes it possible to analyse how the attributes in a given set of relations relate to each other. Analysing these relationships is essential if one wants to avoid problems when the information is actually stored in the database. Examples of these problems will be seen in more detail in the next Chapter, when we speak about Normalisation. 1.1 A simpli_ed view of the topic In its simplest form, a functional dependency (FD) in databases, as the name says, is a dependency of the value of one attribute of a relation on the value of another attribute (not necessarily distinct). To understand what a functional dependency is, it is useful to recall the concept of a function in mathematics. I will make a quick recap here stressing only the properties which are important for us. A function is simply an association of elements of two sets X and Y . The set X is called the domain of the function and the set Y is called its image. This association is such that \every element in X is associated with exactly one element in Y ". Note that di_erent elements in X can be associated to the same element in Y . What is not allowed is that the same element in X is mapped to di_erent elements in Y . Figure 1.1 shows examples of associations and whether they qualify as functions or not. The association on the far right end is not a function because x 2 is associated with both y 2 and y 3. Let us now bring the concept back to _eld of databases. The elements in X and Y are

Page 2: normalization

simply the values in the attributes of the relation. The simplest case in databases occurs when X and Y are just one attribute each, but we will see that, in general, X and Y might be a composition of attributes. Consider the relation \Albums" below. 1

Page 3: normalization

Notes on Logical Database Design & Normalization Database design primarily consists of two parts: logical design and physical design. Logical database design requires following certain sets of rules called rules of normalization Normalization Conditions: Any database designer must address two fundamental issues: • Storing data on the disk that is most efficient in disk space usage- resulting in low cost. • Fetching and saving data with the fastest response time – resulting in high performance. For your better understanding I am giving you the live examples from Project with a clear explanation to the 3 important rules of Normalization: Scope of my project: To display all the Doctors information who gives lectures on my company’s products. Rule I: The first rule of normalization requires removing repeating data values and specifies that no two rows can be identical in a database. This means that each entity must have a Primary key (One or more columns of the table) which uniquely identifies a row in the table. For e.g.: - Find below the table Professional that stores all the Doctors info. Each doctor is uniquely identified from this table because I have primary key constraints on the following columns.

PROFESSIONAL

PROF_ID DB2_SPKR_NUM LAST_NAME FIRST_NAME MIDDLE_INITIAL SUFFIX SALUTATION SSN PRIMARY KEYS EMAIL_ADDRESS ME_NUM CID REPSTRY_POST_DTM SPEC_ID

SSN & CID are two unique numbers for a Professional (Doctor). So, at any point of time there cannot be a professional repeated twice in this table.

Page 4: normalization

Rule II: An entity is in the second normal form if it confirms to the first normal form and all non-key attributes of the table are fully dependent on the entire primary key. If the primary key consists of multiple columns, then non-key columns should depend on the entire key and not just the subset of the key.

LAST_NAME FIRST_NAME MIDDLE_INITIAL Non Key columns SUFFIX SALUTATION

These columns directly depend on the primary key ( SSN,CID) and not partially. Rule III: An entity is in the third normal form if it already conforms the first two normal forms and none of none key elements are dependent on any other non-key attributes. All such attributes should be removed from the table. PROFESSIONAL PROF_ID DB2_SPKR_NUM LAST_NAME FIRST_NAME MIDDLE_INITIAL SUFFIX SALUTATION SSN EMAIL_ADDRESS ME_NUM CID REPSTRY_POST_DTM SPEC_ID FULL_NAME If FULL_NAME = LAST_NAME + FIRST_NAME Existence of Full_Name column in the table violates third normal form because a non key attribute (Full_Name) is dependent on two other non key attributes (Last_Name and First_Name). Therefor, to confirm the third rule of normalization, you must remove the Full_Name column from Professional table. Advantages of Normalization: • Because info. Is logically kept togeather, normalization provides better understanding of the system. • With less redundant data, it is easier to maintain referential integrity for the system. • Because tables are smaller with normalization, index creation and data sorts are much faster.

Page 5: normalization

• Disadvantages of Normalization: The main goal of normalization is to reduce redundancy in the system. As a result of normalization data is stored in different tables. To retrieve or modify data, you usually have to establish several multiple joins across multiple tables which can have adverse impact on performance of the system.

Page 6: normalization

CS3 Tutorial on Normalisation Lucy Hederman 1. The determinacy diagram for table X(A,B,C,) is:

A

B

C

; An occurrence of the table is:a1a 2a 3a4 2b

b1

2b2b c1

2c

c1

2c

A B C

Suppose a fifth row, starting with a2 as the value of A, were to be added. What must be the

value of attribute B? What must be the value of attribute C? Why would the fifth row be illegal? Can attribute A contain duplicate values? Is attribute A a candidate key?

2. Shoes are sold in a variety of styles and sizes. A style is identified by a style#. Each style has

a single description (e.g. men's slippers) and the same description may apply to several styles. The attribute weekly-sales represents the number of shoes of a particular size and style sold in the previous week (e.g. 25 pairs, style# 17 size 8). The attribute monthly-style-value represents the total sales value in the previous month for each style. Draw a determinacy diagram for the attributes style#, style-description, size, weekly-sales, monthly-style-value.

Identify the candidate key for the diagram. Derive a set of well-normalised tables for the

determinancy diagram- i.e. remove to a new table any determinant that is not a candidate key.

3. Every room in a building is identified by a room# and has precisely one telephone. Each

telephone has its own distinct extension#. There are two types of telephone, internal dialing only (type I), and external/internal dialing (type E). Information on rooms and telephones will be held in the table:

Office (room#, number-of-occupants, telephone-extension#, telephone type). Making any further plausible assumptions necessary: (a) draw a determinacy diagram for Office; (b) write down the candidate key(s); (c) is the corresponding table well-normalised ? (d) if not derive a set ofwell-normalised tables. 4. Repeat question 3 but with the addition of the attributes employee# and employee-name.

Values of employee# identify individual employees. Each employee has only one name and occupies only one room.

5. Repeat question 4 but allowing several telephones per room. All employees in a room share

all the telephones in that room.

Page 7: normalization

6. Tuples in the table EMP(ENAME, PNAME, DNAME) represent the fact that employee

named ENAME works on project PNAME and has dependent DNAME. Plausible enterprise rules will result in multi-valued dependencies in this relation. State these rules and indicate the multi-valued dependencies.

Give a sample relation instance which displays redundancy. Suggest fully normalised

relations for the same information and show the resulting relation instances for your sample. 7. The figure below shows an invoice from a company called Bilbo and Baggins. a). Draw a determinancy diagram for the attributes in the invoice. Omit both

"derivable" attributes, such as amount-due, and fixed attributes, such as VAT-reg-no, which are likely to be the same for all invoices. Assume plausible enterprise rules and state your assumptions.

b) Derive a set of well-normalised tables from your determinancy diagram.

Page 8: normalization

Normalisation

The theory of Relational Database Design

Page 9: normalization

CS3/3ICT2 Normalisation 2

Introduction

• Normalisation is a theory for designing relational schema that “make sense” and work well.

• Well-normalised tables avoid redundancy and thereby reduce inconsistencies.

• Redundancy is unnecessary duplication.• In well-normalised DBs semantic

dependencies are maintained by primary key uniqueness.

Page 10: normalization

CS3/3ICT2 Normalisation 3

Goals of Normalisation

• Eliminate certain kinds of redundancy• avoid certain update anomalies• good reresentation of real world• simplify enforcement of DB integrity

Page 11: normalization

CS3/3ICT2 Normalisation 4

Update anomalies

• Undesirable side-effects that occur when performaing insertion, modification or deletion operations on badly designed relational DBs.

SSN987654333321678467

NameJ SmithM BurkeA DolanK DoyleO O’NeillR McKay

Dept121132

DeptMgr321467321321678467

Dept Name

...

Representing Department info in the Employee table causes problems.

Page 12: normalization

CS3/3ICT2 Normalisation 5

Sample anomalies

• Modification -– when the manager of a dept changes we have to

change many values.– If we are not careful the DB will contain

inconsistencies.– There is no easy way to get the DB to ensure

that a department has only one manager and only one name.

Page 13: normalization

CS3/3ICT2 Normalisation 6

Anomalies continued

• Deletion -– if O O’Neill leaves we delete his tuple and lose

• the fact that there is a department 3 • the name of dept 3• who is the manager of dept. 3

• Insertion– how would we create a new department before

any employees are assigned to it ?

Page 14: normalization

CS3/3ICT2 Normalisation 7

Better design

• Separate entities are represented in separate tables.

Dept Name

...

Dept123

Dept121132

NameJ SmithM BurkeA DolanK DoyleO O’NeillR McKay

SSN987654333321678467

DeptMgr321467678

Note that mapping from an ER model following the steps given will give a well-normalised DB.

Page 15: normalization

CS3/3ICT2 Normalisation 8

Boyce-Codd Normal Form

• After a lot of other approaches Boyce and Codd noticed a simple rule for ensuring tables are well-normalised. Tables which obey the rule are in BCNF (Boyce CoddNormal Form).

• BCNF rule:Every determinant in a table must be a

candidate key for that table.

Page 16: normalization

CS3/3ICT2 Normalisation 9

Determinants

• A is a determinant of B if each value of A has precisely one (possibly null) associated value of B.

Said another way -• A is a determinant of B if and only if

whenever two tuples agree on their A value they agree on their B value.

A B

Page 17: normalization

CS3/3ICT2 Normalisation 10

Determinants

• Note that determinancy depends on semantics of data – cannot be decided from individual table

occurences.• Alternative terminology

– if A (functionally) determines B then– B is (functionally) dependent on A

Page 18: normalization

CS3/3ICT2 Normalisation 11

Example determinants

• SSN determines employee name• SSN determines employee department• Dept. No. determines Dept. Name• Dept. Name determines Dept. No.

– assuming Dept. names are also unique• Emp. Name does not determine Emp. Dept

– two John Smiths could be in difft. Depts.• Emp. Name does not determine SSN.

Page 19: normalization

CS3/3ICT2 Normalisation 12

Determinancy Diagram

SSNName

Department Dept. Name

Dept. Mgr

In general key attributes of an entity determine all the single-valued attributes of the entity.

Page 20: normalization

CS3/3ICT2 Normalisation 13

Composite Determinants• (SSN, Project#) together

determine the hours that the employee works on the project.

• Suppose packsize of a part depends on the supplier.

SSN

Project#hours

PName

Name

S#

P#packsize

PName

Page 21: normalization

CS3/3ICT2 Normalisation 14

Superfluous Attrbiutes

• Superfluous attributes– If SSN determines name, so does (SSN, Dept)

and (SSN, Dept, salary), etc.– Always remove superfluous attributes from

determinants.

Page 22: normalization

CS3/3ICT2 Normalisation 15

Transitive Dependencies

• SSN actually determines DeptMgr

• but only because – SSN determines DeptNo and– DeptNo determines DeptMgr.

• Be careful to remove transitive dependencies.– They mess up normalisation.

SSN

DeptNo

Dept. Mgr

Page 23: normalization

CS3/3ICT2 Normalisation 16

Candidate keys

• candidate key = any attribute or set of attributes which will be unique for a table (set of attributes).– As well as the primary key there may be other

candidate keys.– E.g. DNUMBER and DNAME are both

candidate keys for the Department table.• Key = row identifier• Candidate key = candidate identifier

Page 24: normalization

CS3/3ICT2 Normalisation 17

Finding candidate keys

• Every key is by definition a determinant of all other attributes in a relation.– So in a diagram, any attribute (or composite)

from which all other attributes are reachable is a candidate key.

SSN

Project#hours

PName

Name(SSN, Project#) is a

(composite) candidate key for a table

containing these five attributes.

Page 25: normalization

CS3/3ICT2 Normalisation 18

What are the candidate keys ?

student

subject

teacher

V

W

X

Y

AZC

B

DE

F

G H J

K

L

M

N

P Q R

B

DF

E

GH

S

T U

Page 26: normalization

CS3/3ICT2 Normalisation 19

Problems occur when ...

• Redundancy and anomalies occur when there are determinants which are not candidate keys.

SSN Name

DeptNo Dept. Name

Dept. Mgr

• SSN is the only key for a table containing these attributes– all attributes are reachable from SSN.

• SSN, DeptNo and DeptName are determinants– they have arrows coming out of them.

Page 27: normalization

CS3/3ICT2 Normalisation 20

BCNF rule

• In well-normalised relations (Boyce-Codd normal form) every determinant is a candidate key.

SSN Name

DeptNo

Dept. Name

Dept. Mgr

DeptNo

The employee/dept table decomposed to BCNF.

Note that both DeptNo and DeptName are candidate keys of the second table.

Page 28: normalization

CS3/3ICT2 Normalisation 21

Transformation to BCNF• Create new tables such that each

non-key determinant is a candidate key in a new table.

• The new table contains the attributes which are directly determined by the new candidate key.

V

W

X

Y

AZC

B

V X

V

W Y

V

W

AZAC

B

BCNF tables :(V, X)(A, B, C)(V, W, Z, A)(V, W, Y)

Page 29: normalization

CS3/3ICT2 Normalisation 22

Other Normal Forms

• First NF - no multi-valued attributes– all relational DBs are 1NF

• 2NF - every non-key attribute is fully dependent on the primary key

• 3NF - eliminate functional dependencies between non-key attributes– all dependencies can then be enforced

by uniqueness of keys.

G H J

Table is in 2NF but not 3NF

Page 30: normalization

CS3/3ICT2 Normalisation 23

BCNF vs. 3NF• BCNF goes further than 3NF, some say too far. • A 3NF table that has no overlapping composite keys is in

BCNF.

student

subject

teacher

3NF, not BCNFkeys: (student, subject)

(student, teacher)teacher is a determinant

student teacher

subjectteacher

BCNFbut tables are not independent

A teacher teaches only one subject.For a given subject a given student has only one teacher.

Page 31: normalization

CS3/3ICT2 Normalisation 24

4NF : Multi-valued dependencies

• If a course can have multiple teachers and multiple texts, blind mapping to 1NF will giveSubjectPhysicsPhysicsPhysicsPhysicsMathsMathsMaths

TeacherGreenBrownGreenBrownGreenGreenGreen

TextBasic MechanisBasic MechanicsPrinciples of OpticsPrinciples of OpticsBasic MechanicsVector AnalysisTrigonometry

which clearly has redundancy.

Page 32: normalization

CS3/3ICT2 Normalisation 25

Fully-normalised

• BCNF relations are well-normalised• Fully-normalised relations are those with no

multi-valued dependencies (4NF) and no join dependencies (5NF).

Page 33: normalization

3ICT2 Additional Normalisation Tutorial (Sample Exam Questions)

1. A database for a multi-branch bank is to record the following attributes -

{account#, customer-name, customer-address, branch#, branch-address, credit-code, credit-limit}.

An account# is unique within a branch. A customer may have many accounts at

one or many branches of the bank. Joint accounts are not supported. Assume that customers can be identified by name.

Each account is assigned a credit code. The code determines the credit limit. For

example accounts with code B have a credit limit of £500.

(a) Draw a determinancy diagram for the seven attributes. Note any assumptions that you make.

(b) Derive a set of well-normalised (BCNF) relations from your determinancy

diagram. Underline the primary key for each relation. (c) Write down a sequence of relational algebra operations to get the names

and addresses of all customers whose credit limit is less than 100. 2. An employee database is to hold information about employees, the department

they are in and the skills which they hold. The attributes to be stored are emp-id, emp-name, emp-phone, dept-name, dept-phone, dept-mgrid, skill-id, skill-name, skill-date, skill-level

An employee may have many skills, such as word-processing, typing, librarian,

filing, ... The date on which the skill was last tested and the level displayed at that test are recorded for the purposes of assigning work and determining salary. An employee is attached to one department and each department has a unique manager.

(a) Draw a dependency diagram for the above database, stating clearly any

assumptions that you make. (b) Derive a set of well normalised (BCNF) relations, indicating the primary key

of each relation.

(c) Write down a sequence of relational algebra operations to get the names and phone numbers of employees who can both file and type to a level of 6 or more.

Page 34: normalization

More Relational Algebra and SQL exercises 1. Consider the following relational database schema. It is intended to represent who will eat what

kinds of sandwiches and the places which serve the various kinds of sandwiches. A sample database instance is also given.

TASTES Name Filling

SANDWICHES Location Bread Filling Price

LOCATIONS LName Phone Address

LincolnO'Neill'sOld NagButtery

683 4523674 2134767 8132702 3421

Lincoln PlacePearse StDame StCollege St

LNAME PHONE ADDRESSLOCATIONS

BrownBrownBrownJonesGreenGreenGreen

TurkeyBeefHamCheeseBeefTurkeyCheese

NAME FILLINGTASTES

LincolnO'Neill'sO'Neill'sOld NagButteryO'Neill'sButteryLincolnLincolnOld Nag

RyeWhiteWholeRyeWhiteWhiteWhiteRyeWhiteRye

HamCheeseHamBeefCheeseTurkeyHamBeefHamHam

1.251.201.251.351.001.351.101.351.301.40

LOCATION BREAD FILLING PRICESANDWICHES

(a) Give a series of relational algebra operations (select, project, join, difference, division, ...) to produce the following four relations :

(i) Cheap_Places is the set of locations that do not have any sandwiches costing more

than 1.30. (ii) Can_Eat is a set of (name, location) tuples indicating who can eat where (i.e. in

what locations are there fillings which that person likes). (iii) Jones_Places is the set of locations where Jones can eat, along with their phone

numbers.

(iv) All_Eat is the set of locations, if any, where everyone mentioned in the tastes relation can eat together. (Use the division operator.)

(b) Write SQL statements to retrieve the following information:

(i) places where Jones can eat (using a nested subquery). (ii) places where Jones can eat (without using a nested subquery). (ii) for each location the number of people who can eat there.

Page 35: normalization

2. Consider the following relational database schema. It is intended to represent the holdings of a multi-branch library. A sample database instance is also given.

Title Author Publisher

Branch Title #copies

Branch

Titles

Holdings

BCode Librarian Address

B1B2B3

John SmithMary JonesFrancis Owens

SusannahHow to FishA History of DublinComputersThe Wife

Ann BrownAmy FlyDavid LittleBlaise PascalAnn Brown

MacmillanStop PressWileyApplewoodsMacmillan

B1B1B1B2B2B2B3B3B3B3

SusannahHow toA histHow toComputersThe WifeA hist ..ComputersSusannahThe Wife

3214231431

BCode Librarian Address

2 Anglesea Rd34 Pearse StGrange X

Title Author Publisher

Branch Title #copies

(a) Give a series of relational algebra operations (select, project, join, difference, division, ...) to produce the following four relations :

(i) Librarians is a list of the names of the branch librarians. (ii) Brown_Branches is the set of branches which have holdings of books by Ann

Brown. (iii) No_Browns is the set of branches which have no holdings of any books by Ann

Brown. (Use the Brown_Branches relation from (ii). (iv) In_All_Branches is the set of book titles, if any, along with their authors, which are

each held at all branches. (Use the division operator.)

(b) Write SQL statements to retrieve the following information:

(i) the names of all library books published by Macmillan. (ii) branches that hold any books by Ann Brown (using a nested subquery). (iii) branches that hold any books by Ann Brown (without using a nested subquery). (iv) the total number of books held at each branch.

Page 36: normalization

3. A Library Database is to contain information about published papers in a number of selected subject areas; each paper is classified under any number of different subject areas. The following data is to be stored:

subject classification number name of subject S1 relational databases S2 theory S3 data models S4 security S5 distributed databases paper number: P1 subject areas: S1, S2, S3 title: A relational model for large shared data banks author: Codd, EF journal: CACM volume: 13 number: 6 paper number: P2 subject areas: S4 title: Data Security author: Denning, D journal: ACM Computing Surveys volume: 11 number: 3 paper number: P3 subject areas: S1, S2, S3 title: Formal aspects of the relational model author: Furtado, A journal: Information Systems volume: 3 number: 2 paper number: P4 subject areas: S5 title: Distributed deadlock detection algorithm author: Obermarck, R journal: TODS volume: 7 number: 2

(a) Show how this data could be represented in a relational database.

(b) Draw a database schema for the relational library database, indicating primary and foreign keys.

(c) Give a series of relational algebra operations (select, project, join, difference, union,

intersection, division, ...) on your relational database schema to produce the following four relations :

(i) CACM13 is a list of titles and authors of papers from volume 13 of the CACM

journal. (ii) SECURITY is a list of the title, author, and paper identification number of papers

whose subject is "security". (NB You must not use the fact that you know that the code for "security" is S4.)

(iii) THE&SEC is a list of paper identification numbers for papers which are about both theory and security.

(d) Provide SQL statements on your relational database schema for the following queries

(i) the titles and authors of all papers from volume 13 of the CACM journal. (ii) the title, author, and paper identification number of papers whose subject is

"security". (NB You must not use the fact that you know that the code for "security" is S4.)

(iii) for each subject area, the subject name and the total number of papers in the database which are on that subject.

Page 37: normalization

http://www.dcs.kcl.ac.uk/teaching/units/1999/cs02db/pdf/lecture-notes-01.pdf http://www.dcs.kcl.ac.uk/teaching/units/1999/cs02db/ http://www.dcs.kcl.ac.uk/teaching/units/1999/cs02db/index.html#slides


Recommended