CSC 240 (Blum)1 Normalization. CSC 240 (Blum)2 Normalization Normalization is a procedure within...

CSC 240 (Blum)1 Normalization CSC 240 (Blum)2 Normalization Normalization is a procedure within relational database design to ensure that 1.Each unit of data (value that a property takes on) is accessible to querying 2.Relationships are properly modeled 3.Data redundancy (unnecessary repetition) is minimized. CSC 240 (Blum)3 Complementary Approaches ER modeling is more of a top-down approach to database design. One starts with an overview of the most generic categories of data. Normalization is more of a bottom-up approach. One starts with a collection of attributes and sees how they break into various tables. CSC 240 (Blum)4 Unique Data Specification A piece of data should be identifiable by specifying first its Table, then its column and its row within that table No further specification should be required. CSC 240 (Blum)5 The data Bart is found in the Character Table in the FirstName Column in the 78th row (better yet in the row with CharacterID SC0003). CSC 240 (Blum)6 Rule 1: No multi-valued fields A cell in the table should hold only one unit of data. If a cell contained more than one unit of data, there would be no way to specify one of these units within the relational (table) approach. (Multi-valued fields can also cause storage problems, which are hidden from us when we work at the 4GL level.) CSC 240 (Blum)7 personIDfNamelNamecharacters SP0004DanCastellenataHomer Simpson Krusty the Clown Grampa Simpson Barney Gumble Itchy (the mouse) Mayor Quimby Groundskeeper Willy Scott Christian Hans Moleman Arnie Pie Sideshow Mel No multi-valued fields! CSC 240 (Blum)8 Flattening the file (table) The easy way to eliminate multi-valued fields is to introduce multiple records, one for each value of the multi-valued field All of the other fields would be the same. This is called flattening. CSC 240 (Blum)9 personIDfNamelNamecharacter SP0004DanCastellenataHomer Simpson SP0004DanCastellenataKrusty the Clown SP0004DanCastellenataGrampa Simpson SP0004DanCastellenataBarney Gumble SP0004DanCastellenataItchy (the mouse) SP0004DanCastellenataMayor Quimby SP0004DanCastellenataGroundskeeper Willy SP0004DanCastellenataScott Christian CSC 240 (Blum)10 First Normal Form If all of the tables in a database are set up such that each entry (cell) of each table contains at most one piece of data (i.e. is single-valued), then the database is said to be in the First Normal Form (1NF). There is a series of criteria to apply, each more stringent than the next. There will be a second normal form, a third normal form and so on. CSC 240 (Blum)11 What has been gained? The advantage of flattening is that each piece of data is now accessible without further parsing. Barney Gumble is no longer part of an entry but is its own entry. CSC 240 (Blum)12 What has been lost? The obvious price of flattening a file is redundancy (repetition of data). In addition to the storage issues that repeated data causes (which are hidden from us), there are the so-called data anomalies. These make maintaining the datas integrity (correctness) more difficult. CSC 240 (Blum)13 Whats the key? Recall the key is used to identify (uniquely) a record and is important to querying. Note that personID was the key to the table that had multi-valued fields. But in this new table personID alone cannot be a key. CSC 240 (Blum)14 Key consideration The character field can be the key if we do not include the producers (who do not voice characters). Or we could have a composite key. One could introduce a noCharacter character and a noActor person to solve the NULL problem. Note that no part of a composite key should be NULL. CSC 240 (Blum)15 Insertion Anomaly Ignoring the issue of Maggie (who no one voices) and the producers (who do not voice characters) theres another problem. Since the character and real person information are both needed: We cannot add a character without knowing the real person who provides the voice. We cannot add a person without knowing at least one character he/she voices. This type of problem is known as an insertion anomaly. CSC 240 (Blum)16 Deletion Anomaly The way the data is arranged now, if Dan Castellenata is fired and eliminated from the database, so too are all of the characters he voices eliminated. This feature is sometimes desired and can be built in. When this feature is not desired, but occurs because of the database design, it is known as a deletion anomaly. CSC 240 (Blum)17 Update Anomaly Isnt his name Castellaneta, not Castellenata? Because of all of the repetition caused by the flattening, the name must be changed in many places. One either does a search to find all occurrences. Or one runs the risk of having inconsistent data and consequently queries that dont turn up complete and accurate information. Inaccurate data caused by the inconsistent updating of redundant information is known as an update anomaly. Doh CSC 240 (Blum)18 personIDfNamelNamecharacter SP0004DanCastellanetaHomer Simpson SP0004DanCastellanetaKrusty the Clown SP0004DanCastellanetaGrampa Simpson SP0004DanCastellanataBarney Gumble SP0004DanCastellenetaItchy (the mouse) SP0004DanCastellanetaMayor Quimby SP0004DanCastellanetaGroundskeeper Willy SP0004DanCastellanetaScott Christian CSC 240 (Blum)19 More Tables, Less Redundancy One can eliminate the data redundancy by introducing more tables. This will simplify updating but can complicate querying. The repeated data will be moved to its own table and in its place will be inserted a foreign key to identify the appropriate row of the new table. CSC 240 (Blum)20 Splitting The table PeopleVoicingCharacters(personID, characterID, personFName, personLName, regular,characterFName, characterLName, description) becomes two tables RealPerson(personID, fName,lName, regular) Character(characterID, fName, lName, personID, description) CSC 240 (Blum)21 Repetition cannot be completely eliminated When PeopleVoicingCharacters became RealPerson and Character, the personID field appeared in both tables. This repetition is required to maintain the relationship between a character and a real person. The decomposition minimized (but did not eliminate) the repetition. Another feature of the split is that we can add unvoiced characters to the Character table and people who are not actors to the RealPerson table. CSC 240 (Blum)22 Important Properties Two important properties of decomposition: One should be able to put the original table back together by using a join, thus no information is lost by the split. This is called the Lossless-join property. Any constraint on the data imposed on the original table should be imposed on the resulting tables. This is called the Dependency preservation property. CSC 240 (Blum)23 How to do the splits? We would like to have a systematic approach for splitting tables in a way that minimizing redundancy. This is where the notion of functional dependence comes in. How the fields depend on each other determines where they end up when the table splits. CSC 240 (Blum)24 Functional dependence If for all of the records in a table, one can determine what value field B takes on simply by knowing what value field A has, then field B is said to be functionally dependent on field A. Either of the fields above may be composite, that is, comprised of other fields (e.g. name is a composite field comprised of fName and lName fields). CSC 240 (Blum)25 Derived attributes are functionally dependent We called an attribute derived if it was determined by another attribute. E.g. age or ageCategory can be derived from dateOfBirth Thus age is functionally dependent on dateOfBirth. Functional dependence is not necessarily a two- way street You can determine ageCategory from dateOfBirth but you cannot determine dateOfBirth from ageCategory. CSC 240 (Blum)26 Other functional dependencies A dateOfBirth is functionally dependent on a socSecNum. Theres no mathematical formula relating the two. But the socSecNum identifies a person, and that person as a unique dateOfBirth. Knowing the socSecNum, one can determine the dateOfBirth. CSC 240 (Blum)27 All fields depend on the primary key. Recall that a tables primary key is a field or fields that can be used to identify uniquely each record belonging to that table. If a field can be uniquely identified by the primary key then it is functionally dependent on the primary key. CSC 240 (Blum)28 Composite primary keys When the primary key is composite, the question arises: Is a field uniquely determined by (functionally dependent on) part of the primary key or the whole primary key? Data that is dependent on part of a primary key can be split off into another table. Known as partial dependencies. CSC 240 (Blum)29 Splitting off fields that depend on part of a key. Imagine a MovieDirector table MovieDirector(movieID, directorID, movieTitle, movieType, movieDescription, directorFirstName, directorLastName, directorDateOfBirth) It would have a composite primary key comprised of movieID and directorID. CSC 240 (Blum)30 Depending on parts The movieTitle, movieType, movieDescription fields depend only on the movieID portion of the primary key. The directorFirstName, directorLastName and directorDateOfBirth fields depend only on the directorID portion of the primary key. CSC 240 (Blum)31 MovieDirector: Three-Way Split We split off any fields that depend only on movieID. Partial dependence (dependent on a part) We split off any fields that depend only on directorID. Partial dependence We are left with any fields that depend on both. Full dependence CSC 240 (Blum)32 MovieDirector: Three-Way Split One gets: Movie(movieID, movieTitle, movieType, movieDescription) Director(directorID, directorFirstName, directorLastName, directorDateOfBirth) MovieDirector2(movieID, directorID) (If you assume that a movie has one director, then the directorID would be functionally dependent on the movieID, and you would only need two tables above.) CSC 240 (Blum)33 CSC 240 (Blum)34 Two or Three-Way Split In the Simpsons example, the PeopleVoicingCharacters became two tables, while in film example, the MovieDirector table became three tables. The difference is that personID was functionally dependent on the characterID, whereas there was no dependency between movieId and directorID. This agrees with the notion that many-to-many relationships between entities require their own tables. CSC 240 (Blum)35 Second Normal Form If a table has no multi-valued fields AND does not have any non-key fields which are functionally dependent on only part of the primary key, then such a table is said to be in the second normal form (2NF). The higher normal forms will be presented next time. CSC 240 (Blum)36 References Database Systems, Rob and Coronel Database Systems, Connolly and Begg Fundamentals of Relational Databases, Mata-Toledo and Cushman Concepts of Database Management, Pratt and Adamski

Date post:	17-Jan-2018
Category:	Documents
Upload:	colin-townsend
View:	231 times
Download:	0 times

CSC 240 (Blum)1 Normalization. CSC 240 (Blum)2 Normalization Normalization is a procedure within...

Documents