+ All Categories
Home > Documents > DBMS Caracterization

DBMS Caracterization

Date post: 14-Feb-2018
Category:
Upload: mikkezavala
View: 227 times
Download: 0 times
Share this document with a friend

of 88

Transcript
  • 7/23/2019 DBMS Caracterization

    1/88

    Databases

    Ken Moody

    Computer LaboratoryUniversity of Cambridge, UK

    Lecture notes by Timothy G. Griffin

    Lent 2012

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 1 / 175

    Lecture 01 : What is a DBMS?

    DB vs. IR

    Relational Databases

    ACID properties

    Two fundamental trade-offsOLTP vs. OLAP

    Course outline

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 2 / 175

  • 7/23/2019 DBMS Caracterization

    2/88

    Example Database Management Systems (DBMSs)

    A few database examplesBanking : supporting customer accounts, deposits and

    withdrawalsUniversity : students, past and present, marks, academic status

    Business : products, sales, suppliers

    Real Estate : properties, leases, owners, renters

    Aviation : flights, seat reservations, passenger info, prices,payments

    Aviation : Aircraft, maintenance history, parts suppliers, partsorders

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 3 / 175

    Some observations about these DBMSs ...

    They contains highly structured data that has been engineered tomodel somerestrictedaspect of the real world

    Theysupport the activityof an organization in an essential way

    They supportconcurrent access, both read and write

    They often outlive their designersUsers need to know very little about the DBMS technology used

    Well designed database systems are nearly transparent, just partof our infrastructure

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 4 / 175

  • 7/23/2019 DBMS Caracterization

    3/88

    Databases vs Information Retrieval

    Always askWhat problem am I solving?

    DBMS IR systemexact query results fuzzy query resultsoptimized for concurrent updates optimized for concurrent readsdata models a narrow domain domain often open-endedgenerates documents (reports) search existing documentsincrease control over information reduce information overload

    And of course there are many systems that combine elements of DBand IR.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 5 / 175

    Still the dominant approach : Relational DBMSs

    The problem : in 1970 you could notwrite a database application withoutknowing a great deal about thelow-level physical implementation ofthe data.

    Codds radical idea [C1970]: giveusers a model of data and a

    language for manipulating that datawhich is completely independent ofthe details of its physicalrepresentation/implementation.

    This decouples development ofDatabase Management Systems(DBMSs) from the development ofdatabase applications (at least in an

    idealized world).This is the kind of abstraction at the heart of Computer Science!

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 6 / 175

  • 7/23/2019 DBMS Caracterization

    4/88

    What services do applications expect from a DBMS?Transactions ACID properties (Concurrent Systems course)

    Atomicity Either all actions are carried out, or none are

    logs needed to undo operations, if needed

    ConsistencyIf each transaction is consistent, and the database is

    initially consistent, then it is left consistentApplications designers must exploit the DBMSs

    capabilities.

    Isolation Transactions are isolated, or protected, from the effects ofother scheduled transactions

    Serializability, 2-phase commit protocol

    Durability If a transactions completes successfully, then its effectspersist

    Logging and crash recovery

    These concepts should be familiar from Concurrent Systems andApplications.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 7 / 175

    What constitutes a good DBMS application design?

    Domain of Interest Domain of Interest

    Database Database

    real-world change

    database update(s)

    represent represent

    At the very least, this diagram should commute!

    Does your database design support all required changes?

    Can an update corrupt the database?

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 8 / 175

  • 7/23/2019 DBMS Caracterization

    5/88

    Relational Database Design

    Our tools

    Entity-Relationship (ER) modeling high-level,diagram-baseddesignRelational modeling formal modelnormal formsbased

    on Functional Dependencies (FDs)

    SQL implementation Where the rubber meets the road

    The ER and FD approaches are complementaryER facilitates design by allowing communication with domainexpertswho may know little about database technology.

    FD allows us formally explore general design trade-offs. Such asA Fundamental Trade-off in Database Design:the more we

    reducedata redundancy, the harder it is to enforce some types ofdata integrity. (An example of this is made precise when we lookat 3NF vs. BCNF.)

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 9 / 175

    ER Demo Diagram (Notation follows SKS book)1

    Employee

    NameNumber

    ISA

    Mechanic SalesmanDoes

    RepairJobNumber

    Description

    CostParts

    Work

    Repairs Car

    License

    ModelYear

    Manufacturer

    Buys

    Price

    Date

    Value

    Sells

    Date

    Value

    Commission

    Client ID

    Name PhoneAddress

    buyerseller

    1By Pvel Calado,http://www.texample.net/tikz/examples/entity-relationship-diagram

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 10 / 175

  • 7/23/2019 DBMS Caracterization

    6/88

    A Fundamental Trade-off in Database

    Implementation Query response vs. updatethroughputRedundancy is a Bad Thing.

    One of the main goals of ER and FD modeling is to reduce data

    redundancy. We seeknormalizeddesigns.A normalized database can support high update throughput andgreatly facilitates the task of ensuring semantic consistency anddata integrity.

    Update throughput is increased because in a normalizeddatabase a typical transaction need only lock a few data items perhaps just one field of one row in a very large table.

    Redundancy is a Good Thing.

    A de-normalized database can greatly improve the response timeof read-only queries.

    Selective and controlled de-normalization is often required inoperational systems.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 11 / 175

    OLAP vs. OLTP

    OLTP Online Transaction Processing

    OLAP Online Analytical Processing

    Commonly associated with terms like DecisionSupport, Data Warehousing, etc.

    OLAP OLTPSupports analysis day-to-day operations

    Data is historical currentTransactions mostly reads updates

    optimized for query processing updatesNormal Forms not important important

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 12 / 175

  • 7/23/2019 DBMS Caracterization

    7/88

    Example : Data Warehouse (Decision support)

    business analysis queries

    Extract

    fast updates

    Operational Database Data Warehouse

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 13 / 175

    Example : Embedded databases

    FIDO = Fetch Intensive Data Organization

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 14 / 175

  • 7/23/2019 DBMS Caracterization

    8/88

    Example : Hinxton Bio-informatics

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 15 / 175

    NoSQL Movement

    TechnologiesKey-value store

    Directed Graph Databases

    Main memory stores

    Distributed hash tables

    ApplicationsFacebookGoogle

    iMDB

    ...

    Always remember to ask : What problem am I solving?

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 16 / 175

  • 7/23/2019 DBMS Caracterization

    9/88

    Term Outline

    Lecture 02 The relational data model.

    Lecture 03 Entity-Relationship (E/R) modelling

    Lecture 04 Relational algebra and relational calculus

    Lecture 05 SQLLecture 06 Case Study - Cancer registry for the NHS - challenges

    Lecture 07 Schema refinement I

    Lecture 08 Schema refinement II

    Lecture 09 Schema refinement III and advanced design

    Lecture 10 On-line Analytical Processing (OLAP)

    Lecture 11 Case Study - Cancer registry for the NHS -

    experiencesLecture 12 XML as a data exchange format

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 17 / 175

    Recommended ReadingTextbooks

    SKS Silberschatz, A., Korth, H.F. and Sudarshan, S. (2002).Database system concepts. McGraw-Hill (4th edition).

    (Adjust accordingly for other editions)

    Chapters 1 (DBMSs)

    2 (Entity-Relationship Model)

    3 (Relational Model)

    4.1 4.7 (basic SQL)

    6.1 6.4 (integrity constraints)

    7 (functional dependencies and normal

    forms)

    22 (OLAP)

    UW Ullman, J. and Widom, J. (1997). A first course indatabase systems. Prentice Hall.

    CJD Date, C.J. (2004). An introduction to database systems.

    Addison-Wesley (8th ed.).Ken Moody (cl.cam.ac.uk) Databases DB 2012 18 / 175

  • 7/23/2019 DBMS Caracterization

    10/88

    Reading for thefunof it ...

    Research Papers (Google for them)

    C1970 E.F. Codd, (1970). "A Relational Model of Data for LargeShared Data Banks". Communications of the ACM.

    F1977 Ronald Fagin (1977) Multivalued dependencies and anew normal form for relational databases. TODS 2 (3).

    L2003 L. Libkin. Expressive power of SQL. TCS, 296 (2003).

    C+1996 L. Colby et al. Algorithms for deferred view maintenance.SIGMOD 199.

    G+1997 J. Gray et al. Data cube: A relational aggregationoperator generalizing group-by, cross-tab, and sub-totals(1997) Data Mining and Knowledge Discovery.

    H2001 A. Halevy. Answering queries using views: A survey.VLDB Journal. December 2001.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 19 / 175

    Lecture 02 : The relational data model

    Mathematical relations and relational schema

    Using SQL to implement a relational schema

    Keys

    Database query languages

    The Relational Algebra

    The Relational Calculi (tuple and domain)

    a bit of SQL

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 20 / 175

  • 7/23/2019 DBMS Caracterization

    11/88

    Lets start with mathematical relations

    Suppose thatS1and S2are sets. The Cartesian product, S1 S2, isthe set

    S1 S2={(s1, s2)| s1S1, s2S2}

    A(binary) relation overS1 S2is any set rwithrS1 S2.

    In a similar way, if we have nsets,

    S1, S2, . . . , Sn,

    then ann-ary relationris a set

    rS1 S2 Sn={(s1, s2, . . . , sn)| siSi}

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 21 / 175

    Relational Schema

    LetXbe a set of kattribute names.

    We will often ignore domains (types) and say thatR(X)denotes arelational schema.

    When we writeR(Z, Y)we meanR(Z Y)andZ Y= .

    u.[X] =v.[X]abbreviatesu.A1=v.A1 u.Ak=v.Ak.Xrepresents some (unspecified) ordering of the attribute names,A1, A2, . . . , Ak

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 22 / 175

  • 7/23/2019 DBMS Caracterization

    12/88

    Mathematical vs. database relations

    Suppose we have ann-tupletS1 S2 Sn. Extracting thei-thcomponent oft, say asi(t), feels a bit low-level.

    Solution: (1) Associate a name, Ai(called anattribute name) witheach domainSi. (2) Instead of tuples, userecords sets of pairseach associating an attribute name Aiwith a value in domainSi.

    A database relationRover the schemaA1:S1 A2:S2 An :Snis afiniteset

    R {{(A1, s1), (A2, s2), . . . , (An, sn)} |siSi}

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 23 / 175

    ExampleA relational schemaStudents(name: string,sid: string,age: integer)

    A relational instance of this schema

    Students = {{(name, Fatima), (sid, fm21), (age, 20)},{(name, Eva), (sid, ev77), (age, 18)},

    {(name, James), (sid, jj25), (age, 19)}}

    A tabular presentation

    name sid age

    Fatima fm21 20Eva ev77 18James jj25 19

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 24 / 175

  • 7/23/2019 DBMS Caracterization

    13/88

    Key Concepts

    Relational Key

    SupposeR(X)is a relational schema with Z X. If for any recordsu

    andvin any instance ofRwe haveu.[Z] =v.[Z] =u.[X] =v.[X],

    thenZ is asuperkey forR. If no proper subset ofZ is a superkey, thenZis akey forR. We writeR(Z, Y)to indicate thatZ is a key forR(Z Y).

    Note that this is asemanticassertion, and that a relation can have

    multiple keys.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 25 / 175

    Creating Tables in SQL

    create table Students

    (sid varchar(10),

    name varchar(50),

    age int);

    -- insert record with attribute names

    insert into Students set

    name = Fatima, age = 20, sid = fm21;

    -- or insert records with values in same order

    -- as in create table

    insert into Students values

    (jj25 , James , 19),

    (ev77 , Eva , 18);

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 26 / 175

  • 7/23/2019 DBMS Caracterization

    14/88

    Listing a Table in SQL

    -- list by attribute order of create table

    mysql> select * from Students;

    +------+--------+------+| sid | name | age |

    +------+--------+------+

    | ev77 | Eva | 18 |

    | fm21 | Fatima | 20 |

    | jj25 | James | 19 |

    +------+--------+------+

    3 rows in set (0.00 sec)

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 27 / 175

    Listing a Table in SQL

    -- list by specified attribute order

    mysql> select name, age, sid from Students;

    +--------+------+------+

    | name | age | sid |

    +--------+------+------+

    | Eva | 18 | ev77 || Fatima | 20 | fm21 |

    | James | 19 | jj25 |

    +--------+------+------+

    3 rows in set (0.00 sec)

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 28 / 175

  • 7/23/2019 DBMS Caracterization

    15/88

    Keys in SQLAkeyis a set of attributes that will uniquely identify any record (row) ina table.

    -- with this create table

    create table Students

    (sid varchar(10),name varchar(50),

    age int,

    primary key (sid));

    -- if we try to insert this (fourth) student ...

    mysql> insert into Students set

    name = Flavia, age = 23, sid = fm21;

    ERROR 1062 (23000): Duplicate

    entry fm21 for key PRIMARY

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 29 / 175

    What is a (relational) database query language?

    Input : a collection of Output : a singlerelation instances relation instance

    R1, R2, , Rk = Q(R1, R2, , Rk)

    How can we expressQ?In order to meet Codds goals we want a query language that ishigh-level and independent of physical data representation.

    There aremanypossibilities ...

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 30 / 175

  • 7/23/2019 DBMS Caracterization

    16/88

    The Relational Algebra (RA)

    Q ::= R base relation| p(Q) selection| X(Q) projection

    | Q Q product| Q Q difference| Q Q union| Q Q intersection| M(Q) renaming

    pis a simple boolean predicate over attributes values.

    X= {A1, A2, . . . , Ak}is a set of attributes.

    M={A1 B1, A2 B2, . . . , Ak Bk}is a renaming map.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 31 / 175

    Relational Calculi

    The Tuple Relational Calculus (TRC)

    Q={t |P(t)}

    The Domain Relational Calculus (DRC)

    Q={(A1=v1, A2=v2, . . . , Ak=vk)| P(v1, v2, , vk)}

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 32 / 175

  • 7/23/2019 DBMS Caracterization

    17/88

    The SQL standard

    Origins at IBM in early 1970s.SQL has grown and grown through many rounds of

    standardization : ANSI: SQL-86 ANSI and ISO : SQL-89, SQL-92, SQL:1999, SQL:2003,

    SQL:2006, SQL:2008

    SQL is made up of many sub-languages : Query Language Data Definition Language System Administration Language ...

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 33 / 175

    Selection

    R

    A B C D

    20 10 0 5511 10 0 74 99 17 2

    77 25 4 0

    =

    Q(R)

    A B C D

    20 10 0 5577 25 4 0

    RA Q=A>12(R)

    TRC Q={t |tR t.A> 12}

    DRC Q={{(A, a), (B, b), (C, c), (D, d)} |{(A, a), (B, b), (C, c), (D, d)} R a>12}

    SQL select * from R where R.A > 12

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 34 / 175

  • 7/23/2019 DBMS Caracterization

    18/88

    Projection

    R

    A B C D

    20 10 0 5511 10 0 74 99 17 2

    77 25 4 0

    =

    Q(R)

    B C

    10 099 1725 4

    RA Q=B,C(R)

    TRC Q={t | uR t.[B, C] =u.[B, C]}

    DRC Q={{(B, b), (C, c)} |

    {(A, a), (B, b), (C, c), (D, d)} R}SQL select distinct B, C from R

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 35 / 175

    Why thedistinctin the SQL?

    The SQL query

    select B, C from R

    will produce a bag (multiset)!

    R

    A B C D

    20 10 0 5511 10 0 74 99 17 2

    77 25 4 0

    =

    Q(R)

    B C

    10 0 10 0 99 1725 4

    SQL is actually based on multisets, not sets. We will look into thismore in Lecture 11.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 36 / 175

  • 7/23/2019 DBMS Caracterization

    19/88

    Lecture 03 : Entity-Relationship (E/R) modelling

    Outline

    EntitiesRelationships

    Their relational implementations

    n-ary relationships

    Generalization

    On the importance of SCOPE

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 37 / 175

    Some real-world data ...

    ... from the Internet Movie Database (IMDb).

    Title Year Actor

    Austin Powers: International Man of Mystery 1997 Mike Myers

    Austin Powers: The Spy Who Shagged Me 1999 Mike MyersDude, Wheres My Car? 2000 Bill ChottDude, Wheres My Car? 2000 Marc Lynn

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 38 / 175

  • 7/23/2019 DBMS Caracterization

    20/88

    Entities diagrams and Relational Schema

    Movie

    TitleYear

    MovieID Person

    FirstNameLastName

    PersonID

    These diagrams represent relational schema

    Movie(MovieID, Title, Year)

    Person(PersonID, FirstName, LastName)

    Yes, this ignores types ...

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 39 / 175

    Entity sets (relational instances)

    Movie

    MovieID Title Year

    55871 Austin Powers: International Man of Mystery 199755873 Austin Powers: The Spy Who Shagged Me 1999171771 Dude, Wheres My Car? 2000

    (Tim used line number from IMDb raw file movies.list as MovieID.)

    Person

    PersonID FirstName LastName

    6902836 Mike Myers1757556 Bill Chott5882058 Marc Lynn

    (Tim used line number from IMDb raw file actors.list as PersonID)

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 40 / 175

  • 7/23/2019 DBMS Caracterization

    21/88

    Relationships

    Movie

    TitleMovieID

    Year ActsIn Person

    FirstNameLastName

    PersonID

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 41 / 175

    Foreign Keys and Referential Integrity

    Foreign Key

    Suppose we haveR(Z, Y). Furthermore, letS(W)be a relationalschema withZ W. We say that Zrepresents aForeign Key inSforRif for any instance we have Z(S) Z(R). This is a semanticassertion.

    Referential integrity

    A database is said to havereferential integritywhen all foreign keyconstraints are satisfied.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 42 / 175

  • 7/23/2019 DBMS Caracterization

    22/88

    A relational representation

    A relational schema

    ActsIn(MovieID, PersonID)

    Withreferential integrity constraints

    MovieID(ActsIn) MovieID(Movie)

    PersonID(ActsIn) PersonID(Person)

    ActsIn

    PersonID MovieID

    6902836 55871

    6902836 558731757556 1717715882058 171771

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 43 / 175

    Foreign Keys in SQL

    create table ActsIn

    ( MovieID int not NULL,

    PersonID int not NULL,

    primary key (MovieID, PersonID),

    constraint actsin_movie

    foreign key (MovieID)references Movie(MovieID),

    constraint actsin_person

    foreign key (PersonID)

    references Person(PersonID))

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 44 / 175

  • 7/23/2019 DBMS Caracterization

    23/88

    Relational representation of relationships, in general?

    That depends ...

    Mapping Cardinalities for binary relations,RS T

    RelationRis meaning

    many to many no constraints

    one to many tT, s1, s2S.(R(s1, t) R(s2, t)) = s1=s2

    many to one s S, t1, t2T.(R(s, t1) R(s, t2)) = t1=t2

    one to one one to many and many to one

    Note that the database terminology differs slightly from standardmathematical terminology.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 45 / 175

    Diagrams for Mapping Cardinalities

    ER diagram RelationRis

    TRSmany to many (M :N)

    TRSone to many (1: M)

    TRSmany to one (M :1)

    TRSone to one (1:1)

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 46 / 175

  • 7/23/2019 DBMS Caracterization

    24/88

    Relationships to Relational Schema

    T X

    Y

    R

    U

    SZ

    W

    RelationRis Schema

    many to many (M :N) R(X, Z, U)

    one to many (1: M) R(X, Z, U)

    many to one (M :1) R(X, Z, U)

    one to one (1:1) R(X, Z, U)and/orR(X, Z, U)(alternate keys)

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 47 / 175

    one to one does not mean a "1-to-1 correspondence

    T X

    Y

    R

    U

    SZ

    W

    This database instance is OKS R TZ W

    z1 w1z2 w2z3 w3

    Z X Uz1 x2 u1

    X Y

    x1 y1x2 y2x3 y3x4 y4

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 48 / 175

  • 7/23/2019 DBMS Caracterization

    25/88

    Some more real-world data ... (a slight change ofSCOPE)

    Title Year Actor RoleAustin Powers: International Man of Mystery 1997 Mike Myers Austin PowersAustin Powers: International Man of Mystery 1997 Mike Myers Dr. EvilAustin Powers: The Spy Who Shagged Me 1999 Mike Myers Austin PowersAustin Powers: The Spy Who Shagged Me 1999 Mike Myers Dr. EvilAustin Powers: The Spy Who Shagged Me 1999 Mike Myers Fat BastardDude, Wheres My Car? 2000 Bill Chott Big Cult Guard 1Dude, Wheres My Car? 2000 Marc Lynn Cop with Whips

    How will this change our model?

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 49 / 175

    WillActsInremain a binary Relationship?

    Movie

    TitleYear

    MovieID ActsIn

    Role

    Person

    FirstNameLastName

    PersonID

    No! An actor can have many roles in the same movie!

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 50 / 175

  • 7/23/2019 DBMS Caracterization

    26/88

    CouldActsInbe modeled as a Ternary Relationship?

    Movie

    TitleYear

    MovieID ActsIn Person

    FirstNameLastName

    PersonID

    Role

    Description

    Yes, this works!

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 51 / 175

    Can a ternary relationship be modeled with multiple

    binary relationships?

    MovieHasCastingCastingActsInPerson

    RequiresRole

    Role

    TheCastingentity seems artificial. What attributes would it have?

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 52 / 175

  • 7/23/2019 DBMS Caracterization

    27/88

    Sometimes ternary to multiple binary makes moresense ...

    BranchWorks-OnEmployee

    Job

    BranchInvolvesProjectAssigned-ToEmployee

    Requires

    Job

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 53 / 175

    Generalization

    Comedy

    ISA

    Movie

    Drama

    Questions

    Is every movie either comedy or a drama?Can a movie be a comedy and a drama?

    But perhaps this isnt a good model ...

    What attributes would distinguish Drama and Comedy entities?

    What aboundScience Fiction?

    PerhapsGenrewould make a nice entity, which could have a

    relationship withMovie.Would a ternary relationship be better?Ken Moody (cl.cam.ac.uk) Databases DB 2012 54 / 175

  • 7/23/2019 DBMS Caracterization

    28/88

    Question: What is the right model?

    Answer: The question doesnt make sense!There is no right model ...

    It depends on the intended use of the database.

    What activity will the DBMS support?What data is needed to support that activity?

    The issue of SCOPE is missing from most textbooks

    Supposethat all databases begin life with beautifully designedschemas.

    Observethat many operational databases are in a sorry state.

    Concludethat thescope and goalsof a database continuallychange, and thatschema evolutionis a difficult problem to solve,in practice.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 55 / 175

    Another change of SCOPE ...

    Movies with detailed release datesTitle Country Day Month Year

    Austin Powers: International Man of Mystery USA 02 05 1997Austin Powers: International Man of Mystery Iceland 24 10 1997Austin Powers: International Man of Mystery UK 05 09 1997Austin Powers: International Man of Mystery Brazil 13 02 1998Austin Powers: The Spy Who Shagged Me USA 08 06 1999Austin Powers: The Spy Who Shagged Me Iceland 02 07 1999Austin Powers: The Spy Who Shagged Me UK 30 07 1999Austin Powers: The Spy Who Shagged Me Brazil 08 10 1999Dude, Wheres My Car? USA 10 12 2000Dude, Wheres My Car? Iceland 9 02 2001Dude, Wheres My Car? UK 9 02 2001Dude, Wheres My Car? Brazil 9 03 2001Dude, Wheres My Car? Russia 18 09 2001

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 56 / 175

  • 7/23/2019 DBMS Caracterization

    29/88

    ... and an attribute becomes an entity with aconnecting relation.

    Movie

    TitleYear

    MovieID

    Movie

    TitleMovieID

    Year Released MovieRelease

    CountryDate

    Year

    Month

    Day

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 57 / 175

    Lecture 04 : Relational algebra and relational calculus

    OutlineConstructing new tuples!

    Joins

    Limitations of Relational Algebra

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 58 / 175

  • 7/23/2019 DBMS Caracterization

    30/88

    Renaming

    R

    A B C D

    20 10 0 55

    11 10 0 74 99 17 2

    77 25 4 0

    =

    Q(R)

    A E C F

    20 10 0 55

    11 10 0 74 99 17 2

    77 25 4 0

    RA Q={BE, DF}(R)

    TRC Q={t | uR t.A= u.A t.E=u.E t.C=u.C t.F =u.D}

    DRC Q={{(A, a), (E, b), (C, c), (F, d)} |{(A, a), (B, b), (C, c), (D, d)} R}

    SQL select A, B as E, C, D as F from R

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 59 / 175

    Union

    R

    A B

    20 1011 104 99

    S

    A B

    20 1077 1000

    =

    Q(R, S)

    A B

    20 1011 104 99

    77 1000

    RA Q=R S

    TRC Q={t |tR tS}

    DRC Q={{(A, a), (B, b)} | {(A, a), (B, b)} R {(A, a), (B, b)} S}

    SQL (select * from R) union (select * from S)

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 60 / 175

  • 7/23/2019 DBMS Caracterization

    31/88

    Intersection

    R

    A B

    20 1011 104 99

    S

    A B

    20 1077 1000

    =

    Q(R)

    A B

    20 10

    RA Q=R S

    TRC Q={t |tR tS}

    DRC Q={{(A, a), (B, b)} | {(A, a), (B, b)} R {(A, a), (B, b)} S}

    SQL(select * from R) intersect (select * from S)

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 61 / 175

    Difference

    R

    A B

    20 1011 104 99

    S

    A B

    20 1077 1000

    =

    Q(R)

    A B

    11 104 99

    RA Q=R S

    TRC Q={t |tR tS}

    DRC Q={{(A, a), (B, b)} | {(A, a), (B, b)} R {(A, a), (B, b)} S}

    SQL (select * from R) except (select * from S)

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 62 / 175

  • 7/23/2019 DBMS Caracterization

    32/88

    Wait, are we missing something?

    Suppose we want to add information about college membership to ourStudent database. We could add an additional attribute for the college.

    StudentsWithCollege :+--------+------+------+--------+

    | name | age | sid | college|

    +--------+------+------+--------+

    | Eva | 18 | ev77 | Kings |

    | Fatima | 20 | fm21 | Clare |

    | James | 19 | jj25 | Clare |

    +--------+------+------+--------+

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 63 / 175

    Put logically independent data in distinct tables?Students : +--------+------+------+-----+

    | name | age | sid | cid |

    +--------+------+------+-----+

    | Eva | 18 | ev77 | k |

    | Fatima | 20 | fm21 | cl |

    | James | 19 | jj25 | cl |

    +--------+------+------+-----+

    Colleges : +-----+---------------+

    | cid | college_name |

    +-----+---------------+

    | k | Kings |

    | cl | Clare |

    | sid | Sidney Sussex |

    | q | Queens |

    ... .....

    But how do we put them back together again?Ken Moody (cl.cam.ac.uk) Databases DB 2012 64 / 175

  • 7/23/2019 DBMS Caracterization

    33/88

    Product

    R

    A B

    20 1011 10

    4 99

    S

    C D

    14 9977 100 =

    Q(R, S)A B C D

    20 10 14 9920 10 77 100

    11 10 14 9911 10 77 1004 99 14 994 99 77 100

    Note the automaticflattening

    RA Q=R S

    TRC Q={t | uR, vS, t.[A, B] =u.[A, B] t.[C, D] =

    v.[C, D]}DRC Q={{(A, a), (B, b), (C, c), (D, d)} |

    {(A, a), (B, b)} R {(C, c), (D, d)} S}

    SQL select A, B, C, D from R, SKen Moody (cl.cam.ac.uk) Databases DB 2012 65 / 175

    Product is special!

    R

    A B

    20 104 99

    =

    R AC, BD(R)

    A B C D

    20 10 20 1020 10 4 994 99 20 10

    4 99 4 99

    is the only operation in the Relational Algebra that created newrecords (ignoring renaming),

    But usually creates too many records!

    Joinsare the typical way of using products in a constrainedmanner.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 66 / 175

  • 7/23/2019 DBMS Caracterization

    34/88

    Natural Join

    Natural JoinGivenR(X, Y)and S(Y, Z), we define the natural join, denotedR S, as a relation over attributes X, Y, Zdefined as

    R S {t | uR, vS, u.[Y] =v.[Y] t=u.[X] u.[Y] v.[Z]}

    In the Relational Algebra:

    R S=X,Y,Z(Y=Y(R YY(S)))

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 67 / 175

    Join example

    Students

    name sid age cid

    Fatima fm21 20 clEva ev77 18 kJames jj25 19 cl

    Colleges

    cid cname

    k Kingscl Clareq Queens...

    ...

    =

    name,cname(Students Colleges)

    name cname

    Fatima ClareEva Kings

    James Clare

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 68 / 175

  • 7/23/2019 DBMS Caracterization

    35/88

    The same in SQL

    select name, cname

    from Students, Colleges

    where Students.cid = Colleges.cid

    +--------+--------+

    | name | cname |

    +--------+--------+

    | Eva | Kings |

    | Fatima | C lare |

    | James | Clare |

    +--------+--------+

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 69 / 175

    Division

    GivenR(X, Y)and S(Y), the division ofRbyS, denotedR S, is therelation over attributesXdefined as (in the TRC)

    R S {x | sS, x sR}.

    name award

    Fatima writingFatima musicEva musicEva writingEva danceJames dance

    award

    musicwritingdance

    = name

    Eva

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 70 / 175

  • 7/23/2019 DBMS Caracterization

    36/88

    Division in the Relational Algebra?

    Clearly,R SX(R). SoR S=X(R) C, whereCrepresentscounter examples to the division condition. That is, in the TRC,

    C={x | sS, x sR}.

    U=X(R) Srepresents all possible x sforxX(R)andsS,

    soT =U Rrepresents all thosex sthat are not in R,

    soC=X(T)represents those recordsxthat are counterexamples.

    Division in RA

    R SX(R) X((X(R) S) R)

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 71 / 175

    Query Safety

    A query likeQ={t |tR tS}raises some interesting questions.Should we allow the following query?

    Q={t |tS}

    We want our relations to befinite!

    Safety

    A (TRC) queryQ={t |P(t)}

    issafeif it is always finite for any database instance.

    Problem : query safety is not decidable!

    Solution : define a restricted syntax that guarantees safety.

    Safe queries can be represented in the Relational Algebra.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 72 / 175

  • 7/23/2019 DBMS Caracterization

    37/88

    Limitations of simple relational query languages

    The expressive power of RA, TRC, and DRC are essentially thesame. None can express thetransitive closureof a relation.

    We could extend RA to more powerful languages (like Datalog).SQL has been extended with many features beyond the RelationalAlgebra. stored procedures recursive queries ability to embed SQL in standard procedural languages

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 73 / 175

    Lecture 05 : SQL and integrity constraints

    OutlineNULLin SQL

    three-valued logic

    Multisets and aggregation in SQLViews

    General integrity constraints

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 74 / 175

  • 7/23/2019 DBMS Caracterization

    38/88

    What isNULLin SQL?

    What if you dont know Kims age?

    mysql> select * from students;

    +------+--------+------+| sid | name | age |

    +------+--------+------+

    | ev77 | Eva | 18 |

    | fm21 | Fatima | 20 |

    | jj25 | James | 19 |

    | ks87 | Kim | NULL |

    +------+--------+------+

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 75 / 175

    What isNULL?

    NULLis aplace-holder, not a value!

    NULLis not a member of any domain (type),

    For records withNULL forage, an expression like age > 20mustunknown!

    This means we need (at least) three-valued logic.

    LetrepresentWe dont know!

    T F

    T T F F F F F

    F

    T F

    T T T T

    F T F T

    v v

    T F

    F T

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 76 / 175

  • 7/23/2019 DBMS Caracterization

    39/88

    NULLcan lead to unexpected resultsmysql> select * from students;

    +------+--------+------+

    | sid | name | age |

    +------+--------+------+

    | ev77 | Eva | 18 |

    | fm21 | Fatima | 20 |

    | jj25 | James | 19 |

    | ks87 | Kim | NULL |

    +------+--------+------+

    mysql> select * from students where age 19;

    +------+--------+------+

    | sid | name | age |

    +------+--------+------+| ev77 | Eva | 18 |

    | fm21 | Fatima | 20 |

    +------+--------+------+

    select ... where P

    The select statement only returns those records where the wherepredicate evaluates totrue.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 77 / 175

    The ambiguity of NULL

    Possible interpretations of NULLThere is a value, but we dont know what it is.

    No value is applicable.

    The value is known, but you are not allowed to see it.

    ...

    A great deal of semantic muddle is created by conflating all of theseinterpretations into one non-value.

    On the other hand, introducing distinct NULLs for each possibleinterpretation leads to very complex logics ...

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 78 / 175

  • 7/23/2019 DBMS Caracterization

    40/88

    Not everyone approves ofNULL

    C. J. Date [D2004], Chapter 19Before we go any further, we should make it very clear that in ouropinion (and in that of many other writers too, we hasten to add),NULLs and 3VL are and always were a serious mistake and have noplace in the relational model.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 79 / 175

    ageis not a good attribute ...

    Theagecolumn is guaranteed to go out of date! Lets record dates ofbirth instead!

    create table Students

    ( sid varchar(10) not NULL,

    name varchar(50) not NULL,

    birth_date date,cid varchar(3) not NULL,

    primary key (sid),

    constraint student_college foreign key (cid)

    references Colleges(cid) )

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 80 / 175

  • 7/23/2019 DBMS Caracterization

    41/88

    ageis not a good attribute ...

    mysql> select * from Students;

    +------+---------+------------+-----+| sid | name | birth_date | cid |

    +------+---------+------------+-----+

    | ev77 | Eva | 1990-01-26 | k |

    | fm21 | Fatima | 1988-07-20 | c l |

    | jj25 | James | 1989-03-14 | cl |

    +------+---------+------------+-----+

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 81 / 175

    Use aviewto recover original table(Note : the age calculation here is not correct!)

    create view StudentsWithAge as

    select sid, name,

    (year(current_date()) - year(birth_date)) as age,

    cid

    from Students;

    mysql> select * from StudentsWithAge;

    +------+---------+------+-----+| sid | name | age | cid |

    +------+---------+------+-----+

    | ev77 | Eva | 19 | k |

    | fm21 | Fatima | 21 | cl |

    | jj25 | James | 20 | cl |

    +------+---------+------+-----+

    Views are simply identifiers that represent a query. The views name

    can be used as if it were a stored table.Ken Moody (cl.cam.ac.uk) Databases DB 2012 82 / 175

  • 7/23/2019 DBMS Caracterization

    42/88

    But that calculation is not correct ...Clearly the calculation of age does not take into account the day andmonth of year.

    From 2010 Database Contest (winner : Sebastian Probst Eide)SELECT year(CURRENT_DATE()) - year(birth_date) -

    CASE WHEN month(CURRENT_DATE()) < month(birth_date)THEN 1

    ELSE

    CASE WHEN month(CURRENT_DATE()) = month(birth_date)

    THEN

    CASE WHEN day(CURRENT_DATE()) < day(birth_date)

    THEN 1

    ELSE 0

    END

    ELSE 0

    END

    END

    AS age FROM Students

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 83 / 175

    An Example ...

    mysql> select * from marks;

    +-------+-----------+------+

    | sid | course | mark |

    +-------+-----------+------+

    | ev77 | databases | 92 |

    | ev77 | spelling | 99 |

    | tgg22 | spelling | 3 || tgg22 | databases | 100 |

    | fm21 | databases | 92 |

    | fm21 | spelling | 100 |

    | jj25 | databases | 88 |

    | jj25 | spelling | 92 |

    +-------+-----------+------+

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 84 / 175

  • 7/23/2019 DBMS Caracterization

    43/88

    ... of duplicates

    mysql> select mark from marks;

    +------+

    | mark |

    +------+| 92 |

    | 99 |

    | 3 |

    | 100 |

    | 92 |

    | 100 |

    | 88 |

    | 92 |

    +------+

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 85 / 175

    Why Multisets?

    Duplicates are important foraggregate functions.

    mysql> select min(mark),

    max(mark),

    sum(mark),

    avg(mark)

    from marks;

    +-----------+-----------+-----------+-----------+

    | min(mark) | max(mark) | sum(mark) | avg(mark) |

    +-----------+-----------+-----------+-----------+

    | 3 | 100 | 666 | 83.2500 |

    +-----------+-----------+-----------+-----------+

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 86 / 175

  • 7/23/2019 DBMS Caracterization

    44/88

    Thegroup byclause

    mysql> select course,

    min(mark),

    max(mark),

    avg(mark)

    from marks

    group by course;

    +-----------+-----------+-----------+-----------+

    | course | min(mark) | max(mark) | avg(mark) |

    +-----------+-----------+-----------+-----------+

    | databases | 88 | 100 | 93.0000 |

    | spelling | 3 | 100 | 73.5000 |

    +-----------+-----------+-----------+-----------+

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 87 / 175

    Visualizing group by

    sid course mark

    ev77 databases 92ev77 spelling 99tgg22 spelling 3

    tgg22 databases 100fm21 databases 92fm21 spelling 100jj25 databases 88jj25 spelling 92

    group by=

    course mark

    spelling 99spelling 3spelling 100spelling 92

    course mark

    databases 92databases 100databases 92databases 88

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 88 / 175

  • 7/23/2019 DBMS Caracterization

    45/88

    Visualizing group by

    course mark

    spelling 99

    spelling 3spelling 100spelling 92

    course mark

    databases 92databases 100databases 92

    databases 88

    min(mark)=

    course min(mark)spelling 3

    databases 88

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 89 / 175

    Thehavingclause

    How can we select on the aggregated columns?

    mysql> select course,

    min(mark),

    max(mark),

    avg(mark)

    from marks

    group by course

    having min(mark) > 60;

    +-----------+-----------+-----------+-----------+

    | course | min(mark) | max(mark) | avg(mark) |

    +-----------+-----------+-----------+-----------+

    | databases | 88 | 100 | 93.0000 |

    +-----------+-----------+-----------+-----------+

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 90 / 175

  • 7/23/2019 DBMS Caracterization

    46/88

    Use renaming to make things nicer ...

    mysql> select course,

    min(mark) as minimum,

    max(mark) as maximum,

    avg(mark) as average

    from marks

    group by course

    having minimum > 60;

    +-----------+---------+---------+---------+

    | course | minimum | maximum | average |

    +-----------+---------+---------+---------+

    | databases | 88 | 100 | 93.0000 |

    +-----------+---------+---------+---------+

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 91 / 175

    Materialized Views

    SupposeQis a very expensive, and very frequent query.Why not de-normalize some data to speed up the evaluation of Q?

    This might be a reasonable thing to do, or ... ... it might be the first step to destroying the integrity of your data

    design.Why not store the value of Qin a table? This is called amaterialized view. But now there is a problem: How often should this view be

    refreshed?

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 92 / 175

  • 7/23/2019 DBMS Caracterization

    47/88

    General integrity constraints

    Suppose thatCis some constraint we would like to enforce on ourdatabase.

    LetQC

    be a query that captures all violations of C.

    Enforce (somehow) that the assertion that is always QCempty.

    Example

    C=Z W, and FD that was not preserved for relation R(X),

    LetQRbe a join that reconstructs R,

    LetQRbe this query withX X and

    QC=W=W(Z=Z(QR Q

    R

    ))

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 93 / 175

    Assertions in SQL

    create view C_violations as ....

    create assertion check_C

    check not (exists C_violations)

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 94 / 175

  • 7/23/2019 DBMS Caracterization

    48/88

    Lectures 06 : Case Study - Cancer registry for theNHS

    ECRIC is a cancer registry, recording details about all tumours inpeople in the East of England. This data is particularly sensitive, andits use is strictly controlled. The lecture focusses on the challenges ofscaling up the registration system to cover all cancer patients inEngland, while still maintaining the long term accuracy and continuityof the data set.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 95 / 175

    Lecture 07 : Schema refinement I

    OutlineER is for top-down and informal (but rigorous) design

    FDs are used for bottom-up and formal design and analysis

    update anomaliesReasoning about Functional Dependencies

    Heaths rule

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 96 / 175

  • 7/23/2019 DBMS Caracterization

    49/88

    Update anomalies

    Big Table

    sid name college course part term_name

    yy88 Yoni New Hall Algorithms I IA Easteruu99 Uri Kings Algorithms I IA Easterbb44 Bin New Hall Databases IB Lentbb44 Bin New Hall Algorithms II IB Michaelmaszz70 Zip Trinity Databases IB Lentzz70 Zip Trinity Algorithms II IB Michaelmas

    How can we tell if an insert record is consistent with currentrecords?

    Can we record data about a course before students enroll?

    Will we wipe out information about a college when last studentassociated with the college is deleted?

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 97 / 175

    Redundancy implies more locking ...

    ... at least for correct transactions!

    Big Table

    sid name college course part term_name

    yy88 Yoni New Hall Algorithms I IA Easteruu99 Uri Kings Algorithms I IA Easterbb44 Bin New Hall Databases IB Lent

    bb44 Bin New Hall Algorithms II IB Michaelmaszz70 Zip Trinity Databases IB Lentzz70 Zip Trinity Algorithms II IB Michaelmas

    ChangeNew HalltoMurray Edwards College Conceptually simple update May require locking entire table.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 98 / 175

  • 7/23/2019 DBMS Caracterization

    50/88

    Redundancy is the root of (almost) all database evils

    It may not be obvious, but redundancy is also the cause of updateanomalies.By redundancy wedo notmean that some values occur manytimes in the database! A foreign key value may be have millions of copies!

    But then, what do we mean?

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 99 / 175

    Functional Dependency

    Functional Dependency (FD)

    LetR(X)be a relational schema andY X,Z Xbe two attributesets. We sayYfunctionally determinesZ, writtenY Z, if for any twotuplesuand vin an instance of R(X)we have

    u.Y= v.Y u.Z= v.Z.

    We callY Z afunctional dependency.

    A functional dependency is a semantic assertion. It represents a rulethat should always hold in any instance of schema R(X).

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 100 / 175

  • 7/23/2019 DBMS Caracterization

    51/88

    Example FDs

    Big Table

    sid name college course part term_name

    yy88 Yoni New Hall Algorithms I IA Easter

    uu99 Uri Kings Algorithms I IA Easterbb44 Bin New Hall Databases IB Lentbb44 Bin New Hall Algorithms II IB Michaelmaszz70 Zip Trinity Databases IB Lentzz70 Zip Trinity Algorithms II IB Michaelmas

    sid name

    sid college

    course partcourse term_name

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 101 / 175

    Keys, revisited

    Candidate Key

    LetR(X)be a relational schema andY X. Yis acandidate keyif1 The FDY Xholds, and2 for no proper subsetZ YdoesZ Xhold.

    Prime and Non-prime attributesAn attributeAisprimeforR(X)if it is a member of some candidate keyforR. Otherwise,Aisnon-prime.

    Database redundancy roughly means the existence of non-keyfunctional dependencies!

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 102 / 175

  • 7/23/2019 DBMS Caracterization

    52/88

    Semantic Closure

    Notation

    F |=

    Y

    Z

    means that any database instance that that satisfies every FD ofF,must also satisfyY Z.

    Thesemantic closureof F, denotedF+, is defined to be

    F+ ={Y Z | Y Z atts(F)and F |=Y Z}.

    Themembership problemis to determine ifY Z F+.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 103 / 175

    Reasoning about Functional Dependencies

    We writeF Y Z whenY Z can be derived fromFvia thefollowing rules.

    Armstrongs Axioms

    Reflexivity IfZ

    Y

    , thenF

    Y

    Z

    .Augmentation IfF Y ZthenF Y, W Z, W.

    Transitivity IfF Y Z andF |=Z W, thenF Y W.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 104 / 175

  • 7/23/2019 DBMS Caracterization

    53/88

    Logical Closure (of a set of attributes)

    Notation

    closure(F, X) ={A| F X A}

    Claim 1IfY W FandY closure(F, X), thenW closure(F, X).

    Claim 2Y W F+ if and only ifW closure(F, Y).

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 105 / 175

    Soundness and Completeness

    Soundness

    F f = f F+

    Completenessf F+ = F f

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 106 / 175

  • 7/23/2019 DBMS Caracterization

    54/88

    Proof of Completeness (soundness left as an exercise)

    Show(F f) = (F |=f):

    Suppose(F Y Z)forR(X).

    LetY+ =closure(F, Y).

    BZ, withBY+.

    Construct an instance ofRwith just two records, uand v, thatagree onY+ but not onX Y+.

    By construction, this instance does not satisfy Y Z.But it does satisfyF! Why? letS T be any FD in F, withu.[S] =v.[S]. SoS Y+. and soT Y+by claim 1,

    and sou.[T] =v.[T]

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 107 / 175

    Closure

    By soundness and completeness

    closure(F, X) ={A| F X A}= {A| X A F+}

    Claim 2 (from previous lecture)

    Y W F+ if and only ifW closure(F, Y).

    If we had an algorithm for closure(F, X), then we would have a (bruteforce!) algorithm for enumerating F+:

    F+

    for every subsetY atts(F) for every subsetZ closure(F, Y),

    outputY Z

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 108 / 175

  • 7/23/2019 DBMS Caracterization

    55/88

    Attribute Closure Algorithm

    Input : a set of FDsFand a set of attributesX.

    Output : Y=closure(F, X)

    1 Y:= X

    2 while there is someS T FwithS Yand T Y, thenY:= Y T.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 109 / 175

    An Example (UW1997, Exercise 3.6.1)

    R(A, B, C, D)withFmade up of the FDs

    A, BCCDDA

    What isF+?

    Brute force!Lets just consider all possible nonempty setsX there are only 15...

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 110 / 175

  • 7/23/2019 DBMS Caracterization

    56/88

    Example (cont.)

    F ={A, BC, CD, DA}

    For the single attributes we have

    {A}+ ={A},

    {B}+ ={B},{C}+ ={A, C, D},

    {C}CD= {C, D}

    DA= {A, C, D}

    {D}+ ={A, D}

    {D}DA= {A, D}

    The only new dependency we get with a single attribute on the left isCA.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 111 / 175

    Example (cont.)

    F ={A, BC, CD, DA}

    Now consider pairs of attributes.

    {A, B}+ ={A, B, C, D}, soA, BDis a new dependency

    {A, C}+ ={A, C, D}, soA, CDis a new dependency

    {A, D}+ ={A, D}, so nothing new.

    {B, C}+ ={A, B, C, D}, soB, CA, Dis a new dependency

    {B, D}+ ={A, B, C, D}, soB, DA, Cis a new dependency

    {C, D}+ ={A, C, D}, soC, D Ais a new dependency

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 112 / 175

  • 7/23/2019 DBMS Caracterization

    57/88

    Example (cont.)

    F ={A, BC, CD, DA}

    For the triples of attributes:

    {A, C, D}+ ={A, C, D},{A, B, D}+ ={A, B, C, D}, soA, B, DCis a new dependency

    {A, B, C}+ ={A, B, C, D}, soA, B, CDis a new dependency

    {B, C, D}+ ={A, B, C, D}, soB, C, DAis a new dependency

    And since{A, B, C, D}+ ={A, B, C, D}, we get no newdependencies with four attributes.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 113 / 175

    Example (cont.)

    We generated 11 new FDs:

    C A A, B DA, C D B, C AB, C D B, D AB, D C C, D A

    A, B, C D A, B, D C

    B, C, D A

    Can you see the Key?

    {A, B},{B, C}, and{B, D}are keys.

    Note: this schema is already in 3NF! Why?

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 114 / 175

  • 7/23/2019 DBMS Caracterization

    58/88

    Consequences of Armstrongs Axioms

    Union IfF |=Y Z and F |=Y W, thenF |=Y W, Z.Pseudo-transitivity IfF |=Y Z and F |=U, Z W, then

    F |=Y, U W.

    Decomposition IfF |=Y Z and W Z, thenF |=Y W.

    Exercise : Prove these using Armstrongs axioms!

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 115 / 175

    Proof of the Union Rule

    Suppose we haveF |=Y Z,F |=Y W.

    By augmentation we have

    F |=Y, Y Y, Z,

    that is, F |=Y Y, Z.

    Also using augmentation we obtain

    F |=Y, Z W, Z.

    Therefore, by transitivity we obtain

    F |=Y W, Z.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 116 / 175

  • 7/23/2019 DBMS Caracterization

    59/88

    Example application of functional reasoning.

    Heaths RuleSupposeR(A, B, C)is a relational schema with functionaldependencyA B, then

    R=A,B(R) A A,C(R).

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 117 / 175

    Proof of Heaths Rule

    We first show thatRA,B(R) A A,C(R).

    Ifu= (a, b, c) R, thenu1= (a, b) A,B(R)andu2= (a, c) A,C(R).

    Since{(a, b)} A{(a, c)}= {(a, b, c)}we knowuA,B(R) A A,C(R).

    In the other direction we must showR =A,B(R) A A,C(R) R.Ifu= (a, b, c) R, then there must exist tuplesu1= (a, b) A,B(R)andu2= (a, c) A,C(R).

    This means that there must exist au = (a, b, c) Rsuch thatu2=A,C({(a, b

    , c)}).

    However, the functional dependency tells us that b= b, sou= (a, b, c) R.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 118 / 175

  • 7/23/2019 DBMS Caracterization

    60/88

    Closure Example

    R(A, B, C, D, E, F)withA, BCB, CD

    DEC, F B

    What is the closure of{A, B}?

    {A, B} A,BC

    = {A, B, C}B,CD

    = {A, B, C, D}DE= {A, B, C, D, E}

    So{A, B}+ ={A, B, C, D, E}andA, BC, D, E.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 119 / 175

    Lecture 08 : Normal Forms

    OutlineFirst Normal Form (1NF)

    Second Normal Form (2NF)

    3NF and BCNFMulti-valued dependencies (MVDs)

    Fourth Normal Form

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 120 / 175

  • 7/23/2019 DBMS Caracterization

    61/88

    The Plan

    Given a relational schemaR(X)with FDsF :Reason about FDs IsFmissing FDs that are logically implied by those in F?

    Decompose eachR(X)into smallerR1(X1), R2(X2), Rk(Xk),where eachRi(Xi)is in the desired Normal Form.

    Are some decompositions better than others?

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 121 / 175

    Desired properties of any decomposition

    Lossless-join decomposition

    A decomposition of schema R(X)to S(Y Z)and T(Y (X Z))is alossless-join decomposition if for every database instances we haveR=S T.

    Dependency preserving decompositionA decomposition of schema R(X)to S(Y Z)and T(Y (X Z))isdependency preserving, if enforcing FDs on Sand Tindividually hasthe same effect as enforcing all FDs onS T.

    We will see that it is not always possible to achieve both of these goals.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 122 / 175

  • 7/23/2019 DBMS Caracterization

    62/88

    First Normal Form (1NF)

    We will assume every schema is in 1NF.

    1NF

    A schemaR(A1:S1, A2:S2, , An:Sn)is in First Normal Form(1NF) if the domainsS1are elementary their values areatomic.

    name

    Timothy George Griffin =

    first_name middle_name last_name

    Timothy George Griffin

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 123 / 175

    Second Normal Form (2NF)

    Second Normal Form (2NF)A relational schemaRis in 2NF if for every functional dependencyX Aeither

    A X, or

    Xis a superkey forR, orAis a member of some key, or

    Xis not a proper subset of any key.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 124 / 175

  • 7/23/2019 DBMS Caracterization

    63/88

    3NF and BCNF

    Third Normal Form (3NF)

    A relational schemaRis in 3NF if for every functional dependencyX Aeither

    A X, or

    Xis a superkey forR, or

    Ais a member of some key.

    Boyce-Codd Normal Form (BCNF)A relational schemaRis in BCNF if for every functional dependencyX Aeither

    A X, or

    Xis a superkey forR.

    Is something missing?

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 125 / 175

    Another look at Heaths Rule

    GivenR(Z, W, Y)with FDsF

    IfZ W F+, the

    R=Z,W(R) Z,Y(R)

    What about an implication in the other direction? That is, suppose wehave

    R=Z,W(R) Z,Y(R).

    Q Can we conclude anything about FDs on R? In particular,is it true that Z Wholds?

    A No!

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 126 / 175

  • 7/23/2019 DBMS Caracterization

    64/88

    We just needonecounter example ...

    R = A,B(R) A,C(R)

    A B C

    a b1 c1a b2 c2a b1 c2a b2 c1

    A B

    a b1a b2

    A C

    a c1a c2

    ClearlyA Bis not an FD of R.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 127 / 175

    A concrete example

    course_name lecturer text

    Databases Tim Ullman and WidomDatabases Fatima DateDatabases Tim DateDatabases Fatima Ullman and Widom

    Assuming that texts and lecturers are assigned to coursesindependently, then a better representation would in two tables:

    course_name lecturer

    Databases TimDatabases Fatima

    course_name text

    Databases Ullman and WidomDatabases Date

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 128 / 175

  • 7/23/2019 DBMS Caracterization

    65/88

    Time for a definition! MVDs

    Multivalued Dependencies (MVDs)

    LetR(Z, W, Y)be a relational schema. A multivalued dependency,denotedZ W, holds if whenever tanduare two records that agreeon the attributes ofZ, then there must be some tuple vsuch that

    1 vagrees with both tanduon the attributes ofZ,2 vagrees with ton the attributes of W,3 vagrees with uon the attributes ofY.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 129 / 175

    A few observations

    Note 1Every functional dependency is multivalued dependency,

    (Z W) = (ZW).

    To see this, just let v=uin the above definition.

    Note 2LetR(Z, W, Y)be a relational schema, then

    (ZW) (Z Y),

    by symmetry of the definition.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 130 / 175

  • 7/23/2019 DBMS Caracterization

    66/88

    MVDs and lossless-join decompositions

    Fun Fun FactLetR(Z, W, Y)be a relational schema. The decompositionR1(Z, W),R2(Z, Y)is a lossless-join decomposition of Rif and only if the MVDZWholds.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 131 / 175

    Proof of Fun Fun Fact

    Proof of(ZW) = R= Z,W(R) Z,Y(R)

    SupposeZ W.

    We know (from proof of Heaths rule) that RZ,W(R) Z,Y(R).So we only need to show Z,W(R) Z,Y(R) R.

    SupposerZ,W(R) Z,Y(R).

    So there must be a tRanduRwith

    {r}= Z,W({t}) Z,Y({u}).In other words, there must be a tRand uRwitht.Z= u.Z.So the MVD tells us that then there must be some tuplevRsuch that

    1 vagrees with bothtanduon the attributes of Z,2 vagrees withton the attributes ofW,3 vagrees withuon the attributes of Y.

    Thisvmust be the same as r, so rR.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 132 / 175

  • 7/23/2019 DBMS Caracterization

    67/88

    Proof of Fun Fun Fact (cont.)

    Proof ofR= Z,W(R) Z,Y(R) = (ZW)

    SupposeR=Z,W(R) Z,Y(R).

    Lettandube any records in Rwitht.Z= u.Z.Letvbe defined by{v}= Z,W({t}) Z,Y({u})(and we knowvRby the assumption).Note that by construction we have

    1 v.Z= t.Z= u.Z,2 v.W= t.W,3 v.Y= u.Y.

    Therefore,Z Wholds.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 133 / 175

    Fourth Normal Form

    Trivial MVDThe MVDZ Wistrivialfor relational schema R(Z, W, Y)if

    1 Z W={}, or2 Y= {}.

    4NFA relational schemaR(Z, W, Y)is in 4NF if for every MVD Z Weither

    ZWis a trivial MVD, or

    Zis a superkey for R.

    Note : 4NFBCNF3NF 2NF

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 134 / 175

  • 7/23/2019 DBMS Caracterization

    68/88

    Summary

    We always want the lossless-join property. What are our options?

    3NF BCNF 4NFPreserves FDs Yes Maybe Maybe

    Preserves MVDs Maybe Maybe MaybeEliminates FD-redundancy Maybe Yes Yes

    Eliminates MVD-redundancy No No Yes

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 135 / 175

    Inclusions

    Clearly BCNF3NF2NF. These are proper inclusions:

    In 2NF, but not 3NFR(A, B, C), withF ={A B, BC}.

    In 3NF, but not BCNFR(A, B, C), withF ={A, BC, CB}.

    This is in 3NF sinceABand ACare keys, so there are nonon-prime attributes

    But not in BCNF sinceCis not a key and we have CB.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 136 / 175

  • 7/23/2019 DBMS Caracterization

    69/88

    Schema refinement III and advanced design

    OutlineGeneral Decomposition Method (GDM)

    The lossless-join condition is guaranteed by GDM

    The GDMdoes notalways preserve dependencies!

    FDs vs ER models?

    Weak entities

    Using FDs and MVDs to refine ER models

    Another look at ternary relationships

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 137 / 175

    General Decomposition Method (GDM)

    GDM1 Understand your FDsF(computeF+),2 findR(X) =R(Z, W, Y)(setsZ,Wand Yare disjoint) with FD

    Z W F+ violating a condition of desired NF,3 splitRinto two tablesR1(Z, W)and R2(Z, Y)4 wash, rinse, repeat

    ReminderForZ W, if we assumeZ W= {}, then the conditions are

    1 Zis a superkey for R(2NF, 3NF, BCNF)2 Wis a subset of some key (2NF, 3NF)3 Zis not a proper subset of any key (2NF)

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 138 / 175

  • 7/23/2019 DBMS Caracterization

    70/88

    The lossless-join condition is guaranteed by GDM

    This method will produce a lossless-join decomposition because

    of (repeated applications of) Heaths Rule!That is, each time we replace an Sby S1and S2, we will alwaysbe able to recoverSas S1 S2.

    Note that in GDM step 3, the FD Z Wmay represent akeyconstraintforR1.

    But does the method always terminate? Please think about this ....

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 139 / 175

    General Decomposition Method Revisited

    GDM++1 Understand your FDs and MVDsF(computeF+),2 findR(X) =R(Z, W, Y)(setsZ,Wand Yare disjoint) with either

    FDZ W F+ or MVDZ W F+ violating a condition of

    desired NF,3 splitRinto two tablesR1(Z, W)and R2(Z, Y)4 wash, rinse, repeat

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 140 / 175

  • 7/23/2019 DBMS Caracterization

    71/88

    Return to Example Decompose to BCNF

    R(A, B, C, D)

    F ={A, BC, CD, DA}

    Which FDs inF+ violate BCNF?

    C AC DD A

    A, C DC, D A

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 141 / 175

    Return to Example Decompose to BCNF

    DecomposeR(A, B, C, D)to BCNF

    UseCDto obtain

    R1(C, D). This is in BCNF. Done.R2(A, B, C)This is not in BCNF. Why? A, Band B, Care the only

    keys, andCAis a FD forR1. So useCAto obtain R2.1(A, C). This is in BCNF. Done. R2.2(B, C). This is in BCNF. Done.

    Exercise : Try starting with any of the other BCNF violations and seewhere you end up.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 142 / 175

  • 7/23/2019 DBMS Caracterization

    72/88

    The GDMdoes notalways preserve dependencies!

    R(A, B, C, D, E)

    A, B CD, E C

    B D

    {A, B}+ ={A, B, C, D},

    soA, BC, D,

    and{A, B, E}is a key.

    {B, E}+ ={B, C, D, E},

    soB, EC, D,

    and{A, B, E}is a key (again)

    Lets try for a BCNF decomposition ...

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 143 / 175

    Decomposition 1

    DecomposeR(A, B, C, D, E)usingA, BC, D:R1(A, B, C, D). Decompose this using BD: R1.1(B, D). Done. R1.2(A, B, C). Done.

    R2(A, B, E). Done.But in this decomposition, how will we enforce this dependency?

    D, EC

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 144 / 175

  • 7/23/2019 DBMS Caracterization

    73/88

    Decomposition 2

    DecomposeR(A, B, C, D, E)usingB, EC, D:R3(B, C, D, E). Decompose this usingD, EC

    R3.1(C, D, E). Done. R3.2(B, D, E). Decompose this usingBD: R3.2.1(B, D). Done. R3.2.2(B, E). Done.

    R4(A, B, E). Done.

    But in this decomposition, how will we enforce this dependency?

    A, BC

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 145 / 175

    Summary

    It is always possible to obtain BCNF that has the lossless-joinproperty (using GDM) But the result may not preserve all dependencies.

    It is always possible to obtain 3NF that preserves dependencies

    and has the lossless-join property. Using methods based on minimal covers (for example, see

    EN2000).

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 146 / 175

  • 7/23/2019 DBMS Caracterization

    74/88

    Recall : a small change ofscope...... changed this entity

    Movie

    TitleYear

    MovieID

    into two entities and a relationship :

    Movie

    TitleMovieID

    Released MovieRelease

    CountryDate

    Year

    Month

    Day

    But is there something odd about the MovieRelease entity?

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 147 / 175

    MovieRelease represents aWeak entity set

    Movie

    TitleMovieID

    Released MovieRelease

    CountryDate

    Year

    Month

    Day

    DefinitionWeak entity sets do not have a primary key.

    The existence of a weak entity depends on an identifying entity setthrough anidentifying relationship.

    The primary key of the identifying entity together with the weakentitiesdiscriminators(dashed underline in diagram) identify eachweak entity element.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 148 / 175

  • 7/23/2019 DBMS Caracterization

    75/88

    Can FDs help us think about implementation?

    R(I, T, D, C)I T

    I = MovieIDT = TitleD = DateC = Country

    Turn the decomposition crank to obtain

    R1(I, T) R2(I, D, C)

    I(R2) I(R1)

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 149 / 175

    Movie Ratings exampleScope = UK

    Title Year Rating

    Austin Powers: International Man of Mystery 1997 15Austin Powers: The Spy Who Shagged Me 1999 12Dude, Wheres My Car? 2000 15

    Scope = Earth

    Title Year Country RatingAustin Powers: International Man of Mystery 1997 UK 15Austin Powers: International Man of Mystery 1997 Malaysia 18SXAustin Powers: International Man of Mystery 1997 Portugal M/12Austin Powers: International Man of Mystery 1997 USA PG-13Austin Powers: The Spy Who Shagged Me 1999 UK 12Austin Powers: The Spy Who Shagged Me 1999 Portugal M/12Austin Powers: The Spy Who Shagged Me 1999 USA PG-13Dude, Wheres My Car? 2000 UK 15Dude, Wheres My Car? 2000 USA PG-13

    Dude, Wheres My Car? 2000 Malaysia 18PLKen Moody (cl.cam.ac.uk) Databases DB 2012 150 / 175

  • 7/23/2019 DBMS Caracterization

    76/88

  • 7/23/2019 DBMS Caracterization

    77/88

    Oh, but the real world is such a bother!

    from IMDb raw data file certificates.list

    2 Fast 2 Furious (2003) Switzerland:14 (canton of Vaud)2 Fast 2 Furious (2003) Switzerland:16 (canton of Zurich)

    28 Days (2000) Canada:13+ (Quebec)

    28 Days (2000) Canada:14 (Nova Scotia)

    28 Days (2000) Canada:14A (Alberta)

    28 Days (2000) Canada:AA (Ontario)

    28 Days (2000) Canada:PA (Manitoba)

    28 Days (2000) Canada:PG (British Columbia)

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 153 / 175

    Ternary or multiple binary relationships?

    TRS

    U

    TR3ER1S

    R2

    U

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 154 / 175

  • 7/23/2019 DBMS Caracterization

    78/88

    Ternary or multiple binary relationships?

    TRS

    U

    TR2SR1U

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 155 / 175

    Look again at ER Demo Diagram2

    How might this be refined using FDs or MVDs?

    Employee

    NameNumber

    ISA

    Mechanic SalesmanDoes

    RepairJobNumber

    Description

    CostParts

    Work

    Repairs Car

    License

    ModelYear

    Manufacturer

    Buys

    Price

    Date

    Value

    Sells

    Date

    Value

    Commission

    Client ID

    Name PhoneAddress

    buyerseller

    2By Pvel Calado,http://www.texample.net/tikz/examples/entity-relationship-diagram

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 156 / 175

  • 7/23/2019 DBMS Caracterization

    79/88

    Lecture 10 : On-line Analytical Processing (OLAP)

    OutlineLimits of SQL aggregationOLAP : Online Analytic Processing

    Data cubes

    Star schema

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 157 / 175

    Limits of SQL aggregation

    Flat tables are great for processing, but hard for people to readand understand.

    Pivot tables and cross tabulations (spreadsheet terminology) arevery useful for presenting data in ways that people canunderstand.

    SQL does not handle pivot tables and cross tabulations well.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 158 / 175

  • 7/23/2019 DBMS Caracterization

    80/88

    OLAP vs. OLTP

    OLTP : Online Transaction Processing (traditional databases) Data is normalized for the sake of updates.

    OLAP : Online Analytic Processing These are (almost) read-only databases. Data is de-normalized for the sake of queries! Multi-dimensional data cube emerging as common data model.

    This can be seen as a generalization of SQLsgroup by

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 159 / 175

    OLAP Databases : Data Models and Design

    The big question

    Is the relational model and its associated query language (SQL) wellsuited for OLAP databases?

    Aggregation (sums, averages, totals, ...) are very common inOLAP queries Problem : SQL aggregation quickly runs out of steam. Solution : Data Cube and associated operations (spreadsheets on

    steroids)

    Relational design is obsessed with normalization Problem : Need to organize data well since all analysis queries

    cannot be anticipated in advance. Solution : Multi-dimensional fact tables, with hierarchy in

    dimensions, star-schema design.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 160 / 175

  • 7/23/2019 DBMS Caracterization

    81/88

    A very influential paper [G+1997]

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 161 / 175

    From aggregates to data cubes

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 162 / 175

  • 7/23/2019 DBMS Caracterization

    82/88

    The Data Cube

    Data modeled as ann-dimensional (hyper-) cube

    Each dimension is associated with a hierarchyEach point records facts

    Aggregation and cross-tabulation possible along all dimensions

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 163 / 175

    Hierarchy forLocationDimension

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 164 / 175

  • 7/23/2019 DBMS Caracterization

    83/88

    Cube Operations

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 165 / 175

    The Star Schema as a design tool

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 166 / 175

  • 7/23/2019 DBMS Caracterization

    84/88

    Lectures 11 : Case Study - Cancer registry for theNHS, Part II

    The extension of ECRIC to cover all of England requires schemareconciliation, a problem that remains unresolved since it was firstencountered in the 1980s. Jem Rashbass has a long track record inNHS IT, and is now CEO of ECRIC. Jem will explain what the NHSneeds and why - some of the existing challenges and futureopportunities. The session will close with an open forum in which theDBA of the now national level Cancer Registry DBMS will join Jem.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 167 / 175

    Lecture 12 : XML as a data exchange format

    OutlineHTML vs. XML

    Using XML to solve the data exchange problem

    Domain-specific XML schemaNative XML databases

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 168 / 175

  • 7/23/2019 DBMS Caracterization

    85/88

    HTML vs XML

    HTML

    HTML = Content + (fixed) Schema + (fixed) presentation

    Untangle these and generalize to

    XML

    XML = ContentXSL = defines presentations

    DTD or XSchema = defines schema

    HTML : Hypertext Markup LanguageXML : eXtensible Markup LanguageXSL : Extensible Stylesheet Language (similar to CSS)CSS : Cascading Style SheetsDTD : Document Type Definition

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 169 / 175

    XML data is semi-structured UniCode text

    Body of text, and possibly nested tags.

    An XML schema definestag names

    which associated values are optional or requiredtypes of associated values

    type of the associated body

    What would Churchill say?XML is the worst form of data representation, except for all those otherforms that have been tried from time to time.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 170 / 175

  • 7/23/2019 DBMS Caracterization

    86/88

    The data exchange problem

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 171 / 175

    XML as a data exchange standard

    Domain-specific schema can become standards.Ken Moody (cl.cam.ac.uk) Databases DB 2012 172 / 175

  • 7/23/2019 DBMS Caracterization

    87/88

    There are now thousands of domain-specific schema

    WML: Wireless markup language (WAP)OFX: Open financial exchangeCML: Chemical markup language

    AML: Astronomical markup languageMathML: Mathematics markup languageSMIL: Synchronized multimedia integration languageThML: Theological markup language

    .....

    The public XML schema is in some many ways dual to the manyprivate SQL schemas involved in data exchange.

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 173 / 175

    Two basic kinds of XML databases (hybrids possible)

    XML-enabled databases Native XML databaseRelational (XML for exchange) direct storage of XML dataData-centric Document-centricSQL XPath and XQueryhttp://www.mysql.com/ http://basex.org

    http://exist.sourceforge.net

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 174 / 175

  • 7/23/2019 DBMS Caracterization

    88/88

    The End

    (http://xkcd.com/327)

    Ken Moody (cl.cam.ac.uk) Databases DB 2012 175 / 175


Recommended