Post on 31-Mar-2015
transcript
Relative Information Capacity of Simple Relational Database Schemata
Paper by: Richard HullPresented by: Jose Picado
Outline
• Problem: Data relativism and information capacity– Definition– Examples– Importance
• Hierarchy of dominance measures• Basic results• Discussion
Data relativism
• Represent the same data in different ways
Data relativism
• Represent the same data in different ways• Represent the same data under different
schemas
Data relativism
• Represent the same data in different ways• Represent the same data under different
schemas
Person
name sex spouseSchema 1
Example taken from: Kosky, Anhony. Transforming Databases with Recursive Data Structures, 1996.
Data relativism
• Represent the same data in different ways• Represent the same data under different
schemas
Person
name sex spouse
Female
name
Male
name
Marriage
husband wife
Schema 1
Schema 2
Example taken from: Kosky, Anhony. Transforming Databases with Recursive Data Sturctures, 1996.
Relative information capacity
• Expressiveness of a schema• Different schemas representing same data
may have different information capacity
Relative information capacity
• Expressiveness of a schema• Different schemas representing same data
may have different information capacity
Person
name sex spouse
Female
name
Male
name
Marriage
husband wife
Schema 1
Schema 2
Example taken from: Kosky, Anthony. Transforming Databases with Recursive Data Structures, 1996.
Relative information capacity
• Expressiveness of a schema• Different schemas representing same data
may have different information capacity
Person
name sex spouse
Female
name
Male
name
Marriage
husband wife
Schema 1:• Does not require that the
spouse attribute of a man goes to a woman.
• Does not require that for each spouse attribute in one direction there is a corresponding spouse attribute in another direction.
Example taken from: Kosky, Anthony. Transforming Databases with Recursive Data Structures, 1996.
Relative information capacity
• Expressiveness of a schema• Different schemas representing same data
may have different information capacity
Person
name sex spouse
Female
name
Male
name
Marriage
husband wife
Schema 2:• Allows unmarried people to
be represented in the database.
Example taken from: Kosky, Anthony. Transforming Databases with Recursive Data Structures, 1996.
Relative information capacity
• Possible solution: – Transform existing schema to new schema by
structural manipulations
Person
name sex spouse
Female
name
Male
name
Marriage
husband wife
transformation
Relative information capacity
• Possible solution: – Transform existing schema to new schema by
structural manipulations– Information capacity preserving?
Person
name sex spouse
Female
name
Male
name
Marriage
husband wife
transformation
Importance
• Schema evolution– None of the information stored in the initial
database is lost
Person
name sex spouse
Female
name
Male
name
Marriage
husband wife
Importance• Data integration– All information in one of the component
databases is reflected in the integrated database
City
name state
State
name capital
City
name isCapital country
Country
name language currency
City
name place
Country
name language currency capital
State
name capital
Example taken from: Kosky, Anthony. Transforming Databases with Recursive Data Structures, 1996.
Importance
• Database normalization theory• User view construction• Schema simplification• Translation between data models
Hull’s paper
• Introduces theoretical tools for studying measures of relative information capacity– Theoretical frameworks at the time were complex– There was no clear definition about the concept– Hull introduced nice ways of comparing schemata
and their information capacity• Defines a hierarchy of measures to compare
information capacity of schemata
Hull’s paper
• Gives some basic results concerning the previous measures
• Considers only non-keyed relations
Person
id name
Person
id name
123 John
123 Mary
123 John
123 Mary
Non-keyed Keyed
Instances:
Relations:
Definitions
• Schema P is a set of relations• Relations composed of attributes, which may
be of different basic types• Basic types are domain designators (have a
fixed domain of possible values)• I(P) is the instances of P, usually infinite
Person
id name
111 John
222 Mary123 Anne
234 Joe
aaa Jack
bbb Ted
Schema P Instances I(P)
…
Transformation
• P and Q are relational schemata• A transformation from P to Q is a map
Transformation
• P and Q are relational schemata• A transformation from P to Q is a map
PPerson
id nameBirth
id date
Transformation
• P and Q are relational schemata• A transformation from P to Q is a map
P
QPersonInfo
id name bdate
Person
id nameBirth
id date
Transformation
• P and Q are relational schemata• A transformation from P to Q is a map
P
QPersonInfo
id name bdate
Person
id nameBirth
id date
PersonInfo(x,y,z) :- Person(x,y), Birth(x,z).
Dominance
• P and Q are relational schemata
• Q dominates P via if the composition of followed by is the identity on P
Dominance
Person
name sex spouse
Female
name
Male
name
Marriage
husband wife
P
Q
Dominance
1. Take instances of P: I(P)
Person
John male Mary
Mary female John
Anne female Joe
Joe male Anne
Dominance
2. Apply to I(P) Male(x) :- Person(x,y,z), y=“male”.Female(x) :- Person(x,y,z), y=“female”.Marriage(x,y) :- Person(x,u,y), Person(y,v,x), u=“male”, v=“female”
Male
John
Joe
Female
Mary
Anne
Marriage
John Mary
Joe Anne
Person
John male Mary
Mary female John
Anne female Joe
Joe male Anne
Dominance
3. Apply to (I(P))
Person(x,”male”,z) :- Male(x), Marriage(x,z).Person(x,”female”,z) :- Female(x), Marriage(x,z).
Male
John
Joe
Female
Mary
Anne
Marriage
John Mary
Joe Anne
Person
John male Mary
Mary female John
Anne female Joe
Joe male Anne
( (I(P)))
Dominance
4. Compare I(P) and ( (I(P)))
Person
John male Mary
Mary female John
Anne female Joe
Joe male Anne
Person
John male Mary
Mary female John
Anne female Joe
Joe male Anne
I(P)
Dominance
• P and Q are relational schemata
• Q dominates P via if the composition of followed by is the identity on P
Q has at least as much capacity for storing information as P
Information structured according to P can be restructured to “fit” into Q, and restructured again to “fit” into P
Equivalence
• P and Q are equivalent (xxx) if they have equivalent information capacity
• P and Q are equivalent if – Q dominates P (xxx) and – P dominates Q (xxx)
Information dominance measures
1. Calculous dominance2. Generic dominance3. Internal dominance4. Absolute dominance
More restrictive
Less restrictive
Types of equivalency
1. P and Q are equivalent (calc)2. P and Q are equivalent (gen)3. P and Q are equivalent (int)4. P and Q are equivalent (abs)
More restrictive
Less restrictive
Level 1: Calculous dominance
• Only allow transformations to be relational calculus expressions
• Relational calculus:– First order logic or predicate calculus– Predicates: atom,
– Each query Q(x1, …, xn) is a predicate P
Level 1: Calculous dominance
• Only allow transformations to be relational calculus expressions
• are relational calculus expressions
• Q dominates P calculously
Level 2: Generic dominance
• Only allow transformations that treat domain elements as “essentially uninterpreted objects”
• Treat all elements as equals except some set of constants
• Property of all query languages, such as SQL and Datalog
Level 2: Generic dominance
• Only allow transformations that treat domain elements as “essentially uninterpreted objects”
• treat all elements as equals
• Q dominates P generically
Level 3: Internal dominance
• Only allow transformations that do not invent any data
• Invent data: numerical computations or string manipulations
player goals games player performance
performance = goals/games
Level 3: Internal dominance
• Only allow transformations that do not invent any data
• do not invent data• Q dominates P internally
Level 4: Absolute dominance
• Some set of values • : instances of P that contain only values
in Y, where• : cardinality of instances of P containing
only values in Y• If then
Q dominates P absolutely• Easy to compute: based on counting of
instances, instead of transformations
Basic results
• Q dominates P calculously
Q dominates P generically
Q dominates P internally
Q dominates P absolutely
Basic results
• Sometimes absolute and internal dominance hold, but generic and calculous dominance don’t
A A
B B
A B
Q
PQ dominates P (abs, int)• and transformation (int)
does not invent data
Q does not dominate P (gen, calc)• There is no transformation (gen, calc) that
takes instances of P to Q and then back to P
Basic results
• Absolute dominance useful for verifying calculous (not) dominance
A B
A C
A B C
Q
P• Q dominates P calculously
Q dominates P absolutely
• P does not dominate Q absolutelyP does not dominates Q
calculously*under certain constraints
Basic results
• Dominance is preserved by re-namings of basic types (homomorphism)– h(P): homomorphism of P– If Q dominates P then
h(Q) dominates h(P)for any measure of dominance (calc, gen, int, abs)
Basic results
• Calculous dominance does not accurately measure the presence of “semantic correspondence”
Basic results
• Calculous dominance does not accurately measure the presence of “semantic correspondence”
name position goalsname goals minutes S1R1
NAME NUMBER NUMBER NAME NAME NUMBER
title publisher pagestitle pages edition S2R2P
Basic results
• Calculous dominance does not accurately measure the presence of “semantic correspondence”
NAME NAME NUMBER NUMBERT
P
Q
name position goalsname goals minutes S1R1
NAME NUMBER NUMBER NAME NAME NUMBER
title publisher pagestitle pages edition S2R2
Basic results
• Calculous dominance does not accurately measure the presence of “semantic correspondence”
NAME NAME NUMBER NUMBERT
P
Q
Q dominates P (calc), but there is not semantic mapping from P to Q
name position goalsname goals minutes S1R1
NAME NUMBER NUMBER NAME NAME NUMBER
title publisher pagestitle pages edition S2R2
Basic results
• If only non-keyed relational schemata with only one basic type, then all types of dominance are equivalent
Theorem: Let P and Q be non-keyed relational schemata over a single basic type B. Then the following are equivalent:a. Q dominates P (calc)b. Q dominates P (gen)c. Q dominates P (int)d. Q dominates P (abs)
Basic results
• With any reasonable measure of relative information capacity, two non-keyed relational schemata are equivalent iff they are identical
• In the relational model (non-keyed), there is essentially at most one way to represent a given data set
Discussion
• Strong points:– ???
Discussion
• Strong points:1. Provides a theory to study relative information
capacity
Discussion
• Strong points:1. Provides a theory to study relative information
capacity2. Data relativism is important as it arises in many
areas
Discussion
• Strong points:1. Provides a theory to study relative information
capacity2. Data relativism is important as it arises in many
areas3. Defines a hierarchy of dominance measures
Discussion
• Strong points:1. Provides a theory to study relative information
capacity2. Data relativism is important as it arises in many
areas3. Defines a hierarchy of dominance measures4. Gives important results about the relational
model
Discussion
• Weak points:– ???
Discussion
• Weak points:1. Does not support dependencies/constraints• Hierarchy of dominance measures• Basic results
Discussion
• Functional dependency (FD):Given attributes in relation R, the functional dependency means that all tuples in R that agree on attributes must also agree on .
id name address
123 John 21 Kings St.
234 Mary 31 Kings St.
Discussion
• Multivalued dependency (MVD):For MVD , if two tuples of R agree on all the attributes of X, then their components in Y may be swapped, and the result will be two tuples that are also in the relation.
course book lecturer
Machine Learning
Pattern Recognition
John
Artificial Intelligence
AIMA Mary
Discussion
• Inclusion dependency (IND):For , for any tuple t1 in R1, there must exist a tuple t2 in R2, such that
id title
111 Pattern Recognition
222 AIMA
bookid customer
111 John
222 Mary
Book
Order
Discussion
• Weak points:1. Does not support dependencies/constraints• Hierarchy of dominance measures• Basic results
Dependencies change the final result of the paper
Discussion
• Weak points:1. Does not support dependencies/constraints• Hierarchy of dominance measures• Basic results
2. Open questions: • Absolute dominance implies internal dominance?• Generic dominance implies calculous dominance?• Is there a measure for “semantic correspondence”?
Thank you
Quiz
• What are the four formal measures of relative information capacity defined by Hull? Write them in order from most restrictive to less restrictive.