+ All Categories

Tsimmis

Date post: 21-Jan-2016
Category:
Upload: vinnie
View: 36 times
Download: 0 times
Share this document with a friend
Description:
Tsimmis. The Stanford-IBM Manager of Multiple Information Sources Overview Mediator specification A reduction to Datalog Using object id’s for information fusion Querying and query processing. Overview. A GAV system: global data defined in terms of sources - PowerPoint PPT Presentation
Popular Tags:
35
2005 Integration/tsimmis 1 Tsimmis The Stanford-IBM Manager of Multiple Information Sources Overview Mediator specification A reduction to Datalog Using object id’s for information fusion Querying and query processing
Transcript
Page 1: Tsimmis

2005 Integration/tsimmis 1

Tsimmis

The Stanford-IBM Manager of Multiple Information Sources

OverviewMediator specificationA reduction to DatalogUsing object id’s for information fusionQuerying and query processing

Page 2: Tsimmis

2005 Integration/tsimmis 2

Overview

• A GAV system: global data defined in terms of sources

• simple, schema-less (semi-structured) data model self-describing data -- precursor of XML

• Supports a notion of object identity – used for proper fusion of data from multiple sources

• Relationship between sources (wrappers) and mediator specified by a declarative language – variant of Datalog

• Query execution planning & optimization tailored for integration environment

• Semi-declarative mechanism for wrapper construction

Page 3: Tsimmis

2005 Integration/tsimmis 3

Data model: OEM (Object Exchange Model)Each piece of data describes itself, no schema, no fixed structure

An object: <o-id, label, type, value>, where:• label is a description of what this data is• type – the type of the value; can be atomic, or set • value – the value of the object• o-id – an identifier that uniquely identifies it Example (atomic types): <&f1, first_name, string, ‘Joe’> <&l1, last_name, string, ‘Chung’> <&t1, title, string, ‘Professor’> <&b1, birth_date, integer, 1976>

Page 4: Tsimmis

2005 Integration/tsimmis 4

Example (with a set type):

<&e1, employee, set {&f1, &l1, &t1, &rep1}

<&f1, first_name, string, ‘Joe’>

<&l1, last_name, string, ‘Chung’>

<&t1, title, string, ‘Professor’>

<&rep1, reports_to, string, ‘John Doe’>

A set type is used to represent an object with sub-objects

Here, a record structure

In other cases, the sets may be real sets, with the same label repeating many times.

Note: no order on oid’s in a set (contrast to XML)

Page 5: Tsimmis

2005 Integration/tsimmis 5

On object id’s:

• Usually temporary id’s assigned during query processing– Used for relating an object to its sub-objects

– Valid only for duration of a query

– Of no interest to the user

• Can be used by mediator writer for specifying data fusion (later)

Notation (for queries, examples):

• Types are usually omitted – can be inferred from the data; hence objects written as triples

• when o-id’s are irrelevant -- write < -- , label, value>, or even <label, value)

Page 6: Tsimmis

2005 Integration/tsimmis 6

Interim summary:

• relational data can be exported in this format

• The format allows records that have some common fields, but each may have extra fields (semi-structured data) – like XML

• The lack of any schema seems, in retrospect, a disadvantage

Page 7: Tsimmis

2005 Integration/tsimmis 7

Mediator specification

Each source is assumed to be wrapped by a wrapper, that exports data in the OEM format

A mediator specification determines how source data is imported to the mediator and combined with data from other sources.

The language MSL (Mediator Specification Language) is an adaptation of non-recursive Datalog to this data model, and the needs of integration

Page 8: Tsimmis

2005 Integration/tsimmis 8

An example:

Two sources export data (via wrappers) on university people (both related to the CS dept):

• CS: a relational source, with two tables:

employee(first_name, last_name, title, reports_to)

student(first_name, last_name, year)

• Whois : A university facility that contains information about employees and students; usually name, dept are given but fields change between records

Page 9: Tsimmis

2005 Integration/tsimmis 9

Some data from CS:

<&e1, employee, set {&f1, &l1, &t1, &rep1}

<&f1, first_name, string, ‘Joe’>

<&l1 last_name, string, ‘Chung’>

<&t1, title, string, ‘Professor’>

<&rep1, reports_to, string, ‘John Doe’>

<&e2, employee, set {&f2, &l2, &t2}

<&f2, first_name, string, ‘John’>

<&l2, last_name, string, ‘Doe’>

<&s3, student, set {&f2, &l3, &y3}

<&f3, first_name, string, ‘Jack’>

<&l3, last_name, string, ‘Dean’>

<&y3, year, integer, 3>

Page 10: Tsimmis

2005 Integration/tsimmis 10

Some data from whois:

<&p1, person, set {&n1, &d1, &rel1, &em1}

<&n1, name, string, ‘Joe Chung’>

<&d1 dept, string, ‘CS’>

<&rel1, relation, string, ‘employee’>

<&em1, email, string, ‘chung@cs’>

<&p2, person, set {&n2, &d2, &rel2, &y2}

<&n2, name, string, ‘Nick Cave’>

<&d2, dept, string, ‘CS’>

<rel2, relation, string, ‘student’>

<&y2, year, integer, 3>

Page 11: Tsimmis

2005 Integration/tsimmis 11

A comparison of the sources:

• Domain mismatch: Different representations for name in the two sources (The resolution of such issues is the responsibility of the mediator)

• Schematic discrepancy: employee, student are relation names in CS, data in whois

• In one source (whois), there is no fixed schema – different objects may have different fields;

but, we would like in some cases to import all data about a person, w/o knowing what data exists

• The sources (e.g. CS) may evolve; we would like the mediator spec to be insensitive to most changes

Page 12: Tsimmis

2005 Integration/tsimmis 12

Specification of a mediator med (by examples)

(MS0) : Show in mediator names & relationship of CS people that exist in both sources:

<cs_person {<name N>, <relation R>}> @med :-

<person {<name N>, <dept ‘CS’>, <relation R> }> @whois , <R {<first_name,FN>, <last_name LN> }> @CS ,

decompose_name(N, LN, FN)External: decompose_name(string, string, string)(b,f,f) name_to_lnfn decompose_name(string, string, string)(f,b,b) lnfn_to_name

Explanation:• Capital letter – variables• External: a (conversion) function (implemented in some pl) : implemented by, b – bound (in), f – free (out) • o-id, and type were omitted!

Page 13: Tsimmis

2005 Integration/tsimmis 13

<cs_person {<name N>, <relation R>}> @med :-

<person {<name N>, <dept ‘CS’>, <relation R> }> @whois ,

<R {<first_name,FN>, <last_name LN> }> @CS ,

More explanation:

• {, } represent sets

• In body: <person {<name N>, <dept ‘CS’>, <relation R> } means that there is an object with – label person,

– value that is a set that contains at least objects with labels name, dept, relation, possibly more

• In head: These are the elements that go into the mediated object

Page 14: Tsimmis

2005 Integration/tsimmis 14

How are the problems addressed?

• Domain mismatch: Different representations for names – use conversion functions

• Schematic discrepancy: employee, student are relation names in CS, data in whois – variables can range on both data and labels (see the variable R in query) (same now in XQuery)

• In one source (whois), there is no fixed schema – same fields will be retrieved

• The sources may evolve; we would like the mediator spec to be insensitive to most changes -- same as previous point

Page 15: Tsimmis

2005 Integration/tsimmis 15

(MS1) : similar, but now we want all fields from both sources<cs_person {<name N>, <relation R>, Rest1 Rest2}> @med :-

<person {<name N>, <dept ‘CS’>, <relation R> | Rest1}> @whois ,

<R {<first_name,FN>, <last_name LN> | Rest2}> @CS , decompose_name(N, LN, FN)

External:

decompose_name(string, string, string)(b,f,f) name_to_lnfn

decompose_name(string, string, string)(f,b,b) lnfn_to_name

Explanation:

• Rest variables distinguished in body by occurring after |• Bound to the fields in the object not mentioned explicitly

(~ set difference)

The language is called MSL (mediator specification language)

Page 16: Tsimmis

2005 Integration/tsimmis 16

An object generated by the mediator for MS1 in med:

<&cp1, cs_person, set {&mn1, &mrel1, &t1, &rep1, &em1}

<&mn1, name, string, ‘Joe Chung’>

<&mrel1, relation, string, ‘employee’>

<&t1, title, string, ‘Professor’>

<&rep1, reports_to, string, ‘John Doe’>

<&em1, email, string, ‘chung@cs’>

Note: this is a virtual object; materialization only for user queries

Page 17: Tsimmis

2005 Integration/tsimmis 17

Q: How is it generated?

• Match each body atom (a pattern <o-id, l, t, val>) with objects in the specified source– If label is a constant – can match only this constant

– Same for value

– A variable matches any label/value

– {…} match only sets

– Rest matches any components not matched explicitly

• A successful match binds the matched variables

This is essentially a (flexible) notion of a valuation from a query body to data

o-ids for the result are generated by med, since here they are not specified explicitly

Q: can you generate a few other objects from the given data?

Page 18: Tsimmis

2005 Integration/tsimmis 18

How are the problems addressed?

• In one source (whois), there is no fixed schema – different objects may have different fields, but we want all fields

• The sources may evolve; we would like the mediator spec to be insensitive to most changes

– the variables Rest1, Rest2 range are bound to the set of all the sub-objects not explicitly specified

Page 19: Tsimmis

2005 Integration/tsimmis 19

The rest variables can be removed:<cs_person {<name N>, <relation R>, <R1-id R1-l R1-v>

<R2-id R2-l R2-v)}> @med :-

<person {<name N>, <dept ‘CS’>, <relation R> ,< R1-id R1-l R1-v>}>

@whois , R1-l notin {name, dept, relation}

<R {<first_name,FN>, <last_name LN> , <R2-id R2-l R2-v>}> @CS ,

R2-l notin {first_name, last_name}

decompose_name(N, LN, FN)

External:

………

Note: notin can be replaced here by a conjunction of neq

Page 20: Tsimmis

2005 Integration/tsimmis 20

A reduction to Datalog

I. Model each source as a relational database:

top(src, oid) – the object identified by oid is top-level in src

object(src, oid, lab, val) – the object identified by oid exists in src, has label lab and atomic value

val

object(src, oid, lab, set) – the object identified by oid exists in src, has label lab and a set value

set is here a special constant

member(src, o1, o2) – in src, o1 has a set value, o2 is in the set

The original OO database is essentially a graph of objects and relationships; the above captures this graph, relationally

Page 21: Tsimmis

2005 Integration/tsimmis 21

Some obvious integrity constraints:

If member(src, o1, o2) then also object(src, o1, lab1, set) and object(src, o2, lab2, v2)

hold for some lab1, lab2, v2

Any more?

Page 22: Tsimmis

2005 Integration/tsimmis 22

II. Translate MSL rules to use these relations:

(MS0) <cs_person {<name N>, <relation R>}> @med :-

(1) <person {<name N>, <dept ‘CS’>, <relation R> }> @whois ,

(2) <R {<first_name,FN>, <last_name LN> }> @CS ,

(3) decompose_name(N, LN, FN) ……..

(11 top(whois, &P1),

object(whois, &P1, person, set),

object(whois, &N1, name, N),

object(whois, &D1, dept, ‘CS’) ,

object(whois, &Rel1, relation, R)

• -- similar : top(CS, &P2), ….

• what is your suggestion?

Page 23: Tsimmis

2005 Integration/tsimmis 23

<cs_person {<name N>, <relation R>}> @med :-

(1) <person {<name N>, <dept ‘CS’>, <relation R> }> @whois ,

(2) <R {<first_name,FN>, <last_name LN> }> @CS ,

(3) decompose_name(N, LN, FN) ……..

head top(med, f(&P1,&P2)),

object(med, f(&P1, p2), cs_person, set),

……..

Here f is a new function symbol (a ‘syntactic’ function)

The term f(&P1, &P2) states that the new object id is determined by that of the two objects retrieved from whois and CS

But, it seems we generate a multi-head rule?

Page 24: Tsimmis

2005 Integration/tsimmis 24

There is no inherent difficulty with multi-head rules:• We can introduce an intermediate relation binds(..) to

collect all the bindings from the (translation of) the body

• And a new rule with one atom in head, and binds(..) as the body, for each of the components of the head.

The term f(&P1, &P2) ensures that the facts that are generated refer to the same object

Page 25: Tsimmis

2005 Integration/tsimmis 25

Using object id’s for information fusion

So far, oid’s -- (almost) an implementation feature: enabling references to sub-objects

But, they can be used logically: if several rules use the same oid in the head, then the information produced by the rules is fused together, into a collection of sub-objects for a unique object

For this to work, the oid in the head must be a function of some of the variables (or constants) in the body (safety); then each tuple of bindings for these variables produces a unique oid

id-based object fusionSemantic oid’s

Page 26: Tsimmis

2005 Integration/tsimmis 26

Example:

Two sources about technical reports, use same report numbers; one has a title, the other -- the postscript

(MS3): <trep(RN) techreport {<title T>}> @cs :-

<report {<report_num RN> <title T>}> @cs1

<trep(RN) techreport {<postscipt P>}> @cs :-

<report {<report_num RN> <postscript P>}> @cs2

• If report #5 occurs in both sources, the first rule attaches the title to the fused object, the second rule attaches the postscript

• If it occurs only in cs1, then only a title field will be attached

• If it occurs only in cs2, then only a ps field will be attached

Page 27: Tsimmis

2005 Integration/tsimmis 27

We can retrieve all the fields from both sources, w/o having to know their labels

(MS4): <trep(RN) techreport V> @cs :-

<report V:{<report_num RN>}> @cs1

<trep(RN) techreport V> @cs :-

<report V:{<report_num RN>}> @cs2

Variable V binds to a set of objects (provided one has a report_number field)

In this example, if both sources contain a title field, the mediated object will have both

The mediated object certainly has two report_number fields!

Can this be avoided?

Hint: use the same idea for the object with report_number label

Page 28: Tsimmis

2005 Integration/tsimmis 28

We can select to retrieve fields from cs1, and only fields not there from cs2

(MS5): <trep(RN) techreport (field(RN, F) F V>}> @cs :-

<report {<report_num RN> <F V> }> @cs1

provided(RN, F) :- <report {<report_number RN> <F V>}> @cs1

<trep(RN) techreport {(field(RN, F) F V>}> @cs :-

not provided(RN, F), <report {<report_num RN> <F V> }> @cs2

• Use of predicates, in addition to objects, is useful

• This is a case of stratified negation, has well-defined semantics

• When evaluating against cs1, makes sense to collect the bindings for the first two rules together

Page 29: Tsimmis

2005 Integration/tsimmis 29

Assume reports have a field related that references another report, how can we transform them to mediator objects?

(MS6): <trep(RN) techreport (L V>}> @cs :-

<report {<report_num RN> <L V> }> @cs1, L neq related

<trep(RN) techreport {<related trep(RRN>}> @cs :-

<report {<report_num RN> <related {<report <report_number RRN> }>}> @cs1

But, this solution assumes

• We know which sub-objects contain references

• They are at a fixed, known, depth

In XML, both assumptions may fail

Assuming we have a construct like // can we address these issues?

Page 30: Tsimmis

2005 Integration/tsimmis 30

Querying and query processing

Can use a variety of languages for querying

We illustrate querying using MSL

Example:

Find all info about Joe Chung:

(Q1) JC :- JC: <cs_person {<name ‘Joe Chung’>}> @med

New feature: object variable JC

When Q1 is processed, it binds to any object with name ‘Joe Chung’

Each such binding inserts the object into the answer

Page 31: Tsimmis

2005 Integration/tsimmis 31

Processing the query:

• Remove the object variable<cs_person {<L V>}> :- <cs_person {<name ‘Joe Chung’> <L V>}> @med

Note: L neq name is not needed; why?• Match the body condition with head of rule defining med (after the rest

variables were also removed, p. 19):

<cs_person {<name N>, <relation R>, <R1-id R1-l R1-v>

<R2-id R2-l R2-v)}> @med :-

<person {<name N>, <dept ‘CS’>, <relation R> ,< R1-id R1-l R1-v>}> @whois ,

R1-l notin {name, dept, relation}

<R {<first_name,FN>, <last_name LN> , <R2-id R2-l R2-v>}> @CS ,

R2-l notin {first_name, last_name}

decompose_name(N, LN, FN)

Page 32: Tsimmis

2005 Integration/tsimmis 32

A match: <cs_person {<name ‘Joe Chung’> <L V>}>

<cs_person {<name N>, <relation R>, <R1-id R1-l R1-v>

<R2-id R2-l R2-v)}>

N is replaced by ‘Joe Chung’ (L and V match each of the fields)

The rule <cs_person {<name ‘Joe Chung’>, <relation R>, <R1-id R1-l R1-v>

<R2-id R2-l R2-v)}> @med :-

<person {<name ‘Joe Chung’>, <dept ‘CS’>, <relation R> ,

< R1-id R1-l R1-v>}> @whois ,

R1-l notin {name, dept, relation}

<R {<first_name,FN>, <last_name LN> , <R2-id R2-l R2-v>}> @CS ,

R2-l notin {first_name, last_name}

decompose_name(Joe Chung, LN, FN)

Page 33: Tsimmis

2005 Integration/tsimmis 33

So, we have replaced a view by its definition, with query bindings accounted for

• Add an object id to the head : f(Joe chung) (we do not show it)

why is it needed?

• Now, decompose the result into source queries:

(1) <person {<name ‘Joe Chung’>, <dept ‘CS’>, <relation R> ,

< R1-id R1-l R1-v>}> @whois ,

R1-l notin {name, dept, relation}

(2) <R {<first_name,FN>, <last_name LN> , <R2-id R2-l R2-v>}> @CS ,

R2-l notin {first_name, last_name}

And a glue:(3) decompose_name(Joe Chung, LN, FN)

What are the options for query processing?

Page 34: Tsimmis

2005 Integration/tsimmis 34

I. Obtain R bindings from whois, and FN, LN bindings from decompose, then use these in queries on CS

II. Obtain FN, LN bindings from decompose, use these in (two) queries on CS,

also use the given bindings to query whois

then join

III. Query cs, then use decompose and the results to query whois

The selection of the query processing strategy requires an optimizer

Page 35: Tsimmis

2005 Integration/tsimmis 35

Summary

Tssimis combines

• Semi-structured data

• Semantic object id’s for object fusion

• A GAV approach

Advantages:

• Offers a solution the schematic discrepancy problem

• Can deal with source evolution, unknown structure, …

• Semantics oid’s are a nice mechanism for information fusion

Some problems:

• Does not provide easy access to deeply nested data

• Nor to data whose depth is variable/unknown


Recommended