+ All Categories
Home > Documents > 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and...

1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and...

Date post: 14-Dec-2015
Category:
Upload: nickolas-gordon
View: 216 times
Download: 1 times
Share this document with a friend
Popular Tags:
56
Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases: Core ideas of federated databases; Schema matching Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http://www.cs.kuleuven.ac.be/~berendt/teaching/2009-10- 1stsemester/adb/ ast update: 12 October 2009
Transcript
Page 1: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

1Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

1

Advanced databases –

Defining and combining heterogeneous databases:

Core ideas of federated databases; Schema matching

Bettina Berendt

Katholieke Universiteit Leuven, Department of Computer Science

http://www.cs.kuleuven.ac.be/~berendt/teaching/2009-10-1stsemester/adb/

Last update: 12 October 2009

Page 2: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

2Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

2

Until now ...

... we have looked into modelling

... we have seen how the languages RDF and OWL allow us to combine different schemas and data

... we have seen how Linked Data on the Web uses HTTP as a connecting protocol/architecture

... we have assumed that such combinations can be done effortlessly (unique names etc.)

Now we need to ask:

What are the challenges of such combinations?

What are approaches proposed to solve it?

– from the databases & the Semantic Web / ontologies fields

– from architectural and logical points of view

Page 3: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

3Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

3Motivation 1: Price comparison engines search & combine heterogeneous travel-agency DBs, which seach & combine heterogeneous airline DBs

Page 4: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

4Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

4

Motivation 2: Schemas coming from different languages

A river is a natural stream of water, usually freshwater, flowing toward an ocean, a lake, or another stream. In some cases a river flows into the ground or dries up completely before reaching another body of water. Usually larger streams are called rivers while smaller streams are called creeks, brooks, rivulets, rills, and many other terms, but there is no general rule that defines what can be called a river. Sometimes a river is said to be larger than a creek,[1] but this is not always the case.[2]

Une rivière est un cours d'eau qui s'écoule sous l'effet de la gravité et qui se jette dans une autre rivière ou dans un fleuve, contrairement au fleuve qui se jette, lui, dans la mer ou dans l'océan.

Een rivier is een min of meer natuurlijke waterstroom. We onderscheiden oceanische rivieren (in België ook wel stroom genoemd) die in een zee of oceaan uitmonden, en continentale rivieren die in een meer, een moeras of woestijn uitmonden. Een beek is de aanduiding voor een kleine rivier. Tussen beek en rivier ligt meestal een bijrivier.

Page 5: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

5Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

5

Motivation 3: „Who was that?“ – Re-identification

Page 6: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

6Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

6

Motivation 3: „Who was that?“ – Re-identification

Page 7: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

7Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

7

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

Evaluating matching

Involving the user: Explanations; mass collaboration

Core ideas of federated databases

Page 8: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

8Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

8

Overview

goal: interoperability through data integration:

combining heterogeneous data sources under a single query interface

A federated database system is a type of meta-database management system (DBMS) which transparently integrates multiple autonomous database systems into a single federated database.

The constituent databases are interconnected via a computer network, and may be geographically decentralized.

Since the constituent database systems remain autonomous, a federated database system is a contrastable alternative to the (sometimes daunting) task of merging together several disparate databases.

A federated database (or virtual database) is the fully-integrated, logical composite of all constituent databases in a federated database system.

Page 9: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

9Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

9

Issues in federating data sources

Interconnection and cooperation of autonomous and heterogeneous databases must address

Distribution

Autonomy

Heterogeneity

Page 10: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

10Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

10

Architectures: Dealing differently with autonomy

Tightly coupled: global schema integration, e.g. data warehousing

More loosely coupled: federated databases with schema matching/mapping:

Global as View (GaV): the global schema is defined in terms of the underlying schemas

Local as View (LaV): the local schemas are defined in terms of the global schema

Page 11: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

11Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

11

Issues in query processing

In both GAV and LAV systems, a user poses conjunctive queries over a virtual schema represented by a set of views, or "materialized" conjunctive queries.

Integration seeks to rewrite the queries represented by the views to make their results equivalent or maximally contained by our user's query.

This corresponds to the problem of answering queries using views.

Page 12: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

12Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

12

An example developed in-house: SQI - PLQL

Purpose: For federated search in learning-object repositories

An approach with conceptual-level abstraction from data sources

Integratable data source types:

Relational, XML, IR systems, (search engine) Web services, search APIs

Full abstraction of user from data sources:

Yes

User-specific data souce selection for integration:

Depends on application

User-specific data modeling for integration:

No

Explicit, queryable semantics:

(delegated to the sources: LOM etc.)

Page 13: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

13Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

13

Heterogeneity

Heterogeneity is independent of location of data

When is an information system homogeneous?

Software that creates and manipulates data is the same

All data follows same structure and data model and is part of a single universe of discourse

Different levels of heterogeneity Different languages to write applications Different query languages Different models Different DBMSs Different file systems Semantic heterogeneity etc.

Page 14: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

14Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

14

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

Evaluating matching

Involving the user: Explanations; mass collaboration

Core ideas of federated databases

Page 15: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

15Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

15

The match problem(Running example 1)

Given two schemas S1 and S2, find a mapping between elements of S1 and S2 that correspond semantically to each other

Page 16: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

16Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

16

Running example 2

Page 17: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

17Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

17

Motivation: application areas

Schema integration in multi-database systems

Data integration systems on the Web

Translating data (e.g., for data warehousing)

E-commerce message translation

P2P data management

Model management (tools for easily manipulating models of data)

Page 18: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

18Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

18

Based on what information can the matchings/mappings be found?

(work on the two running examples)

Page 19: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

19Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

19

The match operator

Match operator: f(S1,S2) = mapping between S1 and S2

for schemas S1, S2

Mapping

a set of mapping elements

Mapping elements

elements of S1, elements of S2, mapping expression

Mapping expression

different functions and relationships

Page 20: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

20Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

20

Matching expressions: examples

Scalar relations (=, ≥, ...) S.HOUSES.location = T.LISTINGS.area

Functions T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate) T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state)

ER-style relationships (is-a, part-of, ...) Set-oriented relationships (overlaps, contains, ...) Any other terms that are defined in the expression language used

Page 21: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

21Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

21

Matching and mapping

1. Find the schema match („declarative“)

2. Create a procedure (e.g., a query expression) to enable automated data translation or exchange (mapping, „procedural“)

Example of result of step 2: To create T.LISTINGS from S (simplified notation):

area = SELECT location FROM HOUSES

agent-name = SELECT name FROM AGENTS

agent-address = SELECT concat(city,state) FROM AGENTS

list-price = SELECT price * (1+fee-rate)

FROM HOUSES, AGENTS

WHERE agent-id = id

Page 22: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

22Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

22Based on what information can the matchings/mappings be found?

Rahm & Bernstein‘s classification of schema matching approaches

Page 23: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

23Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

23

Challenges

Semantics of the involved elements often need to be inferred

Often need to base (heuristic) solutions on cues in schema and data, which are unreliable

e.g., homonyms (area), synonyms (area, location)

Schema and data clues are often incomplete e.g., date: date of what?

Global nature of matching: to choose one matching possibility, must typically exclude all others as worse

Matching is often subjective and/or context-dependent e.g., does house-style match house-description or not?

Extremely laborious and error-prone process e.g., Li & Clifton 200: project at GTE telecommunications:

40 databases, 27K elements, no access to the original developers of the DB estimated time for just finding and documenting the matches: 12 person years

Page 24: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

24Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

24

Semi-automated schema matching (1)

Rule-based solutions

Hand-crafted rules

Exploit schema information

+ relatively inexpensive

+ do not require training

+ fast (operate only on schema, not data)

+ can work very well in certain types of applications & domains

+ rules can provide a quick & concise method of capturing user knowledge about the domain

– cannot exploit data instances effectively

– cannot exploit previous matching efforts

(other than by re-use)

Page 25: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

25Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

25

Semi-automated schema matching (2)

Learning-based solutions

Rules/mappings learned from attribute specifications and statistics of data content (Rahm&Bernstein: „instance-level matching“)

Exploit schema information and data

Some approaches: external evidence Past matches

Corpus of schemas and matches („matchings in real-estate applications will tend to be alike“)

Corpus of users (more details later in this slide set)

+ can exploit data instances effectively

+ can exploit previous matching efforts

– relatively expensive

– require training

– slower (operate data)

– results may be opaque (e.g., neural network output) explanation components! (more details later)

Page 26: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

26Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

26

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

Evaluating matching

Involving the user: Explanations; mass collaboration

Core ideas of federated databases

Page 27: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

27Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

27

Overview (1)

Rule-based approach

Schema types:

Relational, XML

Metadata representation:

Extended ER

Match granularity:

Element, structure

Match cardinality:

1:1, n:1

Page 28: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

28Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

28

Overview (2)

Schema-level match:

Name-based: name equality, synonyms, hypernyms, homonyms, abbreviations

Constraint-based: data type and domain compatibility, referential constraints

Structure matching: matching subtrees, weighted by leaves

Re-use, auxiliary information used:

Thesauri, glossaries

Combination of matchers:

Hybrid

Manual work / user input:

User can adjust threshold weights

Page 29: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

29Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

29

Basic representation: Schema trees

Computation overview:

1. Compute similarity coefficients between elements of these graphs

2. Deduce a mapping from these coefficients

Page 30: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

30Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

30

Computing similarity coefficients (1): Linguistic matching

Operates on schema element names (= nodes in schema tree)

1. Normalization Tokenization (parse names into tokens based on punctuation, case,

etc.)

e.g., Product_ID {Product, ID}

Expansion (of abbreviations and acronyms)

Elimination (of prepositions, articles, etc.)

2. Categorization / clustering Based on data types, schema hierarchy, linguistic content of names

e.g., „real-valued elements“, „money-related elements“

3. Comparison (within the categories) Compute linguistic similarity coefficients (lsim) based on thesarus

(synonmy, hypernymy)

Output: Table of lsim coefficients (in [0,1]) between schema elements

Page 31: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

31Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

31

How to identify synonyms and homonyms: Example WordNet

Page 32: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

32Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

32

How to identify hypernyms: Example WordNet

Page 33: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

33Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

33

Computing similarity coefficients (2): Structure matching

Intuitions:

Leaves are similar if they are linguistic & data-type similar, and if they have similar neighbourhoods

Non-leaf elements are similar if linguistically similar & have similar subtrees (where leaf sets are most important)

Procedure:

1. Initialize structural similarity of leaves based on data types

Identical data types: compat. = 0.5; otherwise in [0,0.5]

2. Process the tree in post-order

3. Stronglink(leaf1, leaf2) iff their weighted sim. ≥ threshold

4. .

Page 34: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

34Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

34

The structure matching algorithm

Output: an 1:n mapping for leaves

To generate non-leaf mappings: 2nd post-order traversal

Page 35: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

35Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

35

Matching shared types

Solution: expand the schema into a schema tree, then proceed as before

Can help to generate context-dependent mappings

Fails if a cycle of containment and IsDerivedFrom relationships is present (e.g., recursive type definitions)

Page 36: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

36Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

36

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

Evaluating matching

Involving the user: Explanations; mass collaboration

Core ideas of federated databases

Page 37: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

37Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

37

Main ideas

A learning-based approach

Main goal: discover complex matches

In particular: functions such as

T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate)

T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state)

Works on relational schemas

Basic idea: reformulate schema matching as search

Page 38: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

38Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

38

Architecture

Specialized searchers are specialized on discovering certain types of complex matches make search more efficient

Page 39: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

39Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

39

Overview of implemented searchers

Page 40: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

40Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

40

Example: The textual searcher

For target attribute T.LISTINGS.agent-address: Examine attributes and concatenations of attributes from S Restrict examined set by analyzing textual properties

Data type information in schema, heuristics (proportion of non-numeric characters etc.)

Evaluate match candidates based on data correspondences, prune inferior candidates

Page 41: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

41Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

41

Example: The numerical searcher

For target attribute T.LISTINGS.list-price:

Examine attributes and arithmetic expressions over them from S

Restrict examined set by analyzing numeric properties

Data type information in schema, heuristics

Evaluate match candidates based on data correspondences, prune inferior candidates

Page 42: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

42Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

42

Search strategy (1): Example textual searcher

1. Learn a (Naive Bayes) classifier

text class („agent-address“ or „other“)

from the data instances in T.LISTINGS.agent-address

2. Apply this classifier to each match candidate (e.g., location, concat(city,state)

3. Score of the candidate = average over instance probabilities

4. For expansion: beam search – only k-top scoring candiates

Page 43: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

43Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

43

Search strategy (2): Example numeric searcher

1. Get value distributions of target attribute and each candidate

2. Compare the value distributions (Kullback-Leibler divergence measure)

3. Score of the candidate = Kullback-Leibler measure

Page 44: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

44Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

44

Evaluation strategies of implemented searchers

Page 45: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

45Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

45

Pruning by domain constraints

Multiple attributes of S: „attributes name and beds are unrelated“ do not generate match candidates with these 2 attributes

Properties of a single attribute of T: „the average value of num-rooms does not exceed 10“ use in evaluation of candidates

Properties of multiple attributes of T: „lot-area and num-baths are unrelated“ at match selector level, „clean up“:

Example

– T.num_baths S.baths

– ? T.lot-area (S.lot-sq-feet/43560)+1.3e-15 * S.baths

Based on the domain constraint, drop the term involving S.baths

Page 46: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

46Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

46

Pruning by using knowledge from overlap data

When S and T share the same data

Consider fraction of data for which mapping is correct

e.g., house locations:

S.HOUSES.location overlaps more with T.LISTINGS.area than with T.LISTINGS.agent-address

Discard the candidate T.LISTINGS.agent-address = S.HOUSES.location,

keep only T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS,state)

Page 47: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

47Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

47

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

Evaluating matching

Involving the user: Explanations; mass collaboration

Core ideas of federated databases

Page 48: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

48Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

48

How to compare?

Input: What kind of input data? (What languages? Only toy examples? What external information?)

Output: mapping between attributes or tables, nodes or paths? How much information does the system report?

Quality measures: metrics for accuracy and completeness?

Effort: how much savings of manual effort, how quantified?

Pre-match effort (training of learners, dictionary preparation, ...)

Post-match effort (correction and improvement of the match output)

How are these measured?

Page 49: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

49Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

49

Match quality measures

Need a „gold standard“ (the „true“ match)

Measures from information retrieval:

(standard choice: F1, = 0.5)

Quantifies post-match effort

Page 50: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

50Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

50

Benchmarking

Do, Melnik, and Rahm (2003) found that evaluation studies were not comparable

Need more standardized conditions (benchmarks)

Now a tradition of competitions in ontology matching (more in the next session):

Test cases and contests at http://www.ontologymatching.org/evaluation.html

Page 51: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

51Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

51

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

Evaluating matching

Involving the user: Explanations; mass collaboration

Core ideas of federated databases

Page 52: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

52Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

52

Example in iMAP

User sees ranked candidates:

1. List-price = price

2. List-price = price * (1 + fee-rate)

Explanation:

a) Both generated from numeric searcher, 2 ranked higher than 1

b) But:

c) Match month-posted = fee-rate

d) domain constraint: matches for month-posted and price do not share attributes

)e cannot match list-price to anything to do with fee-rate

f) Why c)?

g) Data instances of fee-rate were classified as of type date

User corrects this wrong step f), the rest is repaired accordingly

Page 53: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

53Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

53

Background knowledge structure for explanation: dependency graph

Page 54: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

54Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

54

MOBS: Using mass collaboration to automate data integration

1. Initialization: a correct but partial match (e.g. title = a1, title = b2, etc.)

2. Soliciting user feedback: User query user must answer a simple question user gets answer to initial query

3. Computing user weights (e.g., trustworthiness = fraction of correct answers to known mappings)

4. Combining user feedback (e.g, majority count) Important: „instant gratification“ (e.g., include the new field in the

results page after a user has given helpful input)

Page 55: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

55Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

55

Next lecture

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

Evaluating matching

Involving the user: Explanations; mass collaboration

Ontology matching / alignment

Core ideas of federated databases

Page 56: 1 Berendt: Advanced databases, 2009, berendt/teaching 1 Advanced databases – Defining and combining heterogeneous databases:

56Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching

56

References / background reading; acknowledgements

http://en.wikipedia.org/wiki/Federated_database_system

http://en.wikipedia.org/wiki/Data_integration

Rahm, E. & Bernstein, P.A. (2001). A survey of approaches to automatic schema matching. The VLBD Journal, 10, 334-350.

http://research.microsoft.com/~philbe/VLDBJ-Dec2001.pdf

Doan, A. & Halevy, A.Y. (2004). Semantic Integration Research in the Database Community: A brief survey. AI Magazine.

http://dit.unitn.it/~p2p/RelatedWork/Matching/si-survey-db-community.pdf

Madhavan, J., Bernstein, P.A., Rahm, E. (2001). Generic Schema Matching with Cupid. In Proc. Of the 27th VLDB Conference.

http://dbs.uni-leipzig.de/de/publication/title/generic_schema_matching_with_cupid

Dhamankar, R., Lee, Y., Doan, A., Halevy, A., & Domingos, P. (2004). iMAP: Discovering complex semantic matches between database schemas. In Proc. Of SIGMOD 2004.

http://citeseer.ist.psu.edu/680053.html

Do, H.-H., Melnik, S., & Rahm, E. (2003). Comparison of schema matching evaluations. In Web, Web-Services, and Database Systems: NODe 2002, Web- and Database-Related Workshops, Erfurt, Germany, October 7-10, 2002. Revised Papers (pp. 221-237). Springer.

http://dit.unitn.it/~p2p/RelatedWork/Comparison%20of%20Schema%20Matching%20Evaluations.pdf

McCann, R., Doan, A., Varadarajan, V., & Kramnik, A. (2003). Building data integration systems via mass collaboration. In Proc. International Workshop on the Web and Databases (WebDB).

http://citeseer.ist.psu.edu/675796.html

Please see the Powerpoint slide-specific „notes“ for URLs of used pictures and formulae


Recommended