Download - Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

Yannis Velegrakis (University of Trento) Angela Bonifati (CNR)

EDBT 2011, Uppsala, Sweden, March 21st-25th

Benchmarks

From Usage To Evaluation

Schema Matching and Mapping Systems

2 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions


Talk Outline

Introduction









Introduction

Data is inherently heterogeneous

Due to the explosion of online data repositories

Due to the variety of users, who develop a wealth of applications

At different time

With disparate requirements in their mind

A fundamental requirement is to translate data across different formats

How data is transformed from one format to another is done through mappings


Mappings Are All Around

Data integration [Lenzerini 2002]

to specify the relationship between local and global schemas

S1

S2

S3

Global

Schema T

I1

I2

I3 Sources



Schema integration [Batini et al. 1986]

to specify the relationship between the input schemas and the integrated schemas

S1

S2

S3

Integrated

Schema

Input Schemas



Data exchange [Fagin et al. 2005]

to specify the relationship between source and target schemas

S T

mappings

Source schema Target Schema

I I J



Schema evolution [Lerner 2000]

to specify the relationship between the old and new version of an evolved schema

S1 S1’ S1’’

Evolving Schema S1


How Did It All Start

One of the first systems to deal with this problem was developed at IBM in 1977: EXPRESS (EXtraction, Processing and REStructuring

System) [Shu et al. 1977] consists of two languages: DEFINE that works as a DDL (Data Definition Language)

CONVERT that works as a DTL (Data Translation Language) and has a total of 9 operators, each of which receives as input a data file, performs the respective transformation and generates an output data file.

EXPRESS required the users familiarity with the languages and was customized to only one model (hierarchical)

After that, inter-model transformations were also studied [Tork-Roth et al. 1997] [Atzeni et al. 1997]


Emphasis on Data Translation

[Abiteboul et al. 1997] proposed a declarative framework for data translation

[Davidson et al. 1997] focused on constraint satisfaction

[Milo et al. 1998] leveraged a library of transformation rules and pattern-matching techniques

[Clue et al. 1998] emphasized type-checking

[Beeri at al. 1999] focused on tree-based transformations for XML data structures


Talk Outline

Introduction









A Data Transfer Example

Source: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

Target: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

name status

PIX Active

E-services Active

Clio Inactive

cid email phone

Benedikt [email protected] 5827766

Hull [email protected] 5824509

Shrivastava [email protected] 3608776

Belanger [email protected] 3608600

Fernandez [email protected] 3608679

name official

AT&T AT&T Research Labs

Lucent Lucent Technologies, Bell Labs Innovations

Projects

Grants

Contacts

Companies

gid project recipient manager supervisor

g1 PIX AT&T Fernandez Belanger

g2 PIX AT&T Shrivastava Belanger

g3 E-services Bell-labs Benedikt Hull

So

urce I

nsta

nce


Desired Target Instance

Target: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

Source: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

Finances

code: E-services

Projects

fid finId

g3 ???

Funds

code: PIX

fid finId

g1 ???

g2 ???

Funds

finId mPhone company

??? 3608679 ???

??? 3608776 ???

??? 5827766 ???

coid name

Sk2(AT&T) AT&T

Sk2(Lucent) Lucent

??? ???

??? ???

??? ???

Companies

Targ

et in

sta

nce


The Needed Transformation Query LET $doc0 := document("inputXMLfile") RETURN <T> { distinct-values ( FOR $x0 IN $doc0/S/grant, $x1 IN $doc0/S/project, $x2 IN $doc0/S/contact, $x3 IN $doc0/S/contact WHERE $x2/cid/text() = $x0/manager/text() AND $x0/supervisor/text() = $x3/cid/text() AND $x0/project/text() = $x1/name/text() RETURN <project> <code> { $x0/project/text() } </code> { distinct-values ( FOR $x0L1 IN $doc0/S/grant, $x1L1 IN $doc0/S/project, $x2L1 IN $doc0/S/contact, $x3L1 IN $doc0/S/contact WHERE $x2L1/cid/text() = $x0L1/manager/text() AND $x0L1/supervisor/text() = $x3L1/cid/text() AND $x0L1/project/text() = $x1L1/name/text() AND $x0/project/text() = $x0L1/project/text() RETURN <funding> <fid> { $x0L1/gid/text() } </fid> <finId> { "Sk52(", $x0L1/gid/text(), ", ", $x0L1/project/text(), ")" } </finId> </funding> ) } </project> ) } { distinct-values ( FOR $x0 IN $doc0/S/grant, $x1 IN $doc0/S/project, $x2 IN $doc0/S/contact, $x3 IN $doc0/S/contact WHERE $x2/cid/text() = $x0/manager/text() AND $x0/supervisor/text() = $x3/cid/text() AND $x0/project/text() = $x1/name/text() RETURN <finance> <finId> { $x0/gid/text() } </finId> <mPhone> { $x2/phone/text() } </mPhone> <company> { "Sk46(", $x2/phone/text(), ", ", $x0/gid/text(), ")" } </company> </finance> ) } { distinct-values ( FOR $x0 IN $doc0/S/grant, $x1 IN $doc0/S/project, $x2 IN $doc0/S/contact, $x3 IN $doc0/S/contact WHERE $x2/cid/text() = $x0/manager/text() AND $x0/supervisor/text() = $x3/cid/text() AND $x0/project/text() = $x1/name/text() RETURN <company> <coid> { "Sk46(", $x2/phone/text(), ", ", $x0/gid/text(), ")" } </coid> <name> { "Sk49(", $x2/phone/text(), ", ", $x0/gid/text(), ")" } </name> </company> ) } { distinct-values ( FOR $x0 IN $doc0/S/company RETURN <company> <coid> { "Sk93(", $x0/cname/text(), ")" } </coid> <name> { $x0/cname/text() } </name> </company> ) }


The Road To Mapping Systems

The design of data transformations has been a manual task for a long while

Designers had to be familiar with the language

As schemas became larger and more complex, the task became too laborious, time-consuming and error-prone

The need of raising the level of abstraction and trying to automate the tasks was soon realized.

The idea …

Mapping Systems


Generating Mapping

Different techniques exist to generate mappings:

Manual, e.g.

by means of high-level mapping languages, such as [Bernstein et al. 2007]

by means of sophisticated user interfaces [Altova 2008]

Semi-automatic, e.g.

By means of designer guidance [Alexe2008]

Via advanced algorithms to do the reasoning instead of the mapping designer [Madhavan at al. 2001] [Popa et al. 2002][Do et al. 2002][Bonifati et al. 2008]


The First Step of a Mapping Task

Source: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

Target: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

name status

PIX Active

E-services Active

Clio Inactive

cid email phone






name official



Projects

Grants

Contacts

Companies






Matching

Given two schemas as input, the source and the target schema, matching is a process that produces as output a set of matches or correspondences or (simply) lines between the elements of the two schemas

A match is a triple <Es, Et, e> Where Es is a set of elements of the source

schema, Et is a set of elements of the target schema, and e specifies a simple relationship (equality or set inclusion) or complex relationship between element in Es and Et


The Matching Relationship e

Depends on the cardinalities of Es and Et

Depends on the semantics:

Can be a function

Can be an arithmetic operation

Can be a set-theoretic relation (e.g. ≡,overlaps)

Can be a conceptual modeling relationship (e.g. part-of, subclass-of)


Matching: An Alternative Definition

The matching process [Euzenat et al. 2007] can be seen as a function f from a pair of schemas S and T, an optional input alignment A, a set of matching parameters p and a set of resources r:

A’ = f(S, T, A?, p, r)

Ultimately, an alignment is a set of correspondences between elements in S and elements in T


Matching Examples

Simple relationship: Name ≡ Title Location ≡ Address

Complex relationship: speed = velocity x 2.237 speed x 0.447 = velocity speed = concat(velocity x 2.237, „MPH‟) speed ≥ velocity

Company

Location

Name

Source

Address

Organization

Title

Target


The matching process

Can be roughly divided into three steps:

Pre-match: training of classifiers for machine learning-based matchers, matching parameters (weights, thresholds), adjustments of resources, such as thesauri and constraints

Match: the actual matching task

Post-match: the user may check and modify the displayed matches


Some Schema Matchers

Cupid [Madhavan et al. 2001] : based on structural and name similarity

S-Match [Giunchiglia et al. 2004]: based on semantic closeness

Coma++ [Aumueller et al. 2005]: based on matching reuse

LSD [Doan et al. 2001]: based on data value analysis and machine-learning techniques

iMap [Dhamankar et al. 2004]: suited for complex e expressions

Similarity Flooding [Melnik et al. 2002]: based on graph similarity


Similarity Flooding


COMA++


Matchings Are Not Enough

Source: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

Target: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

name status

PIX Active

E-services Active

Clio Inactive

cid email phone






name official



Projects

Grants

Contacts

Companies





Cannot describe the full details of the transformation


Source Schema Target Schema

Matchings

Matcher

The Mapping Generation Process

Matching is just the beginning of any mapping generation

process


Mappings

Given the source and the target schemas, mapping is a process that takes as input a set of matches between the elements of the two schemas and produces a relationship or constraint e that must hold between their respective instances

In other words, a mapping is a triple <S, T, e>

Where S is the source schema, T is the target schema, and e specify a constraint that any instances adhering to S and T must satisfy or an executable statement to transform the instance of S into the instances of T


A Mapping Example

Source: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

Target: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

project(na,st), grant(gid,na,re,ma,su), contact(ma,em,ph)

project(na,FUND), fund(gid,finId), finance (finId,ph,company)

company(company, name)


Mappings & Instances

Mappings are the basic ingredients of many tasks, such as information integration, P2P query answering, data exchange etc.

In particular, mappings as inter-schema constraints may not be enough to fully specify a unique target instance There may exist multiple target instances

satisfying the mappings

Finding the best target instance is the goal of the data exchange problem [Fagin et al. 2005] The mapping is converted into a executable

transformation script to obtain that particular instance


S: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

A Data Exchange Example

T: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

Targ

et in

sta

nce

Finances

code: E-services

Projects

fid finId

g3 ???

Funds

code: PIX

fid finId

g1 ???

g2 ???

Funds

finId mPhone company

??? 3608679 ???

??? 3608776 ???

??? 3608600 ???

coid name

??? AT&T

??? Lucent

Companies

project(na,st), grant(gid,na,re,ma,su), contact(ma,em,ph)

project(na,FUND), fund(gid,finId), finance (finId,ph,company),

company(company, name)



Matchings

User

Transformation Scripts

Matcher

Mapping Generation Engine

Mappings (Dependencies)

Query Engine

The Mapping Generation Process

Data Exchange Engine

Source Instance

Target Instance


Research Prototype Systems

Mapping generation and data exchange are separate tasks Clio[Popa et al. 2002], HePToX[Bonifati et al. 2010],

Spicy[Bonifati et al. 2008]

Mappings Generation the mappings are expressed as high-level assertions in a

logical formalism

A mapping is a source-to-target tuple-generating dependency (or s-t tgd in short)

𝜙 𝑥 → ∃ 𝜓 𝑥, 𝑦 where

φ(x) (ψ(x,y), resp.) is a conjunction of atoms over the source (target, resp.)

Data exchange The respective module transforms the high-level mappings

into transformation scripts (in SQL or XQuery) to generate the target instance.


Clio


Spicy


HepToX



Matchings

User


Matcher



Query Engine

Commercial Mapping Systems


Source Instance

Target Instance

Mapping Engine

Mapping generation and Data exchange are merged in one. The system directly creates in some native language the final transformation script


Popular Commercial Systems

Altova Mapforce

Stylus Studio

IBM Rational Data Architect

BizTalk mapper

Adeptia

BEA Aqualogic


Stylus Studio


Altova MapForce


Adeptia


BizTalk Mapper


IBM Rational Data Architect


A Mapping Tool Categorization

All tools provide to the mapping designer:

A graphical representation of the two schemas

A set of graphical transformation constructs

The granularity and power of these constructs is a main factor of differentiation among the tools

Detailed Specification by the Designer

Intelligence of the mapping tool/

Effort in post-verification

Roughly

Commercial

Mapping Tools

Roughly

Research

Prototypes


Issues in Data Exchange

When multiple target instances exist, how do we compute the best one?

Is a given target instance better than another?

Universal solutions

Introduced in [Fagin et al. 2005]

These are the “most general” target instances, and also represent the entire space of solutions

Among the universal solutions, the smallest of all and the most compact one is called the “core”


Universal Core Instances

T: Rcd

Advised: Rcd

sname

facid

WorksWith: Rcd

sname

facid

S: Rcd

PTStud: Rcd

age

name

GradStud: Rcd

age

name

sname facid

Bob N3

Ann N4

WorksWith

PTStud(x,y), Advised(y,z)

GradStud (x,y) Advised(y,z), WorksWith(y,z)

Source instance

PTStud

age name

27 Bob

30 Ann

GradStud

age name

32 John

30 Ann

Advised

sname facid

Bob N3

Ann N4

John N1

N1 Cathy

A solution:

sname facid

Bob N3

Ann N4

WorksWith

Advised

sname facid

Bob N3

Ann N4

John N1

N2 Ann

sname facid

Bob N3

Ann N4

WorksWith

Advised

sname facid

Bob N3

Ann N4

John N1

A universal Solution:

The core:


Commercial vs. Research Prototype Systems

Whereas research prototypes (e.g. Clio, Spicy++) are tending to produce target instances that look more and more like the core

Commercial tools leave the task to the users, who have to manually interact with sophisticated GUIs and write pieces of the transformation manually

No core definition is even considered Core? No,

thanks


Mixing Matching and Mapping

Matching and Mapping not always by separate tools

Clio has as an add-on a matcher based on attribute feature analysis [Naumann et al. 2002]

Bernstein‟s model management considers the matcher as a fully integrated and indistinguishable component

Spicy [Bonifati et al. 2008] has a matcher based on instance-based structural analysis


Limitations of Current Systems

Manual approaches are not applicable to large-scale mapping tasks

The user/developer has to become familiar with the mapping language and the user interfaces

The outcome of the mapping process may not respect the user requirements and desired semantics (unsurprisingly!)

Specifications may be incomplete and dependent of system peculiarities

Thus, there is a need for a verification and guidance process



Matchings

User


Matcher



Query Engine

The Verification Process


Source Instance

Target Instance

Data

Examples

Expected

Target

Instance

Verification

And

Selection User


A-Posteriori Verification

The main problem with matching and mapping is the dichotomy between the expected results and the generated answers

Some tools allow a post-verification

by using data examples

Tupelo [Fletcher et al. 2006], Muse[Alexe et al. 2008] Clio [Alexe et al. 2010]

by using an automatic instance comparison,

Spicy [Bonifati et al. 2008]

by means of manual user feedback

unfeasible for large-scale tasks

via debugging techniques

Routes [Chiticariu et al. 2006]


ETL systems

Extract-Transform-Load tools are data transformation tools based on graphical flowcharts with nodes encoding transformation primitives and edges encoding the transformation flow

Can be considered as a special form of mapping system Generate transformations

GUI

An intermediate language (An algebra for ETL)

Output (transformation scripts)

They are not mapping tools in the classical sense Focus only on data transformation operators


1 2 3

6

Not Null

(CustKey) SK(custkey) PhoneFormat

New - Old

Customer.

new

CUSTO

MER7C.D+

Error

4 5

SK(custkey) PhoneFormat

Customer.

old

Cnew

Cold

An ETL data flowchart


How Can One Decide If A Product is Good ?


Talk Outline

Introduction









Importance Of A Benchmark

Help Designers and Developers to improve their tools by

assessing their usefulness and constantly evaluating their performance

Users to compare the different available tools and evaluate suitability for their needs [Haas et al. 2007]

Researchers to compare themselves to others

Exist for a long term To allow adequate measurements

Evolution in the field

Help assess absolute results Properties of the results

How they compare to the others


Benchmark

Well-designed tests (scenarios) with which the results of a system can be evaluated [Castro et al. 2004]

A standardized application scenario that serves as a basis for testing and evaluation and comparison [Merriam-Webster]

Clearly specified scenarios that everyone can implement

Clearly specify the factors that are measured, and under what conditions they should be measured.

Should measure of the degree of achievement

Should be reproducible and stable

Can be used repeatedly


Principles

Systematic Procedure

Continuity

Quality and equity

Dissemination

Intelligibility


Types of evaluation

Competence Benchmarks Measure competences and performance with respect to a

task

Aim at characterizing the kind of tasks each method is good for

For designers to improve their systems

Comparative evaluation Comparison of results of various systems on a common task

Aim at finding the best system Tuning of the system an issue

Comparison of systems and aim at general field improvement

Application-specific Comparison of various systems on a specific task

Competitive evaluation


Evaluation Steps

Planning

Specifying task, software, hardware, input, output

Processing

Analysis

Result evaluation according to predefined measures


Bottom Line: Benchmarks Are Great !


Talk Outline

Introduction









Generic Matching/Mapping Benchmark Goals

Compare in terms of

Performance

Usability

Effectiveness

Applicability to real-world scenarios

Improve the quality of the matching and mapping generation process


Query Benchmark Schema Mapping Benchmark

■Evaluation Scenarios: A setting ( a

database instance/schema + a query)

and the outcome

■For a mapping system, what is the input and

what is the output?

■The query engine should support the

scenarios (mainly should be able to

evaluate the query of each scenario)

■A mapping tool input language should be able

to express the transformations of interest.

What are they?

■Supporting the scenario:

search engine result=expected outcome

■How do you compare a mapping system

output with an expected output?

■Good Query Engine = Fast (Correct)

Responses

■What do we measure?

Effort? Expressiveness?

■Range the characteristics of the data

instance to measure how well the engine

scales

■What do we scale?

Query vs Matching/Mapping Benchmarks


The Scenario Input

Source Schema S

Target Schema T

Maybe an Instance of the Source Schema S

A specification of what we need to achieve

Matching Systems

Typically there is no specification

Just the Source and the Target Schema

A complete set of matches assumed as correct

Mappings Systems

…

Major Issue


The Specification for Mapping Systems

An expected (desired) transformation

Mapping Systems try to guess it

Issues

No formal semantic framework to express it

No formal relationship to the outcome.

Note: Query engine benchmarks (e.g., TMC-H or XMark) leverage on the semantics of the query language

Clear what the scenario is asking

Clear what to compare the result sets


Expressing Desired Transformations

Natural Language

Too generic and ambiguous

Complete specification formalism (query?)

Beats the purpose of a mapping system

Comparing the generated mapping of the tool to the precise specification is like asking equivalence of two mappings (a hard problem ! )


Expressing Desired Transformations

Graphical Interface Different constructs expressed in different tools

Typically a GUI for the query language

Continuously evolve

Simple specification (Correspondences?) 1-1, many-1, between atomic or complex

elements, Nested, w/o annotations, GUI constructs Can get so complex that they become the same as the

actual mappings

Ambiguous. The same set of correspondences interpreted different from different mapping tools Without a standard way to interpret them? Risky !


Company

Location

Name

Source

Address

Organization

Title

Target

A Simple Ambiguous Scenario

Different interpretations may arise from a simple “copy” scenario

<Source>

<Company>

<Name>IBM</Name>

<Location>NY</Location>

</Company>

<Company>

<Name>MS</Name>

<Location>WA</Location>

</Company>

</Source>

<Target>

<Organization>

<Title>IBM</Title>

<Title>MS</Title>

<Address>NY</Address>

<Address>WA</Address>

</Organization>

</Target>



Different tools might generate different instances with the same arrows

<Source>

<Company>

<Name>IBM</Name>


</Company>

<Company>

<Name>MS</Name>


</Company>

</Source>

<Target>

<Organization>

<Title>IBM</Title>


</Organization>

</Target>

Company

Location

Name

Source

Address

Organization

Title

Target



Arrows between non-leaf nodes are not allowed in all tools

<Source>

<Company>

<Name>IBM</Name>


</Company>

<Company>

<Name>MS</Name>


</Company>

</Source>

<Target>

<Organization>

<Title>IBM</Title>


</Organization>

<Organization>

<Title>MS</Title>

<Address>WA</Address>

</Organization>

</Target>

Company

Location

Name

Source

Address

Organization

Title

Target


The Issue of The User Input

Many tools allow the mapping designer to manually edit the generated transformation

Power as the one provided by the language.

Shortcuts and abstraction levels.


The Scenario Output

The output has to be correct

Satisfy the desired transformation

Compare to the expected transformation

Matching

A set of Matches

Mapping

It is not clear what the output is

Mapping

The transformation scripts

The transformed data

Same data may be generated by different mappings


Evaluation Challenges

What are we testing?

Expressiveness

Performance

of the tool?

of the generated mappings?

Quality

of the generated mappings?

of the integrated schema?

of the target data? [Dong et al. 2009]

User effort

Heavily depends on the mapping interface

Measuring these factors is hard without a formal (and standard) agreement on expressing specifications


Matching vs. Mapping Benchmarking

Matching System Evaluation is typically set comparison

Consider only semantics of the schema

More automatic

Mapping System Evaluation is more challenging

Considers semantics of schemas and transformations

Requires more human intervention


The User Can Also Help …

By being presented with …

The mappings

Difficult to overcome the heterogeneity of languages

The generated target instance

Not feasible for large and complex instances [Velegrakis et al. 2005]

A representative sample of the target instance

Appealing alternative based on positive/negative examples, but still in its infancy [Alexe et al. 2008]

Presented in details later on

How do all these get into the evaluation function?


Talk Outline

Introduction









A Common Design Pattern

Example: TPC-H

Sets of test cases (With their expected output)

Find those that can be successfully executed

Characterize the system

Sets of Matching or Mapping Scenarios


Examples of Data Sets

Public and well-designed schemas

Meaningful overlap

Limited by the existence of real schemas

Need to be discriminating

Who is the Oracle? External knowledge or a human?


Large Scale Ontology Sets

[Zhang et al. 2004]

Two large ontologies from the anatomy domain

Foundational Model of anatomy & Galen

Thousands of classes, no instances

[Lambrix et al. 2003]

Gene & Signal Ontology

Partial overlap


OAEI Data

Artificial data set

33 classes, 64 properties, 76 individuals

Initial ontology distorted

Result: ~50 pairs of ontologies

Correct by construction


Data Set Factors Affecting The Evaluation

Heterogeneity of the modeling language (schemas/ontologies) The language itself Number of schemas (1-to-1 or many-to-1) From scratch matching or there is a head-start Multiplicity: How many elements in one schema can match with

how many on the other Are Oracles permitted? Is user input permitted Can there be a-priori training External methods and auxiliary inputs Is justification of the output needed? Relations of the correspondences (only = or others as well) Is there a time limit for the matching/mapping Can the matching be on leaves only or not?


OAEI Evaluation Example

[Euzenat et al., 2006]

Ontology Alignment Evaluation Initiative

oaei.ontologymatching.org

Yearly contest

Participants:

Provided with OAEI API

Execute all tests

Provide their results & Paper

Make the results public


Building Large Ontology Sets

[Avesani et al. 2005]

Test sets for matching web directories and classifications

Two web directories are similar if their web pages are similar

It can be considered a matching technique by itself


Thesauri

Thesauri covering large hierarchies of concepts and textual knowledge

Digital Libraries and Museums

Large need to match them

Example:

AGROVOC (FAO): 16K terms

NAL (US Agricaltural dep): 41K terms


Various examples

Illinois Semantic Integration Archive

http://pages.cs.wisc.edu/~anhai/wisc-si-archive

Collection of different schemas & Data

Faculties

Courses

Real Estate







Real Examples Lack Systematic Design

Existing Datasets not systematic

Completeness ?

Correctness?

Deduplicated?

Clarity?

Mainly testbeds or standardized tests

But not benchmarks

Benchmark tests should be

Consistent

Complete

Minimal


Real-World Matching Problems

[Kopcke et al. 2010]

Collection of matching problems

[Giunchiglia et al. 2009]

4500 matches between 3 web directories

Error free

Low complexity

High discriminative capacity


XBenchMatch

[Duchateau et al. 2007]

Criteria for testing and evaluating matching tools

Focuses on assessment of matching tools

Quality

Time

10 Datasets for matching

Classified according to:

Data level, e.g., degree of heterogeneity

Process level, e.g., scale


STBenchmark

[Alexe et al. 2008] www.stbenchmark.org

Evaluate the effectiveness of the mapping system

Derived from real applications DBLP, BioWarehouse, …

Derived from Information Integration Literature [Lerner, 2000], [Carey, 2000],

etc.

Minimum set of transformations that should be supported

1 scenario - 1 transformation Described by

Source & Target Schemas Transformation Query Instance of the Source Schema Capture most practically relevant

transformation cases

Copying Constant Value Generation Horizontal Partition Surrogate Key Assignment Vertical Partition Unnesting (Flattening) Nesting self-Joins Denormalization Keys and Object Fusion Atomic Value Management Aggregation Order Ordered By Flipping Metadata to Data Flipping Data to Metadata Flipping Data to Nested Metadata


Scenario: Copy

Source:

Protein * name accession created

Target:

Protein * Name Accession Created

for $x0 in $doc/Source/Protein return <Protein> <Name> $x0/name/text() <Accession> $x0/accession/text() <Created> $x0/created/text() </Protein>

Textual description +


Scenario: Value Generation

Target:

DataSet * Name LoadingDate

“SwissProt” “July 4th”

<DataSet> <Name>SwissProt</Name> <LoadingDate>July 4th</LoadingDate> </DataSet>


Scenario: Horizontal Partitioning

Source:

gene * txt type protein

Target:

Gene * Name Protein

Synonym * Name Protein

If type ==“ primary”

for $x0 in $doc/Source/Gene where $x0/type/text() == ‟primary‟ return <Gene> <Name> $x0/name/text() <Protein> $x0/protein/text() </Gene> for $x0 in $doc/Source/Gene where $x0/type/text() != ‟primary‟ return <Synonym> <Name> $x0/name/text() <Protein> $x0/protein/text() </Synonym>


Scenario: Surrogate Key Assignment

Source:

gene * txt type protein

Target:

Gene * Name Protein WID

Synonym * Name Protein WID


Id()

Id’() for $x0 in $doc/Source/Gene where $x0/type/text() == ‟primary‟ return <Gene> <Name> $x0/name/text() <Protein> $x0/protein/text() <WID> genID() </Gene> for $x0 in $doc/Source/Gene where $x0/type/text() != ‟primary‟ return <Synonym> <Name> $x0/name/text() <Protein> $x0/protein/text() <WID> genID() </Synonym>


Scenario: Vertical Partition

Source:

Reaction * entry name comment orthology definition equation

Target:

Reaction * Entry Name Comment CoFactor

ChemicalInfo * Orthology Definition Equation CoFactor

Labeled Nulls

for $x0 in $doc/Source/Reaction let $id = genID() return <Reaction> <Entry> $x0/name/text() <Name> $x0/name/text() <Comment> $x0/comment/text() <CoFactor> $id </Reaction> <ChemicalInfo> <Orthology> $x0/orthology/text() <Definition> $x0/definition/text() <Equation> $x0/equation/text() <CoFactor> $id </ChemicalInfo>

Normalization Note that no key information is assumed, as such duplication is allowed


Scenario: Join Path Selection

Target:

Taxon * Id Name UniqueName Class Parent Rank EmblCode

Source:

Name * id name uniqueName class

Node * taxid parentId rank emblCode

for $x0 in $doc/Source/Name, $x1 in $doc/Source/Node where $x0/id/text() = $x1/taxId/text() return <Taxon> <Id> $x0/id/text() <Name> $x0/name/text() <UniqueName> $x0/uniqueName/text() <Class> $x0/class/text() <Parent> $x1/parentId/text() <Rank> $x1/rank/text() <EmblCode> $x1/emblCode/text() </Taxon>

Denormalization


Scenario: Cyclic Joins

Source:

Gene * name type protein

Target:

Gene * Name Protein

Synonym * Name GeneWID


for $x0 in $doc/Source/Gene where $x0/type/text() == ‟primary‟ return <Gene> <name> $x0/name/text() <protein> $x0/protein/text() </Gene> for $x1 in $doc/Source/Gene where $x1/type/text() != ‟primary‟ and $x1/protein/text() == $x0/protein/text() return <Synonym> <Name> $x1/name/text() <GeneWID> $x0/name/text() </Synonym>


Scenario: Un-Nesting Structures

Target:

Publication * Title Year PublishedIn Name

Source:

Reference * title year publishedIn Author * name

for $x0 in $doc/Source/Reference $x1 in $x0/Author return <Publication> <Title> $x0/title/text() <Year> $x0/text() <PublishedIn> $x0/publishedIn/text() <Name> $x1/name/text() </Publication>


Scenario: Nesting Structures

Target:

Period * Year Author * Name Publication * Title PublishedIn

Source:

Publication * title year publishedIn name

for $x0 in distinct-values($doc/Source/Publication/year) return <Period> <Year> $x0 for $x1 in distinct-values($doc/Source/Publication[year=$x0]/Name) return <Author> <Name> $x1 for $x2 in $doc/Source/Publication where $x2/year/text()=$x0 and $x2/name/text()=$x1 return <Publication> <Title> $x2/title/text() <PublishedIn> $x2/publishedIn/text() </Publication> </Author> </Period>


Scenario: Keys & Object Fusion

Target:

Experiment * Contact Date Description ExperimentalData * Data Role

Source:

Experiment * eid contact date description ExperimentalData * data role

FlowCytometrySample id contact date Probe * data type

<Source2> for $x0 in $doc/Source/Experiment $x1 in $x0/ExperimentalData return <Datum> <id> genID($x0/contact/text(), $x0/date/text()) <Contact> $x0/contact/text() <Date> $x0/date/text() <Description> $x0/description/text() <Data> $x1/data/text() <Role> $x1/role/text() </Datum> for $x0 in $doc/Source/FlowCytometrySample $x1 in $x0/Probe return <Datum> <id> genID($x0/contact/text(), $x0/date/text() ) <Contact> $x0/contact/text() <Date> $x0/date/text() <Data> $x1/data/text() <Role> $x1/type/text() </Datum> </Source2>

for $x0 in distinct-values($doc/Source2/Datum/id) return <Experiment> for $x1 in ($doc/Source2/Datum[id=$x0])[1] return $x1/Contact $x1/Date $x1/Description for $x3 in $doc/Source2/Datum where $x3/id/text() = $x0 <ExperimentalData> $x3/Data $x3/Role <ExperimentalData> </Experiment>


Scenario: Atomic Value Manipulation

Target:

Contact * FirstName LastName Address Phone

Source:

Contact * name address street city zip phone

GetFirstName(…)

Type Discrepancy Handling

for $x0 in $doc/Source/Contact return <Contact> <FirstName> GetFirstName( $x0/name/text() ) <LastName> GetLastName( $x0/name/text() ) <Address> Concat( $x0/street/text(), $x0/city/text(), $x0/zip/text() ) <Phone> String2Int( $x0/phone/text() ) </Contact>


Scenarios: Aggregation & Order


Scenarios: Data Meta-data


Thalia

[Hammer et al. 2005]

Integration tools benchmark

Rich set of test data

Source schemas

Syntactic and semantic heterogeneity

12 test queries for the integrated schema


Real Data Is Not Enough for Benchmarking...

… but definitely not for the reason Dilbert thinks !


Talk Outline

Introduction









Why Synthetic Data & Scenarios

To stress test the system

To understand performance in diverse situations

To create additional realistic test cases

To ensure that unforeseen situations are also tested


Top-down Scenario Construction

Start with a big schema and divide/extract

TaxME2 [Giunchiglia et al. 2009] Preserves correctness, complexity, performance

[Okawara et al. 2006] Techniques on how a benchmark should be

[Hammer et al. 2005] – Thalia Large dataset + filters

[Lee et al. 2007] eTuner Duplicate schema, split the data in 2

Modify the first half

Limited kinds of modifications not very natural


Bottom-up Scenario Construction

Create the schema from scratch

STBenchmark [Alexe et al. 2008]

Schema Generator

Expands basic scenarios

Changing basic characteristics of the scenario

Data Generator

Hand-to-Hand with the Schema Generator

ToXGen [Barbosa at al. 2002]


Parameters of Schema Generation

• Number of subelements

• Nesting depth

• Join size

• Join width

• Join kind (star / chain)

• Function arity

f(…)

Source R1 [0…*] A1 A2 A3

R2 [0…*] A4 A5

R3 [0…*] A6

R4 [0…*] A7 A8 A9

R5 [0…*] A10 A11

R6 [0…*] A12

Parameter values:

Sampled from normal

distributions given by

average and standard

deviation


Stretching The Unesting Scenario

Source Reference [0…*] title year publishedIn Author [0…*] name Affiliation [0…*] university country Students [0…*] sname

Target Publication [0…*] Title Year PublishedIn Name University Country StudentName

Unnesting

Basic Scenario


Stretching The Horizontal Partitioning

Vary the number of partitions

Vary the number of elements that exist in each partition

Source:

gene * txt type protein hpAttr1 hpAttr2

Target:

Gene * Name Protein HpAttr1 HpAttr2

Synonym * Name Protein HpAttr1 HpAttr2

HPRel1 * Name Protein HpAttr1 HpAttr2


Combining Mapping Scenarios

Lack of diversity Combine scenarios

Based on a set of configuration parameters, generate a complex mapping scenario by concatenating scaled-up mapping scenarios

S1 T1

S2 T2

P1

P2

Concatenation of (S1, T1, P1) and (S2, T2, P2)


Stretched Mapping Scenarios

Source Reference [0…*] title year publishedIn Author [0…*] name Affiliation [0…*] university country Students [0…*] sname

Target Publication [0…*] Title Year PublishedIn Name University Country StudentName

Horizontal

Partitioning


Horizontal

Partitioning

Copy

Unnesting Unnesting

Copy

Complex Mapping Scenario


Composing Mapping Scenarios

Intermix of basic mapping scenario transformations

capture cases where different types of transformations occur simultaneously on the same part of a schema

Main idea:

Generate a source schema S

Evolve S to obtain a target schema.

All based on the configuration parameters

Example next:


Composed Mapping Scenarios R1 [0…*] A1 A2 SE1 [0…*] A3 A4 SE2 [0…*] A5 R2 [0…*] A7 A8 SE3 [0…*] A9 SE4 [0…*] A10 A11 R3 [0…*] A12 A13 A14

F1 A1 A2 A3 A4 A5 F2 A7 A8 A9 A10 A11 F3 A12 A13 A14

F1 A1 A3 A5 F2 A7 A8 A10 A11 F3 A12 A13 A14

F1 A1 A3 A5 A7b F2 A7 A8 A10 A11 A12 F3 A13 A14

F1 A1 A3 A5 A7b A15 (id) F2 A7 A8 A10 A11 A12 A16(=“June”) F3 A13 A14 A17(=A3*A14)

R4 [0…*] A1 A3 SE5 [0…*] A5 A7b A15 R6 [0…*] A7 A8 A10 A11 A12 A16 R7 [0…*] A13 A14 SE6 [0…*] A17

Transformation

query

S T

P

• unnesting

• removal

• duplication

• migration

• addition

• nesting


Synthetic Examples for Matching

[Ferrara et al. 2008]

ISLab Instance Matching Benchmark

Creates a reference ontology and populates it

Using the web

Performs a sequence of modifications

Variations in data values

Structural heterogeneity

Semantic variations


Scenarios Are Useless Without Metrics


Talk Outline

Introduction









Metric Categorization

Qualitative metrics

Compliance measures

Quantitative metrics

Performance measures

User-specific metrics

Application specific metrics


Qualitative = Compliance

Evaluate the degree of compliance of the system with respect to some standard Matching: Precision, Recall, F-measure and Fallout

[Euzenat et al. 2007]

Measure the difference of the system output to some reference (expected) output An expert user is typically assumed to provide the

expected matches [Duchateau et al. 2007] [Euzenat et al. 2004] They do not provide any measure of the post-match

effort

They do not consider the time spent by the user in doing verification during intermediate stages


Terminology

E−𝑮 𝑮 − 𝑬 𝑮 ∩ 𝑬

𝑼− (𝑮 ∪ 𝑬)

False Negatives

False Positives

True Positives

True Negatives

E: Expected G: Generated


Hamming Distance

Measures the dissimilarity between matches

H(G,E)= 1 − |G∩E|

|G∪E|

Example:

E=(Book-Volume,Person-Human,Science-Essay}

G=(Product-Volume,Person-Writer,Science-Essay}

H G, E = 1 −1

3=

2

3


Precision

Originated from IR [van Rijsbergen, 1975]

Adopted to matching [Do et al. 2002]

Ratio of correctly found correspondences (true positives) over the total number of returned correspondences (true and false positives)

Intuitively: The degree of correctness

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 G, E =|G∩E|

|G|


Recall

Originated from IR [van Rijsbergen, 1975]

Adopted to matching [Do et al. 2002]

Ratio of correctly found correspondences (true positives) over the total number of expected correspondences (true positives and false negatives)

Intuitively: The degree of completeness

𝑅𝑒𝑐𝑎𝑙𝑙 G, E =|G∩E|

|E|


Fallout

The percentage of the found matches that are false positives

Intuitively: How much error has been made

𝐹𝑎𝑙𝑙𝑜𝑢𝑡 G, E =G − |G∩E|

|G|=

|G−E|

|G|


F-Measure

Precision & Recall not always consistent

Their complements Noise & Silence neither

Aggregate Precision & Recall

Percentage of the false positive found matches

Intuitively: How much error has been made

𝐹𝑀𝑒𝑎𝑠𝑢𝑟𝑒𝑎 G, E =𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 G,E ×𝑅𝑒𝑐𝑎𝑙𝑙(G,E)

1−𝛼 ×𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 G,E +𝛼×𝑅𝑒𝑐𝑎𝑙𝑙(G,E)

For a=1, F-Measure is equal to precision and for a=0 to Recall. For a=0.5, it is the harmonic mean


Overall

Like an edit-distance [Melnik et al. 2002]

Ratio of errors over the total number of expected correspondences (true positives and false negatives)

Overall < F-Measure. Ranges [-1,1]

Intuitively: The effort required to fix a matching

𝑂𝑣𝑒𝑟𝑎𝑙𝑙 G, E = 𝑅𝑒𝑐𝑎𝑙𝑙 G, E × 2 −1

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(G,E)

= 1 −G∪E − G∩E

E= 1 −

E−G +|G−E|

|E|


Strength-based Similarity

Takes into consideration the degree of confidence

𝑆𝐵𝑆 G, E =2× |𝑠𝑡𝑟𝑒𝑛𝑔ℎ𝑡ℎG 𝑐 −𝑠𝑡𝑟𝑒𝑛𝑔𝑡ℎE 𝑐 |𝑐∈G∩ E

G +|E|


All-or-nothing vs. Approximation

Product

DVD

Book

Science

Textbook

Popular

……

Volume

Essay

Politics

Biography

Pocket

…… Expected

Far

Close

Pocket


Relaxed Precision & Recall

[Ehrig et al. 2005]

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑎 G, E =𝜔(G,E)

|G|

𝑅𝑒𝑐𝑎𝑙𝑙𝑎 G, E =𝜔(G,E)

|E|

𝜔 G, E = 𝜎(𝑎, 𝑟)𝑎,𝑟 ∈𝑀(G,E)

𝜎: correspondence similarity function

M(G,E): Best matches with regard to 𝜎

Instead of |𝑨 ∩ 𝑹|

Instead of |𝑮 ∩ 𝑬|


Weighted Harmonic Mean

Integrates multiple similarity measures

Given n similarity measures 𝑀𝑖, each with a weight 𝑤𝑖 such that 𝑤𝑖 ∈ 0,1 and 𝑤𝑖𝑖=0..𝑛 = 1

𝑊𝐻𝑀 G, E = 𝑀𝑖(G,E)𝑖∈𝐼

𝑤𝑖×𝑀𝑖(G,E)𝑖∈𝐼


Evaluating Mapping Systems

Efficiency

Mapping Generation Time

Data Translation Performance

Time

Parallelization

Human effort


Matching/Mapping Generation Time

Matching tools Few human intervention

Time has been measured for matching tasks in [Yatskevic et al. 2003], discussed in a recent benchmark [Kopcke et al. 2010], and also addressed in XBenchMatch [Duchateau et al. 2007]

Mapping tools, e.g., Clio, HePToX, Spicy, and STBenchmark do not elaborate on the issue It is hard to measure time in a process in which

human participation is part of the process It includes the time to guide, verify and tune the

mapping tool


Translation Time as Performance Metric

Time to execute the transformation script

Indirectly a quality metric on the generated mappings

Attention to avoid evaluation of the query engines

Same engine & Hardware

Need to be fair

Same target instance

Efficient Core generation

[Mecca et al. 2009] [tenCate et al. 2009]

Time performance of ETL workflows

[Simitsis et al. 2009]


Time and Parallelization for ETL tools

Beyond time performance, other factors may be quite relevant, such as: Workflow execution throughput (under failures or

not)

Avg latency per tuple

Along with the above factors, it is important to increase parallelization: Pipelining: tasks of the ETL workflow are executed

in parallel by different processors, and the output can be consumed by the next task without waiting for the overall completion

Partitioning: data is partitioned and the transformation is applied to chunks of data


Human Effort In Matching Tools

The amount of work required to remove false positives and add false negatives

Since no human intervention takes place during the matching process

Human-spared resources [Duchateau 2009]

It counts the number of user interactions to obtain a 100% F-measure, i.e. the effort to remove false positives and add false negatives, and also to discover missing correspondences


Human Effort In Mapping Tools

Mapping tools can be seen as graphic tools

HCI study can be used

Comparing the GUI of the tools is hard

Schema mapping tools is a new technology and the tools are evolving and keep improving their interface daily

STBenchmark provides a first-cut measure on the effort required to implement a mapping scenario through the visual interface of a mapping system [Alexe et al. 2008]


A Simple Model

STBenchmark model Cost of implementing a mapping scenario:

4*L + S + 2*D + 4*K

L – mouse dragging actions

S – single mouse clicks

D – double mouse clicks

K – keystrokes

[MacKenzie et al. ’91]

Mouse dragging is slower and

more error-prone than clicking

It is easier to make mistakes

when typing

Scenario / System A B C D

Scenario 1 8 16 16 4

Scenario 2 72 78 65 110

Scenario 3 32 52 37 7

Scenario 4 66 81 65 200


Usability study for HePToX/Clio

• [Bonifati et al. 2010]

• Whereas Clio could implement fairly more scenarios, HePToX required

less effort than Clio in the majority of the scenarios it could implement.

• HePToX is click-and-drag oriented, while Clio is click-and-select oriented.


Evaluating Mapping Systems

Efficiency

Mapping Generation Time

Data Translation Performance

Time

Parallelization

Human effort

Effectiveness

Supported Scenarios

Quality

Generated Mappings

Target instance

Target schema


Enumerating Supported Scenarios

System A System B System C

Scenario 1

Scenario 2

Scenario 3

Scenario 4

Scenario 5

Scenario 6

Scenario 7

……………..


Before talking about quality…

In order to check the quality of the generated mappings or the generated target instance:

Somebody has to provide you with the „ideal‟ or „desired‟ mappings or target instance

Normally, the user provides those

Or, alternatively, a „core‟ target instance can be used

Or the set of mappings suggested by a benchmark

The focus here is not on who provides the above components, but on quality!


Quality of the Generated Mappings

Things are tricky for mapping tools:

Measuring the quality of generated mappings amounts to checking Query Containment or Query Equivalence

An NP-complete problem

It is preferable to check the quality of the results of a mapping:

i.e. the generated target instance

Current efforts try to characterize mappings in a quantitative way:

By their information loss, while inverting them

The notion of maximum extended recovery has been introduced in [Arenas2008, Fagin2009]


Quality of the Generated Target instance

A mapping is inherently a query

Same transformation, multiple ways.

Generation time

Readability

Mapping result

Whether it produces the respective results

Yes/No is too harsh

Similarity of queries …. too difficult to compute

Target instance is an alternative

Observe the result of the mappings


The Spicy Way

[Bonifati et al. 2008] – Spicy

Schema Matcher (internal or external)

Mapping Generator (internal or external)

s.A t.E, 0.87

s.C t.F, 0.76

s.D t.F, 0.98

…

Ranked

mappings:

mapping 1, 0.97

mapping 2, 0.87

mapping 3. 0.72

…

match

line selection

mapping generation

Str

uctu

ral

analy

sis

mapping

verification

Spicy

source

target


The Spicy Way

Structural Analysis

uses the model of electrical circuits

uses sampling

uses a set of features on samples, e.g.:

length and character distribution

entropy of values

density of null values etc.


Structural Analysis

For an attribute A with sample sample(A), the atomic piece of circuit is the following


Trees Into Circuits

Circuits can be obtained for nested structures


The Spicy way: lessons learned

Spicy is a system for schema mapping verification

using comparison of instances to gauge the quality

iterating the mapping search algorithm until mapping quality is acceptable

Open issues:

Special element types (images, full-text), complex models (e.g. ontologies) and more complex classes of lines

Other instance comparison techniques other than structural analysis may be needed

What if we look at the efficiency of transformations (e.g. core computation) and their quality at the same time?


Quality of target instance for ETL

The quality of target instances is also important for ETL systems

Target instances can be characterized in terms of:

Easy of maintenance [Simitsis et al. 2009]

Resilience to failures

Data freshness

Compliance to business rules


Data examples

Since the expected/generated target instance may be large and the generated mappings may be numerous, Samples of the expected/generated target instance

can be used in turn

The importance of data examples goes back to [Yan et al. 2001] Each mapping is a connected graph G = (N, E), where

N is the set of nodes or source schema relations, and E represents conjunctions of join predicates among the nodes

Data associations (subgraphs of G) can be derived as relations that contain that maximum number of attributes that can be joined along E

Data associations are leveraged to understand what has to be included in a mapping


Routes

Source-to-target dependencies, st:

m1: CardHolders(cn,l,s,n) ->

Accounts(cn,L,s),Clients(s,n)

m2: Dependents(an,s,n) -> Clients(s,n)

Target dependencies, t:

m3: Clients(s,n) -> (Accounts(A,L,s))

MANHATTAN CREDIT

CardHolders:

cardNo ²

limit ²

ssn ²

name ²

Dependents:

accNo ²

ssn ²

name ²

FARGO FINANCE

Accounts:

² accNo

² creditLine

² accHolder

Clients:

² ssn

² name

m2

m1

m3

S: T:

Source instance I Target instance J Solution for I under the schema mapping

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

fk1

Allow the Inspection of the flow of the mappings [Chiticariu and Tan, 2006]


Example Debugging Scenario 1

Unknown credit limit?

15K is not copied over to the target

Source instance I Target instance J

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

Alice ID1 $15K 123

CardHolders ID1 L1 123

Accounts

Alice ID1

Clients m1

A route for the Accounts tuple



Unknown credit limit?

15K is not copied over to the target


123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

Alice ID1 $15K 123

CardHolders ID1 L1 123

Accounts

Alice ID1

Clients m1

A route for the Accounts tuple



Unknown account number?

123 is not copied over to the target

as Bob’s account number


123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

m2 Bob ID2 123

Dependents

ID2 L2 A2

Accounts

Bob ID2

Clients m3

Route for Accounts tuple with accNo A2



Unknown account number?

123 is not copied over to the target

as Bob’s account number


123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

m2 Bob ID2 123

Dependents

ID2 L2 A2

Accounts

Bob ID2

Clients m3

Route for Accounts tuple with accNo A2


The SPIDER System

Based on Routes


Data Examples as Evaluation Tools

The use of data examples as evaluation tools is underway [Chiticariu et al. 2008]

Examples are used to understand and refine mappings towards the desired specification

Universal data examples are data examples derived from universal solutions [Alexe et al. 2010]

If S and T contain only unary relations, with only Σst, a mapping is characterized by a set of

Positive data examples (I, J) such that (I,J) Σ

Negative data examples (I, J) such that (I,J) Σ


Muse

[Chiticariu et al. 2008] Muse

Build ad-doc probes for each attribute, such that an small source example is built and two differentiating target examples are obtained

After that, Muse asks the designer „Which target instance look correct?‟

This leads to eliminate some mappings that lead to the unchosen target instance

The result is:

A set of correct homomorphically equivalent target instances

It also allows the design of Skolem functions, not addressed in Routes


MUSE Workflow

MUSE

Mapping

Specification Real Source

Instance

(if available)

Real/Synthetic

Data

Examples

Mapping designer

inspects

data examples

Examination

Generation

Essentially

Yes/No Answers

Refinement Grouping Semantics

Disambiguation


Example

CompDB: Rcd

Companies: Set of

Company: Rcd

cbranch

cname

location

Projects: Set of

Project: Rcd

pid

pname

cbranch

manager

Employees: Set of

Employee: Rcd

eid

ename

contact

OrgDB: Rcd

Orgs: Set of

Org: Rcd

oname

Projects: Set of

Project: Rcd

pname

manager

Employees: Set of

Employee: Rcd

eid

ename

Declarative Mapping

for

c in CompDB.Companies

p in CompDB.Projects

e in CompDB.Employees

satisfy

p.cbranch = c.cbranch

e.eid = p.manager

exists

o in OrgDB.Orgs

p1 in o.Projects

e1 in OrgDB.Employees

satisfy

p1.manager = e1.eid

where

c.cname = o.oname

e.eid = e1.eid

e.ename = e1.ename

p.pname = p1.pname


Example

Grouping Projects:

Example source:

Companies

Redmond Microsoft USA

S. Valley Microsoft USA

Projects

P1 DB Redmond e4

P2 Web S. Valley e5

Group by cbranch

Orgs

Microsoft

Projects:

DB e4

Microsoft

Projects:

Web e5

Group by cname

Orgs

Microsoft

Projects:

DB e4

Web e5

CompDB: Rcd

Companies: Set of

Company: Rcd

cbranch

cname

location

Projects: Set of

Project: Rcd

pid

pname

cbranch

manager

Employees: Set of

Employee: Rcd

eid

ename

contact

OrgDB: Rcd

Orgs: Set of

Org: Rcd

oname

Projects: Set of

Project: Rcd

pname

manager

Employees: Set of

Employee: Rcd

eid

ename


Example

Declarative Mapping

for

c in CompDB.Companies

p in CompDB.Projects

e in CompDB.Employees

satisfy

p.cbranch = c.cbranch

e.eid = p.manager

exists

o in OrgDB.Orgs

p1 in o.Projects

e1 in OrgDB.Employees

satisfy

p1.manager = e1.eid

where

c.cname = o.oname

e.eid = e1.eid

e.ename = e1.ename

p.pname = p1.pname

o.Projects = SKProjects(c.cbranch, c.cname, c.location)

Grouping

Function

But group by what subset of {cbranch, cname, location} ?

CompDB: Rcd

Companies: Set of

Company: Rcd

cbranch

cname

location

Projects: Set of

Project: Rcd

pid

pname

cbranch

manager

Employees: Set of

Employee: Rcd

eid

ename

contact

OrgDB: Rcd

Orgs: Set of

Org: Rcd

oname

Projects: Set of

Project: Rcd

pname

manager

Employees: Set of

Employee: Rcd

eid

ename


Muse-G: Grouping Semantics Design

Goal: infer a grouping function that has the same effect as the one intended by the designer

Muse-G probes each possible grouping attribute: start with cbranch

Example source

Companies

Redmond Microsoft USA


Projects

P1 DB Redmond e4

P2 Web S. Valley e5

Employees

e4 John x234

e5 Anna x888

Target Scenario 1

group

by cbranch

Orgs

Microsoft

Projects:

DB e4

Microsoft

Projects:

Web e5

Employees

e4 John

e5 Anna

y subset of {Microsoft, USA}

Target Scenario 2

do not group

by cbranch

Orgs

Microsoft

Projects:

DB e4

Web e5

Employees

e4 John

e5 Anna

SK(Redmond,y)

SK(S. Valley,y)

SK(y)


Muse-G: Second Question

The next probed attribute is cname

Example source

Companies


Mt. View Google USA

Projects

P1 DB S. Valley e4

P4 Web Mt. View e6

Employees

e4 John x234

e6 Kat x331

Target Scenario 1

group

by cname

Orgs

Microsoft

Projects:

DB e4

Google

Projects:

Web e6

Employees

e4 John

e6 Kat

Target Scenario 2

do not group

by cname

Orgs

Microsoft

Projects:

DB e4

Web e6

Google

Projects:

DB e4

Web e6

Employees

e4 John

e6 Kat

y subset of {USA}

The wizard continues to probe the remaining possible grouping attributes

SK(Microsoft,y)

SK(Google,y)

SK(y)

SK(y)


Quality of the Generated Target Schema

When mappings are used in schema integration:

The quality of the generated integrated schema wrt. the intended integrated schema

Three metrics have been conceived:

Completeness [Batista07]:

Minimality [Batista07]:


Quality of the Generated Target Schema


An example

Nr. of common elements=6; α = 2;

Completeness=6/7; Minimality = 5/7;

Structurality = (1+1+10+1/4+1/2)/6

Proximity = 0.73

B

F

D

E

C

X

Z

A

B

F

D

E

C

A

(generated) (intended)

G


Talk Outline

Introduction



Challenges in evaluating matching & mapping




Conclusions and References


What We Talked About

The importance of Matching/Mappings

Explained the matching/mapping tasks

Presented existing tools & their functionality

What is a benchmark

Generic evaluation principles

Why is a benchmark for mapping systems a challenge.

Finding real-world evaluation scenarios


Metrics for effectiveness, efficiency and quality


Main Sources

[Euzenat et al. 2007] J. Euzenat and P. Shvaiko, "Ontology matching", Springer-Verlag, 2007

[Bellahsene et al. 2011] Z. Bellahsene, A. Bonifati and E. Rahm, "Schema Matching and Mapping", Springer-Verlag, 2011


Are you taking questions?

Thank you !!

Of course ! Go ahead


Partial List of References

[Alexe et al. 2008] Alexe B, Chiticariu L, Miller RJ, Tan WC (2008) “Muse: Mapping Understanding and deSign by Example”. In: ICDE, pp 10-19

[Bonifati et al. 2008] Bonifati A, Mecca G, Pappalardo A, Raunich S, Summa G (2008) “Schema Mapping Verification: The Spicy Way”. In: EDBT, pp 85-96

[Lenzerini 2002] Lenzerini Maurizio “Data Integration: A Theoretical Perspective”. In: PODS, pp 233-246

[Miller et al. 2000] Miller RJ, Haas LM, Hernandez MA (2000) “Schema Mapping as Query Discovery”. In: VLDB, pp 77-88

[Rahm et al. 2001] Rahm E, Bernstein PA (2001) “A survey of approaches to automatic schema matching”. VLDB Journal 10(4):334-35

[Velegrakis 2005] Velegrakis Y “Managing Schema Mappings in Highly Heterogeneous Environments”. PhD thesis, University of Toronto


Partial List of References (cont’d)

[Altova 2008] Altova (2008) MapForce. Http://www.altova.com [Batini et al. 1986] Batini C, Lenzerini M, Navathe SB (1986) A

Comparative Analysis of Methodologies for Database Schema Integration. ACM Comp. Surv. 18(4):323-364

[Bernstein et al. 2007] Bernstein PA, Melnik S (2007) Model management 2.0: manipulating richer mappings. In: SIGMOD, pp 1–12

[Do et al. 2002] Do HH, Rahm E (2002) COMA - A System for Flexible Combination of Schema Matching Approaches. In: VLDB, pp 610–621

[Euzenat et al. 2007] Euzenat J, Shvaiko P (2007) Ontology matching. Springer Verlag, Heidelberg

[Fagin et al. 2005] Fagin R, Kolaitis PG, Miller RJ, Popa L (2005) Data exchange: semantics and query answering. Theoretical Computer Science 336(1):89-124

[Lerner 2000] Lerner BS (2000) A Model for Compound Type Changes Encountered in Schema Evolution. TPCTC 25(1):83–127

[Popa et al, 2002] Popa L, Velegrakis Y, Miller RJ, Hernandez MA, Fagin R (2002)Translating Web Data. In: VLDB, pp 598–609



[Alexe2010] Alexe B, Kolaitis PG, Tan W (2010b) Characterizing Schema Mappings via Data Examples. In: PODS

[Aumueller2006] Aumueller D, Do HH, Massmann S, Rahm E (2005) Schema and ontology matching with COMA++. In: SIGMOD, pp 906–908

[Bonifati et al. 2010] Angela Bonifati, Elaine Qing Chang, Terence Ho, Laks V. S. Lakshmanan, Rachel Pottinger, Yongik Chung: Schema mapping and query Translation in heterogeneous P2P XML databases. VLDB J. 19(2): 231-256 (2010)

[Chiticariu et al. 2006] Chiticariu L, TanWC(2006) Debugging Schema Mappings with Routes. In: VLDB, pp 79–90

[Dhamankar at al. 2004] Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Y. Halevy, Pedro Domingos: iMAP: Discovering Complex Mappings between Database Schemas. SIGMOD Conference 2004: 383-394

[Doan et al. 2001] Doan A, Domingos P, Halevy AY (2001) Reconciling schemas of disparate data sources: A machine-learning approach. In: SIGMOD, pp 509–520



[Fletcher et al. 2006] Fletcher GHL, Wyss CM (2006) Data Mapping as Search. In: EDBT, pp 95–111

[Giunchiglia et al. 2004] Giunchiglia F, Shvaiko P, Yatskevich M (2004) S-Match: an Algorithm and an Implementation of Semantic Matching. In: ESWS, pp 61–75

[Madhavanet al. 2001] Madhavan J, Bernstein PA, Rahm E (2001) Generic Schema Matching with Cupid. In: VLDB, pp 49–58.

[Naumann et al. 2002] Naumann F, Ho CT, Tian X, Haas LM, Megiddo N (2002) Attribute Classification Using Feature Analysis. In: ICDE, p 271

[Shu et al. 1977] N. C. Shu, B. C. Housel, R. W. Taylor, S. P. Ghosh, and V. Y. Lum. EXPRESS: A Data EXtraction, Processing and REstructuring System. ACM Transactions on Database Systems (TODS), 2(2):134–174, 1977.