Yannis Velegrakis (University of Trento) Angela Bonifati (CNR)
EDBT 2011, Uppsala, Sweden, March 21st-25th
Benchmarks
From Usage To Evaluation
Schema Matching and Mapping Systems
2 EDBT 2011, A. Bonifati & Y. Velegrakis
Talk Outline
Introduction
Matching and mapping: techniques & tools
Benchmarks and evaluation principles
Designing a matching & mapping benchmark
Using real-world scenarios for evaluation
Generating synthetic evaluation scenarios
Measuring efficiency and effectiveness
Conclusions and future directions
3 EDBT 2011, A. Bonifati & Y. Velegrakis
Talk Outline
Introduction
Matching and mapping: techniques & tools
Benchmarks and evaluation principles
Designing a matching & mapping benchmark
Using real-world scenarios for evaluation
Generating synthetic evaluation scenarios
Measuring efficiency and effectiveness
Conclusions and future directions
4 EDBT 2011, A. Bonifati & Y. Velegrakis
Introduction
Data is inherently heterogeneous
Due to the explosion of online data repositories
Due to the variety of users, who develop a wealth of applications
At different time
With disparate requirements in their mind
A fundamental requirement is to translate data across different formats
How data is transformed from one format to another is done through mappings
5 EDBT 2011, A. Bonifati & Y. Velegrakis
Mappings Are All Around
Data integration [Lenzerini 2002]
to specify the relationship between local and global schemas
S1
S2
S3
Global
Schema T
I1
I2
I3 Sources
6 EDBT 2011, A. Bonifati & Y. Velegrakis
Mappings Are All Around
Schema integration [Batini et al. 1986]
to specify the relationship between the input schemas and the integrated schemas
S1
S2
S3
Integrated
Schema
Input Schemas
7 EDBT 2011, A. Bonifati & Y. Velegrakis
Mappings Are All Around
Data exchange [Fagin et al. 2005]
to specify the relationship between source and target schemas
S T
mappings
Source schema Target Schema
I I J
8 EDBT 2011, A. Bonifati & Y. Velegrakis
Mappings Are All Around
Schema evolution [Lerner 2000]
to specify the relationship between the old and new version of an evolved schema
S1 S1’ S1’’
Evolving Schema S1
9 EDBT 2011, A. Bonifati & Y. Velegrakis
How Did It All Start
One of the first systems to deal with this problem was developed at IBM in 1977: EXPRESS (EXtraction, Processing and REStructuring
System) [Shu et al. 1977] consists of two languages: DEFINE that works as a DDL (Data Definition Language)
CONVERT that works as a DTL (Data Translation Language) and has a total of 9 operators, each of which receives as input a data file, performs the respective transformation and generates an output data file.
EXPRESS required the users familiarity with the languages and was customized to only one model (hierarchical)
After that, inter-model transformations were also studied [Tork-Roth et al. 1997] [Atzeni et al. 1997]
10 EDBT 2011, A. Bonifati & Y. Velegrakis
Emphasis on Data Translation
[Abiteboul et al. 1997] proposed a declarative framework for data translation
[Davidson et al. 1997] focused on constraint satisfaction
[Milo et al. 1998] leveraged a library of transformation rules and pattern-matching techniques
[Clue et al. 1998] emphasized type-checking
[Beeri at al. 1999] focused on tree-based transformations for XML data structures
11 EDBT 2011, A. Bonifati & Y. Velegrakis
Talk Outline
Introduction
Matching and mapping: techniques & tools
Benchmarks and evaluation principles
Designing a matching & mapping benchmark
Using real-world scenarios for evaluation
Generating synthetic evaluation scenarios
Measuring efficiency and effectiveness
Conclusions and future directions
12 EDBT 2011, A. Bonifati & Y. Velegrakis
A Data Transfer Example
Source: Rcd
projects: Set of
project: Rcd
name
status
grants: Set of
grant: Rcd
gid
project
recipient
manager
supervisor
contacts: Set of
contact: Rcd
cid
phone
companies: Set of
company: Rcd
name
official
Target: Rcd
projects: Set of
project: Rcd
code
funds: Set of
fund: Rcd
fid
finId
finances: Set of
finance: Rcd
finId
mPhone
company
companies: Set of
company: Rcd
coid
name
name status
PIX Active
E-services Active
Clio Inactive
cid email phone
Benedikt [email protected] 5827766
Hull [email protected] 5824509
Shrivastava [email protected] 3608776
Belanger [email protected] 3608600
Fernandez [email protected] 3608679
name official
AT&T AT&T Research Labs
Lucent Lucent Technologies, Bell Labs Innovations
Projects
Grants
Contacts
Companies
gid project recipient manager supervisor
g1 PIX AT&T Fernandez Belanger
g2 PIX AT&T Shrivastava Belanger
g3 E-services Bell-labs Benedikt Hull
So
urce I
nsta
nce
13 EDBT 2011, A. Bonifati & Y. Velegrakis
Desired Target Instance
Target: Rcd
projects: Set of
project: Rcd
code
funds: Set of
fund: Rcd
fid
finId
finances: Set of
finance: Rcd
finId
mPhone
company
companies: Set of
company: Rcd
coid
name
Source: Rcd
projects: Set of
project: Rcd
name
status
grants: Set of
grant: Rcd
gid
project
recipient
manager
supervisor
contacts: Set of
contact: Rcd
cid
phone
companies: Set of
company: Rcd
name
official
Finances
code: E-services
Projects
fid finId
g3 ???
Funds
code: PIX
fid finId
g1 ???
g2 ???
Funds
finId mPhone company
??? 3608679 ???
??? 3608776 ???
??? 5827766 ???
coid name
Sk2(AT&T) AT&T
Sk2(Lucent) Lucent
??? ???
??? ???
??? ???
Companies
Targ
et in
sta
nce
14 EDBT 2011, A. Bonifati & Y. Velegrakis
The Needed Transformation Query LET $doc0 := document("inputXMLfile") RETURN <T> { distinct-values ( FOR $x0 IN $doc0/S/grant, $x1 IN $doc0/S/project, $x2 IN $doc0/S/contact, $x3 IN $doc0/S/contact WHERE $x2/cid/text() = $x0/manager/text() AND $x0/supervisor/text() = $x3/cid/text() AND $x0/project/text() = $x1/name/text() RETURN <project> <code> { $x0/project/text() } </code> { distinct-values ( FOR $x0L1 IN $doc0/S/grant, $x1L1 IN $doc0/S/project, $x2L1 IN $doc0/S/contact, $x3L1 IN $doc0/S/contact WHERE $x2L1/cid/text() = $x0L1/manager/text() AND $x0L1/supervisor/text() = $x3L1/cid/text() AND $x0L1/project/text() = $x1L1/name/text() AND $x0/project/text() = $x0L1/project/text() RETURN <funding> <fid> { $x0L1/gid/text() } </fid> <finId> { "Sk52(", $x0L1/gid/text(), ", ", $x0L1/project/text(), ")" } </finId> </funding> ) } </project> ) } { distinct-values ( FOR $x0 IN $doc0/S/grant, $x1 IN $doc0/S/project, $x2 IN $doc0/S/contact, $x3 IN $doc0/S/contact WHERE $x2/cid/text() = $x0/manager/text() AND $x0/supervisor/text() = $x3/cid/text() AND $x0/project/text() = $x1/name/text() RETURN <finance> <finId> { $x0/gid/text() } </finId> <mPhone> { $x2/phone/text() } </mPhone> <company> { "Sk46(", $x2/phone/text(), ", ", $x0/gid/text(), ")" } </company> </finance> ) } { distinct-values ( FOR $x0 IN $doc0/S/grant, $x1 IN $doc0/S/project, $x2 IN $doc0/S/contact, $x3 IN $doc0/S/contact WHERE $x2/cid/text() = $x0/manager/text() AND $x0/supervisor/text() = $x3/cid/text() AND $x0/project/text() = $x1/name/text() RETURN <company> <coid> { "Sk46(", $x2/phone/text(), ", ", $x0/gid/text(), ")" } </coid> <name> { "Sk49(", $x2/phone/text(), ", ", $x0/gid/text(), ")" } </name> </company> ) } { distinct-values ( FOR $x0 IN $doc0/S/company RETURN <company> <coid> { "Sk93(", $x0/cname/text(), ")" } </coid> <name> { $x0/cname/text() } </name> </company> ) }
15 EDBT 2011, A. Bonifati & Y. Velegrakis
The Road To Mapping Systems
The design of data transformations has been a manual task for a long while
Designers had to be familiar with the language
As schemas became larger and more complex, the task became too laborious, time-consuming and error-prone
The need of raising the level of abstraction and trying to automate the tasks was soon realized.
The idea …
Mapping Systems
16 EDBT 2011, A. Bonifati & Y. Velegrakis
Generating Mapping
Different techniques exist to generate mappings:
Manual, e.g.
by means of high-level mapping languages, such as [Bernstein et al. 2007]
by means of sophisticated user interfaces [Altova 2008]
Semi-automatic, e.g.
By means of designer guidance [Alexe2008]
Via advanced algorithms to do the reasoning instead of the mapping designer [Madhavan at al. 2001] [Popa et al. 2002][Do et al. 2002][Bonifati et al. 2008]
17 EDBT 2011, A. Bonifati & Y. Velegrakis
The First Step of a Mapping Task
Source: Rcd
projects: Set of
project: Rcd
name
status
grants: Set of
grant: Rcd
gid
project
recipient
manager
supervisor
contacts: Set of
contact: Rcd
cid
phone
companies: Set of
company: Rcd
name
official
Target: Rcd
projects: Set of
project: Rcd
code
funds: Set of
fund: Rcd
fid
finId
finances: Set of
finance: Rcd
finId
mPhone
company
companies: Set of
company: Rcd
coid
name
name status
PIX Active
E-services Active
Clio Inactive
cid email phone
Benedikt [email protected] 5827766
Hull [email protected] 5824509
Shrivastava [email protected] 3608776
Belanger [email protected] 3608600
Fernandez [email protected] 3608679
name official
AT&T AT&T Research Labs
Lucent Lucent Technologies, Bell Labs Innovations
Projects
Grants
Contacts
Companies
gid project recipient manager supervisor
g1 PIX AT&T Fernandez Belanger
g2 PIX AT&T Shrivastava Belanger
g3 E-services Bell-labs Benedikt Hull
18 EDBT 2011, A. Bonifati & Y. Velegrakis
Matching
Given two schemas as input, the source and the target schema, matching is a process that produces as output a set of matches or correspondences or (simply) lines between the elements of the two schemas
A match is a triple <Es, Et, e> Where Es is a set of elements of the source
schema, Et is a set of elements of the target schema, and e specifies a simple relationship (equality or set inclusion) or complex relationship between element in Es and Et
19 EDBT 2011, A. Bonifati & Y. Velegrakis
The Matching Relationship e
Depends on the cardinalities of Es and Et
Depends on the semantics:
Can be a function
Can be an arithmetic operation
Can be a set-theoretic relation (e.g. ≡,overlaps)
Can be a conceptual modeling relationship (e.g. part-of, subclass-of)
20 EDBT 2011, A. Bonifati & Y. Velegrakis
Matching: An Alternative Definition
The matching process [Euzenat et al. 2007] can be seen as a function f from a pair of schemas S and T, an optional input alignment A, a set of matching parameters p and a set of resources r:
A’ = f(S, T, A?, p, r)
Ultimately, an alignment is a set of correspondences between elements in S and elements in T
21 EDBT 2011, A. Bonifati & Y. Velegrakis
Matching Examples
Simple relationship: Name ≡ Title Location ≡ Address
Complex relationship: speed = velocity x 2.237 speed x 0.447 = velocity speed = concat(velocity x 2.237, „MPH‟) speed ≥ velocity
Company
Location
Name
Source
Address
Organization
Title
Target
22 EDBT 2011, A. Bonifati & Y. Velegrakis
The matching process
Can be roughly divided into three steps:
Pre-match: training of classifiers for machine learning-based matchers, matching parameters (weights, thresholds), adjustments of resources, such as thesauri and constraints
Match: the actual matching task
Post-match: the user may check and modify the displayed matches
23 EDBT 2011, A. Bonifati & Y. Velegrakis
Some Schema Matchers
Cupid [Madhavan et al. 2001] : based on structural and name similarity
S-Match [Giunchiglia et al. 2004]: based on semantic closeness
Coma++ [Aumueller et al. 2005]: based on matching reuse
LSD [Doan et al. 2001]: based on data value analysis and machine-learning techniques
iMap [Dhamankar et al. 2004]: suited for complex e expressions
Similarity Flooding [Melnik et al. 2002]: based on graph similarity
24 EDBT 2011, A. Bonifati & Y. Velegrakis
Similarity Flooding
25 EDBT 2011, A. Bonifati & Y. Velegrakis
COMA++
26 EDBT 2011, A. Bonifati & Y. Velegrakis
Matchings Are Not Enough
Source: Rcd
projects: Set of
project: Rcd
name
status
grants: Set of
grant: Rcd
gid
project
recipient
manager
supervisor
contacts: Set of
contact: Rcd
cid
phone
companies: Set of
company: Rcd
name
official
Target: Rcd
projects: Set of
project: Rcd
code
funds: Set of
fund: Rcd
fid
finId
finances: Set of
finance: Rcd
finId
mPhone
company
companies: Set of
company: Rcd
coid
name
name status
PIX Active
E-services Active
Clio Inactive
cid email phone
Benedikt [email protected] 5827766
Hull [email protected] 5824509
Shrivastava [email protected] 3608776
Belanger [email protected] 3608600
Fernandez [email protected] 3608679
name official
AT&T AT&T Research Labs
Lucent Lucent Technologies, Bell Labs Innovations
Projects
Grants
Contacts
Companies
gid project recipient manager supervisor
g1 PIX AT&T Fernandez Belanger
g2 PIX AT&T Shrivastava Belanger
g3 E-services Bell-labs Benedikt Hull
Cannot describe the full details of the transformation
27 EDBT 2011, A. Bonifati & Y. Velegrakis
Source Schema Target Schema
Matchings
Matcher
The Mapping Generation Process
Matching is just the beginning of any mapping generation
process
28 EDBT 2011, A. Bonifati & Y. Velegrakis
Mappings
Given the source and the target schemas, mapping is a process that takes as input a set of matches between the elements of the two schemas and produces a relationship or constraint e that must hold between their respective instances
In other words, a mapping is a triple <S, T, e>
Where S is the source schema, T is the target schema, and e specify a constraint that any instances adhering to S and T must satisfy or an executable statement to transform the instance of S into the instances of T
29 EDBT 2011, A. Bonifati & Y. Velegrakis
A Mapping Example
Source: Rcd
projects: Set of
project: Rcd
name
status
grants: Set of
grant: Rcd
gid
project
recipient
manager
supervisor
contacts: Set of
contact: Rcd
cid
phone
companies: Set of
company: Rcd
name
official
Target: Rcd
projects: Set of
project: Rcd
code
funds: Set of
fund: Rcd
fid
finId
finances: Set of
finance: Rcd
finId
mPhone
company
companies: Set of
company: Rcd
coid
name
project(na,st), grant(gid,na,re,ma,su), contact(ma,em,ph)
project(na,FUND), fund(gid,finId), finance (finId,ph,company)
company(company, name)
30 EDBT 2011, A. Bonifati & Y. Velegrakis
Mappings & Instances
Mappings are the basic ingredients of many tasks, such as information integration, P2P query answering, data exchange etc.
In particular, mappings as inter-schema constraints may not be enough to fully specify a unique target instance There may exist multiple target instances
satisfying the mappings
Finding the best target instance is the goal of the data exchange problem [Fagin et al. 2005] The mapping is converted into a executable
transformation script to obtain that particular instance
31 EDBT 2011, A. Bonifati & Y. Velegrakis
S: Rcd
projects: Set of
project: Rcd
name
status
grants: Set of
grant: Rcd
gid
project
recipient
manager
supervisor
contacts: Set of
contact: Rcd
cid
phone
companies: Set of
company: Rcd
name
official
A Data Exchange Example
T: Rcd
projects: Set of
project: Rcd
code
funds: Set of
fund: Rcd
fid
finId
finances: Set of
finance: Rcd
finId
mPhone
company
companies: Set of
company: Rcd
coid
name
Targ
et in
sta
nce
Finances
code: E-services
Projects
fid finId
g3 ???
Funds
code: PIX
fid finId
g1 ???
g2 ???
Funds
finId mPhone company
??? 3608679 ???
??? 3608776 ???
??? 3608600 ???
coid name
??? AT&T
??? Lucent
Companies
project(na,st), grant(gid,na,re,ma,su), contact(ma,em,ph)
project(na,FUND), fund(gid,finId), finance (finId,ph,company),
company(company, name)
32 EDBT 2011, A. Bonifati & Y. Velegrakis
Source Schema Target Schema
Matchings
User
Transformation Scripts
Matcher
Mapping Generation Engine
Mappings (Dependencies)
Query Engine
The Mapping Generation Process
Data Exchange Engine
Source Instance
Target Instance
33 EDBT 2011, A. Bonifati & Y. Velegrakis
Research Prototype Systems
Mapping generation and data exchange are separate tasks Clio[Popa et al. 2002], HePToX[Bonifati et al. 2010],
Spicy[Bonifati et al. 2008]
Mappings Generation the mappings are expressed as high-level assertions in a
logical formalism
A mapping is a source-to-target tuple-generating dependency (or s-t tgd in short)
𝜙 𝑥 → ∃ 𝜓 𝑥, 𝑦 where
φ(x) (ψ(x,y), resp.) is a conjunction of atoms over the source (target, resp.)
Data exchange The respective module transforms the high-level mappings
into transformation scripts (in SQL or XQuery) to generate the target instance.
34 EDBT 2011, A. Bonifati & Y. Velegrakis
Clio
35 EDBT 2011, A. Bonifati & Y. Velegrakis
Spicy
36 EDBT 2011, A. Bonifati & Y. Velegrakis
HepToX
37 EDBT 2011, A. Bonifati & Y. Velegrakis
Source Schema Target Schema
Matchings
User
Transformation Scripts
Matcher
Mapping Generation Engine
Mappings (Dependencies)
Query Engine
Commercial Mapping Systems
Data Exchange Engine
Source Instance
Target Instance
Mapping Engine
Mapping generation and Data exchange are merged in one. The system directly creates in some native language the final transformation script
38 EDBT 2011, A. Bonifati & Y. Velegrakis
Popular Commercial Systems
Altova Mapforce
Stylus Studio
IBM Rational Data Architect
BizTalk mapper
Adeptia
BEA Aqualogic
39 EDBT 2011, A. Bonifati & Y. Velegrakis
Stylus Studio
40 EDBT 2011, A. Bonifati & Y. Velegrakis
Altova MapForce
41 EDBT 2011, A. Bonifati & Y. Velegrakis
Adeptia
42 EDBT 2011, A. Bonifati & Y. Velegrakis
BizTalk Mapper
43 EDBT 2011, A. Bonifati & Y. Velegrakis
IBM Rational Data Architect
44 EDBT 2011, A. Bonifati & Y. Velegrakis
A Mapping Tool Categorization
All tools provide to the mapping designer:
A graphical representation of the two schemas
A set of graphical transformation constructs
The granularity and power of these constructs is a main factor of differentiation among the tools
Detailed Specification by the Designer
Intelligence of the mapping tool/
Effort in post-verification
Roughly
Commercial
Mapping Tools
Roughly
Research
Prototypes
45 EDBT 2011, A. Bonifati & Y. Velegrakis
Issues in Data Exchange
When multiple target instances exist, how do we compute the best one?
Is a given target instance better than another?
Universal solutions
Introduced in [Fagin et al. 2005]
These are the “most general” target instances, and also represent the entire space of solutions
Among the universal solutions, the smallest of all and the most compact one is called the “core”
46 EDBT 2011, A. Bonifati & Y. Velegrakis
Universal Core Instances
T: Rcd
Advised: Rcd
sname
facid
WorksWith: Rcd
sname
facid
S: Rcd
PTStud: Rcd
age
name
GradStud: Rcd
age
name
sname facid
Bob N3
Ann N4
WorksWith
PTStud(x,y), Advised(y,z)
GradStud (x,y) Advised(y,z), WorksWith(y,z)
Source instance
PTStud
age name
27 Bob
30 Ann
GradStud
age name
32 John
30 Ann
Advised
sname facid
Bob N3
Ann N4
John N1
N1 Cathy
A solution:
sname facid
Bob N3
Ann N4
WorksWith
Advised
sname facid
Bob N3
Ann N4
John N1
N2 Ann
sname facid
Bob N3
Ann N4
WorksWith
Advised
sname facid
Bob N3
Ann N4
John N1
A universal Solution:
The core:
47 EDBT 2011, A. Bonifati & Y. Velegrakis
Commercial vs. Research Prototype Systems
Whereas research prototypes (e.g. Clio, Spicy++) are tending to produce target instances that look more and more like the core
Commercial tools leave the task to the users, who have to manually interact with sophisticated GUIs and write pieces of the transformation manually
No core definition is even considered Core? No,
thanks
48 EDBT 2011, A. Bonifati & Y. Velegrakis
Mixing Matching and Mapping
Matching and Mapping not always by separate tools
Clio has as an add-on a matcher based on attribute feature analysis [Naumann et al. 2002]
Bernstein‟s model management considers the matcher as a fully integrated and indistinguishable component
Spicy [Bonifati et al. 2008] has a matcher based on instance-based structural analysis
49 EDBT 2011, A. Bonifati & Y. Velegrakis
Limitations of Current Systems
Manual approaches are not applicable to large-scale mapping tasks
The user/developer has to become familiar with the mapping language and the user interfaces
The outcome of the mapping process may not respect the user requirements and desired semantics (unsurprisingly!)
Specifications may be incomplete and dependent of system peculiarities
Thus, there is a need for a verification and guidance process
50 EDBT 2011, A. Bonifati & Y. Velegrakis
Source Schema Target Schema
Matchings
User
Transformation Scripts
Matcher
Mapping Generation Engine
Mappings (Dependencies)
Query Engine
The Verification Process
Data Exchange Engine
Source Instance
Target Instance
Data
Examples
Expected
Target
Instance
Verification
And
Selection User
51 EDBT 2011, A. Bonifati & Y. Velegrakis
A-Posteriori Verification
The main problem with matching and mapping is the dichotomy between the expected results and the generated answers
Some tools allow a post-verification
by using data examples
Tupelo [Fletcher et al. 2006], Muse[Alexe et al. 2008] Clio [Alexe et al. 2010]
by using an automatic instance comparison,
Spicy [Bonifati et al. 2008]
by means of manual user feedback
unfeasible for large-scale tasks
via debugging techniques
Routes [Chiticariu et al. 2006]
52 EDBT 2011, A. Bonifati & Y. Velegrakis
ETL systems
Extract-Transform-Load tools are data transformation tools based on graphical flowcharts with nodes encoding transformation primitives and edges encoding the transformation flow
Can be considered as a special form of mapping system Generate transformations
GUI
An intermediate language (An algebra for ETL)
Output (transformation scripts)
They are not mapping tools in the classical sense Focus only on data transformation operators
53 EDBT 2011, A. Bonifati & Y. Velegrakis
1 2 3
6
Not Null
(CustKey) SK(custkey) PhoneFormat
New - Old
Customer.
new
CUSTO
MER7C.D+
Error
4 5
SK(custkey) PhoneFormat
Customer.
old
Cnew
Cold
An ETL data flowchart
54 EDBT 2011, A. Bonifati & Y. Velegrakis
How Can One Decide If A Product is Good ?
55 EDBT 2011, A. Bonifati & Y. Velegrakis
Talk Outline
Introduction
Matching and mapping: techniques & tools
Benchmarks and evaluation principles
Designing a matching & mapping benchmark
Using real-world scenarios for evaluation
Generating synthetic evaluation scenarios
Measuring efficiency and effectiveness
Conclusions and future directions
56 EDBT 2011, A. Bonifati & Y. Velegrakis
Importance Of A Benchmark
Help Designers and Developers to improve their tools by
assessing their usefulness and constantly evaluating their performance
Users to compare the different available tools and evaluate suitability for their needs [Haas et al. 2007]
Researchers to compare themselves to others
Exist for a long term To allow adequate measurements
Evolution in the field
Help assess absolute results Properties of the results
How they compare to the others
57 EDBT 2011, A. Bonifati & Y. Velegrakis
Benchmark
Well-designed tests (scenarios) with which the results of a system can be evaluated [Castro et al. 2004]
A standardized application scenario that serves as a basis for testing and evaluation and comparison [Merriam-Webster]
Clearly specified scenarios that everyone can implement
Clearly specify the factors that are measured, and under what conditions they should be measured.
Should measure of the degree of achievement
Should be reproducible and stable
Can be used repeatedly
58 EDBT 2011, A. Bonifati & Y. Velegrakis
Principles
Systematic Procedure
Continuity
Quality and equity
Dissemination
Intelligibility
59 EDBT 2011, A. Bonifati & Y. Velegrakis
Types of evaluation
Competence Benchmarks Measure competences and performance with respect to a
task
Aim at characterizing the kind of tasks each method is good for
For designers to improve their systems
Comparative evaluation Comparison of results of various systems on a common task
Aim at finding the best system Tuning of the system an issue
Comparison of systems and aim at general field improvement
Application-specific Comparison of various systems on a specific task
Competitive evaluation
60 EDBT 2011, A. Bonifati & Y. Velegrakis
Evaluation Steps
Planning
Specifying task, software, hardware, input, output
Processing
Analysis
Result evaluation according to predefined measures
61 EDBT 2011, A. Bonifati & Y. Velegrakis
Bottom Line: Benchmarks Are Great !
62 EDBT 2011, A. Bonifati & Y. Velegrakis
Talk Outline
Introduction
Matching and mapping: techniques & tools
Benchmarks and evaluation principles
Designing a matching & mapping benchmark
Using real-world scenarios for evaluation
Generating synthetic evaluation scenarios
Measuring efficiency and effectiveness
Conclusions and future directions
63 EDBT 2011, A. Bonifati & Y. Velegrakis
Generic Matching/Mapping Benchmark Goals
Compare in terms of
Performance
Usability
Effectiveness
Applicability to real-world scenarios
Improve the quality of the matching and mapping generation process
64 EDBT 2011, A. Bonifati & Y. Velegrakis
Query Benchmark Schema Mapping Benchmark
■Evaluation Scenarios: A setting ( a
database instance/schema + a query)
and the outcome
■For a mapping system, what is the input and
what is the output?
■The query engine should support the
scenarios (mainly should be able to
evaluate the query of each scenario)
■A mapping tool input language should be able
to express the transformations of interest.
What are they?
■Supporting the scenario:
search engine result=expected outcome
■How do you compare a mapping system
output with an expected output?
■Good Query Engine = Fast (Correct)
Responses
■What do we measure?
Effort? Expressiveness?
■Range the characteristics of the data
instance to measure how well the engine
scales
■What do we scale?
Query vs Matching/Mapping Benchmarks
65 EDBT 2011, A. Bonifati & Y. Velegrakis
The Scenario Input
Source Schema S
Target Schema T
Maybe an Instance of the Source Schema S
A specification of what we need to achieve
Matching Systems
Typically there is no specification
Just the Source and the Target Schema
A complete set of matches assumed as correct
Mappings Systems
…
Major Issue
66 EDBT 2011, A. Bonifati & Y. Velegrakis
The Specification for Mapping Systems
An expected (desired) transformation
Mapping Systems try to guess it
Issues
No formal semantic framework to express it
No formal relationship to the outcome.
Note: Query engine benchmarks (e.g., TMC-H or XMark) leverage on the semantics of the query language
Clear what the scenario is asking
Clear what to compare the result sets
67 EDBT 2011, A. Bonifati & Y. Velegrakis
Expressing Desired Transformations
Natural Language
Too generic and ambiguous
Complete specification formalism (query?)
Beats the purpose of a mapping system
Comparing the generated mapping of the tool to the precise specification is like asking equivalence of two mappings (a hard problem ! )
68 EDBT 2011, A. Bonifati & Y. Velegrakis
Expressing Desired Transformations
Graphical Interface Different constructs expressed in different tools
Typically a GUI for the query language
Continuously evolve
Simple specification (Correspondences?) 1-1, many-1, between atomic or complex
elements, Nested, w/o annotations, GUI constructs Can get so complex that they become the same as the
actual mappings
Ambiguous. The same set of correspondences interpreted different from different mapping tools Without a standard way to interpret them? Risky !
69 EDBT 2011, A. Bonifati & Y. Velegrakis
Company
Location
Name
Source
Address
Organization
Title
Target
A Simple Ambiguous Scenario
Different interpretations may arise from a simple “copy” scenario
<Source>
<Company>
<Name>IBM</Name>
<Location>NY</Location>
</Company>
<Company>
<Name>MS</Name>
<Location>WA</Location>
</Company>
</Source>
<Target>
<Organization>
<Title>IBM</Title>
<Title>MS</Title>
<Address>NY</Address>
<Address>WA</Address>
</Organization>
</Target>
70 EDBT 2011, A. Bonifati & Y. Velegrakis
A Simple Ambiguous Scenario
Different tools might generate different instances with the same arrows
<Source>
<Company>
<Name>IBM</Name>
<Location>NY</Location>
</Company>
<Company>
<Name>MS</Name>
<Location>WA</Location>
</Company>
</Source>
<Target>
<Organization>
<Title>IBM</Title>
<Address>NY</Address>
</Organization>
</Target>
Company
Location
Name
Source
Address
Organization
Title
Target
71 EDBT 2011, A. Bonifati & Y. Velegrakis
A Simple Ambiguous Scenario
Arrows between non-leaf nodes are not allowed in all tools
<Source>
<Company>
<Name>IBM</Name>
<Location>NY</Location>
</Company>
<Company>
<Name>MS</Name>
<Location>WA</Location>
</Company>
</Source>
<Target>
<Organization>
<Title>IBM</Title>
<Address>NY</Address>
</Organization>
<Organization>
<Title>MS</Title>
<Address>WA</Address>
</Organization>
</Target>
Company
Location
Name
Source
Address
Organization
Title
Target
72 EDBT 2011, A. Bonifati & Y. Velegrakis
The Issue of The User Input
Many tools allow the mapping designer to manually edit the generated transformation
Power as the one provided by the language.
Shortcuts and abstraction levels.
73 EDBT 2011, A. Bonifati & Y. Velegrakis
The Scenario Output
The output has to be correct
Satisfy the desired transformation
Compare to the expected transformation
Matching
A set of Matches
Mapping
It is not clear what the output is
Mapping
The transformation scripts
The transformed data
Same data may be generated by different mappings
74 EDBT 2011, A. Bonifati & Y. Velegrakis
Evaluation Challenges
What are we testing?
Expressiveness
Performance
of the tool?
of the generated mappings?
Quality
of the generated mappings?
of the integrated schema?
of the target data? [Dong et al. 2009]
User effort
Heavily depends on the mapping interface
Measuring these factors is hard without a formal (and standard) agreement on expressing specifications
75 EDBT 2011, A. Bonifati & Y. Velegrakis
Matching vs. Mapping Benchmarking
Matching System Evaluation is typically set comparison
Consider only semantics of the schema
More automatic
Mapping System Evaluation is more challenging
Considers semantics of schemas and transformations
Requires more human intervention
76 EDBT 2011, A. Bonifati & Y. Velegrakis
The User Can Also Help …
By being presented with …
The mappings
Difficult to overcome the heterogeneity of languages
The generated target instance
Not feasible for large and complex instances [Velegrakis et al. 2005]
A representative sample of the target instance
Appealing alternative based on positive/negative examples, but still in its infancy [Alexe et al. 2008]
Presented in details later on
How do all these get into the evaluation function?
77 EDBT 2011, A. Bonifati & Y. Velegrakis
Talk Outline
Introduction
Matching and mapping: techniques & tools
Benchmarks and evaluation principles
Designing a matching & mapping benchmark
Using real-world scenarios for evaluation
Generating synthetic evaluation scenarios
Measuring efficiency and effectiveness
Conclusions and future directions
78 EDBT 2011, A. Bonifati & Y. Velegrakis
A Common Design Pattern
Example: TPC-H
Sets of test cases (With their expected output)
Find those that can be successfully executed
Characterize the system
Sets of Matching or Mapping Scenarios
79 EDBT 2011, A. Bonifati & Y. Velegrakis
Examples of Data Sets
Public and well-designed schemas
Meaningful overlap
Limited by the existence of real schemas
Need to be discriminating
Who is the Oracle? External knowledge or a human?
80 EDBT 2011, A. Bonifati & Y. Velegrakis
Large Scale Ontology Sets
[Zhang et al. 2004]
Two large ontologies from the anatomy domain
Foundational Model of anatomy & Galen
Thousands of classes, no instances
[Lambrix et al. 2003]
Gene & Signal Ontology
Partial overlap
81 EDBT 2011, A. Bonifati & Y. Velegrakis
OAEI Data
Artificial data set
33 classes, 64 properties, 76 individuals
Initial ontology distorted
Result: ~50 pairs of ontologies
Correct by construction
82 EDBT 2011, A. Bonifati & Y. Velegrakis
Data Set Factors Affecting The Evaluation
Heterogeneity of the modeling language (schemas/ontologies) The language itself Number of schemas (1-to-1 or many-to-1) From scratch matching or there is a head-start Multiplicity: How many elements in one schema can match with
how many on the other Are Oracles permitted? Is user input permitted Can there be a-priori training External methods and auxiliary inputs Is justification of the output needed? Relations of the correspondences (only = or others as well) Is there a time limit for the matching/mapping Can the matching be on leaves only or not?
83 EDBT 2011, A. Bonifati & Y. Velegrakis
OAEI Evaluation Example
[Euzenat et al., 2006]
Ontology Alignment Evaluation Initiative
oaei.ontologymatching.org
Yearly contest
Participants:
Provided with OAEI API
Execute all tests
Provide their results & Paper
Make the results public
84 EDBT 2011, A. Bonifati & Y. Velegrakis
Building Large Ontology Sets
[Avesani et al. 2005]
Test sets for matching web directories and classifications
Two web directories are similar if their web pages are similar
It can be considered a matching technique by itself
85 EDBT 2011, A. Bonifati & Y. Velegrakis
Thesauri
Thesauri covering large hierarchies of concepts and textual knowledge
Digital Libraries and Museums
Large need to match them
Example:
AGROVOC (FAO): 16K terms
NAL (US Agricaltural dep): 41K terms
86 EDBT 2011, A. Bonifati & Y. Velegrakis
Various examples
Illinois Semantic Integration Archive
http://pages.cs.wisc.edu/~anhai/wisc-si-archive
Collection of different schemas & Data
Faculties
Courses
Real Estate
87 EDBT 2011, A. Bonifati & Y. Velegrakis
Real Examples Lack Systematic Design
Existing Datasets not systematic
Completeness ?
Correctness?
Deduplicated?
Clarity?
Mainly testbeds or standardized tests
But not benchmarks
Benchmark tests should be
Consistent
Complete
Minimal
88 EDBT 2011, A. Bonifati & Y. Velegrakis
Real-World Matching Problems
[Kopcke et al. 2010]
Collection of matching problems
[Giunchiglia et al. 2009]
4500 matches between 3 web directories
Error free
Low complexity
High discriminative capacity
89 EDBT 2011, A. Bonifati & Y. Velegrakis
XBenchMatch
[Duchateau et al. 2007]
Criteria for testing and evaluating matching tools
Focuses on assessment of matching tools
Quality
Time
10 Datasets for matching
Classified according to:
Data level, e.g., degree of heterogeneity
Process level, e.g., scale
90 EDBT 2011, A. Bonifati & Y. Velegrakis
STBenchmark
[Alexe et al. 2008] www.stbenchmark.org
Evaluate the effectiveness of the mapping system
Derived from real applications DBLP, BioWarehouse, …
Derived from Information Integration Literature [Lerner, 2000], [Carey, 2000],
etc.
Minimum set of transformations that should be supported
1 scenario - 1 transformation Described by
Source & Target Schemas Transformation Query Instance of the Source Schema Capture most practically relevant
transformation cases
Copying Constant Value Generation Horizontal Partition Surrogate Key Assignment Vertical Partition Unnesting (Flattening) Nesting self-Joins Denormalization Keys and Object Fusion Atomic Value Management Aggregation Order Ordered By Flipping Metadata to Data Flipping Data to Metadata Flipping Data to Nested Metadata
91 EDBT 2011, A. Bonifati & Y. Velegrakis
Scenario: Copy
Source:
Protein * name accession created
Target:
Protein * Name Accession Created
for $x0 in $doc/Source/Protein return <Protein> <Name> $x0/name/text() <Accession> $x0/accession/text() <Created> $x0/created/text() </Protein>
Textual description +
92 EDBT 2011, A. Bonifati & Y. Velegrakis
Scenario: Value Generation
Target:
DataSet * Name LoadingDate
“SwissProt” “July 4th”
<DataSet> <Name>SwissProt</Name> <LoadingDate>July 4th</LoadingDate> </DataSet>
93 EDBT 2011, A. Bonifati & Y. Velegrakis
Scenario: Horizontal Partitioning
Source:
gene * txt type protein
Target:
Gene * Name Protein
Synonym * Name Protein
If type ==“ primary”
for $x0 in $doc/Source/Gene where $x0/type/text() == ‟primary‟ return <Gene> <Name> $x0/name/text() <Protein> $x0/protein/text() </Gene> for $x0 in $doc/Source/Gene where $x0/type/text() != ‟primary‟ return <Synonym> <Name> $x0/name/text() <Protein> $x0/protein/text() </Synonym>
94 EDBT 2011, A. Bonifati & Y. Velegrakis
Scenario: Surrogate Key Assignment
Source:
gene * txt type protein
Target:
Gene * Name Protein WID
Synonym * Name Protein WID
If type ==“ primary”
Id()
Id’() for $x0 in $doc/Source/Gene where $x0/type/text() == ‟primary‟ return <Gene> <Name> $x0/name/text() <Protein> $x0/protein/text() <WID> genID() </Gene> for $x0 in $doc/Source/Gene where $x0/type/text() != ‟primary‟ return <Synonym> <Name> $x0/name/text() <Protein> $x0/protein/text() <WID> genID() </Synonym>
95 EDBT 2011, A. Bonifati & Y. Velegrakis
Scenario: Vertical Partition
Source:
Reaction * entry name comment orthology definition equation
Target:
Reaction * Entry Name Comment CoFactor
ChemicalInfo * Orthology Definition Equation CoFactor
Labeled Nulls
for $x0 in $doc/Source/Reaction let $id = genID() return <Reaction> <Entry> $x0/name/text() <Name> $x0/name/text() <Comment> $x0/comment/text() <CoFactor> $id </Reaction> <ChemicalInfo> <Orthology> $x0/orthology/text() <Definition> $x0/definition/text() <Equation> $x0/equation/text() <CoFactor> $id </ChemicalInfo>
Normalization Note that no key information is assumed, as such duplication is allowed
96 EDBT 2011, A. Bonifati & Y. Velegrakis
Scenario: Join Path Selection
Target:
Taxon * Id Name UniqueName Class Parent Rank EmblCode
Source:
Name * id name uniqueName class
Node * taxid parentId rank emblCode
for $x0 in $doc/Source/Name, $x1 in $doc/Source/Node where $x0/id/text() = $x1/taxId/text() return <Taxon> <Id> $x0/id/text() <Name> $x0/name/text() <UniqueName> $x0/uniqueName/text() <Class> $x0/class/text() <Parent> $x1/parentId/text() <Rank> $x1/rank/text() <EmblCode> $x1/emblCode/text() </Taxon>
Denormalization
97 EDBT 2011, A. Bonifati & Y. Velegrakis
Scenario: Cyclic Joins
Source:
Gene * name type protein
Target:
Gene * Name Protein
Synonym * Name GeneWID
If type ==“ primary”
for $x0 in $doc/Source/Gene where $x0/type/text() == ‟primary‟ return <Gene> <name> $x0/name/text() <protein> $x0/protein/text() </Gene> for $x1 in $doc/Source/Gene where $x1/type/text() != ‟primary‟ and $x1/protein/text() == $x0/protein/text() return <Synonym> <Name> $x1/name/text() <GeneWID> $x0/name/text() </Synonym>
98 EDBT 2011, A. Bonifati & Y. Velegrakis
Scenario: Un-Nesting Structures
Target:
Publication * Title Year PublishedIn Name
Source:
Reference * title year publishedIn Author * name
for $x0 in $doc/Source/Reference $x1 in $x0/Author return <Publication> <Title> $x0/title/text() <Year> $x0/text() <PublishedIn> $x0/publishedIn/text() <Name> $x1/name/text() </Publication>
99 EDBT 2011, A. Bonifati & Y. Velegrakis
Scenario: Nesting Structures
Target:
Period * Year Author * Name Publication * Title PublishedIn
Source:
Publication * title year publishedIn name
for $x0 in distinct-values($doc/Source/Publication/year) return <Period> <Year> $x0 for $x1 in distinct-values($doc/Source/Publication[year=$x0]/Name) return <Author> <Name> $x1 for $x2 in $doc/Source/Publication where $x2/year/text()=$x0 and $x2/name/text()=$x1 return <Publication> <Title> $x2/title/text() <PublishedIn> $x2/publishedIn/text() </Publication> </Author> </Period>
100 EDBT 2011, A. Bonifati & Y. Velegrakis
Scenario: Keys & Object Fusion
Target:
Experiment * Contact Date Description ExperimentalData * Data Role
Source:
Experiment * eid contact date description ExperimentalData * data role
FlowCytometrySample id contact date Probe * data type
<Source2> for $x0 in $doc/Source/Experiment $x1 in $x0/ExperimentalData return <Datum> <id> genID($x0/contact/text(), $x0/date/text()) <Contact> $x0/contact/text() <Date> $x0/date/text() <Description> $x0/description/text() <Data> $x1/data/text() <Role> $x1/role/text() </Datum> for $x0 in $doc/Source/FlowCytometrySample $x1 in $x0/Probe return <Datum> <id> genID($x0/contact/text(), $x0/date/text() ) <Contact> $x0/contact/text() <Date> $x0/date/text() <Data> $x1/data/text() <Role> $x1/type/text() </Datum> </Source2>
for $x0 in distinct-values($doc/Source2/Datum/id) return <Experiment> for $x1 in ($doc/Source2/Datum[id=$x0])[1] return $x1/Contact $x1/Date $x1/Description for $x3 in $doc/Source2/Datum where $x3/id/text() = $x0 <ExperimentalData> $x3/Data $x3/Role <ExperimentalData> </Experiment>
101 EDBT 2011, A. Bonifati & Y. Velegrakis
Scenario: Atomic Value Manipulation
Target:
Contact * FirstName LastName Address Phone
Source:
Contact * name address street city zip phone
GetFirstName(…)
Type Discrepancy Handling
for $x0 in $doc/Source/Contact return <Contact> <FirstName> GetFirstName( $x0/name/text() ) <LastName> GetLastName( $x0/name/text() ) <Address> Concat( $x0/street/text(), $x0/city/text(), $x0/zip/text() ) <Phone> String2Int( $x0/phone/text() ) </Contact>
102 EDBT 2011, A. Bonifati & Y. Velegrakis
Scenarios: Aggregation & Order
103 EDBT 2011, A. Bonifati & Y. Velegrakis
Scenarios: Data Meta-data
104 EDBT 2011, A. Bonifati & Y. Velegrakis
Thalia
[Hammer et al. 2005]
Integration tools benchmark
Rich set of test data
Source schemas
Syntactic and semantic heterogeneity
12 test queries for the integrated schema
105 EDBT 2011, A. Bonifati & Y. Velegrakis
Real Data Is Not Enough for Benchmarking...
… but definitely not for the reason Dilbert thinks !
106 EDBT 2011, A. Bonifati & Y. Velegrakis
Talk Outline
Introduction
Matching and mapping: techniques & tools
Benchmarks and evaluation principles
Designing a matching & mapping benchmark
Using real-world scenarios for evaluation
Generating synthetic evaluation scenarios
Measuring efficiency and effectiveness
Conclusions and future directions
107 EDBT 2011, A. Bonifati & Y. Velegrakis
Why Synthetic Data & Scenarios
To stress test the system
To understand performance in diverse situations
To create additional realistic test cases
To ensure that unforeseen situations are also tested
108 EDBT 2011, A. Bonifati & Y. Velegrakis
Top-down Scenario Construction
Start with a big schema and divide/extract
TaxME2 [Giunchiglia et al. 2009] Preserves correctness, complexity, performance
[Okawara et al. 2006] Techniques on how a benchmark should be
[Hammer et al. 2005] – Thalia Large dataset + filters
[Lee et al. 2007] eTuner Duplicate schema, split the data in 2
Modify the first half
Limited kinds of modifications not very natural
109 EDBT 2011, A. Bonifati & Y. Velegrakis
Bottom-up Scenario Construction
Create the schema from scratch
STBenchmark [Alexe et al. 2008]
Schema Generator
Expands basic scenarios
Changing basic characteristics of the scenario
Data Generator
Hand-to-Hand with the Schema Generator
ToXGen [Barbosa at al. 2002]
110 EDBT 2011, A. Bonifati & Y. Velegrakis
Parameters of Schema Generation
• Number of subelements
• Nesting depth
• Join size
• Join width
• Join kind (star / chain)
• Function arity
f(…)
Source R1 [0…*] A1 A2 A3
R2 [0…*] A4 A5
R3 [0…*] A6
R4 [0…*] A7 A8 A9
R5 [0…*] A10 A11
R6 [0…*] A12
Parameter values:
Sampled from normal
distributions given by
average and standard
deviation
111 EDBT 2011, A. Bonifati & Y. Velegrakis
Stretching The Unesting Scenario
Source Reference [0…*] title year publishedIn Author [0…*] name Affiliation [0…*] university country Students [0…*] sname
Target Publication [0…*] Title Year PublishedIn Name University Country StudentName
Unnesting
Basic Scenario
112 EDBT 2011, A. Bonifati & Y. Velegrakis
Stretching The Horizontal Partitioning
Vary the number of partitions
Vary the number of elements that exist in each partition
Source:
gene * txt type protein hpAttr1 hpAttr2
Target:
Gene * Name Protein HpAttr1 HpAttr2
Synonym * Name Protein HpAttr1 HpAttr2
HPRel1 * Name Protein HpAttr1 HpAttr2
113 EDBT 2011, A. Bonifati & Y. Velegrakis
Combining Mapping Scenarios
Lack of diversity Combine scenarios
Based on a set of configuration parameters, generate a complex mapping scenario by concatenating scaled-up mapping scenarios
S1 T1
S2 T2
P1
P2
Concatenation of (S1, T1, P1) and (S2, T2, P2)
114 EDBT 2011, A. Bonifati & Y. Velegrakis
Stretched Mapping Scenarios
Source Reference [0…*] title year publishedIn Author [0…*] name Affiliation [0…*] university country Students [0…*] sname
Target Publication [0…*] Title Year PublishedIn Name University Country StudentName
Horizontal
Partitioning
Source Schema Target Schema
Horizontal
Partitioning
Copy
Unnesting Unnesting
Copy
Complex Mapping Scenario
115 EDBT 2011, A. Bonifati & Y. Velegrakis
Composing Mapping Scenarios
Intermix of basic mapping scenario transformations
capture cases where different types of transformations occur simultaneously on the same part of a schema
Main idea:
Generate a source schema S
Evolve S to obtain a target schema.
All based on the configuration parameters
Example next:
116 EDBT 2011, A. Bonifati & Y. Velegrakis
Composed Mapping Scenarios R1 [0…*] A1 A2 SE1 [0…*] A3 A4 SE2 [0…*] A5 R2 [0…*] A7 A8 SE3 [0…*] A9 SE4 [0…*] A10 A11 R3 [0…*] A12 A13 A14
F1 A1 A2 A3 A4 A5 F2 A7 A8 A9 A10 A11 F3 A12 A13 A14
F1 A1 A3 A5 F2 A7 A8 A10 A11 F3 A12 A13 A14
F1 A1 A3 A5 A7b F2 A7 A8 A10 A11 A12 F3 A13 A14
F1 A1 A3 A5 A7b A15 (id) F2 A7 A8 A10 A11 A12 A16(=“June”) F3 A13 A14 A17(=A3*A14)
R4 [0…*] A1 A3 SE5 [0…*] A5 A7b A15 R6 [0…*] A7 A8 A10 A11 A12 A16 R7 [0…*] A13 A14 SE6 [0…*] A17
Transformation
query
S T
P
• unnesting
• removal
• duplication
• migration
• addition
• nesting
117 EDBT 2011, A. Bonifati & Y. Velegrakis
Synthetic Examples for Matching
[Ferrara et al. 2008]
ISLab Instance Matching Benchmark
Creates a reference ontology and populates it
Using the web
Performs a sequence of modifications
Variations in data values
Structural heterogeneity
Semantic variations
118 EDBT 2011, A. Bonifati & Y. Velegrakis
Scenarios Are Useless Without Metrics
119 EDBT 2011, A. Bonifati & Y. Velegrakis
Talk Outline
Introduction
Matching and mapping: techniques & tools
Benchmarks and evaluation principles
Designing a matching & mapping benchmark
Using real-world scenarios for evaluation
Generating synthetic evaluation scenarios
Measuring efficiency and effectiveness
Conclusions and future directions
120 EDBT 2011, A. Bonifati & Y. Velegrakis
Metric Categorization
Qualitative metrics
Compliance measures
Quantitative metrics
Performance measures
User-specific metrics
Application specific metrics
121 EDBT 2011, A. Bonifati & Y. Velegrakis
Qualitative = Compliance
Evaluate the degree of compliance of the system with respect to some standard Matching: Precision, Recall, F-measure and Fallout
[Euzenat et al. 2007]
Measure the difference of the system output to some reference (expected) output An expert user is typically assumed to provide the
expected matches [Duchateau et al. 2007] [Euzenat et al. 2004] They do not provide any measure of the post-match
effort
They do not consider the time spent by the user in doing verification during intermediate stages
122 EDBT 2011, A. Bonifati & Y. Velegrakis
Terminology
E−𝑮 𝑮 − 𝑬 𝑮 ∩ 𝑬
𝑼− (𝑮 ∪ 𝑬)
False Negatives
False Positives
True Positives
True Negatives
E: Expected G: Generated
123 EDBT 2011, A. Bonifati & Y. Velegrakis
Hamming Distance
Measures the dissimilarity between matches
H(G,E)= 1 − |G∩E|
|G∪E|
Example:
E=(Book-Volume,Person-Human,Science-Essay}
G=(Product-Volume,Person-Writer,Science-Essay}
H G, E = 1 −1
3=
2
3
124 EDBT 2011, A. Bonifati & Y. Velegrakis
Precision
Originated from IR [van Rijsbergen, 1975]
Adopted to matching [Do et al. 2002]
Ratio of correctly found correspondences (true positives) over the total number of returned correspondences (true and false positives)
Intuitively: The degree of correctness
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 G, E =|G∩E|
|G|
125 EDBT 2011, A. Bonifati & Y. Velegrakis
Recall
Originated from IR [van Rijsbergen, 1975]
Adopted to matching [Do et al. 2002]
Ratio of correctly found correspondences (true positives) over the total number of expected correspondences (true positives and false negatives)
Intuitively: The degree of completeness
𝑅𝑒𝑐𝑎𝑙𝑙 G, E =|G∩E|
|E|
126 EDBT 2011, A. Bonifati & Y. Velegrakis
Fallout
The percentage of the found matches that are false positives
Intuitively: How much error has been made
𝐹𝑎𝑙𝑙𝑜𝑢𝑡 G, E =G − |G∩E|
|G|=
|G−E|
|G|
127 EDBT 2011, A. Bonifati & Y. Velegrakis
F-Measure
Precision & Recall not always consistent
Their complements Noise & Silence neither
Aggregate Precision & Recall
Percentage of the false positive found matches
Intuitively: How much error has been made
𝐹𝑀𝑒𝑎𝑠𝑢𝑟𝑒𝑎 G, E =𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 G,E ×𝑅𝑒𝑐𝑎𝑙𝑙(G,E)
1−𝛼 ×𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 G,E +𝛼×𝑅𝑒𝑐𝑎𝑙𝑙(G,E)
For a=1, F-Measure is equal to precision and for a=0 to Recall. For a=0.5, it is the harmonic mean
128 EDBT 2011, A. Bonifati & Y. Velegrakis
Overall
Like an edit-distance [Melnik et al. 2002]
Ratio of errors over the total number of expected correspondences (true positives and false negatives)
Overall < F-Measure. Ranges [-1,1]
Intuitively: The effort required to fix a matching
𝑂𝑣𝑒𝑟𝑎𝑙𝑙 G, E = 𝑅𝑒𝑐𝑎𝑙𝑙 G, E × 2 −1
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(G,E)
= 1 −G∪E − G∩E
E= 1 −
E−G +|G−E|
|E|
129 EDBT 2011, A. Bonifati & Y. Velegrakis
Strength-based Similarity
Takes into consideration the degree of confidence
𝑆𝐵𝑆 G, E =2× |𝑠𝑡𝑟𝑒𝑛𝑔ℎ𝑡ℎG 𝑐 −𝑠𝑡𝑟𝑒𝑛𝑔𝑡ℎE 𝑐 |𝑐∈G∩ E
G +|E|
130 EDBT 2011, A. Bonifati & Y. Velegrakis
All-or-nothing vs. Approximation
Product
DVD
Book
Science
Textbook
Popular
……
Volume
Essay
Politics
Biography
…… Expected
Far
Close
131 EDBT 2011, A. Bonifati & Y. Velegrakis
Relaxed Precision & Recall
[Ehrig et al. 2005]
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑎 G, E =𝜔(G,E)
|G|
𝑅𝑒𝑐𝑎𝑙𝑙𝑎 G, E =𝜔(G,E)
|E|
𝜔 G, E = 𝜎(𝑎, 𝑟)𝑎,𝑟 ∈𝑀(G,E)
𝜎: correspondence similarity function
M(G,E): Best matches with regard to 𝜎
Instead of |𝑨 ∩ 𝑹|
Instead of |𝑮 ∩ 𝑬|
132 EDBT 2011, A. Bonifati & Y. Velegrakis
Weighted Harmonic Mean
Integrates multiple similarity measures
Given n similarity measures 𝑀𝑖, each with a weight 𝑤𝑖 such that 𝑤𝑖 ∈ 0,1 and 𝑤𝑖𝑖=0..𝑛 = 1
𝑊𝐻𝑀 G, E = 𝑀𝑖(G,E)𝑖∈𝐼
𝑤𝑖×𝑀𝑖(G,E)𝑖∈𝐼
133 EDBT 2011, A. Bonifati & Y. Velegrakis
Evaluating Mapping Systems
Efficiency
Mapping Generation Time
Data Translation Performance
Time
Parallelization
Human effort
134 EDBT 2011, A. Bonifati & Y. Velegrakis
Matching/Mapping Generation Time
Matching tools Few human intervention
Time has been measured for matching tasks in [Yatskevic et al. 2003], discussed in a recent benchmark [Kopcke et al. 2010], and also addressed in XBenchMatch [Duchateau et al. 2007]
Mapping tools, e.g., Clio, HePToX, Spicy, and STBenchmark do not elaborate on the issue It is hard to measure time in a process in which
human participation is part of the process It includes the time to guide, verify and tune the
mapping tool
135 EDBT 2011, A. Bonifati & Y. Velegrakis
Translation Time as Performance Metric
Time to execute the transformation script
Indirectly a quality metric on the generated mappings
Attention to avoid evaluation of the query engines
Same engine & Hardware
Need to be fair
Same target instance
Efficient Core generation
[Mecca et al. 2009] [tenCate et al. 2009]
Time performance of ETL workflows
[Simitsis et al. 2009]
136 EDBT 2011, A. Bonifati & Y. Velegrakis
Time and Parallelization for ETL tools
Beyond time performance, other factors may be quite relevant, such as: Workflow execution throughput (under failures or
not)
Avg latency per tuple
Along with the above factors, it is important to increase parallelization: Pipelining: tasks of the ETL workflow are executed
in parallel by different processors, and the output can be consumed by the next task without waiting for the overall completion
Partitioning: data is partitioned and the transformation is applied to chunks of data
137 EDBT 2011, A. Bonifati & Y. Velegrakis
Human Effort In Matching Tools
The amount of work required to remove false positives and add false negatives
Since no human intervention takes place during the matching process
Human-spared resources [Duchateau 2009]
It counts the number of user interactions to obtain a 100% F-measure, i.e. the effort to remove false positives and add false negatives, and also to discover missing correspondences
138 EDBT 2011, A. Bonifati & Y. Velegrakis
Human Effort In Mapping Tools
Mapping tools can be seen as graphic tools
HCI study can be used
Comparing the GUI of the tools is hard
Schema mapping tools is a new technology and the tools are evolving and keep improving their interface daily
STBenchmark provides a first-cut measure on the effort required to implement a mapping scenario through the visual interface of a mapping system [Alexe et al. 2008]
139 EDBT 2011, A. Bonifati & Y. Velegrakis
A Simple Model
STBenchmark model Cost of implementing a mapping scenario:
4*L + S + 2*D + 4*K
L – mouse dragging actions
S – single mouse clicks
D – double mouse clicks
K – keystrokes
[MacKenzie et al. ’91]
Mouse dragging is slower and
more error-prone than clicking
It is easier to make mistakes
when typing
Scenario / System A B C D
Scenario 1 8 16 16 4
Scenario 2 72 78 65 110
Scenario 3 32 52 37 7
Scenario 4 66 81 65 200
140 EDBT 2011, A. Bonifati & Y. Velegrakis
Usability study for HePToX/Clio
• [Bonifati et al. 2010]
• Whereas Clio could implement fairly more scenarios, HePToX required
less effort than Clio in the majority of the scenarios it could implement.
• HePToX is click-and-drag oriented, while Clio is click-and-select oriented.
141 EDBT 2011, A. Bonifati & Y. Velegrakis
Evaluating Mapping Systems
Efficiency
Mapping Generation Time
Data Translation Performance
Time
Parallelization
Human effort
Effectiveness
Supported Scenarios
Quality
Generated Mappings
Target instance
Target schema
142 EDBT 2011, A. Bonifati & Y. Velegrakis
Enumerating Supported Scenarios
System A System B System C
Scenario 1
Scenario 2
Scenario 3
Scenario 4
Scenario 5
Scenario 6
Scenario 7
……………..
143 EDBT 2011, A. Bonifati & Y. Velegrakis
Before talking about quality…
In order to check the quality of the generated mappings or the generated target instance:
Somebody has to provide you with the „ideal‟ or „desired‟ mappings or target instance
Normally, the user provides those
Or, alternatively, a „core‟ target instance can be used
Or the set of mappings suggested by a benchmark
The focus here is not on who provides the above components, but on quality!
144 EDBT 2011, A. Bonifati & Y. Velegrakis
Quality of the Generated Mappings
Things are tricky for mapping tools:
Measuring the quality of generated mappings amounts to checking Query Containment or Query Equivalence
An NP-complete problem
It is preferable to check the quality of the results of a mapping:
i.e. the generated target instance
Current efforts try to characterize mappings in a quantitative way:
By their information loss, while inverting them
The notion of maximum extended recovery has been introduced in [Arenas2008, Fagin2009]
145 EDBT 2011, A. Bonifati & Y. Velegrakis
Quality of the Generated Target instance
A mapping is inherently a query
Same transformation, multiple ways.
Generation time
Readability
Mapping result
Whether it produces the respective results
Yes/No is too harsh
Similarity of queries …. too difficult to compute
Target instance is an alternative
Observe the result of the mappings
146 EDBT 2011, A. Bonifati & Y. Velegrakis
The Spicy Way
[Bonifati et al. 2008] – Spicy
Schema Matcher (internal or external)
Mapping Generator (internal or external)
s.A t.E, 0.87
s.C t.F, 0.76
s.D t.F, 0.98
…
Ranked
mappings:
mapping 1, 0.97
mapping 2, 0.87
mapping 3. 0.72
…
match
line selection
mapping generation
Str
uctu
ral
analy
sis
mapping
verification
Spicy
source
target
147 EDBT 2011, A. Bonifati & Y. Velegrakis
The Spicy Way
Structural Analysis
uses the model of electrical circuits
uses sampling
uses a set of features on samples, e.g.:
length and character distribution
entropy of values
density of null values etc.
148 EDBT 2011, A. Bonifati & Y. Velegrakis
Structural Analysis
For an attribute A with sample sample(A), the atomic piece of circuit is the following
149 EDBT 2011, A. Bonifati & Y. Velegrakis
Trees Into Circuits
Circuits can be obtained for nested structures
150 EDBT 2011, A. Bonifati & Y. Velegrakis
The Spicy way: lessons learned
Spicy is a system for schema mapping verification
using comparison of instances to gauge the quality
iterating the mapping search algorithm until mapping quality is acceptable
Open issues:
Special element types (images, full-text), complex models (e.g. ontologies) and more complex classes of lines
Other instance comparison techniques other than structural analysis may be needed
What if we look at the efficiency of transformations (e.g. core computation) and their quality at the same time?
151 EDBT 2011, A. Bonifati & Y. Velegrakis
Quality of target instance for ETL
The quality of target instances is also important for ETL systems
Target instances can be characterized in terms of:
Easy of maintenance [Simitsis et al. 2009]
Resilience to failures
Data freshness
Compliance to business rules
152 EDBT 2011, A. Bonifati & Y. Velegrakis
Data examples
Since the expected/generated target instance may be large and the generated mappings may be numerous, Samples of the expected/generated target instance
can be used in turn
The importance of data examples goes back to [Yan et al. 2001] Each mapping is a connected graph G = (N, E), where
N is the set of nodes or source schema relations, and E represents conjunctions of join predicates among the nodes
Data associations (subgraphs of G) can be derived as relations that contain that maximum number of attributes that can be joined along E
Data associations are leveraged to understand what has to be included in a mapping
153 EDBT 2011, A. Bonifati & Y. Velegrakis
Routes
Source-to-target dependencies, st:
m1: CardHolders(cn,l,s,n) ->
Accounts(cn,L,s),Clients(s,n)
m2: Dependents(an,s,n) -> Clients(s,n)
Target dependencies, t:
m3: Clients(s,n) -> (Accounts(A,L,s))
MANHATTAN CREDIT
CardHolders:
cardNo ²
limit ²
ssn ²
name ²
Dependents:
accNo ²
ssn ²
name ²
FARGO FINANCE
Accounts:
² accNo
² creditLine
² accHolder
Clients:
² ssn
² name
m2
m1
m3
S: T:
Source instance I Target instance J Solution for I under the schema mapping
123 $15K ID1 Alice
CardHolders
123 ID2 Bob
Dependents
123 L1 ID1
A2 L2 ID2
Accounts
ID1 Alice
ID2 Bob
Clients
fk1
Allow the Inspection of the flow of the mappings [Chiticariu and Tan, 2006]
154 EDBT 2011, A. Bonifati & Y. Velegrakis
Example Debugging Scenario 1
Unknown credit limit?
15K is not copied over to the target
Source instance I Target instance J
123 $15K ID1 Alice
CardHolders
123 ID2 Bob
Dependents
123 L1 ID1
A2 L2 ID2
Accounts
ID1 Alice
ID2 Bob
Clients
Alice ID1 $15K 123
CardHolders ID1 L1 123
Accounts
Alice ID1
Clients m1
A route for the Accounts tuple
155 EDBT 2011, A. Bonifati & Y. Velegrakis
Example Debugging Scenario 1
Unknown credit limit?
15K is not copied over to the target
Source instance I Target instance J
123 $15K ID1 Alice
CardHolders
123 ID2 Bob
Dependents
123 L1 ID1
A2 L2 ID2
Accounts
ID1 Alice
ID2 Bob
Clients
Alice ID1 $15K 123
CardHolders ID1 L1 123
Accounts
Alice ID1
Clients m1
A route for the Accounts tuple
156 EDBT 2011, A. Bonifati & Y. Velegrakis
Example Debugging Scenario 2
Unknown account number?
123 is not copied over to the target
as Bob’s account number
Source instance I Target instance J
123 $15K ID1 Alice
CardHolders
123 ID2 Bob
Dependents
123 L1 ID1
A2 L2 ID2
Accounts
ID1 Alice
ID2 Bob
Clients
m2 Bob ID2 123
Dependents
ID2 L2 A2
Accounts
Bob ID2
Clients m3
Route for Accounts tuple with accNo A2
157 EDBT 2011, A. Bonifati & Y. Velegrakis
Example Debugging Scenario 2
Unknown account number?
123 is not copied over to the target
as Bob’s account number
Source instance I Target instance J
123 $15K ID1 Alice
CardHolders
123 ID2 Bob
Dependents
123 L1 ID1
A2 L2 ID2
Accounts
ID1 Alice
ID2 Bob
Clients
m2 Bob ID2 123
Dependents
ID2 L2 A2
Accounts
Bob ID2
Clients m3
Route for Accounts tuple with accNo A2
158 EDBT 2011, A. Bonifati & Y. Velegrakis
The SPIDER System
Based on Routes
159 EDBT 2011, A. Bonifati & Y. Velegrakis
Data Examples as Evaluation Tools
The use of data examples as evaluation tools is underway [Chiticariu et al. 2008]
Examples are used to understand and refine mappings towards the desired specification
Universal data examples are data examples derived from universal solutions [Alexe et al. 2010]
If S and T contain only unary relations, with only Σst, a mapping is characterized by a set of
Positive data examples (I, J) such that (I,J) Σ
Negative data examples (I, J) such that (I,J) Σ
160 EDBT 2011, A. Bonifati & Y. Velegrakis
Muse
[Chiticariu et al. 2008] Muse
Build ad-doc probes for each attribute, such that an small source example is built and two differentiating target examples are obtained
After that, Muse asks the designer „Which target instance look correct?‟
This leads to eliminate some mappings that lead to the unchosen target instance
The result is:
A set of correct homomorphically equivalent target instances
It also allows the design of Skolem functions, not addressed in Routes
161 EDBT 2011, A. Bonifati & Y. Velegrakis
MUSE Workflow
MUSE
Mapping
Specification Real Source
Instance
(if available)
Real/Synthetic
Data
Examples
Mapping designer
inspects
data examples
Examination
Generation
Essentially
Yes/No Answers
Refinement Grouping Semantics
Disambiguation
162 EDBT 2011, A. Bonifati & Y. Velegrakis
Example
CompDB: Rcd
Companies: Set of
Company: Rcd
cbranch
cname
location
Projects: Set of
Project: Rcd
pid
pname
cbranch
manager
Employees: Set of
Employee: Rcd
eid
ename
contact
OrgDB: Rcd
Orgs: Set of
Org: Rcd
oname
Projects: Set of
Project: Rcd
pname
manager
Employees: Set of
Employee: Rcd
eid
ename
Declarative Mapping
for
c in CompDB.Companies
p in CompDB.Projects
e in CompDB.Employees
satisfy
p.cbranch = c.cbranch
e.eid = p.manager
exists
o in OrgDB.Orgs
p1 in o.Projects
e1 in OrgDB.Employees
satisfy
p1.manager = e1.eid
where
c.cname = o.oname
e.eid = e1.eid
e.ename = e1.ename
p.pname = p1.pname
163 EDBT 2011, A. Bonifati & Y. Velegrakis
Example
Grouping Projects:
Example source:
Companies
Redmond Microsoft USA
S. Valley Microsoft USA
Projects
P1 DB Redmond e4
P2 Web S. Valley e5
Group by cbranch
Orgs
Microsoft
Projects:
DB e4
Microsoft
Projects:
Web e5
Group by cname
Orgs
Microsoft
Projects:
DB e4
Web e5
CompDB: Rcd
Companies: Set of
Company: Rcd
cbranch
cname
location
Projects: Set of
Project: Rcd
pid
pname
cbranch
manager
Employees: Set of
Employee: Rcd
eid
ename
contact
OrgDB: Rcd
Orgs: Set of
Org: Rcd
oname
Projects: Set of
Project: Rcd
pname
manager
Employees: Set of
Employee: Rcd
eid
ename
164 EDBT 2011, A. Bonifati & Y. Velegrakis
Example
Declarative Mapping
for
c in CompDB.Companies
p in CompDB.Projects
e in CompDB.Employees
satisfy
p.cbranch = c.cbranch
e.eid = p.manager
exists
o in OrgDB.Orgs
p1 in o.Projects
e1 in OrgDB.Employees
satisfy
p1.manager = e1.eid
where
c.cname = o.oname
e.eid = e1.eid
e.ename = e1.ename
p.pname = p1.pname
o.Projects = SKProjects(c.cbranch, c.cname, c.location)
Grouping
Function
But group by what subset of {cbranch, cname, location} ?
CompDB: Rcd
Companies: Set of
Company: Rcd
cbranch
cname
location
Projects: Set of
Project: Rcd
pid
pname
cbranch
manager
Employees: Set of
Employee: Rcd
eid
ename
contact
OrgDB: Rcd
Orgs: Set of
Org: Rcd
oname
Projects: Set of
Project: Rcd
pname
manager
Employees: Set of
Employee: Rcd
eid
ename
165 EDBT 2011, A. Bonifati & Y. Velegrakis
Muse-G: Grouping Semantics Design
Goal: infer a grouping function that has the same effect as the one intended by the designer
Muse-G probes each possible grouping attribute: start with cbranch
Example source
Companies
Redmond Microsoft USA
S. Valley Microsoft USA
Projects
P1 DB Redmond e4
P2 Web S. Valley e5
Employees
e4 John x234
e5 Anna x888
Target Scenario 1
group
by cbranch
Orgs
Microsoft
Projects:
DB e4
Microsoft
Projects:
Web e5
Employees
e4 John
e5 Anna
y subset of {Microsoft, USA}
Target Scenario 2
do not group
by cbranch
Orgs
Microsoft
Projects:
DB e4
Web e5
Employees
e4 John
e5 Anna
SK(Redmond,y)
SK(S. Valley,y)
SK(y)
166 EDBT 2011, A. Bonifati & Y. Velegrakis
Muse-G: Second Question
The next probed attribute is cname
Example source
Companies
S. Valley Microsoft USA
Mt. View Google USA
Projects
P1 DB S. Valley e4
P4 Web Mt. View e6
Employees
e4 John x234
e6 Kat x331
Target Scenario 1
group
by cname
Orgs
Microsoft
Projects:
DB e4
Projects:
Web e6
Employees
e4 John
e6 Kat
Target Scenario 2
do not group
by cname
Orgs
Microsoft
Projects:
DB e4
Web e6
Projects:
DB e4
Web e6
Employees
e4 John
e6 Kat
y subset of {USA}
The wizard continues to probe the remaining possible grouping attributes
SK(Microsoft,y)
SK(Google,y)
SK(y)
SK(y)
167 EDBT 2011, A. Bonifati & Y. Velegrakis
Quality of the Generated Target Schema
When mappings are used in schema integration:
The quality of the generated integrated schema wrt. the intended integrated schema
Three metrics have been conceived:
Completeness [Batista07]:
Minimality [Batista07]:
168 EDBT 2011, A. Bonifati & Y. Velegrakis
Quality of the Generated Target Schema
169 EDBT 2011, A. Bonifati & Y. Velegrakis
An example
Nr. of common elements=6; α = 2;
Completeness=6/7; Minimality = 5/7;
Structurality = (1+1+10+1/4+1/2)/6
Proximity = 0.73
B
F
D
E
C
X
Z
A
B
F
D
E
C
A
(generated) (intended)
G
170 EDBT 2011, A. Bonifati & Y. Velegrakis
Talk Outline
Introduction
Matching and mapping: techniques & tools
Benchmarks and evaluation principles
Challenges in evaluating matching & mapping
Using real-world scenarios for evaluation
Generating synthetic evaluation scenarios
Measuring efficiency and effectiveness
Conclusions and References
171 EDBT 2011, A. Bonifati & Y. Velegrakis
What We Talked About
The importance of Matching/Mappings
Explained the matching/mapping tasks
Presented existing tools & their functionality
What is a benchmark
Generic evaluation principles
Why is a benchmark for mapping systems a challenge.
Finding real-world evaluation scenarios
Generating synthetic evaluation scenarios
Metrics for effectiveness, efficiency and quality
172 EDBT 2011, A. Bonifati & Y. Velegrakis
Main Sources
[Euzenat et al. 2007] J. Euzenat and P. Shvaiko, "Ontology matching", Springer-Verlag, 2007
[Bellahsene et al. 2011] Z. Bellahsene, A. Bonifati and E. Rahm, "Schema Matching and Mapping", Springer-Verlag, 2011
173 EDBT 2011, A. Bonifati & Y. Velegrakis
Are you taking questions?
Thank you !!
Of course ! Go ahead
174 EDBT 2011, A. Bonifati & Y. Velegrakis
Partial List of References
[Alexe et al. 2008] Alexe B, Chiticariu L, Miller RJ, Tan WC (2008) “Muse: Mapping Understanding and deSign by Example”. In: ICDE, pp 10-19
[Bonifati et al. 2008] Bonifati A, Mecca G, Pappalardo A, Raunich S, Summa G (2008) “Schema Mapping Verification: The Spicy Way”. In: EDBT, pp 85-96
[Lenzerini 2002] Lenzerini Maurizio “Data Integration: A Theoretical Perspective”. In: PODS, pp 233-246
[Miller et al. 2000] Miller RJ, Haas LM, Hernandez MA (2000) “Schema Mapping as Query Discovery”. In: VLDB, pp 77-88
[Rahm et al. 2001] Rahm E, Bernstein PA (2001) “A survey of approaches to automatic schema matching”. VLDB Journal 10(4):334-35
[Velegrakis 2005] Velegrakis Y “Managing Schema Mappings in Highly Heterogeneous Environments”. PhD thesis, University of Toronto
175 EDBT 2011, A. Bonifati & Y. Velegrakis
Partial List of References (cont’d)
[Altova 2008] Altova (2008) MapForce. Http://www.altova.com [Batini et al. 1986] Batini C, Lenzerini M, Navathe SB (1986) A
Comparative Analysis of Methodologies for Database Schema Integration. ACM Comp. Surv. 18(4):323-364
[Bernstein et al. 2007] Bernstein PA, Melnik S (2007) Model management 2.0: manipulating richer mappings. In: SIGMOD, pp 1–12
[Do et al. 2002] Do HH, Rahm E (2002) COMA - A System for Flexible Combination of Schema Matching Approaches. In: VLDB, pp 610–621
[Euzenat et al. 2007] Euzenat J, Shvaiko P (2007) Ontology matching. Springer Verlag, Heidelberg
[Fagin et al. 2005] Fagin R, Kolaitis PG, Miller RJ, Popa L (2005) Data exchange: semantics and query answering. Theoretical Computer Science 336(1):89-124
[Lerner 2000] Lerner BS (2000) A Model for Compound Type Changes Encountered in Schema Evolution. TPCTC 25(1):83–127
[Popa et al, 2002] Popa L, Velegrakis Y, Miller RJ, Hernandez MA, Fagin R (2002)Translating Web Data. In: VLDB, pp 598–609
176 EDBT 2011, A. Bonifati & Y. Velegrakis
Partial List of References (cont’d)
[Alexe2010] Alexe B, Kolaitis PG, Tan W (2010b) Characterizing Schema Mappings via Data Examples. In: PODS
[Aumueller2006] Aumueller D, Do HH, Massmann S, Rahm E (2005) Schema and ontology matching with COMA++. In: SIGMOD, pp 906–908
[Bonifati et al. 2010] Angela Bonifati, Elaine Qing Chang, Terence Ho, Laks V. S. Lakshmanan, Rachel Pottinger, Yongik Chung: Schema mapping and query Translation in heterogeneous P2P XML databases. VLDB J. 19(2): 231-256 (2010)
[Chiticariu et al. 2006] Chiticariu L, TanWC(2006) Debugging Schema Mappings with Routes. In: VLDB, pp 79–90
[Dhamankar at al. 2004] Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Y. Halevy, Pedro Domingos: iMAP: Discovering Complex Mappings between Database Schemas. SIGMOD Conference 2004: 383-394
[Doan et al. 2001] Doan A, Domingos P, Halevy AY (2001) Reconciling schemas of disparate data sources: A machine-learning approach. In: SIGMOD, pp 509–520
177 EDBT 2011, A. Bonifati & Y. Velegrakis
Partial List of References (cont’d)
[Fletcher et al. 2006] Fletcher GHL, Wyss CM (2006) Data Mapping as Search. In: EDBT, pp 95–111
[Giunchiglia et al. 2004] Giunchiglia F, Shvaiko P, Yatskevich M (2004) S-Match: an Algorithm and an Implementation of Semantic Matching. In: ESWS, pp 61–75
[Madhavanet al. 2001] Madhavan J, Bernstein PA, Rahm E (2001) Generic Schema Matching with Cupid. In: VLDB, pp 49–58.
[Naumann et al. 2002] Naumann F, Ho CT, Tian X, Haas LM, Megiddo N (2002) Attribute Classification Using Feature Analysis. In: ICDE, p 271
[Shu et al. 1977] N. C. Shu, B. C. Housel, R. W. Taylor, S. P. Ghosh, and V. Y. Lum. EXPRESS: A Data EXtraction, Processing and REstructuring System. ACM Transactions on Database Systems (TODS), 2(2):134–174, 1977.