Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | nickolas-gordon |
View: | 216 times |
Download: | 1 times |
1Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
1
Advanced databases –
Defining and combining heterogeneous databases:
Core ideas of federated databases; Schema matching
Bettina Berendt
Katholieke Universiteit Leuven, Department of Computer Science
http://www.cs.kuleuven.ac.be/~berendt/teaching/2009-10-1stsemester/adb/
Last update: 12 October 2009
2Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
2
Until now ...
... we have looked into modelling
... we have seen how the languages RDF and OWL allow us to combine different schemas and data
... we have seen how Linked Data on the Web uses HTTP as a connecting protocol/architecture
... we have assumed that such combinations can be done effortlessly (unique names etc.)
Now we need to ask:
What are the challenges of such combinations?
What are approaches proposed to solve it?
– from the databases & the Semantic Web / ontologies fields
– from architectural and logical points of view
3Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
3Motivation 1: Price comparison engines search & combine heterogeneous travel-agency DBs, which seach & combine heterogeneous airline DBs
4Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
4
Motivation 2: Schemas coming from different languages
A river is a natural stream of water, usually freshwater, flowing toward an ocean, a lake, or another stream. In some cases a river flows into the ground or dries up completely before reaching another body of water. Usually larger streams are called rivers while smaller streams are called creeks, brooks, rivulets, rills, and many other terms, but there is no general rule that defines what can be called a river. Sometimes a river is said to be larger than a creek,[1] but this is not always the case.[2]
Une rivière est un cours d'eau qui s'écoule sous l'effet de la gravité et qui se jette dans une autre rivière ou dans un fleuve, contrairement au fleuve qui se jette, lui, dans la mer ou dans l'océan.
Een rivier is een min of meer natuurlijke waterstroom. We onderscheiden oceanische rivieren (in België ook wel stroom genoemd) die in een zee of oceaan uitmonden, en continentale rivieren die in een meer, een moeras of woestijn uitmonden. Een beek is de aanduiding voor een kleine rivier. Tussen beek en rivier ligt meestal een bijrivier.
5Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
5
Motivation 3: „Who was that?“ – Re-identification
6Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
6
Motivation 3: „Who was that?“ – Re-identification
7Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
7
Agenda
The match problem & what info to use for matching
(Semi-)automated matching: Example CUPID
(Semi-)automated matching: Example iMAP
Evaluating matching
Involving the user: Explanations; mass collaboration
Core ideas of federated databases
8Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
8
Overview
goal: interoperability through data integration:
combining heterogeneous data sources under a single query interface
A federated database system is a type of meta-database management system (DBMS) which transparently integrates multiple autonomous database systems into a single federated database.
The constituent databases are interconnected via a computer network, and may be geographically decentralized.
Since the constituent database systems remain autonomous, a federated database system is a contrastable alternative to the (sometimes daunting) task of merging together several disparate databases.
A federated database (or virtual database) is the fully-integrated, logical composite of all constituent databases in a federated database system.
9Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
9
Issues in federating data sources
Interconnection and cooperation of autonomous and heterogeneous databases must address
Distribution
Autonomy
Heterogeneity
10Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
10
Architectures: Dealing differently with autonomy
Tightly coupled: global schema integration, e.g. data warehousing
More loosely coupled: federated databases with schema matching/mapping:
Global as View (GaV): the global schema is defined in terms of the underlying schemas
Local as View (LaV): the local schemas are defined in terms of the global schema
11Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
11
Issues in query processing
In both GAV and LAV systems, a user poses conjunctive queries over a virtual schema represented by a set of views, or "materialized" conjunctive queries.
Integration seeks to rewrite the queries represented by the views to make their results equivalent or maximally contained by our user's query.
This corresponds to the problem of answering queries using views.
12Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
12
An example developed in-house: SQI - PLQL
Purpose: For federated search in learning-object repositories
An approach with conceptual-level abstraction from data sources
Integratable data source types:
Relational, XML, IR systems, (search engine) Web services, search APIs
Full abstraction of user from data sources:
Yes
User-specific data souce selection for integration:
Depends on application
User-specific data modeling for integration:
No
Explicit, queryable semantics:
(delegated to the sources: LOM etc.)
13Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
13
Heterogeneity
Heterogeneity is independent of location of data
When is an information system homogeneous?
Software that creates and manipulates data is the same
All data follows same structure and data model and is part of a single universe of discourse
Different levels of heterogeneity Different languages to write applications Different query languages Different models Different DBMSs Different file systems Semantic heterogeneity etc.
14Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
14
Agenda
The match problem & what info to use for matching
(Semi-)automated matching: Example CUPID
(Semi-)automated matching: Example iMAP
Evaluating matching
Involving the user: Explanations; mass collaboration
Core ideas of federated databases
15Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
15
The match problem(Running example 1)
Given two schemas S1 and S2, find a mapping between elements of S1 and S2 that correspond semantically to each other
16Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
16
Running example 2
17Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
17
Motivation: application areas
Schema integration in multi-database systems
Data integration systems on the Web
Translating data (e.g., for data warehousing)
E-commerce message translation
P2P data management
Model management (tools for easily manipulating models of data)
18Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
18
Based on what information can the matchings/mappings be found?
(work on the two running examples)
19Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
19
The match operator
Match operator: f(S1,S2) = mapping between S1 and S2
for schemas S1, S2
Mapping
a set of mapping elements
Mapping elements
elements of S1, elements of S2, mapping expression
Mapping expression
different functions and relationships
20Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
20
Matching expressions: examples
Scalar relations (=, ≥, ...) S.HOUSES.location = T.LISTINGS.area
Functions T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate) T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state)
ER-style relationships (is-a, part-of, ...) Set-oriented relationships (overlaps, contains, ...) Any other terms that are defined in the expression language used
21Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
21
Matching and mapping
1. Find the schema match („declarative“)
2. Create a procedure (e.g., a query expression) to enable automated data translation or exchange (mapping, „procedural“)
Example of result of step 2: To create T.LISTINGS from S (simplified notation):
area = SELECT location FROM HOUSES
agent-name = SELECT name FROM AGENTS
agent-address = SELECT concat(city,state) FROM AGENTS
list-price = SELECT price * (1+fee-rate)
FROM HOUSES, AGENTS
WHERE agent-id = id
22Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
22Based on what information can the matchings/mappings be found?
Rahm & Bernstein‘s classification of schema matching approaches
23Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
23
Challenges
Semantics of the involved elements often need to be inferred
Often need to base (heuristic) solutions on cues in schema and data, which are unreliable
e.g., homonyms (area), synonyms (area, location)
Schema and data clues are often incomplete e.g., date: date of what?
Global nature of matching: to choose one matching possibility, must typically exclude all others as worse
Matching is often subjective and/or context-dependent e.g., does house-style match house-description or not?
Extremely laborious and error-prone process e.g., Li & Clifton 200: project at GTE telecommunications:
40 databases, 27K elements, no access to the original developers of the DB estimated time for just finding and documenting the matches: 12 person years
24Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
24
Semi-automated schema matching (1)
Rule-based solutions
Hand-crafted rules
Exploit schema information
+ relatively inexpensive
+ do not require training
+ fast (operate only on schema, not data)
+ can work very well in certain types of applications & domains
+ rules can provide a quick & concise method of capturing user knowledge about the domain
– cannot exploit data instances effectively
– cannot exploit previous matching efforts
(other than by re-use)
25Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
25
Semi-automated schema matching (2)
Learning-based solutions
Rules/mappings learned from attribute specifications and statistics of data content (Rahm&Bernstein: „instance-level matching“)
Exploit schema information and data
Some approaches: external evidence Past matches
Corpus of schemas and matches („matchings in real-estate applications will tend to be alike“)
Corpus of users (more details later in this slide set)
+ can exploit data instances effectively
+ can exploit previous matching efforts
– relatively expensive
– require training
– slower (operate data)
– results may be opaque (e.g., neural network output) explanation components! (more details later)
26Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
26
Agenda
The match problem & what info to use for matching
(Semi-)automated matching: Example CUPID
(Semi-)automated matching: Example iMAP
Evaluating matching
Involving the user: Explanations; mass collaboration
Core ideas of federated databases
27Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
27
Overview (1)
Rule-based approach
Schema types:
Relational, XML
Metadata representation:
Extended ER
Match granularity:
Element, structure
Match cardinality:
1:1, n:1
28Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
28
Overview (2)
Schema-level match:
Name-based: name equality, synonyms, hypernyms, homonyms, abbreviations
Constraint-based: data type and domain compatibility, referential constraints
Structure matching: matching subtrees, weighted by leaves
Re-use, auxiliary information used:
Thesauri, glossaries
Combination of matchers:
Hybrid
Manual work / user input:
User can adjust threshold weights
29Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
29
Basic representation: Schema trees
Computation overview:
1. Compute similarity coefficients between elements of these graphs
2. Deduce a mapping from these coefficients
30Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
30
Computing similarity coefficients (1): Linguistic matching
Operates on schema element names (= nodes in schema tree)
1. Normalization Tokenization (parse names into tokens based on punctuation, case,
etc.)
e.g., Product_ID {Product, ID}
Expansion (of abbreviations and acronyms)
Elimination (of prepositions, articles, etc.)
2. Categorization / clustering Based on data types, schema hierarchy, linguistic content of names
e.g., „real-valued elements“, „money-related elements“
3. Comparison (within the categories) Compute linguistic similarity coefficients (lsim) based on thesarus
(synonmy, hypernymy)
Output: Table of lsim coefficients (in [0,1]) between schema elements
31Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
31
How to identify synonyms and homonyms: Example WordNet
32Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
32
How to identify hypernyms: Example WordNet
33Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
33
Computing similarity coefficients (2): Structure matching
Intuitions:
Leaves are similar if they are linguistic & data-type similar, and if they have similar neighbourhoods
Non-leaf elements are similar if linguistically similar & have similar subtrees (where leaf sets are most important)
Procedure:
1. Initialize structural similarity of leaves based on data types
Identical data types: compat. = 0.5; otherwise in [0,0.5]
2. Process the tree in post-order
3. Stronglink(leaf1, leaf2) iff their weighted sim. ≥ threshold
4. .
34Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
34
The structure matching algorithm
Output: an 1:n mapping for leaves
To generate non-leaf mappings: 2nd post-order traversal
35Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
35
Matching shared types
Solution: expand the schema into a schema tree, then proceed as before
Can help to generate context-dependent mappings
Fails if a cycle of containment and IsDerivedFrom relationships is present (e.g., recursive type definitions)
36Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
36
Agenda
The match problem & what info to use for matching
(Semi-)automated matching: Example CUPID
(Semi-)automated matching: Example iMAP
Evaluating matching
Involving the user: Explanations; mass collaboration
Core ideas of federated databases
37Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
37
Main ideas
A learning-based approach
Main goal: discover complex matches
In particular: functions such as
T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate)
T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state)
Works on relational schemas
Basic idea: reformulate schema matching as search
38Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
38
Architecture
Specialized searchers are specialized on discovering certain types of complex matches make search more efficient
39Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
39
Overview of implemented searchers
40Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
40
Example: The textual searcher
For target attribute T.LISTINGS.agent-address: Examine attributes and concatenations of attributes from S Restrict examined set by analyzing textual properties
Data type information in schema, heuristics (proportion of non-numeric characters etc.)
Evaluate match candidates based on data correspondences, prune inferior candidates
41Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
41
Example: The numerical searcher
For target attribute T.LISTINGS.list-price:
Examine attributes and arithmetic expressions over them from S
Restrict examined set by analyzing numeric properties
Data type information in schema, heuristics
Evaluate match candidates based on data correspondences, prune inferior candidates
42Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
42
Search strategy (1): Example textual searcher
1. Learn a (Naive Bayes) classifier
text class („agent-address“ or „other“)
from the data instances in T.LISTINGS.agent-address
2. Apply this classifier to each match candidate (e.g., location, concat(city,state)
3. Score of the candidate = average over instance probabilities
4. For expansion: beam search – only k-top scoring candiates
43Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
43
Search strategy (2): Example numeric searcher
1. Get value distributions of target attribute and each candidate
2. Compare the value distributions (Kullback-Leibler divergence measure)
3. Score of the candidate = Kullback-Leibler measure
44Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
44
Evaluation strategies of implemented searchers
45Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
45
Pruning by domain constraints
Multiple attributes of S: „attributes name and beds are unrelated“ do not generate match candidates with these 2 attributes
Properties of a single attribute of T: „the average value of num-rooms does not exceed 10“ use in evaluation of candidates
Properties of multiple attributes of T: „lot-area and num-baths are unrelated“ at match selector level, „clean up“:
Example
– T.num_baths S.baths
– ? T.lot-area (S.lot-sq-feet/43560)+1.3e-15 * S.baths
Based on the domain constraint, drop the term involving S.baths
46Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
46
Pruning by using knowledge from overlap data
When S and T share the same data
Consider fraction of data for which mapping is correct
e.g., house locations:
S.HOUSES.location overlaps more with T.LISTINGS.area than with T.LISTINGS.agent-address
Discard the candidate T.LISTINGS.agent-address = S.HOUSES.location,
keep only T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS,state)
47Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
47
Agenda
The match problem & what info to use for matching
(Semi-)automated matching: Example CUPID
(Semi-)automated matching: Example iMAP
Evaluating matching
Involving the user: Explanations; mass collaboration
Core ideas of federated databases
48Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
48
How to compare?
Input: What kind of input data? (What languages? Only toy examples? What external information?)
Output: mapping between attributes or tables, nodes or paths? How much information does the system report?
Quality measures: metrics for accuracy and completeness?
Effort: how much savings of manual effort, how quantified?
Pre-match effort (training of learners, dictionary preparation, ...)
Post-match effort (correction and improvement of the match output)
How are these measured?
49Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
49
Match quality measures
Need a „gold standard“ (the „true“ match)
Measures from information retrieval:
(standard choice: F1, = 0.5)
Quantifies post-match effort
50Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
50
Benchmarking
Do, Melnik, and Rahm (2003) found that evaluation studies were not comparable
Need more standardized conditions (benchmarks)
Now a tradition of competitions in ontology matching (more in the next session):
Test cases and contests at http://www.ontologymatching.org/evaluation.html
51Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
51
Agenda
The match problem & what info to use for matching
(Semi-)automated matching: Example CUPID
(Semi-)automated matching: Example iMAP
Evaluating matching
Involving the user: Explanations; mass collaboration
Core ideas of federated databases
52Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
52
Example in iMAP
User sees ranked candidates:
1. List-price = price
2. List-price = price * (1 + fee-rate)
Explanation:
a) Both generated from numeric searcher, 2 ranked higher than 1
b) But:
c) Match month-posted = fee-rate
d) domain constraint: matches for month-posted and price do not share attributes
)e cannot match list-price to anything to do with fee-rate
f) Why c)?
g) Data instances of fee-rate were classified as of type date
User corrects this wrong step f), the rest is repaired accordingly
53Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
53
Background knowledge structure for explanation: dependency graph
54Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
54
MOBS: Using mass collaboration to automate data integration
1. Initialization: a correct but partial match (e.g. title = a1, title = b2, etc.)
2. Soliciting user feedback: User query user must answer a simple question user gets answer to initial query
3. Computing user weights (e.g., trustworthiness = fraction of correct answers to known mappings)
4. Combining user feedback (e.g, majority count) Important: „instant gratification“ (e.g., include the new field in the
results page after a user has given helpful input)
55Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
55
Next lecture
The match problem & what info to use for matching
(Semi-)automated matching: Example CUPID
(Semi-)automated matching: Example iMAP
Evaluating matching
Involving the user: Explanations; mass collaboration
Ontology matching / alignment
Core ideas of federated databases
56Berendt: Advanced databases, 2009, http://www.cs.kuleuven.be/~berendt/teaching
56
References / background reading; acknowledgements
http://en.wikipedia.org/wiki/Federated_database_system
http://en.wikipedia.org/wiki/Data_integration
Rahm, E. & Bernstein, P.A. (2001). A survey of approaches to automatic schema matching. The VLBD Journal, 10, 334-350.
http://research.microsoft.com/~philbe/VLDBJ-Dec2001.pdf
Doan, A. & Halevy, A.Y. (2004). Semantic Integration Research in the Database Community: A brief survey. AI Magazine.
http://dit.unitn.it/~p2p/RelatedWork/Matching/si-survey-db-community.pdf
Madhavan, J., Bernstein, P.A., Rahm, E. (2001). Generic Schema Matching with Cupid. In Proc. Of the 27th VLDB Conference.
http://dbs.uni-leipzig.de/de/publication/title/generic_schema_matching_with_cupid
Dhamankar, R., Lee, Y., Doan, A., Halevy, A., & Domingos, P. (2004). iMAP: Discovering complex semantic matches between database schemas. In Proc. Of SIGMOD 2004.
http://citeseer.ist.psu.edu/680053.html
Do, H.-H., Melnik, S., & Rahm, E. (2003). Comparison of schema matching evaluations. In Web, Web-Services, and Database Systems: NODe 2002, Web- and Database-Related Workshops, Erfurt, Germany, October 7-10, 2002. Revised Papers (pp. 221-237). Springer.
http://dit.unitn.it/~p2p/RelatedWork/Comparison%20of%20Schema%20Matching%20Evaluations.pdf
McCann, R., Doan, A., Varadarajan, V., & Kramnik, A. (2003). Building data integration systems via mass collaboration. In Proc. International Workshop on the Web and Databases (WebDB).
http://citeseer.ist.psu.edu/675796.html
Please see the Powerpoint slide-specific „notes“ for URLs of used pictures and formulae