Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 0 times |
Data Integration:A Status Report
Alon HalevyUniversity of Washington, Seattle
BTW 2003
February 27th, 2003 BTW 2003
Data Integration Report Recent progress
Mediation languages Query processing (XML and other) Commercial
Current challenges Flexible architectures: peer-data
mgmt. Getting to the root of semantic
heterogeneity: schema mapping.
ReviewsSh ip p in gO rd ersIn ven toryBooks
m ybooks .com M edia ted S chem a
W e s t
...
F e dE x
W A N
a lt.bo o ks .re v ie w s
In te rne tIn te rne t In te rne t
UP S
E a s t O rde rs C us to me rR e v ie w s
NY Time s
...
M o rga n-K a ufma n
P re ntic e -Ha ll
Data Integration Systems
• This is one possible architecture (virtual integration)• Only logical mediated schema is central. Data stays at the sources.
February 27th, 2003 BTW 2003
Motivation and Activity Application areas of data integration:
Enterprise information integration ($$) The government Data sources on the web Scientific data sharing.
Many research projects: Mine: Information Manifold, Tukwila,
LSD. Companies:
Many startups, big guys getting in.
February 27th, 2003 BTW 2003
Outline Recent progress
Mediation languages Adaptive Query processing XML data management Commercial
Current challenges Flexible architectures: peer-data mgmt. Getting to the root of semantic
heterogeneity: schema mapping. Crossing the Structure Chasm.
February 27th, 2003 BTW 2003
Mediation Languages
Goal: Mediated Schema
SourceSource Source Source Source
Language forSpecifyingSemanticrelationships
Q
Q’ Q’ Q’ Q’ Q’
February 27th, 2003 BTW 2003
Global-as-View (GAV)
Mediated Schema
SourceSource Source Source SourceR1 R2 R3 R4 R5
Title, Actor, …
Create view Actor ASR1UnionSelect A,B From S2Union…
February 27th, 2003 BTW 2003
Local-as-View (LAV)
Mediated Schema
SourceSource Source Source SourceR1 R2 R3 R4 R5
Title, Actor …
Create View R1 asSelect title, nameFrom Title Join ActorWhere Year>1970
Create View R5 asSelect *From MovieWhere lang=“German”
(GLAV)
February 27th, 2003 BTW 2003
Adaptive Query Processing Problem: no stats, network unstable Cannot ‘Plan and then execute’ Need to adapt plan during execution. Idea already in Ingres (1976) Proposed before data integration:
Cole and Graefe (choose nodes) Kabra and Dewitt (mid-query re-opt).
February 27th, 2003 BTW 2003
Convergent Query Processing[Zack Ives, Ph.D 2002, U. Penn]
Processor starts with initial plan Monitors execution, accumulating stats.
Switches plan when a better one found Reuses intermediate results. Final, cleanup phase.
Possible transformation types: Plan partitioning, data partitioning, low-level
rescheduling. Can be aggressive (e.g., with aggregations).
February 27th, 2003 BTW 2003
XML Query Processing XML facilitates integration. Mediator query processor may
manipulate XML directly. Progress on:
Publishing to XML, XML views on relations
Physical algebras for manipulating XML Optimization of XQuery.
February 27th, 2003 BTW 2003
The Commercial World Some startups:
Nimble, MetaMatrix, Calixa, Enosys, … Big guys making announcements:
IBM, BEA, MS, (Oracle still being defiant). Progress: analysts have buzzword -- EII. Challenges:
Integration with EAI? Yet another middleware? Horizontal vs. vertical?
February 27th, 2003 BTW 2003
Outline Recent progress
Mediation languages Adaptive Query processing XML data management Commercial
Current challenges Flexible architectures: peer-data
mgmt. Getting to the root of semantic
heterogeneity: schema mapping.
February 27th, 2003 BTW 2003
Peer Data-Management
PDMS: a network of peers Peers can:
Export base data Provide views on base data Serve as logical mediators for other peers
A peer can be both a server and a client.
Semantic relationships are specified locally (between small sets of peers).
Network of Mappings (Piazza)
UW Stanford
DBLP
Saarbruecken Leipzig
CiteSeer
Berlin
GAV, LAVGLAV
Q
Q’
Q’Q’’
Q’’
Q’’
Q’’
February 27th, 2003 BTW 2003
Advantages of PDMS No need for a central mediated schema. Can map data opportunistically, as is most
convenient. Queries are posed using the peer’s schema.
Answers come from anywhere in the system.
Semantic Web. This is not P2P file sharing.
Data has rich semantics Membership is not as dynamic.
Schema Mediation
UW Stanford
DBLP
Saarbruecken Leipzig
CiteSeer
Berlin
GAV, LAVGLAV
Q
Q’
Q’Q’’
Q’’
Q’’
Q’’When can LAV and GAV be combined to form such a network structure?[ICDE-03],[WWW-03 for XML]
Query Optimization
UW Stanford
DBLP
Saarbruecken Leipzig
CiteSeer
Berlin
Q
Q’
Q’Q’’
Q’’
Q’’
Q’’Problems: • redundant paths• expensive reformulation.
Possible solution:• Pre-compose some paths
February 27th, 2003 BTW 2003
Mapping Composition Incredibly subtle! [w/ Madhavan] In general, composition can be an
infinite set of GLAV formulas. Results:
Finite in many cases Even when infinite, often has finite,
useful encoding. Hence, compositions can usually be
pre-optimized.
Management of Updates[w/ Mork, Gribble]
UW Stanford
DBLP
Saarbruecken Leipzig
CiteSeer
Berlin
Q
Q’
Q’Q’’
Q’’
Q’’
Q’’Problem: when updates are generated, we don’t know who will use them.
Solution: • represent updates as first-class citizens• Complement with boosters• Rules for usage.
Other Research Issues
UW Stanford
DBLP
Saarbruecken Leipzig
CiteSeer
Berlin
Q
Q’
Q’Q’’
Q’’
Q’’
Q’’Intelligent data placement
Management of mapping networks
Improving networks: finding additional connections.
Indexing of views
February 27th, 2003 BTW 2003
Schema Matching/Mapping
Given S1 and S2: a pair of schemas/DTDs/ontologies,… Possibly, data accompanying instances Additional domain knowledge
Find: A match between S1 and S2
A set of correspondences between the terms. Ultimately, a mapping
Should enable translating data between the schemas.
Example: House Listings
house
location view
house
address
front back
num-baths
full-baths half-baths
Water view
Lake Mountains
1-1 mapping non 1-1 mapping?
February 27th, 2003 BTW 2003
Motivations Heart of any data sharing architecture
Virtual, warehouse, messaging, web services, semantic web Translation of legacy data, EAI, …
Key operator in model management Algebra for manipulating models of data See [Bernstein, CIDR-03], Melnik et al. [SIGMOD
03]. Currently, a bottleneck. Done mostly by
hand.
February 27th, 2003 BTW 2003
Approaches to Matching Matching is hard because schema does
not fully capture the semantics. Many techniques proposed. They
consider similarities in: Attribute names (synonyms) Data values, data types Relationships between columns Structural similarities
Anything a human expert would try! Hence, let’s try to simulate a human.
February 27th, 2003 BTW 2003
Philosophy of Solutions Effective schema matching
requires a principled combination of techniques.
Like human experts, the matcher should improve over time Learn from seeing many schemas,
matches. LSD [Doan, Ph.D 2002, U. of Illinois] COMA [Do et al.]
February 27th, 2003 BTW 2003
Corpus Based Solution[Madhavan, Bernstein, Chen, Halevy, Shenoy] Collect a corpus of schemas and
matches. Learn from the corpus:
Create a classifier for every corpus element Use multi-strategy learning.
Given S1 and S2 : Compare each schema element to corpus
elements. If two elements’ similarity vectors are close,
then maybe they match each other.
February 27th, 2003 BTW 2003
Learning from Corpus vs. Learning from the schemas
Shipping Domain
0
0.2
0.4
0.6
0.8
1
P1a P1b P2a P2b P3a P3b P4a P4b
Schema Pairs
Rec
all
MKB BASIC
February 27th, 2003 BTW 2003
Finding Different MatchesShipping Domain
-15
-10
-5
0
5
10
15
P1a P1b P2a P2b P3a P3b P4a P4b
Schema Pairs
Avg
Nu
mb
er o
f M
atch
es
Extra Matches Missed Matches
February 27th, 2003 BTW 2003
Other Corpus Based Tools Conjecture: a corpus of schemas can be
the basis for many useful tools. Auto-complete:
I start creating a schema (or show sample data), and the tool suggests a completion.
Query reformulation: I ask a query using my terminology, and it
gets reformulated appropriately. Improving structured queries over
structured web sites (and focused crawling, a la BINGO!)
February 27th, 2003 BTW 2003
The Corpus Contents:
Schemas, ontologies, meta-data, data, queries.
Sample statistics: How often does a word appear as a
relation name? When it does, what tend to be the
attribute names? What other tables are there? What
are the foreign keys?
February 27th, 2003 BTW 2003
Conclusion: Crossing the Structure Chasm Data authoring, querying and
sharing is everywhere; done by novices too.
Semantic web: the extreme example.
CorpusOf
schemas
schemamapping
February 27th, 2003 BTW 2003
Some References www.cs.washington.edu/homes/alon Piazza: WebDB01, ICDE03, WWW03 The Structure Chasm: CIDR-03 Mediation surveys: VLDB Journal 01
Lenzerini, PODS 02 tutorial. Schema matching:
Rahm and Bernstein, VLDB Journal 01.