Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | viviana-fearn |
View: | 216 times |
Download: | 1 times |
BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER,
BRITISH COLUMBIA, CANADA, JUNE 2008
Bootstrapping Pay-As-You-Go Data Integration
Systems
Presented by Andrew Zitzelberger
Data Integration
Offer a single-point interface to a set of data sources Mediated schema Semantic mappings Query through mediated schema
Pay-as-you-go Many contexts can be useful without full integration System starts with few (or inaccurate) semantic mappings Mappings are improved over time
Problem Requires significant upfront and ongoing effort
Contributions
Self-configuring data integration system Provides an advanced starting point for pay-as-you-go
systems Initial configuration provides good precision and
recall
Algorithms Mediated schema generation Semantic mapping generation
Concept Probabilistic mediated schema
Mediated Schema Generation
1) Remove infrequent attributes Ensure mediated schema contain most relevant
attributes
2) Construct weighted graph Nodes are remaining attributes Edges are the values of some similarity measure: s(ai,
aj) Cull edges below threshold τ
3) Cluster nodes Cluster is a connected component of the graph
Probabilistic Mediated Schema Generation
Allow for error є in weighted graph Certain edges ≥ τ + є τ - є < Uncertain edges ≤ τ + є Cull edges < τ – є
Remove unnecessary uncertain edgesCreate schema from every subset of
uncertain edges
Probabilistic Mapping Generation
Weighted correspondence
Choose the consistent p-mapping with the maximum entropy.
Probabilistic Mapping Generation
1) Enumerate one-to-one mappings Mappings must contain subset of correspondences
2) Assign probabilities that maximize entropy Solve the following constraint maximization problem
Probabilistic Mediated Schema Consolidation
Why? User expects a single deterministic schema More efficient query answering
How?
Schema Consolidation Example
M = {M1, M2}M1 contains {a1, a2, a3}, {a4}, and {a5, a6}M2 contains {a2, a3, a4} and {a1, a5, a6}T contains {a1}, {a2, a3}, {a4}, and {a5,
a6}
Probabilistic Mapping Consolidation
Modify p-mapping Update the mappings to match new mediated schema
Modify probabilities Schema mapping probability by Pr(Mi)
Consolidate Add all new mappings to new set
If mapping already in new set during addition, add probabilites
Experimental Setup
UDI – the data integration system Accepts select-project queries (only one table)
Source data – MySQLQuery processor – JavaJaro Winkler simularity computation –
SecondStringEntropy maximization problem – KnitroOperating System – Windows VistaCPU – Intel Core 2 GHzMemory – 2GB
Experiments
Domains: Movie, Car, People, Course, Bibliography
Golden Standards Manually created for People and Bibliography Partially created for others
10 test queries One to four attributes in SELECT clause Zero to three predicates in WHERE clause
Experiments
Compare to other methods: MySQL keyword search engine
KEYWORDNAIVE KEYWORDSTRUCT KEYWORDSTRICT
SOURCE Unions results of each data source
TOPMAPPING Only consider p-mapping with highest probability
Experiments
Compare against other Q&A methods: SINGLEMED – single deterministic mediated schema UNIONALL – single deterministic mediated schema
that contains a singleton cluster for each frequent source attribute
Experiment and Result
Setup efficiency 3.5 minutes for 817 data sources Roughly linear increase of time with data sources
Maximum-entropy problem is most time consuming
Future Work
Different schema matcherDealing with multiple-table sourcesIncluding multi-table schemasNormalizing mediated schemas