1 Statistical Schema Matching across Web Query Interfaces Bin He ， Kevin Chen-Chuan Chang SIGMOD...

transcript

Statistical Schema Matching across Web Query Interfaces

Bin He ， Kevin Chen-Chuan Chang

SIGMOD 2003

Background: Large-Scale Integration of the deep Web

Query Result

The Deep Web

Challenge: matching query interfaces (QIs)Book Domain

Music Domain

Traditional approaches of schema matching – Pairwise Attribute Correspondence Scale is a challenge

Only small scale Large-scale is a

must for our task Scale is an opportunity

Useful ContextPairwise Attribute

Correspondence

S2:writertitlecategoryformat

S3:nametitlekeywordbinding

S1:authortitlesubjectISBN

S1.author S3.nameS1.subject S2.category

Deep Web Observation

Proliferating sources

Converging

vocabularies

A hidden schema model exists?

Our View (Hypothesis):

QIsFinite Vocabulary Statistical Model Generate QIs with different probabilities

Instantiation probability:P(QI1|M)

A hidden schema model exists? Our View (Hypothesis):

Now the problem is:

QIsFinite Vocabulary Statistical Model Generate QIs with different probabilities

Given , can we discover

Instantiation probability:P(QI1|M)

MGS framework & Goal

Hypothesis modeling Hypothesis generation Hypothesis selection

Verify the phenomenons

Validate MGSsd with two metrics

Comparison with Related Work

Related Work Authors’ Work

Paradigms Match two input sources Match many sources

Techniques Machine Learning, Contraint-based, hybrid ones

Statistical approach

Input data Relational or Structured schemas with inconsistency

Interface with consistency

Focuses Name match, structure match,etc

Synonym discovery

Outline

MGS MGSsd: Hypothesis Modeling, Generation, Selection Deal with Real World Data Final Algorithm Case Study Metrics Experimental Results Conclusion and Future Issues My Assessment

Towards hidden model discovery: Statistical schema matching (MGS)

1. Define the abstract Model structure M to solve a target question

P(QI|M) = …

2. Given QIs, Generate the model candidates

P(QIs|M) > 0

AA BB CC SS TT PP

3. Select the candidate with highest confidence

What is the confidence of given ?

AA BB CC

MGSSD: Specialize MGS for Synonym Discovery

MGS is generally applicable to a wide range of schema matching tasks E.g., attribute grouping

Focus : discover synonym attributes Author – Writer, Subject – Category No hierarchical matching: Query interface as flat schema No complex matching: (LastName, FirstName) – Author

Hypothesis Modeling: Structure Goal: capture synonym relationship Two-level model structure

Possible schemas: I1={author, title, subject, ISBN}, I2={title,category, ISBN}

Concepts

Attributes

Mutually Independent

Mutually Exclusive

No overlapping concepts

Hypothesis Modeling: Formula Definition and Formula:

Probability that M can generate schema I:

Hypothesis Modeling: Instantiation probability

P(author|M) = α1 * β1P(C1|M)

* P(author|C1) =

author

1.Observing an attribute

2.Observing a schemaP({author, ISBN, subject}|M) = P(author|M) * P(ISBN|M) * P(subject|M) * (1 – P(C2|M))

3.Observing a schema setP(QIs|M) = П P(QIi|M)

Consistency check

A set of schema I as schema observation <Ii,Bi>:number of occurrences Bi for each Ii M is consistent if Pr (I|M)>0 Find consistent models as candidates

Hypothesis Generation

Two sub-steps

1. Consistent Concept Construction

2.Build Hypothesis Space

Hypothesis Generation: Space pruning Prune the space of model candidates

Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Co-occurrence graph

Example: Observations: QI1 = {author, subject} and QI2 = {author, category} Space of model: any set partition of {author, subject, category}

author categorysubject

C1 C3C2

author subjectcategory

Hypothesis Generation Prune the space of model candidates

Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption

Example: Observations: QI1 = {author, subject} and QI2 = {author, category} Space of model: any set partition of {author, subject, category} Model candidates after pruning:

C1 C3C2

author subjectcategory

Hypothesis Generation (Cont.) Build Probability Functions Maximum likelihood estimation

Estimate ai and Bj that maximize Pr (I|M)

Hypothesis Selection

Rank the model candidates Select the model that generates the closest distribution

to the observations Approach: hypothesis testing

Example: select schema model at significance level 0.05

=3.93 3.93<7.815: accept =20.20 20.20>14.067: reject

Dealing with the Real World Data Head-often, tail-rare distribution Attribute Selection Systematically remove rare attributes Rare Schema Smoothing Aggregate infrequent schemas into a conceptual event

I(rare) Consensus Projection Follow concept mutual independence assumption

Extract and aggregate New input schemas with re-estimation para.

Final Algorithm Two phases:

Build initial hypothesis space

Discover the hidden model

Attribute Selection Extract the common parts of model candidates of last iteration

Hypothesis Generation

Hypothesis Selection

Combine rare interfaces

Experiment Setup in Case Studies Over 200 sources on four domains Threshold f=10% Significance level : 0.05 Can be specified by users

Example of the MSGsd Algorithm

M1={(ti), (is), (kw), (pr), (fm), (pd), (pu), (su,cg), (au,ln), (fn)}

M2={(ti), (is), (kw), (pr), (fm), (pd), (pu), (su,cg), (au,fn), (ln)}

Metrics

1. How it is close to the correct schema model Precision: Recall:

2. How good it can answer the target question Precison: Recall:

Examples on Metrics

I={<I1,6>, <I2,3>, <I3,1>} I1={author, subject}, I2={author, category}, I3={subject} M1={(author:1):0.6, (subject:0.7,category:0.3):1} M2={(author:1):0.6, (subject:1):0.7, (category:1):0.3}

Metrics 1: Pm(M2,Mc)=0.196+0.036+0.249+0.054=0.58 Rm(M2,Mc)=0.28+0.12+0.42+0.18=1

Metrics 2:

Experimental Results

This approach can identify most concepts correctly Incorrect matchings due to small # observations Do need two suites of metrics Time complexity is exponential

Can generate all correct instances

The discovered synonyms are all correct ones

Advantages Scalability: large-scale matching Solvability: exploit statistical information Generality

Holistic Model Discovery

author name subject categorywriter

Pairwise Attribute Correspondence

S1.author S3.nameS1.subject S2.category

Conclusions & Future Work

Holistic statistical schema matching of massive sources MGS framework to find synonym attributes Discover hidden models Suited for large-scale database Results verify the observed phenomena and show

accuracy and effectiveness Future Issues

Complex matching: (Last Name, First Name) – Author More efficient approximation algorithm Incorporating other matching techniques

My Assessments Promise

Use minimal “light-weight” information: attribute name

Effective with sufficient instances Leverage challenge as opportunity

Limitation Need sufficient observations Simple Assumptions Exponential time complexity Homonyms

Questions

1 Statistical Schema Matching across Web Query Interfaces Bin He ， Kevin Chen-Chuan Chang SIGMOD...

Documents