+ All Categories
Home > Technology > 20090411

20090411

Date post: 07-Dec-2014
Category:
Upload: xoanon
View: 745 times
Download: 0 times
Share this document with a friend
Description:
Paper Reading
Popular Tags:
26
CIKM’08 Shui-Lung Chuang Kevin Chen-Chuan Chang Yen-Ling Lin 2009/04/13 Integrating Web Query Results: Holistic Schema Matching 1 26 pages
Transcript
Page 1: 20090411

26 pages

1

CIKM’08

Shui -Lung Chuang Kev in Chen-Chuan Chang

Yen-L ing L in2009/04 /13

Integrating Web Query Results: Holistic Schema Matching

Page 2: 20090411

26 pages

2

Outline

IntroductionApproachFrameworkAlgorithmExperiments

Page 3: 20090411

26 pages

3

Introduction

Back

Page 4: 20090411

26 pages

4

Introduction

Back

Page 5: 20090411

26 pages

5

Page 6: 20090411

26 pages

6

Introduction –Schema Matching on Query Results

Data fields are the basic units processed by matching.A data field can be viewed as a label plus a set of

values.We lack explicit and complete schema information. e.g.To conquer such challenges, we observe some

niches in this context of integrating query results1) First, we often need to integrate multiple sources. Some useful

effects naturally occur when cross-referencing many sources. 2) Second, although no schema-based constraint is available, there

are indeed useful regularities that can be observed from many sources. These regularities, treated as observed domain constraints, are very helpful for matching discovery.

Page 7: 20090411

26 pages

7

Introduction - Approach

The enrichment occurs basically in three levels1. The content of a field2. The kinds of fields3. The constraints of fields

With all the above enrichment, we learn a more complete schema to describe the whole input data.

This learned schema can thus help us in making further matching.

Page 8: 20090411

26 pages

8

Framework – Problem Statement

Suppose A={a1,a2,…} for the book source. For source S1, the fields X1 = (x11,x12,…,x17) can be assigned with the matching Y1= (a1,a2,…,a7)

Matching is actually discovering the assignment of the groups in A to the fields of each source:

Ys = (ys1,…,ysls) and each yi ∊ A is the group that

source field xsi ∊ Xs is assigned as.

Page 9: 20090411

26 pages

9

Framework Matching as Domain Schema Discovery

Let the domain schema be M=(A, B) A : the set of domain fields B: the statistical constraints

For each source Ss

1) It projects M onto a source schema Ms = (Ys, Vs)

1) Ys: a subset of A to be the fields of source Ss

2) Vs: a set of constraints instantiated from B

2) Construct the source instances Xs

3) Vs Us , Ys Xs : Is = (Xs, Us)

4) Output: Xs

Page 10: 20090411

26 pages

10

Framework Matching as Domain Schema Discovery

This procedure of data generation can be conceptually sketched as:

M=(A, B) where A={a1,…,a11} and B={first(a1):.67, first(a2):.33, pos≻(a2, a3):1} M1=(Y1,V1) where Y1={a1,..,a5,a7,a8} and V1= ={first(a1):.67,

first(a2):.33, pos≻(a2, a3):1} We generate data using source schema M1.

Map Y1 as X1 – e.g., a2 is mapped as x1,2

first(a1) in V1 is rewritten as first(x11) in U1, pos≻(a2,a3) as pos≻(x12,x13)

Page 11: 20090411

26 pages

11

Framework Matching as Domain Schema Discovery

Let the data observed from source Ss be Is= (Xs, Us).Given the matching Y={Ys: s ∊S}, learning the best

domain schema can be described as a probabilistic optimization expression:

Similarly, if the domain schema M is given, the best matching can be discovered, again using statistical techniques to find out the most likely assignment of domain fields to the fields of each source: for each s ∊ S

Ss

ssM

M MYIp ),|(maxarg*

}:{ ** SsYY s

),|(maxarg* MYIpY ss

Ys

s

Page 12: 20090411

26 pages

12

Framework Matching as Domain Schema Discovery

Suppose X1={x11,x12,x13} and X2={x21,x22}. Suppose we have one predicate function to check: first. Then, I1={X1,U1} where U1={first(x11):1}, and I2={X2,U2} where U2={first(X21):1}

Suppose Y1={a1,a2,a3} and Y2={a2,a3}.Construct M1= (Y1,V1), V1={first(a1):1} and M2=(Y2,V2) , V2={first(a2):1}

It is clear that first(a1) holds for M1 but not M2. Thus first(a1) has confidence 0.5. Thus, combining source schemas M1 and M2, the domain schema then becomes M=(A, B) where A={a1, a2, a3} and B={first(a1):.5, first(a2):.5}.

Page 13: 20090411

26 pages

13

Framework Formulation and Overview

Field Model A field model is a statistic model specifying how to

generate instances. A field model is a function that accepts an instance

z and produces p(z| ), indicating the likelihood that z is an instance produced by the field model .

Statistical Constraint A statistical constraint b is written as f(e):c

f: a predicate name, e is the vector of elements, c is a confidence value of range[0,1].

a

aa

a

Page 14: 20090411

26 pages

14

Framework Formulation and Overview

Overall, our framework translates the problem of instance-based matching into a schema-discovery problem.

With such a strategy, we leverage not only the data instances but also the regularities observed from the data in a principled way.

a

Page 15: 20090411

26 pages

15

Algorithm

To solve our matching problem, we need to discover either an optimal matching Y* or an optimal schema M*.

If one of them is obtained, the other can be derived.

The basic idea is to start an initial guess of the matching Y and iteratively improve it using the schema M that is derived from the current estimation of Y.

Page 16: 20090411

26 pages

16

Algorithm

InitMatch The function is to generate an initial matching, to be

the start point for iterations.EnumRelations

We need to identify the constraints occurring in the input data.

Predicate Function

:which elements to check their satisfaction with the predicate and is the original data.

True: the input satisfies the predicate False: otherwise

),,...,( 1 Xiif k

kii ,...,1

kii ,...,1f X

Page 17: 20090411

26 pages

17

Algorithm

LearnSchema – From matching to schema Aim to construct a schema based on a given matching. First, group the matched source fields together. Each group is trained as field model. Model it as 2-state HMM.

Learning an HMM a given a set of instances and computing the probability p(z|a) for given instance z will follow the standard HMM training and inference algorithm.

Page 18: 20090411

26 pages

18

Algorithm

SchemaMatch – From Schema to Matching Given the domain schema, matching becomes labeling

the elements of sources with the appropriate domain fields.

For each hj∈Vs with the corresponding bj ∈ B, let their constraint be fj(yi1,…yik), we define

The most likely value for each yi is thus:

iliil

llllayii

jjiiji

kik

yybhpaazaq,,..,,,...,

,11

)()()|()()()(

j

ji aqzaqi )()( ,

)(maxarg* aqy iAa

i

Page 19: 20090411

26 pages

19

Algorithm

MetaMatch : Adopt F-measure to measure the consistency.

For two matching m1 and m2, using m1 as tastee and m2 as tester,

Let these candidates generated during this process be C and the n matchings be R={r1,…,rn}: The final matching is obtained as:

InitMatch aims to guess an initial matching, to be the start point of the iterative computation.

jiji

jijiji PR

PRF

,,

,,,

2

}{max),( ,211

2

jimj

mi

i Fn

nmmF

RrCm

rmFm ),(maxarg*

Page 20: 20090411

26 pages

20

Algorithm

HoliMatch’s algorithm

Page 21: 20090411

26 pages

21

Experiments

Data set Four domains For each domain, collect 10 sources

Page 22: 20090411

26 pages

22

Experiments

Comparison Methods PairMatch: adopt Corpus-based approach ClusMatch: ChainMatch: e.g., 1-2-3-4 ProgMatch: e.g., becoming (((1-2)-3)-4) InitMatch: an extension of using pairwise matching HoliMatch

Performance The matching accuracy is measured using F-measure. Give the result matching m and the correct matching c,

the F-measure is F(m, c), indicating how close m is to c.

Page 23: 20090411

26 pages

23

Experiments

Matching on Correct Extraction Data Matchers

Iterations

Page 24: 20090411

26 pages

24

Experiments

Matching on Correct Extraction Data Sources

Page 25: 20090411

26 pages

25

Experiments

Matching on Correct Extraction Data Pairwise

Page 26: 20090411

26 pages

26

Experiments

Matching on Real Extraction Data


Recommended