+ All Categories
Home > Technology > Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly...

Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly...

Date post: 11-May-2015
Category:
Upload: paolo-missier
View: 240 times
Download: 1 times
Share this document with a friend
Popular Tags:
66
Approximate entity reconciliation for on-the-fly integration in data mashups Paolo Missier , Alvaro A. A. Fernandes School of Computer Science, University of Manchester Roald Lengu, Giovanna Guerrini DISI, Universita' di Genova, Italy Marco Mesiti DiCo, Universita' di Milano, Italy
Transcript
Page 1: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Approximate entity reconciliationfor on-the-fly integration in data mashups

Paolo Missier, Alvaro A. A. FernandesSchool of Computer Science, University of Manchester

Roald Lengu, Giovanna GuerriniDISI, Universita' di Genova, Italy

Marco MesitiDiCo, Universita' di Milano, Italy

Page 2: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

• New data integration scenarios:– occasional integration with little prior knowledge about the

sources• Context: Data mashups and personal dataspaces

• How to ensure that we are not missing any data in the process?– how costly (i.e. response time) is it to guarantee

completeness?– can we trade completeness for response time?

• Technically speaking: convergence of– record linkage (an old data quality favourite)– approximate joins– adaptive query processing

2

Outline

Page 3: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Early example

3

• sources 1..n: collection of car insurance DBs• data changes frequently• schemas can be analysed / integrated using traditional

techniques• source n+1: reference street atlas

Page 4: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Early example

3

• sources 1..n: collection of car insurance DBs• data changes frequently• schemas can be analysed / integrated using traditional

techniques• source n+1: reference street atlas

• target app: mapping accidents hotspots• alert service to drivers, for example• useful information for decision makers

(image from housingmaps.com)

Page 5: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

The IBM view, 2006VLDB 2006 Keynote by Anant Jhingran (CTO, Information Management, IBM Silicon Valley Laboratory, San Jose, CA):

Enterprise information mashups: integrating information, simply

Situational Applications• Applications that come together for solving some

immediate business problems• constructed “on the fly” for some transient need

and possibly short-lasting

• Data never seen before, consumed on the spot– would take too long for the IT department to provide them – RSS feeds / data streams

4

Mashups

Page 6: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

IBM Mashup Center• IBM Mashup Center

– mashup workflow– leverages Lotus, DB2 plus LDAP, Web Services, ...

5

Page 7: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Yahoo pipes

Is there actually a “join” in the set of operators?also google mashup editor, and more... 6

Page 8: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Dataspaces

7

Page 9: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Dataspaces

7

Page 10: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Dataspaces

7

Page 11: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Integration in dataspaces

8

Page 12: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Integration in dataspaces

8

Page 13: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Integration in dataspaces

8

Page 14: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Assumptions

9

• sources 1..n: collection of car insurance DBs

• source n+1: reference street atlas• target app: mapping accidents

hotspots

– no prior knowledge of data sets (streams) to be joined– assumptions on implicit parent-child attribute relationships– no guarantee of matching values

Page 15: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

The broad context: record linkage

10

Name: John SmithSSN:Address: 477 Cedar Street

Name: John SmithSSN: 123-45-6789Address:

Record values incomplete

Brendan HughesAddress: 564 Hickory Pl.

Brenda HughesAddress: 564 Hickory Pl.

Twins or typo?

Name: Jean SmithPhone #: (337) 555-6676

Name: Phone #: (337) 555 5676

Conflict between forenames and phone number

Name: Alice JonesSSN: 123-45-6789

Names: Lois AvonSSN: 123-45-6789

Same SSN, different names:??

• Are two (slightly) different records two different surface representations of the same real-world entity?

Page 16: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

The broad context: record linkage

10

Name: John SmithSSN:Address: 477 Cedar Street

Name: John SmithSSN: 123-45-6789Address:

Record values incomplete

Brendan HughesAddress: 564 Hickory Pl.

Brenda HughesAddress: 564 Hickory Pl.

Twins or typo?

Name: Jean SmithPhone #: (337) 555-6676

Name: Phone #: (337) 555 5676

Conflict between forenames and phone number

Name: Alice JonesSSN: 123-45-6789

Names: Lois AvonSSN: 123-45-6789

Same SSN, different names:??

• Are two (slightly) different records two different surface representations of the same real-world entity?

• A difficult / uncertain decision process• which attributes should I consider for matching• what are the different weights• context: relative frequency of values?• external knowledge, user input

Page 17: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Results on record linkage

11

A mature field - ample literature– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64,

no. 328, pp. 1183-1210, Dec. 1969

– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007

Page 18: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Results on record linkage

11

A mature field - ample literature– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64,

no. 328, pp. 1183-1210, Dec. 1969

– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007

Record Linkage:

Similarity Measures and

Algorithms

Nick Koudas (University of Toronto)

Sunita Sarawagi (IIT Bombay)

Divesh Srivastava (AT&T Labs-Research)

Sigmod 2006 Data Quality tutorial

Page 19: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Results on record linkage

11

A mature field - ample literature– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64,

no. 328, pp. 1183-1210, Dec. 1969

– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007

Record Linkage:

Similarity Measures and

Algorithms

Nick Koudas (University of Toronto)

Sunita Sarawagi (IIT Bombay)

Divesh Srivastava (AT&T Labs-Research)

Sigmod 2006 Data Quality tutorial7/3/06 6

Application: Merging Lists

! Application: merge address lists

(customer lists, company lists)

to avoid redundancy

! Current status: “standardize”,

different values treated as

distinct for analysis

! Lot of heterogeneity

! Need approximate joins

! Relevant technologies

! Approximate joins

! Clustering/partitioning

Page 20: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Results on record linkage

11

A mature field - ample literature– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64,

no. 328, pp. 1183-1210, Dec. 1969

– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007

Record Linkage:

Similarity Measures and

Algorithms

Nick Koudas (University of Toronto)

Sunita Sarawagi (IIT Bombay)

Divesh Srivastava (AT&T Labs-Research)

Sigmod 2006 Data Quality tutorial7/3/06 6

Application: Merging Lists

! Application: merge address lists

(customer lists, company lists)

to avoid redundancy

! Current status: “standardize”,

different values treated as

distinct for analysis

! Lot of heterogeneity

! Need approximate joins

! Relevant technologies

! Approximate joins

! Clustering/partitioning

Page 21: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

R! R!, S ! S!

Offline vs online linkage• Offline linkage:

– performed once before queries involving joins – reconcile R and S on joining attributes R.A, S.B using your

favourite record linkage technique

12

– perform regular equijoin on the transformed tables:

➡ok for tables that can be analysed ahead of the join➡suitable when multiple queries issued on integrated tables

R! !" S!

Page 22: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

R! R!, S ! S!

Offline vs online linkage• Offline linkage:

– performed once before queries involving joins – reconcile R and S on joining attributes R.A, S.B using your

favourite record linkage technique

12

– perform regular equijoin on the transformed tables:

➡ok for tables that can be analysed ahead of the join➡suitable when multiple queries issued on integrated tables

R! !" S!

• Online linkage:– performed just-in-time before a query– exact join ⇒ approximate join

Page 23: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

• Assume relational data: tables R, S• Assume schema integration is understood

– we focus on data integration only

• Ultimately, data integration involves joining tablesR !"A=B S

Integration with approximate joins

13

• ordinary “exact” match misses out on the similar values

• compromises integration completeness

Mcrosoft

Microsoft

A BC D A

MicrosoftXYZ

XY Microsoft Microsoft Z

Page 24: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Approximate joinsHistorical timeline:

14

from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06.

Page 25: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Approximate joinsHistorical timeline:

14

from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06.

Page 26: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Approximate joinsHistorical timeline:

14

from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06.

Page 27: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

sim(r1, r2) < !1 ! not match!1 ! sim(r1, r2) ! !2 " unknown

!2 < sim(r1, r2)! match

Edit distance / similarity functions• Core sub-problem in approximate join:

– define / choose distance function between values in pairs of joining attributes

15

!r1, r2"sim(r1, r2)1. Similarity function between record pairs

2. Decision rules of the form

Page 28: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

sim(r1, r2) < !1 ! not match!1 ! sim(r1, r2) ! !2 " unknown

!2 < sim(r1, r2)! match

Edit distance / similarity functions• Core sub-problem in approximate join:

– define / choose distance function between values in pairs of joining attributes

15

!r1, r2"sim(r1, r2)1. Similarity function between record pairs

2. Decision rules of the form

A common choice of similarity function in the context of approximate joins is one based on string q-grams

Page 29: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

sim(s1, s2) =|q(s1) ! q(s2)||q(s1) " q(s2)|

Measuring string similarity using q-grams• q-grams map string s to a set q(s) of substrings of length q:

Ex.: 3-grams:

q(“Microsoft Corporation”) = {‘Mic’, ‘icr’, ‘cro’, ‘ros’, ‘oso’, ‘sof ’, ‘oft’, ‘ft ’, ‘t C’, ‘ Co’, ‘Cor’, ‘orp’ }.

q(“Mcrosoft Corporation”) = {‘Mcr’, ‘cro’, ‘ros’, ‘oso’, ‘sof’, ‘oft’, ‘ft ’, ‘t C’, ‘ Co’, ‘Cor’, ‘orp’, ‘rp#’ }.

(Jaccard coefficient)

This is a commonly used measure of string similarity

Page 30: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Online linkage using q-grams

17

– approximate join is a θ join:

– where θΑ,Β incorporates a similarity measure, eg Jaccard

R !"!A,B S

• Naïve method: for each record pair, compute similarity score– I/O and CPU intensive, not scalable

• Goal: reduce O(n2) cost to O(n*w), where w << n – Reduce number of pairs on which similarity is computed – Take advantage of efficient relational join methods

Page 31: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

dis(s1, s2) ! d if |(s1) " q(s2)| # max (|s1|, |s2|)$ (d$ 1)% q $ 1

Efficient relational approximate joins

Idea:reduce approximate join to aggregated set intersection:

18

In practice:• known similarity measures can be used to compare pairs of records• cheap filters (length, count, position) to prune non-matches • Implementation using standard SQL• cost-based join methods

Efficient relational representation:[CGK06] S. Chaudhuri, V. Ganti and R. Kaushik,“A primitive operator for similarity joins in data cleaning” (ICDE’06)‏

Page 32: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Is full approximate join always necessary?

• Remaining source of complexity:– overhead for storing and indexing q-grams– cost of computing set intersection

• Typical mismatch rate in real datasets around 5%• Complexity of full-fledged approximate join not fully

justified

19

Research hypothesis: time-completeness trade-offs

Offer users the option to trade completeness of integration with the time required to complete the join

Page 33: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Adaptive query processing

Idea:implement a hybrid join algorithm that combines exact and approximate join

20

[DIR07] A. Deshpande, Z. G. Ives, and V. Raman. Adaptive query processing. Foundations and Trends in Databases, 1(1):1–140, 2007

See also VLDB 2007 Tutorial athttp://www.vldb2007.org/program/slides/s1426-deshpande.pdf

Intuition:leverage known results on Adaptive Query Processing–developed in the context of query re-optimization–switch physical join operators in mid-flight

Page 34: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Autonomic computing framework

21

[KC03] J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41–50, 2003.

Page 35: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

monitor

respond assess

Autonomic computing framework

21

Page 36: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

monitor

respond assess

Autonomic computing framework

21

incrementalresult size

estimateresult size

computedivergence

switchjoin

operators

start with an exact join (optimistically)at step t during the execution:• estimate the expected size of the join result Ōt at that point• monitor the actual size Ot of the result

• when using exact join: if Ōt and Ot diverge “too much”, then switch to approximate join• when using approximate join: if Ōt and Ot are very close, then switch to exact join

Page 37: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Technical approach and challenges

• Assess:– estimating result size at specific points during join execution

• Respond:– switching between join operators at specific points during

execution• Adaptive Query Processing (AQP): operator replacement

in pipelined query plans [EFP06]

– adding an approximate join operator to the query processor [CGK06]

[EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254

[CGK06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE 2006, p. 5. 22

Need to add several new capabilities to a standard query processing infrastructure

Page 38: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data

• they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.

– a pipelined operator ← this is a key requirement for use in AQP

23

R S

Page 39: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data

• they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.

– a pipelined operator ← this is a key requirement for use in AQP

23

R S

build

xm

yn

R hash table

Page 40: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data

• they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.

– a pipelined operator ← this is a key requirement for use in AQP

23

R S

build

xm

yn

R hash table

build

y

x

r

s

S hash table

Page 41: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data

• they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.

– a pipelined operator ← this is a key requirement for use in AQP

23

R S

build

xm

yn

R hash table

build

y

x

r

s

S hash table

when a tuple appears at either input, it is incrementally added to the corresponding hash table and probed against the opposite hash table.

Page 42: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data

• they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.

– a pipelined operator ← this is a key requirement for use in AQP

23

R S

probe

build

xm

yn

R hash table

build

y

x

r

s

S hash table

when a tuple appears at either input, it is incrementally added to the corresponding hash table and probed against the opposite hash table.

Page 43: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data

• they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.

– a pipelined operator ← this is a key requirement for use in AQP

23

R S

probe

build

xm

yn

R hash table

build

y

x

r

s

S hash table

[R.m,S.s]

when a tuple appears at either input, it is incrementally added to the corresponding hash table and probed against the opposite hash table.

Page 44: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data

• they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.

– a pipelined operator ← this is a key requirement for use in AQP

23

R S

probe

probe

build

xm

yn

R hash table

build

y

x

r

s

S hash table

[R.m,S.s]

when a tuple appears at either input, it is incrementally added to the corresponding hash table and probed against the opposite hash table.

Page 45: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data

• they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.

– a pipelined operator ← this is a key requirement for use in AQP

23

R S

probe

probe

build

xm

yn

R hash table

build

y

x

r

s

S hash table

[R.m,S.s]

[R.n, S.r]

when a tuple appears at either input, it is incrementally added to the corresponding hash table and probed against the opposite hash table.

Page 46: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Estimating result size• Exploit implicit parent-child key assumption:

– at the end of join, we expect a result of size |S|

24

R (parent) S (child)

xc

ydy

x

b

a

• When there are no mismatches:after scanning n < |S| tuples on S:P(a=x in |S| has been matched) = P(tuple c=x is in top n of R) = n/|R|

Thus, join result size On is a binomial random variable:

n

On ! bin(n,n

|R| )

Page 47: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

where Pn,p(n) (.) is the cumulative distribution function for a binomial with parameters n, p(n)

Detecting divergent observed result sizeObservation is an outlier wrt expected result size On after n tuples have been scanned, if:

25

Pn,p(n)(On ! O) ! !out

On

Page 48: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

where Pn,p(n) (.) is the cumulative distribution function for a binomial with parameters n, p(n)

Detecting divergent observed result sizeObservation is an outlier wrt expected result size On after n tuples have been scanned, if:

25

Pn,p(n)(On ! O) ! !out

On

Page 49: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

monitor

respond assess

incrementalresult size

estimateresult size

computedivergencepredicates

switchjoin

operators

On

Instantiating the MAR framework

26

Page 50: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

monitor

respond assess

incrementalresult size

estimateresult size

computedivergencepredicates

switchjoin

operators

On

Instantiating the MAR framework

26

Page 51: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

monitor

respond assess

incrementalresult size

estimateresult size

computedivergencepredicates

switchjoin

operators

On

Instantiating the MAR framework

26

!(n) ! Pn,p(n)(On " O) " "out Discrepancy detected

µi(t) !At,W

W" !curpert

!i(t) !!

t!<t

I(µi(t!)) " "pastpert

Current perturbations on left/right?

Past perturbations on left/right?

σ(t), µ(t), π(t)

Page 52: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Responder’s state machine• Operator switch defined in terms of state transitions• Owing to symmetry, we can use a different operator

on each of the two tables

27

left: exactright: exact

left: approximateright: approximate

left: approximateright: exact

left: exactright: approximate

Page 53: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Rationale for state transitions

lex /rex

lap /rap

lex /rap

lap /rex

evidence that left and /or right input perturbed

evidence that left and /or right input

no longer perturbed

predicates σ(t), µ(t), π(t) provide the evidence needed to drive the transitions

Page 54: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

!0(t) = ¬"(t) ! µleft(t) ! µright(t)

!1(t) = "(t) ! µleft(t) ! µright(t)

!2(t) = "(t) ! ¬µleft(t) ! µright(t) ! #left(t)

Assessment → state transitions

29

!(n) ! Pn,p(n)(On " O) " "out

µi(t) !At,W

W" !curpert

!i(t) !!

t!<t

I(µi(t!)) " "pastpert

Page 55: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

monitor

respond assess

incrementalresult size

estimateresult size

computedivergence

switchjoin

operators

!adapt

On

!(n) ! Pn,p(n)(On " O) " "out

µi(t) !At,W

W" !curpert

!i(t) !!

t!<t

I(µi(t!)) " "pastpert

Completing the loop

30

✔✔

!0(t) = ¬"(t) ! µleft(t) ! µright(t)

!1(t) = "(t) ! µleft(t) ! µright(t)

!2(t) = "(t) ! ¬µleft(t) ! µright(t) ! #left(t)

Page 56: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Note on operator replacement• Details on how to switch operators on the fly are

omitted– main point: pipelined operators expose specific quiescent

states where replacement can take place with no loss of work [EPF06]

31

[EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254

Page 57: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Note on operator replacement• Details on how to switch operators on the fly are

omitted– main point: pipelined operators expose specific quiescent

states where replacement can take place with no loss of work [EPF06]

31

[EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254

Page 58: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Experimental evaluationTrade-off analysis

• Benefits:– achieved level of result completeness– baseline: approximate join throughout

• model marginal gain of hybrid algorithm

• Cost– baseline: exact join throughout

• model marginal cost of hybrid algorithm

32

Page 59: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Test datasetsDatasets chosen as representative of 4 distinct patterns

we expect our results to vary:• uniform perturbation: evidence grows slowly => slow reaction• bursty perturbation: strong evidence => timely reaction

Page 60: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Parameters tuning and gain/cost models

• Each of the MAR parameters tuned empirically• Experiments executed using the best possible

configuration• Nice result: parameter setting is quite independent

from the specific variant pattern

Relative gain grel:• R: result size for approx join only• r: result size for exact only• rabs: result size actually observed

grel = (rabs – r) / (R – r)‏

(details on cost model omitted)

Page 61: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Cost modelunit cost of executing one step in state i: wi

– weights determined experimentally

• number of steps in each state ti• unit state transition cost – experimental: vi

• number of state transitions tri

total absolute cost:cabs = sumi(sci) + sumi(tci)‏

relative cost:c: best cost (exact only)‏C: worst cost (approx only)‏

crel = cabs / (C - c)‏

Page 62: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Results

Page 63: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Results

Page 64: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Discussion• Results similar across different variant patterns

– good!

• Transition cost is not overwhelming:– we never pay more for hybrid than for approx– this gives us a good space for trade-offs– we could let users tune the algorithm without fear of

“breaking” it

Page 65: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Conclusions• An exact / approximate hybrid approach to join with

violations to implicit referential integrity across tables– relational setting

• Approach based on autonomic computing principles– Adaptive query processing techniques

• Application: on-the-fly integration scenarios (mashups, personal dataspaces)

• Results: cost / completeness trade-off analysis– initial encouraging experimental conclusions

Study requires additional testing on real datasets

Page 66: Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

References used in the presentation• A. Halevy and D. Maier, Dataspaces: the Tutorial, VLDB 2008

tutorial, Auckland, NZ, Aug 2008

• N. Koudas, S. Sarawagi, D.Srivastava, Record Linkage: Similarity Measures and Algorithms, VLDB 2006 tutorial, Seoul, Corea, 2006

• [FS69] I.P. Fellegi and A.B. Sunter, A Theory for Record Linkage, J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969

• [EIV07] A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, Duplicate Record Detection: A Survey, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007

• [KC03] J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41–50, 2003.

• EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254

• [CGK06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE 2006, p. 5 39


Recommended