Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | paolo-missier |
View: | 240 times |
Download: | 1 times |
Approximate entity reconciliationfor on-the-fly integration in data mashups
Paolo Missier, Alvaro A. A. FernandesSchool of Computer Science, University of Manchester
Roald Lengu, Giovanna GuerriniDISI, Universita' di Genova, Italy
Marco MesitiDiCo, Universita' di Milano, Italy
• New data integration scenarios:– occasional integration with little prior knowledge about the
sources• Context: Data mashups and personal dataspaces
• How to ensure that we are not missing any data in the process?– how costly (i.e. response time) is it to guarantee
completeness?– can we trade completeness for response time?
• Technically speaking: convergence of– record linkage (an old data quality favourite)– approximate joins– adaptive query processing
2
Outline
Early example
3
• sources 1..n: collection of car insurance DBs• data changes frequently• schemas can be analysed / integrated using traditional
techniques• source n+1: reference street atlas
Early example
3
• sources 1..n: collection of car insurance DBs• data changes frequently• schemas can be analysed / integrated using traditional
techniques• source n+1: reference street atlas
• target app: mapping accidents hotspots• alert service to drivers, for example• useful information for decision makers
(image from housingmaps.com)
The IBM view, 2006VLDB 2006 Keynote by Anant Jhingran (CTO, Information Management, IBM Silicon Valley Laboratory, San Jose, CA):
Enterprise information mashups: integrating information, simply
Situational Applications• Applications that come together for solving some
immediate business problems• constructed “on the fly” for some transient need
and possibly short-lasting
• Data never seen before, consumed on the spot– would take too long for the IT department to provide them – RSS feeds / data streams
4
Mashups
IBM Mashup Center• IBM Mashup Center
– mashup workflow– leverages Lotus, DB2 plus LDAP, Web Services, ...
5
Yahoo pipes
Is there actually a “join” in the set of operators?also google mashup editor, and more... 6
Dataspaces
7
Dataspaces
7
Dataspaces
7
Integration in dataspaces
8
Integration in dataspaces
8
Integration in dataspaces
8
Assumptions
9
• sources 1..n: collection of car insurance DBs
• source n+1: reference street atlas• target app: mapping accidents
hotspots
– no prior knowledge of data sets (streams) to be joined– assumptions on implicit parent-child attribute relationships– no guarantee of matching values
The broad context: record linkage
10
Name: John SmithSSN:Address: 477 Cedar Street
Name: John SmithSSN: 123-45-6789Address:
Record values incomplete
Brendan HughesAddress: 564 Hickory Pl.
Brenda HughesAddress: 564 Hickory Pl.
Twins or typo?
Name: Jean SmithPhone #: (337) 555-6676
Name: Phone #: (337) 555 5676
Conflict between forenames and phone number
Name: Alice JonesSSN: 123-45-6789
Names: Lois AvonSSN: 123-45-6789
Same SSN, different names:??
• Are two (slightly) different records two different surface representations of the same real-world entity?
The broad context: record linkage
10
Name: John SmithSSN:Address: 477 Cedar Street
Name: John SmithSSN: 123-45-6789Address:
Record values incomplete
Brendan HughesAddress: 564 Hickory Pl.
Brenda HughesAddress: 564 Hickory Pl.
Twins or typo?
Name: Jean SmithPhone #: (337) 555-6676
Name: Phone #: (337) 555 5676
Conflict between forenames and phone number
Name: Alice JonesSSN: 123-45-6789
Names: Lois AvonSSN: 123-45-6789
Same SSN, different names:??
• Are two (slightly) different records two different surface representations of the same real-world entity?
• A difficult / uncertain decision process• which attributes should I consider for matching• what are the different weights• context: relative frequency of values?• external knowledge, user input
Results on record linkage
11
A mature field - ample literature– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64,
no. 328, pp. 1183-1210, Dec. 1969
– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007
Results on record linkage
11
A mature field - ample literature– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64,
no. 328, pp. 1183-1210, Dec. 1969
– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007
Record Linkage:
Similarity Measures and
Algorithms
Nick Koudas (University of Toronto)
Sunita Sarawagi (IIT Bombay)
Divesh Srivastava (AT&T Labs-Research)
Sigmod 2006 Data Quality tutorial
Results on record linkage
11
A mature field - ample literature– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64,
no. 328, pp. 1183-1210, Dec. 1969
– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007
Record Linkage:
Similarity Measures and
Algorithms
Nick Koudas (University of Toronto)
Sunita Sarawagi (IIT Bombay)
Divesh Srivastava (AT&T Labs-Research)
Sigmod 2006 Data Quality tutorial7/3/06 6
Application: Merging Lists
! Application: merge address lists
(customer lists, company lists)
to avoid redundancy
! Current status: “standardize”,
different values treated as
distinct for analysis
! Lot of heterogeneity
! Need approximate joins
! Relevant technologies
! Approximate joins
! Clustering/partitioning
Results on record linkage
11
A mature field - ample literature– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64,
no. 328, pp. 1183-1210, Dec. 1969
– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007
Record Linkage:
Similarity Measures and
Algorithms
Nick Koudas (University of Toronto)
Sunita Sarawagi (IIT Bombay)
Divesh Srivastava (AT&T Labs-Research)
Sigmod 2006 Data Quality tutorial7/3/06 6
Application: Merging Lists
! Application: merge address lists
(customer lists, company lists)
to avoid redundancy
! Current status: “standardize”,
different values treated as
distinct for analysis
! Lot of heterogeneity
! Need approximate joins
! Relevant technologies
! Approximate joins
! Clustering/partitioning
R! R!, S ! S!
Offline vs online linkage• Offline linkage:
– performed once before queries involving joins – reconcile R and S on joining attributes R.A, S.B using your
favourite record linkage technique
12
– perform regular equijoin on the transformed tables:
➡ok for tables that can be analysed ahead of the join➡suitable when multiple queries issued on integrated tables
R! !" S!
R! R!, S ! S!
Offline vs online linkage• Offline linkage:
– performed once before queries involving joins – reconcile R and S on joining attributes R.A, S.B using your
favourite record linkage technique
12
– perform regular equijoin on the transformed tables:
➡ok for tables that can be analysed ahead of the join➡suitable when multiple queries issued on integrated tables
R! !" S!
• Online linkage:– performed just-in-time before a query– exact join ⇒ approximate join
• Assume relational data: tables R, S• Assume schema integration is understood
– we focus on data integration only
• Ultimately, data integration involves joining tablesR !"A=B S
Integration with approximate joins
13
• ordinary “exact” match misses out on the similar values
• compromises integration completeness
Mcrosoft
Microsoft
A BC D A
MicrosoftXYZ
XY Microsoft Microsoft Z
Approximate joinsHistorical timeline:
14
from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06.
Approximate joinsHistorical timeline:
14
from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06.
Approximate joinsHistorical timeline:
14
from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06.
sim(r1, r2) < !1 ! not match!1 ! sim(r1, r2) ! !2 " unknown
!2 < sim(r1, r2)! match
Edit distance / similarity functions• Core sub-problem in approximate join:
– define / choose distance function between values in pairs of joining attributes
15
!r1, r2"sim(r1, r2)1. Similarity function between record pairs
2. Decision rules of the form
sim(r1, r2) < !1 ! not match!1 ! sim(r1, r2) ! !2 " unknown
!2 < sim(r1, r2)! match
Edit distance / similarity functions• Core sub-problem in approximate join:
– define / choose distance function between values in pairs of joining attributes
15
!r1, r2"sim(r1, r2)1. Similarity function between record pairs
2. Decision rules of the form
A common choice of similarity function in the context of approximate joins is one based on string q-grams
sim(s1, s2) =|q(s1) ! q(s2)||q(s1) " q(s2)|
Measuring string similarity using q-grams• q-grams map string s to a set q(s) of substrings of length q:
Ex.: 3-grams:
q(“Microsoft Corporation”) = {‘Mic’, ‘icr’, ‘cro’, ‘ros’, ‘oso’, ‘sof ’, ‘oft’, ‘ft ’, ‘t C’, ‘ Co’, ‘Cor’, ‘orp’ }.
q(“Mcrosoft Corporation”) = {‘Mcr’, ‘cro’, ‘ros’, ‘oso’, ‘sof’, ‘oft’, ‘ft ’, ‘t C’, ‘ Co’, ‘Cor’, ‘orp’, ‘rp#’ }.
(Jaccard coefficient)
This is a commonly used measure of string similarity
Online linkage using q-grams
17
– approximate join is a θ join:
– where θΑ,Β incorporates a similarity measure, eg Jaccard
R !"!A,B S
• Naïve method: for each record pair, compute similarity score– I/O and CPU intensive, not scalable
• Goal: reduce O(n2) cost to O(n*w), where w << n – Reduce number of pairs on which similarity is computed – Take advantage of efficient relational join methods
dis(s1, s2) ! d if |(s1) " q(s2)| # max (|s1|, |s2|)$ (d$ 1)% q $ 1
Efficient relational approximate joins
Idea:reduce approximate join to aggregated set intersection:
18
In practice:• known similarity measures can be used to compare pairs of records• cheap filters (length, count, position) to prune non-matches • Implementation using standard SQL• cost-based join methods
Efficient relational representation:[CGK06] S. Chaudhuri, V. Ganti and R. Kaushik,“A primitive operator for similarity joins in data cleaning” (ICDE’06)
Is full approximate join always necessary?
• Remaining source of complexity:– overhead for storing and indexing q-grams– cost of computing set intersection
• Typical mismatch rate in real datasets around 5%• Complexity of full-fledged approximate join not fully
justified
19
Research hypothesis: time-completeness trade-offs
Offer users the option to trade completeness of integration with the time required to complete the join
Adaptive query processing
Idea:implement a hybrid join algorithm that combines exact and approximate join
20
[DIR07] A. Deshpande, Z. G. Ives, and V. Raman. Adaptive query processing. Foundations and Trends in Databases, 1(1):1–140, 2007
See also VLDB 2007 Tutorial athttp://www.vldb2007.org/program/slides/s1426-deshpande.pdf
Intuition:leverage known results on Adaptive Query Processing–developed in the context of query re-optimization–switch physical join operators in mid-flight
Autonomic computing framework
21
[KC03] J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41–50, 2003.
monitor
respond assess
Autonomic computing framework
21
monitor
respond assess
Autonomic computing framework
21
incrementalresult size
estimateresult size
computedivergence
switchjoin
operators
start with an exact join (optimistically)at step t during the execution:• estimate the expected size of the join result Ōt at that point• monitor the actual size Ot of the result
• when using exact join: if Ōt and Ot diverge “too much”, then switch to approximate join• when using approximate join: if Ōt and Ot are very close, then switch to exact join
Technical approach and challenges
• Assess:– estimating result size at specific points during join execution
• Respond:– switching between join operators at specific points during
execution• Adaptive Query Processing (AQP): operator replacement
in pipelined query plans [EFP06]
– adding an approximate join operator to the query processor [CGK06]
[EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254
[CGK06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE 2006, p. 5. 22
Need to add several new capabilities to a standard query processing infrastructure
Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data
• they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.
– a pipelined operator ← this is a key requirement for use in AQP
23
R S
Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data
• they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.
– a pipelined operator ← this is a key requirement for use in AQP
23
R S
build
xm
yn
R hash table
Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data
• they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.
– a pipelined operator ← this is a key requirement for use in AQP
23
R S
build
xm
yn
R hash table
build
y
x
r
s
S hash table
Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data
• they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.
– a pipelined operator ← this is a key requirement for use in AQP
23
R S
build
xm
yn
R hash table
build
y
x
r
s
S hash table
when a tuple appears at either input, it is incrementally added to the corresponding hash table and probed against the opposite hash table.
Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data
• they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.
– a pipelined operator ← this is a key requirement for use in AQP
23
R S
probe
build
xm
yn
R hash table
build
y
x
r
s
S hash table
when a tuple appears at either input, it is incrementally added to the corresponding hash table and probed against the opposite hash table.
Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data
• they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.
– a pipelined operator ← this is a key requirement for use in AQP
23
R S
probe
build
xm
yn
R hash table
build
y
x
r
s
S hash table
[R.m,S.s]
when a tuple appears at either input, it is incrementally added to the corresponding hash table and probed against the opposite hash table.
Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data
• they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.
– a pipelined operator ← this is a key requirement for use in AQP
23
R S
probe
probe
build
xm
yn
R hash table
build
y
x
r
s
S hash table
[R.m,S.s]
when a tuple appears at either input, it is incrementally added to the corresponding hash table and probed against the opposite hash table.
Symmetric hash joinWell-known join operator– basis for approximate join [CGK06]– can be applied to streams of data
• they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far.
– a pipelined operator ← this is a key requirement for use in AQP
23
R S
probe
probe
build
xm
yn
R hash table
build
y
x
r
s
S hash table
[R.m,S.s]
[R.n, S.r]
when a tuple appears at either input, it is incrementally added to the corresponding hash table and probed against the opposite hash table.
Estimating result size• Exploit implicit parent-child key assumption:
– at the end of join, we expect a result of size |S|
24
R (parent) S (child)
xc
ydy
x
b
a
• When there are no mismatches:after scanning n < |S| tuples on S:P(a=x in |S| has been matched) = P(tuple c=x is in top n of R) = n/|R|
Thus, join result size On is a binomial random variable:
n
On ! bin(n,n
|R| )
where Pn,p(n) (.) is the cumulative distribution function for a binomial with parameters n, p(n)
Detecting divergent observed result sizeObservation is an outlier wrt expected result size On after n tuples have been scanned, if:
25
Pn,p(n)(On ! O) ! !out
On
where Pn,p(n) (.) is the cumulative distribution function for a binomial with parameters n, p(n)
Detecting divergent observed result sizeObservation is an outlier wrt expected result size On after n tuples have been scanned, if:
25
Pn,p(n)(On ! O) ! !out
On
monitor
respond assess
incrementalresult size
estimateresult size
computedivergencepredicates
switchjoin
operators
On
Instantiating the MAR framework
26
✔
✔
monitor
respond assess
incrementalresult size
estimateresult size
computedivergencepredicates
switchjoin
operators
On
Instantiating the MAR framework
26
✔
✔
✔
monitor
respond assess
incrementalresult size
estimateresult size
computedivergencepredicates
switchjoin
operators
On
Instantiating the MAR framework
26
✔
✔
✔
!(n) ! Pn,p(n)(On " O) " "out Discrepancy detected
µi(t) !At,W
W" !curpert
!i(t) !!
t!<t
I(µi(t!)) " "pastpert
Current perturbations on left/right?
Past perturbations on left/right?
σ(t), µ(t), π(t)
Responder’s state machine• Operator switch defined in terms of state transitions• Owing to symmetry, we can use a different operator
on each of the two tables
27
left: exactright: exact
left: approximateright: approximate
left: approximateright: exact
left: exactright: approximate
Rationale for state transitions
lex /rex
lap /rap
lex /rap
lap /rex
evidence that left and /or right input perturbed
evidence that left and /or right input
no longer perturbed
predicates σ(t), µ(t), π(t) provide the evidence needed to drive the transitions
!0(t) = ¬"(t) ! µleft(t) ! µright(t)
!1(t) = "(t) ! µleft(t) ! µright(t)
!2(t) = "(t) ! ¬µleft(t) ! µright(t) ! #left(t)
Assessment → state transitions
29
!(n) ! Pn,p(n)(On " O) " "out
µi(t) !At,W
W" !curpert
!i(t) !!
t!<t
I(µi(t!)) " "pastpert
monitor
respond assess
incrementalresult size
estimateresult size
computedivergence
switchjoin
operators
!adapt
On
!(n) ! Pn,p(n)(On " O) " "out
µi(t) !At,W
W" !curpert
!i(t) !!
t!<t
I(µi(t!)) " "pastpert
Completing the loop
30
✔
✔
✔✔
!0(t) = ¬"(t) ! µleft(t) ! µright(t)
!1(t) = "(t) ! µleft(t) ! µright(t)
!2(t) = "(t) ! ¬µleft(t) ! µright(t) ! #left(t)
Note on operator replacement• Details on how to switch operators on the fly are
omitted– main point: pipelined operators expose specific quiescent
states where replacement can take place with no loss of work [EPF06]
31
[EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254
Note on operator replacement• Details on how to switch operators on the fly are
omitted– main point: pipelined operators expose specific quiescent
states where replacement can take place with no loss of work [EPF06]
31
[EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254
Experimental evaluationTrade-off analysis
• Benefits:– achieved level of result completeness– baseline: approximate join throughout
• model marginal gain of hybrid algorithm
• Cost– baseline: exact join throughout
• model marginal cost of hybrid algorithm
32
Test datasetsDatasets chosen as representative of 4 distinct patterns
we expect our results to vary:• uniform perturbation: evidence grows slowly => slow reaction• bursty perturbation: strong evidence => timely reaction
Parameters tuning and gain/cost models
• Each of the MAR parameters tuned empirically• Experiments executed using the best possible
configuration• Nice result: parameter setting is quite independent
from the specific variant pattern
Relative gain grel:• R: result size for approx join only• r: result size for exact only• rabs: result size actually observed
grel = (rabs – r) / (R – r)
(details on cost model omitted)
Cost modelunit cost of executing one step in state i: wi
– weights determined experimentally
• number of steps in each state ti• unit state transition cost – experimental: vi
• number of state transitions tri
total absolute cost:cabs = sumi(sci) + sumi(tci)
relative cost:c: best cost (exact only)C: worst cost (approx only)
crel = cabs / (C - c)
Results
Results
Discussion• Results similar across different variant patterns
– good!
• Transition cost is not overwhelming:– we never pay more for hybrid than for approx– this gives us a good space for trade-offs– we could let users tune the algorithm without fear of
“breaking” it
Conclusions• An exact / approximate hybrid approach to join with
violations to implicit referential integrity across tables– relational setting
• Approach based on autonomic computing principles– Adaptive query processing techniques
• Application: on-the-fly integration scenarios (mashups, personal dataspaces)
• Results: cost / completeness trade-off analysis– initial encouraging experimental conclusions
Study requires additional testing on real datasets
References used in the presentation• A. Halevy and D. Maier, Dataspaces: the Tutorial, VLDB 2008
tutorial, Auckland, NZ, Aug 2008
• N. Koudas, S. Sarawagi, D.Srivastava, Record Linkage: Similarity Measures and Algorithms, VLDB 2006 tutorial, Seoul, Corea, 2006
• [FS69] I.P. Fellegi and A.B. Sunter, A Theory for Record Linkage, J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969
• [EIV07] A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, Duplicate Record Detection: A Survey, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007
• [KC03] J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41–50, 2003.
• EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254
• [CGK06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE 2006, p. 5 39