+ All Categories
Home > Documents > Linking Records with Erroneous Values

Linking Records with Erroneous Values

Date post: 25-Feb-2016
Category:
Upload: baina
View: 46 times
Download: 3 times
Share this document with a friend
Description:
Linking Records with Erroneous Values. Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs. Motivation. s. s. s. s. s. s. integration. Cleaned Data. Search Box. Motivation. Which type of listing are they? A: the same business - PowerPoint PPT Presentation
Popular Tags:
28
Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1
Transcript
Page 1: Linking Records with Erroneous Values

1

Linking Records with Erroneous Values

Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac

AT&T Labs

Page 2: Linking Records with Erroneous Values

2

MotivationSrc Name Phone Address City

V

A-Link Wireless

8185491449

2148 GLENDALE GALLERIA

GLENDALE

V

Abercrombie

8185020728

2229 GLENDALE GALLERIA

GLENDALE

V

Abercrombie & Fitch

8185507492

2151 GLENDALE GALLERIA

GLENDALE

V

Aeropostale

8185458972

2187 GLENDALE GALLERIA

GLENDALE

V

Aerosoles

8182462455

1163 GLENDALE GALLERIA

GLENDALE

V Newtown Pizza Palace 2034266114 65 Church hill Rd NEWTOWN

V

Pizza Palace Of Newtown

2034266114

65 Church hill Rd

NEWTOWN

s

ss

integration

CleanedData

s

s

s

SearchBox

Src Name Phone Address City

D

Aerosoles

8182462455

1163 GLENDALE GALLERIA

GLENDALE

D

Aldo Shoes

8184090612

1157 GLENDALE GALLERIA

GLENDALE

D Newtown Pizza Palace 2034266114 65 Church hill Rd NewtownD Pizza Palace of Newtown 2034266114 Church Hill Rd Newtown

Src Name Phone Address City

A

A 24 Hour 1 A 1 Locksmith

8182404644

3210 GLENDALE GALLERIA

GLENDALE

A

A Link Wireless

8185491449

2148 GLENDALE GALLERIA

GLENDALE

A

Abercrombie

8185020728

2229 GLENDALE GALLERIA

GLENDALE

A

Abercrombie & Fitch

8185507492

2151 GLENDALE GALLERIA

GLENDALE

A Newtown Pizza Palace 2034266114 65 Church hill Rd Newtown

A

Aldo Shoes

8185482540

2154 GLENDALE GALLERIA

GLENDALE

A

Alert Cellular

8182404779

2148 GLENDALE GALLERIA

GLENDALE

Src Name Phone Address CityT Newtown Pizza Palace 2034266114 65 Church hill Rd Newtown

T

Aldo Shoes

8185482540

2154 GLENDALE GALLERIA

GLENDALE

T

American Eagle Outfitters

8189561893

2182 GLENDALE GALLERIA

GLENDALE

T

ANN TAYLOR

8182460350

2178 GLENDALE GALLERIA

GLENDALE

T

Ann Taylor Stores

8182460350

1108 GLENDALE GALLERIA

GLENDALE

Page 3: Linking Records with Erroneous Values

3

MotivationWhich type of listing are

they?

• A: the same business

• B: different businesses sharing the same phone#

• C: different businesses, only one correctly associated with the given phone#

Page 4: Linking Records with Erroneous Values

4

Current Solution• Uniqueness constraint– Each real-world entity has a unique value.

E.g., phone, address• The data may not satisfy the constraint– Erroneous values– Small number of exceptions

• Current two-step solution– Step 1: Record Linkage

• link records that are likely to refer to the same real-world entity [A.K Elmagarmid, TKDE’07], [W.Winkler, Tech Report’06]

– Step 2: Data Fusion• decide the correct values in the presence of conflicts

[J. Bleiholder et. al, ACM Computing Surveys]

Page 5: Linking Records with Erroneous Values

5

Limitations of Current SolutionSOURCE NAME PHONE ADDRESS

s1Microsofe Corp. xxx-1255 1 Microsoft Way Microsofe Corp. xxx-9400 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan W.

s2Microsoft Corp. xxx-1255 1 Microsoft Way Microsofe Corp. xxx-9400 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way

s3Microsoft Corp. xxx-1255 1 Microsoft Way Microsoft Corp. xxx-9400 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way

s4Microsoft Corp. xxx-1255 1 Microsoft Way Microsoft Corp. xxx-9400 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way

s5Microsoft Corp. xxx-1255 1 Microsoft Way Microsoft Corp. xxx-9400 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way

s6 Microsoft Corp. xxx-2255 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way

s7 MS Corp. xxx-1255 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way

s8 MS Corp. xxx-1255 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way

s9 Macrosoft Inc. xxx-0500 2 Sylvan Ways10 MS Corp. xxx-0500 2 Sylvan Way

Locally resolving conflicts for linked records may overlook important global evidence

Erroneous values may prevent correct matching

Traditional techniques may fall short when exceptions to the uniqueness constraints exist

(Microsoft Corp. ,Microsofe Corp., MS Corp.)(XXX-1255, xxx-9400)(1 Microsoft Way)(Macrosoft Inc.)(XXX-0500)(2 Sylvan Way, 2 Sylvan W.)

Page 6: Linking Records with Erroneous Values

6

Our Solution

• Perform linkage and fusion simultaneously– Able to identify incorrect value from the beginning,

so can improve linkage • Make global decisions– Consider sources that associate a pair of values in the

same record, so can improve fusion• Allow small number of violations for capturing

possible exceptions in the real world

Page 7: Linking Records with Erroneous Values

7

Road Map

• Motivation and overview• Problem definition• Solution• Evaluations on YP data• Conclusions

Page 8: Linking Records with Erroneous Values

8

Problem Input

• A set of independent data sources, each providing a set of records

• A set of (soft) uniqueness constraints– Uniqueness constraint (hard constraint):• Business Name, Business Phone, Business

Address– Soft uniqueness constraint (soft constraint): • Business Phone

1-p1

1-p2

Page 10: Linking Records with Erroneous Values

10

K-Partite Graph Encoding

s(1)

N1

1 Microsoft Way

Microsofe Corp.

P1

A1

xxx-1255

N3N2 N4

P2 P3 P4

A2

Microsoft Corp.

MS Corp.

Macrosoft Inc.

2 Sylvan Way

xxx-2255

xxx-9400

xxx-0500

A3

2 Sylvan W.

s(1-2)s(1-5,7,8)

s(2-5)

s(2-6)

s(6)

s(6)

S(7-8)

S(7-8)s(1-2)

s(1-5)

S(3-5)

S(10)

S(10)

S(2-10)

S(1-9)

S(2-9)s(1)

s(1)s(1)

s(1)

S1 Microsofe Corp. XXX-1255 1 Microsoft Way

Page 11: Linking Records with Erroneous Values

11

Solution Encoding

N3N1 N2

1 Microsoft Way

xxx-1255

Microsofe Corp.

N4

P1

A1

P2 P3 P4

A2

Microsoft Corp.

MS Corp.

Macrosoft Inc.

2 Sylvan Way

xxx-2255

xxx-9400

xxx-0500

A3

2 Sylvan W.

Clustering problem & Matching problem

Page 12: Linking Records with Erroneous Values

12

Solution Encoding with Hard ConstraintMicrosofe Corp.

N3N1 N2

1 Microsoft Way

xxx-1255

N4

P1

A1

P2 P3 P4

A2

Microsoft Corp.

MS Corp.Macrosoft Inc.

2 Sylvan Way

xxx-2255

xxx-9400

xxx-0500

A3

2 Sylvan W.

C1

C2 C3

C4Clustering problem

Page 13: Linking Records with Erroneous Values

13

Road Map

• Motivation and overview• Problem definition• Solution• Clustering w.r.t. hard constraint• Matching w.r.t. soft constraint

• Evaluations on YP data• Conclusions

Page 14: Linking Records with Erroneous Values

Clustering w.r.t. Hard Constraints

N3N1 N2

1 Microsoft Way

xxx-1255

Microsofe Corp.N4

P1

A1

P4

A2

Microsoft Corp.

MS Corp.

Macrosoft Inc.

2 Sylvan Way

xxx-0500

A3

2 Sylvan W.

C1 C4

• Ideal clustering:– high cohesion within

each cluster– low correlation

between different clusters

• Objective function– Davis-Bouldin Index

(Minimization)• Average distance of– similarity distance– association distance

Page 15: Linking Records with Erroneous Values

Similarity Distance

15

N3N1 N2

1 Microsoft Way

xxx-1255

Microsofe Corp.N4

P1

A1

P4

A2

Microsoft Corp.

MS Corp.

Macrosoft Inc.

2 Sylvan Way

xxx-0500

A3

2 Sylvan W.

0.95 0.65

0.650.4

0.70.7

0.9d2

S(C1,C4) = 1-0 = 1d3

S(C1,C4) = 1-0 = 1

C1 C4

d1S(C1,C1) = 1 − (0.95+0.65+0.65)/3

= 0.25 (name)d2

S(C1,C1) = 0 (phone)d3

S(C1,C1) = 0 (address)

dS(C1,C1) = (0.25+0+0)/3 = 0.083

0

0 0d1

S(C1,C4) = 1 − (0.7+0.7+0.4)/3 = 0.4

dS(C1,C4) = (0.4+1+1)/3=0.8

• Similarity of values• Defined for each attribute

Page 16: Linking Records with Erroneous Values

Association Distance

16

N3N1 N2

1 Microsoft Way

xxx-1255

Microsofe Corp.

s(1)

N4

P1

A1

P4

A2

Microsoft Corp.

MS Corp.

Macrosoft Inc.

2 Sylvan Way

xxx-0500

s(2-5)

S(7-8)

s(1-2)

S(3-5)S(10) S(1-9)

A3

2 Sylvan W.

S(2-10)

s(1-2)

s(1-5,7,8)

s(2-6) S(7-8) S(2-9)s(1)

s(1)

d1,3A(C1,C1) = 1− 8/9 = 0.11

d2,3A (C1,C1) = 1− 7/8 = 0.125

C1 C4

d1,2A (C1,C1) = 1 − 7/9 = 0.22

dA(C1,C4) = (0.9+0.9+1)/3 = 0.93

d1,2A (C1,C4) = 1 − max(1/10,0/10)

= 0.9

dA(C1,C1) = (0.22+0.11+0.125)/3 = 0.153

S(10)

9 sources (S1-S8,S10)mention (N1,N2,N3,P1)7 sources (S1-S5,S7,S8)Support (N1,N2,N3)-P1

d1,3A(C1,C4) = 0.9

d2,3A (C1,C4) = 1

• Association by edges• Defined for each pair of

attributes

10 sources (S1-S10)mention (N1,N2,N3,N4) (P1,P4)

1 source (S10)supports (N1,N2,N3)-P4No connection between

(N4,P1)

Page 17: Linking Records with Erroneous Values

17

Greedy Algorithm• Obtaining optimal clustering is intractable

– [T.F. Gonzales., 82],[J. Simal et al., 06]• Hill climbing approximation: CLUSTER

– Step1: Initialization• Cluster value representations by their similarity. Do majority voting to

associate clusters – Step2: Adjustment

• For each node, moving to the cluster that minimize this DB index– Step3: Convergence checking

• terminate if step 2 doesn’t change the clustering result. Otherwise, repeat step 2

• The algorithm converges

Page 18: Linking Records with Erroneous Values

18

N3 N1

1 Microsoft Way

xxx-1255

N4

P1

A1

P2 P3 P4

A2

N2

Microsoft Corp.

MS Corp.Macrosoft Inc.

2 Sylvan Way

xxx-2255

xxx-9400

xxx-0500

A3

2 Sylvan W.

C1 C2 C3 C4

Microsofe Corp.

Φ=0.94Φ=1.16

Φ=0.93

Φ=0.89Φ=0.71Φ=0.45

Page 19: Linking Records with Erroneous Values

19

Road Map

• Motivation and overview• Problem definition• Solution• Clustering w.r.t. hard constraint• Matching w.r.t. soft constraint

• Evaluations on YP data• Conclusions

Page 20: Linking Records with Erroneous Values

20

Matching w.r.t. Soft Constraints

• Next? Matching problem• How to match?

N3N1 N2

1 Microsoft Way

xxx-1255

Microsofe Corp.

N4

P1

A1

P2 P3 P4

A2

Microsoft Corp.

MS Corp.

Macrosoft Inc.

2 Sylvan Way

xxx-2255

xxx-9400xxx-0500

A3

2 Sylvan W.

NC1

1 Microsoft Way

xxx-1255

Microsofe Corp.

NC4

PC1

AC1

PC2 PC3 PC4

AC4

Microsoft Corp.MS Corp.

Macrosoft Inc.

2 Sylvan Way

xxx-2255

xxx-9400

xxx-0500

2 Sylvan W.

7s(1-5,7,8)

1S(6)

5s(1-5)

1S(10)

9S(1-9)

9S(1-9)

1S(10)

8S(1-8)

GRAPH TRANSFORM

Page 21: Linking Records with Erroneous Values

21

Matching w.r.t. Soft Constraint

• Intuitions– Largest sum of weights– Smallest gap– How to balance these two goals?

• Optimization problem– Maximize

– Subject to

• Two-phase greedy algorithm: MATCH

Mvu vGapuGap

vuw),( )()(

),(

21 |||ˆ|0

|||ˆ|0 p

AAp

AA

K

K

P2P1 P3

N

1(s1)

9(s2-s10)

10(s1-s10)

Solution 2

Gap(N) = 9

P2P1 P3

N

1(s1)

9(s2-s10)

10(s1-s10)

Solution 1

Gap(N) = 1

P2P1 P3

N

1(s1)

9(s2-s10)

10(s1-s10)

Solution 3

Gap(N) = 0

Page 22: Linking Records with Erroneous Values

22

Road Map

• Motivation and overview• Problem definition• Solution• Evaluations on YP data• Conclusions

Page 23: Linking Records with Erroneous Values

23

Experiment Settings

• Dataset I– Business listings for two zip codes(07035-Lincoln Park NJ,

07715-Belmar, NJ) from multiple sources

Zip BusinessSource

#Sources #Srcs/business07035 662 15 1-707715 149 6 1-3

ZipRecords

#Recs #Names #Phones #Addresses #(Err Ps)07035 1629 1154 839 735 7207715 266 243 184 55 12

ZipConstraint Violation

NP PN NA AN07035 8%(2.6) .8%(2.7) 2%(2.3) 12.6%(5.1)07715 4%(2) 1%(3) 4%(2) 4%(8.5)

Page 24: Linking Records with Erroneous Values

24

Matching of values of different attributes

Clustering of values of the same attribute

Precision

Recall

F-measure

Experiment Settings• Implementation

– MATCH (invoking CLUSTER first)– LINK: record linkage only– FUSE: data fusion only– LINKFUSE: first LINK, then FUSE

• Golden Standard: by manually checking• Measures: Precision/Recall/F-measure

P | G M R M |

| R M |

||||

M

MM

GRGR

RPPRF

2

||||

A

AA

RRGP

||||

A

AA

GRGR

RPPRF

2

Notation Description

Matched pairs for the golden standard

Matched pairs for our results

Clustered pairs for the golden standard

Clustered pairs for our results

G M

R M

G A

R A

Page 25: Linking Records with Erroneous Values

25

Accuracy

07035 Matching (NAME-PHONE) 07035 Matching (NAME-ADDRESS) 07035 Clustering (NAME)

07715 Matching (NAME-PHONE) 07715 Matching (NAME-ADDRESS) 07715 Clustering (NAME)

• MATCH achieves highest F-measure in most cases• Improves LINK by 11% on name-phone matching, by 20% on name clustering

• LINK vs. FUSE vs. LINKFUSE• LINK: high recall in matching• FUSE: high precision in matching, high precision in name clustering• LINKFUSE: only slightly better than FUSE in matching and similar to LINK in

clustering

Page 26: Linking Records with Erroneous Values

26

Efficiency and Scalability• Data set II

– Entire listing: 40+M records• Hadoop-based linkage framework

– Fuzzy self-join using Hadoop– Partition records into strongly connected components

• Efficiency– Linear growth– Execution time

Module Execution time (hour)

Record extraction 0.002

Fuzzy self join 0.89

Connected component 0.89

linkage 1.36

Overall 3.26

median 95th percentile

99th percentile max

2 5 7 2103

Page 27: Linking Records with Erroneous Values

27

Conclusions

• In the real-world, we need to resolve duplicates and conflicts at the same time.

• We reduce the problem to a k-partite graph clustering and matching problem– Combine linkage and fusion– Apply them in the global fashion

• Experiments show high accuracy and scalability

Page 28: Linking Records with Erroneous Values

28

Thank You!


Recommended