Generic Entity Resolution:Identifying Real-World Entities in
Large Data Sets
Hector Garcia-MolinaStanford University
Work with: Omar Benjelloun, Qi Su,Jennifer Widom, Tyson Condie,
Nicolas Pombourcq, David Menestrina, Steven Whang
2
Entity Resolution
N: a A: b CC#: c Ph: e
e1
N: a Exp: d Ph: e
e2
3
Applications
• comparison shopping• mailing lists• classified ads• customer files• counter-terrorism
N: a A: b CC#: c Ph: e
e1
N: a Exp: d Ph: e
e2
4
Outline
• Why is ER challenging?• How is ER done?• Some ER work at Stanford• Confidences
5
Challenges (1)
• No keys!• Value matching– “Kaddafi”, “Qaddafi”, “Kadafi”, “Kaddaffi”...
• Record matching
Nm: TomAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777
Nm: ThomasAd: 132 Main StPh: (650) 555-1212
6
Challenges (2)
• Merging recordsNm: TomAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777
Nm: ThomasAd: 132 Main StPh: (650) 555-1212Zp: 94305
Nm: TomNm: ThomasAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777Zp: 94305
7
Challenges (3)
• ChainingNm: TomWk: IBMOc: laywerSal: 500K
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM
Nm: ThomasAd: 123 MaimOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K
8
Challenges (4)
• Un-mergingNm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K
too young to make500K at IBM!!
9
Challenges (5)
• Confidences in dataNm: Tom (0.9)Ad: 123 Main St (1.0)Ph: (650) 555-1212 (0.6)Ph: (650) 777-7777 (0.8)
(0.8)
• In value matching, match rules, merge:
conf = ?
10
Taxonomy
• Pairwise snaps vs. clustering• De-duplication vs. fidelity enhancement• Schema differences• Relationships• Exact vs. approximate• Generic vs application specific• Confidences
11
Schema Differences
Name: TomAddress: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777
FirstName: TomStreetName: Main StStreetNumber: 123Tel: (650) 777-7777
12
Pair-Wise Snaps vs. Clustering
r1
r2
r3
r4
r5
r6
s9
s8
s7
s10
r1
r2
r3
r4r5
r6
r9
r8
r7
r10
13
De-Duplication vs. Fidelity Enhancement
R SB S
N
14
Relationships
r1 r2
r5
r7brother
father
businessbusiness
15
Using Relationships
p1
p2
p5
p7
a1
a2
a3
a5
a4
authors papers
same??
16
Exact vs Approximate ER
products
cameras resolvedcameras
CDs
books
...
resolvedCDs
resolvedbooks
...
ER
ER
ER
17
Exact vs Approximate ER
terrorists terroristssortby age
Widom 30match against
ages 25-35
18
Generic vs Application Specific
• Match function M(r, s)• Merge function => t
19
Taxonomy
• Pairwise snaps vs. clustering• De-duplication vs. fidelity enhancement• Schema differences• Relationships• Exact vs. approximate• Generic vs application specific• Confidences
20
Outline
• Why is ER challenging?• How is ER done?• Some ER work at Stanford• Confidences
21
Taxonomy
• Pairwise snaps vs. clustering• De-duplication vs. fidelity enhancement• Schema differences No• Relationships No• Exact vs. approximate• Generic vs application specific• Confidences ... later on
22
Model
Nm: TomWk: IBMOc: laywerSal: 500K
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM
Nm: ThomasAd: 123 MaimOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K
r1 r3r2
r4:
M(r1, r2) M(r4, r3)
23
Correct Answer
r1
r2
r3
r4
r5
r6
s9
s8
s7
s10
ER(R) = All derivable records.....
Minus “dominated” records
24
Question
• What is best sequence of match, merge calls that give us right answer?
25
Brute Force Algorithm
• Input R:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]
26
Brute Force Algorithm
• Input R:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]
• Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]
27
Brute Force Algorithm• Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]
• Repeat:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]– r123 =
[a:1, b:2, c:4, e:5, f:6]
28
Question # 1
Brute Force Algorithm
• Input R:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]
• Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]
Can we deleter1, r2?
29
Question # 2
Brute Force Algorithm
• Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]
• Repeat:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]– r123 =
[a:1, b:2, c:4, e:5, f:6]
Can we avoidcomparisons?
30
ICAR Properties
• Idempotence:– M(r1, r1) = true; = r1
• Commutativity:– M(r1, r2) = M(r2, r1)– =
• Associativity– =
31
More Properties
• Representativity– If = r3, then
for any r4 such that M(r1, r4) is truewe also have M(r3, r4) = true.
r1
r2
r3
r4
32
ICAR Properties Efficiency
• Commutativity• Idempotence• Associativity• Representativity
• Can discard records• ER result independent
of processing order
33
Swoosh Algorithms
• Record Swoosh• Merges records as soon as they match• Optimal in terms of record comparisons
• Feature Swoosh• Remembers values seen for each feature• Avoids redundant value comparisons
34
Swoosh Performance
35
If ICAR Properties Do Not Hold?
r1: [Joe Sr., 123 Main, DL:X]
r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]
r12: [Joe Sr., 123 Main, Ph: 123, DL:X]
r2: [Joe, 123 Main, Ph:123]
r3: [Joe Jr., 123 Main, DL:Y]
36
If ICAR Properties Do Not Hold?
r1: [Joe Sr., 123 Main, DL:X]
r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]
r12: [Joe Sr., 123 Main, Ph: 123, DL:X]
r2: [Joe, 123 Main, Ph:123]
r3: [Joe Jr., 123 Main, DL:Y]
Full Answer: ER(R) = {r12, r23, r1, r2, r3}Minus Dominated: ER(R) = {r12, r23}
37
If ICAR Properties Do Not Hold?
r1: [Joe Sr., 123 Main, DL:X]
r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]
r12: [Joe Sr., 123 Main, Ph: 123, DL:X]
r2: [Joe, 123 Main, Ph:123]
r3: [Joe Jr., 123 Main, DL:Y]
Full Answer: ER(R) = {r12, r23, r1, r2, r3}Minus Dominated: ER(R) = {r12, r23}R-Swoosh Yields: ER(R) = {r12, r3} or {r1, r23}
38
Swoosh Without ICAR Properties
39
Distributed Swoosh
P1 P2 P3
r1r2r3r4r5r6...
40
Distributed Swoosh
P1 P2 P3
r1
r3r4
r6...
r1r2
r4r5
...
r2r3
r5r6...
41
DSwoosh Performance
42
Outline
• Why is ER challenging?• How is ER done?• Some ER work at Stanford• Confidences
43
Conclusion
• ER is old and important problem• Our approach: generic• Confidences– challenging– two ways to tame:• thresholds• packages
44
Thanks.
45
Generic Confidence Model
• r1 = 0.7 [ a:v1, b:v2, c: v3]
match.7[a, b, c]
.9[a, c, d]yes (or no)
merge.7[a, b, c]
.9[a, c, d].65[a,b,c,d,x]
46
Problem: Properties May Not Hold
• r1 = 0.9 [a, b, c]• r2 = 0.8 [a, d]• say confidences multiplied on merge• = 0.72[a, b, c, d]• < , r1> = 0.648[a, b, c, d]• < , r2> = = 0.72[a, b, c, d]
47
ER with Confidences
• Very Expensive:– must compute “all derivations”– cannot delete records after they merge
• What can we do??– thresholds– packages
48
Important Property
• If conf(Rx) < threshold• Then for any Ry derived from Rx
conf(Ry) < threshold
r1
r2
r3 r4
C= 0.7 C
49
Thresholds - Example
0.9 [ a: v1, b:v2 ]0.8 [ a:v1, c: v3 ]0.6 [ b:v2, c:v3, d:v4]0.75 [ a:v1, b:v2, c:v3]0.5 [a:v1, b:v2, c:v3, d: v4]...
T=0.7
50
Thresholds - Example
0.9 [ a: v1, b:v2 ]0.8 [ a:v1, c: v3 ]0.6 [ b:v2, c:v3, d:v4]0.75 [ a:v1, b:v2, c:v3]0.5 [a:v1, b:v2, c:v3, d: v4]...
T=0.7
51
Goal: C-Swoosh
baserecords
allpossiblemerges
eliminatedominated
eliminatebelow
threshold
52
Goal: C-Swoosh
baserecords
allpossiblemerges
eliminatedominated
eliminatebelow
threshold
earlier
53
Does Threshold Property Hold?
• NO: records are evidence
merge.7[a, b, c]
.8[a, b, c].9[a, b, c]
54
Does Threshold Property Hold?
• YES: records are beliefs
merge.7[a, b, c]
.8[a, b, c].8[a, b, c]
55
Simple Confidence Model
• 0.7 [a, b]
[a, b] [a, b] [a, b, c] [a, b] [a, b]
[a, b] [a, b, d] ??? ??? ???
Alternate Worlds:
56
Rules
• 0.7[a, b, c], 0.7[a, b, c]⇒ 0.7 [a, b, c]• 0.7 [a, b], 0.5 [a, b]⇒ 0.7 [a, b]• 0.7 [a, b, c], 0.5 [a, b]⇒ 0.7 [a, b, c]• 0.7 [a, b, c], 0.9[a, b]⇒ 0.7 [a, b, c], 0.9[a, b]• etc
57
Matches
0.9[a, b, c]0.8[a, b, d][a, x][c, d, y]
Match with confidence 0.5
worlds
[a,b,c][a,b,d]
[a,b,c,d]
1 2 3 4 5 6 7 8 9 10
58
Matches
0.9[a, b, c]0.8[a, b, d][a, x][c, d, y]
worlds
[a,b,c][a,b,d]
[a,b,c,d]
1 2 3 4 5 6 7 8 9 10
0.4[a,b,c,d]0.9[a, b, c]0.8[a, b, d][a, x][c, d, y]
59
Summary
• Belief model well suited for ER• Evidence model is very complex and
expensive!
60
Packages
• Match does not use confidences– merge does compute confidences
• 4 properties hold for deterministic attributes– e.g., =
ignoring confidences
61
Partition Records
– r1 = .9 [a:1, b:2]– r2 = .8 [a:1, c: 4, e:5]– r3 = .7 [b:2, c:4, f:6]– r4 = .8 [a:7, e:5, f:6]– r5 = .9 [a:7, b:2]
r1
r2
r3
r12r123
r45r5
r4
62
Expand Packages
– r1 = .9 [a:1, b:2]– r2 = .8 [a:1, c: 4, e:5]– r3 = .7 [b:2, c:4, f:6]– r4 = .8 [a:7, e:5, f:6]– r5 = .9 [a:7, b:2]
r1
r2
r3
r12r123
r1, r2, r3
...
63
Conclusion
• ER is old and important problem• Our approach: generic• Confidences– challenging– two ways to tame:• thresholds• packages
Thanks.
65
Extra Slides
66
Taxonomy
• Pairwise snaps vs. clustering• De-duplication vs. fidelity enhancement• Schema differences No• Relationships No• Exact vs. approximate• Generic vs application specific• Confidences ... later on
67
One Confidence Model
[id1, a, b, c, d][id1, a, b, d][id1, a, x][id1, b, y]
[id1, a, b, c, d]
[id3, a, b, c][id3, a, b, d][id3, a, b, f, g][id3, a, b, f, g]
[id3, a, b, f, g]
[id2, a, b, c][id2, a, c, e][id2, a, c, e][id2, a, c, e]
[id2, a, c, e]
shorthand
68
Records Are Evidence
[id1, a, b, c, d][id1, a, b, d][id1, a, x][id1, b, y]
[id1, a, b, c, d]
[id1, (3/4)a, (3/4)b, (1/4)c, (2/4)d, (1/4)x, (1/4)y]
not 0.25
69
New Evidence
[id1, a, b, c, d][id1, a, b, d][id1, a, x][id1, b, y]+ [id1, a, b, c, d]
[id1, a, b, c, d]
[id1, (4/5)a, (4/5)b, (2/5)c, (3/5)d, (1/5)x, (1/5)y]
[id1, (3/4)a, (3/4)b, (1/4)c, (2/4)d, (1/4)x, (1/4)y]+ [id1, a, b, c, d]
70
No Ids
[a, b, c][a, b, d][a, x][c, d, y] [a, b, (1/2)c, (1/2)d]
[a, x][c, d, y]
[a, b, c][a, b, d][a, x][c, d, y]
0.7
0.3
71
No Ids
[a, b, c][a, b, d][a, x][c, d, y] [a, b, (1/2)c, (1/2)d]
[a, x][c, d, y]
[a, b, c][a, b, d][a, x][c, d, y]
0.7
0.3
[(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y][a, x]
[a, b, (1/2)c, (1/2)d][a, x][c, d, y]
0.9
0.1
72
Queries?
[a, b, c][a, b, d][a, x][c, d, y] [a, b, (1/2)c, (1/2)d]
[a, x][c, d, y]
[a, b, c][a, b, d][a, x][c, d, y]
0.7
0.3
[(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y][a, x]
[a, b, (1/2)c, (1/2)d][a, x][c, d, y]
0.9
0.1
Threshold = 0.5; Support = 2Maximal RecordExample: [a, b, c, d]
73
Queries?
[a, b, c][a, b, d][a, x][c, d, y] [a, b, (1/2)c, (1/2)d]
[a, x][c, d, y]
[a, b, c][a, b, d][a, x][c, d, y]
0.7
0.3
[(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y][a, x]
[a, b, (1/2)c, (1/2)d][a, x][c, d, y]
0.9
0.1
Threshold = 0.5; Support = 2Maximal RecordExample: [a, b, c, d]
74
Need Simpler Model?
75
Bonus Material
• Entity Resolution, Confidences,and their relationship to Information Privacy
76
Privacy
Nm: AliceAd: 32 FoxPh: 5551212
1.0
Nm: AliceAd: 32 FoxPh: 5551212
1.0Nm: AliceAd: 32 Fox
1.0Nm: AliceAd: 32 FoxPh: 5551212
0.7Nm: AliceAd: 32 FoxPh: 5551212Ad: 14 Cat
1.0
Bob
Alice
77
Leakage
Nm: AliceAd: 32 FoxPh: 5551212
1.0
Nm: AliceAd: 32 FoxPh: 5550000
0.7
Bob
Alice
L = 0.6 (between 0 and 1)
78
Multi-Record Leakage
Nm: AliceAd: 32 FoxPh: 5551212
1.0
Bob
Alice
LL = 0.9 (between 0 and 1,e.g., max L)
r1, L = 0.9r2, L = 0.8r3, L = 0.7
79
Q1: Added Vulnerability?
Bob
Alice
ΔLL = ??
r1 r2 r3 r4p
r4 may cause Bob’s records tosnap together!
80
Q2: Disinformation?
Bob
Alice
ΔLL = ??
r1 r2 r3 r4 (lies)p
What is mostcost effectivedisinformation?
81
Q3: Verification?
Bob
Alicep
What is best factto verify to increaseconfidence in hypothesis?
r1, 0.9r2, 0.8r3, 0.7...
hypothesis h (0.6)
82
Summary
• Entity resolution is critical• Efficient resolution important• Confidences are important, but how?• ER is key aspect of info privacy
– check www-db.stanford.edu forSwoosh paper & forthcoming paper
83
Thanks.
84
Extra Slides
85
Challenges
• Exponential growth in complexity
0.9 [ a: v1, b:v2 ]0.8 [ a:v1, c: v3 ]0.6 [ b:v2, c:v3, d:v4]0.75 [ a:v1, b:v2, c:v3]0.5 [a:v1, b:v2, c:v3, d: v4]...
86
Three Ideas to Tame Complexity
• Thresholds• Domination• Packages
87
Thresholds
0.9 [ a: v1, b:v2 ]0.8 [ a:v1, c: v3 ]0.6 [ b:v2, c:v3, d:v4]0.75 [ a:v1, b:v2, c:v3]0.5 [a:v1, b:v2, c:v3, d: v4]...
T=0.7
88
Domination
0.9 [ a: v1, b:v2, c:v3 ]0.8 [ a:v1, b: v2, c: v3 ]0.8 [ b:v2, c:v3 ]...
89
Domination
0.9 [ a: v1, b:v2, c:v3 ]0.8 [ a:v1, b: v2, c: v3 ]0.8 [ b:v2, c:v3 ]...
90
Summary
• Our approach: pairwise, generic, Swoosh• Confidences• Making Tractable:– threshold– domination– packages
91
Thanks You
92
What Swoosh Does NOT Do
• Hash table with every pair seen:– records ri, rj– compared values vi, vj
• Swoosh achieves the same effectwith our N2 space
93
Swoosh Performance (I)
94
Swoosh Performance (II)
95
Swoosh Performance (III)
Generic Entity Resolution:�Identifying Real-World Entities in Large Data SetsEntity ResolutionApplicationsOutlineChallenges (1)Challenges (2)Challenges (3)Challenges (4)Challenges (5)TaxonomySchema DifferencesPair-Wise Snaps vs. ClusteringDe-Duplication vs. �Fidelity EnhancementRelationshipsUsing RelationshipsExact vs Approximate ERExact vs Approximate ERGeneric vs Application SpecificTaxonomyOutlineTaxonomyModelCorrect AnswerQuestionBrute Force AlgorithmBrute Force AlgorithmBrute Force AlgorithmQuestion # 1Question # 2ICAR PropertiesMore PropertiesICAR Properties EfficiencySwoosh AlgorithmsSwoosh PerformanceIf ICAR Properties Do Not Hold?If ICAR Properties Do Not Hold?If ICAR Properties Do Not Hold?Swoosh Without ICAR PropertiesDistributed SwooshDistributed SwooshDSwoosh PerformanceOutlineConclusionThanks.Generic Confidence ModelProblem: Properties May Not HoldER with ConfidencesImportant PropertyThresholds - ExampleThresholds - ExampleGoal: C-SwooshGoal: C-SwooshDoes Threshold Property Hold?Does Threshold Property Hold?Simple Confidence ModelRulesMatchesMatchesSummaryPackagesPartition RecordsExpand PackagesConclusionThanks.Extra SlidesTaxonomyOne Confidence ModelRecords Are EvidenceNew EvidenceNo IdsNo IdsQueries?Queries?Need Simpler Model?Bonus MaterialPrivacyLeakageMulti-Record LeakageQ1: Added Vulnerability?Q2: Disinformation?Q3: Verification?SummaryThanks.Extra SlidesChallengesThree Ideas to Tame ComplexityThresholdsDominationDominationSummaryThanks YouWhat Swoosh Does NOT DoSwoosh Performance (I)Swoosh Performance (II)Swoosh Performance (III)