+ All Categories
Home > Documents > Generic Entity Resolution: Identifying Real-World Entities ...Generic Entity Resolution: Identifying...

Generic Entity Resolution: Identifying Real-World Entities ...Generic Entity Resolution: Identifying...

Date post: 02-Feb-2021
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
95
Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su, Jennifer Widom, Tyson Condie, Nicolas Pombourcq, David Menestrina, Steven Whang
Transcript
  • Generic Entity Resolution:Identifying Real-World Entities in

    Large Data Sets

    Hector Garcia-MolinaStanford University

    Work with: Omar Benjelloun, Qi Su,Jennifer Widom, Tyson Condie,

    Nicolas Pombourcq, David Menestrina, Steven Whang

  • 2

    Entity Resolution

    N: a A: b CC#: c Ph: e

    e1

    N: a Exp: d Ph: e

    e2

  • 3

    Applications

    • comparison shopping• mailing lists• classified ads• customer files• counter-terrorism

    N: a A: b CC#: c Ph: e

    e1

    N: a Exp: d Ph: e

    e2

  • 4

    Outline

    • Why is ER challenging?• How is ER done?• Some ER work at Stanford• Confidences

  • 5

    Challenges (1)

    • No keys!• Value matching– “Kaddafi”, “Qaddafi”, “Kadafi”, “Kaddaffi”...

    • Record matching

    Nm: TomAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777

    Nm: ThomasAd: 132 Main StPh: (650) 555-1212

  • 6

    Challenges (2)

    • Merging recordsNm: TomAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777

    Nm: ThomasAd: 132 Main StPh: (650) 555-1212Zp: 94305

    Nm: TomNm: ThomasAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777Zp: 94305

  • 7

    Challenges (3)

    • ChainingNm: TomWk: IBMOc: laywerSal: 500K

    Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM

    Nm: ThomasAd: 123 MaimOc: lawyer

    Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer

    Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K

  • 8

    Challenges (4)

    • Un-mergingNm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K

    too young to make500K at IBM!!

  • 9

    Challenges (5)

    • Confidences in dataNm: Tom (0.9)Ad: 123 Main St (1.0)Ph: (650) 555-1212 (0.6)Ph: (650) 777-7777 (0.8)

    (0.8)

    • In value matching, match rules, merge:

    conf = ?

  • 10

    Taxonomy

    • Pairwise snaps vs. clustering• De-duplication vs. fidelity enhancement• Schema differences• Relationships• Exact vs. approximate• Generic vs application specific• Confidences

  • 11

    Schema Differences

    Name: TomAddress: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777

    FirstName: TomStreetName: Main StStreetNumber: 123Tel: (650) 777-7777

  • 12

    Pair-Wise Snaps vs. Clustering

    r1

    r2

    r3

    r4

    r5

    r6

    s9

    s8

    s7

    s10

    r1

    r2

    r3

    r4r5

    r6

    r9

    r8

    r7

    r10

  • 13

    De-Duplication vs. Fidelity Enhancement

    R SB S

    N

  • 14

    Relationships

    r1 r2

    r5

    r7brother

    father

    businessbusiness

  • 15

    Using Relationships

    p1

    p2

    p5

    p7

    a1

    a2

    a3

    a5

    a4

    authors papers

    same??

  • 16

    Exact vs Approximate ER

    products

    cameras resolvedcameras

    CDs

    books

    ...

    resolvedCDs

    resolvedbooks

    ...

    ER

    ER

    ER

  • 17

    Exact vs Approximate ER

    terrorists terroristssortby age

    Widom 30match against

    ages 25-35

  • 18

    Generic vs Application Specific

    • Match function M(r, s)• Merge function => t

  • 19

    Taxonomy

    • Pairwise snaps vs. clustering• De-duplication vs. fidelity enhancement• Schema differences• Relationships• Exact vs. approximate• Generic vs application specific• Confidences

  • 20

    Outline

    • Why is ER challenging?• How is ER done?• Some ER work at Stanford• Confidences

  • 21

    Taxonomy

    • Pairwise snaps vs. clustering• De-duplication vs. fidelity enhancement• Schema differences No• Relationships No• Exact vs. approximate• Generic vs application specific• Confidences ... later on

  • 22

    Model

    Nm: TomWk: IBMOc: laywerSal: 500K

    Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM

    Nm: ThomasAd: 123 MaimOc: lawyer

    Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer

    Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K

    r1 r3r2

    r4:

    M(r1, r2) M(r4, r3)

  • 23

    Correct Answer

    r1

    r2

    r3

    r4

    r5

    r6

    s9

    s8

    s7

    s10

    ER(R) = All derivable records.....

    Minus “dominated” records

  • 24

    Question

    • What is best sequence of match, merge calls that give us right answer?

  • 25

    Brute Force Algorithm

    • Input R:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]

  • 26

    Brute Force Algorithm

    • Input R:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]

    • Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]

  • 27

    Brute Force Algorithm• Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]

    • Repeat:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]– r123 =

    [a:1, b:2, c:4, e:5, f:6]

  • 28

    Question # 1

    Brute Force Algorithm

    • Input R:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]

    • Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]

    Can we deleter1, r2?

  • 29

    Question # 2

    Brute Force Algorithm

    • Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]

    • Repeat:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]– r123 =

    [a:1, b:2, c:4, e:5, f:6]

    Can we avoidcomparisons?

  • 30

    ICAR Properties

    • Idempotence:– M(r1, r1) = true; = r1

    • Commutativity:– M(r1, r2) = M(r2, r1)– =

    • Associativity– =

  • 31

    More Properties

    • Representativity– If = r3, then

    for any r4 such that M(r1, r4) is truewe also have M(r3, r4) = true.

    r1

    r2

    r3

    r4

  • 32

    ICAR Properties Efficiency

    • Commutativity• Idempotence• Associativity• Representativity

    • Can discard records• ER result independent

    of processing order

  • 33

    Swoosh Algorithms

    • Record Swoosh• Merges records as soon as they match• Optimal in terms of record comparisons

    • Feature Swoosh• Remembers values seen for each feature• Avoids redundant value comparisons

  • 34

    Swoosh Performance

  • 35

    If ICAR Properties Do Not Hold?

    r1: [Joe Sr., 123 Main, DL:X]

    r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]

    r12: [Joe Sr., 123 Main, Ph: 123, DL:X]

    r2: [Joe, 123 Main, Ph:123]

    r3: [Joe Jr., 123 Main, DL:Y]

  • 36

    If ICAR Properties Do Not Hold?

    r1: [Joe Sr., 123 Main, DL:X]

    r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]

    r12: [Joe Sr., 123 Main, Ph: 123, DL:X]

    r2: [Joe, 123 Main, Ph:123]

    r3: [Joe Jr., 123 Main, DL:Y]

    Full Answer: ER(R) = {r12, r23, r1, r2, r3}Minus Dominated: ER(R) = {r12, r23}

  • 37

    If ICAR Properties Do Not Hold?

    r1: [Joe Sr., 123 Main, DL:X]

    r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]

    r12: [Joe Sr., 123 Main, Ph: 123, DL:X]

    r2: [Joe, 123 Main, Ph:123]

    r3: [Joe Jr., 123 Main, DL:Y]

    Full Answer: ER(R) = {r12, r23, r1, r2, r3}Minus Dominated: ER(R) = {r12, r23}R-Swoosh Yields: ER(R) = {r12, r3} or {r1, r23}

  • 38

    Swoosh Without ICAR Properties

  • 39

    Distributed Swoosh

    P1 P2 P3

    r1r2r3r4r5r6...

  • 40

    Distributed Swoosh

    P1 P2 P3

    r1

    r3r4

    r6...

    r1r2

    r4r5

    ...

    r2r3

    r5r6...

  • 41

    DSwoosh Performance

  • 42

    Outline

    • Why is ER challenging?• How is ER done?• Some ER work at Stanford• Confidences

  • 43

    Conclusion

    • ER is old and important problem• Our approach: generic• Confidences– challenging– two ways to tame:• thresholds• packages

  • 44

    Thanks.

  • 45

    Generic Confidence Model

    • r1 = 0.7 [ a:v1, b:v2, c: v3]

    match.7[a, b, c]

    .9[a, c, d]yes (or no)

    merge.7[a, b, c]

    .9[a, c, d].65[a,b,c,d,x]

  • 46

    Problem: Properties May Not Hold

    • r1 = 0.9 [a, b, c]• r2 = 0.8 [a, d]• say confidences multiplied on merge• = 0.72[a, b, c, d]• < , r1> = 0.648[a, b, c, d]• < , r2> = = 0.72[a, b, c, d]

  • 47

    ER with Confidences

    • Very Expensive:– must compute “all derivations”– cannot delete records after they merge

    • What can we do??– thresholds– packages

  • 48

    Important Property

    • If conf(Rx) < threshold• Then for any Ry derived from Rx

    conf(Ry) < threshold

    r1

    r2

    r3 r4

    C= 0.7 C

  • 49

    Thresholds - Example

    0.9 [ a: v1, b:v2 ]0.8 [ a:v1, c: v3 ]0.6 [ b:v2, c:v3, d:v4]0.75 [ a:v1, b:v2, c:v3]0.5 [a:v1, b:v2, c:v3, d: v4]...

    T=0.7

  • 50

    Thresholds - Example

    0.9 [ a: v1, b:v2 ]0.8 [ a:v1, c: v3 ]0.6 [ b:v2, c:v3, d:v4]0.75 [ a:v1, b:v2, c:v3]0.5 [a:v1, b:v2, c:v3, d: v4]...

    T=0.7

  • 51

    Goal: C-Swoosh

    baserecords

    allpossiblemerges

    eliminatedominated

    eliminatebelow

    threshold

  • 52

    Goal: C-Swoosh

    baserecords

    allpossiblemerges

    eliminatedominated

    eliminatebelow

    threshold

    earlier

  • 53

    Does Threshold Property Hold?

    • NO: records are evidence

    merge.7[a, b, c]

    .8[a, b, c].9[a, b, c]

  • 54

    Does Threshold Property Hold?

    • YES: records are beliefs

    merge.7[a, b, c]

    .8[a, b, c].8[a, b, c]

  • 55

    Simple Confidence Model

    • 0.7 [a, b]

    [a, b] [a, b] [a, b, c] [a, b] [a, b]

    [a, b] [a, b, d] ??? ??? ???

    Alternate Worlds:

  • 56

    Rules

    • 0.7[a, b, c], 0.7[a, b, c]⇒ 0.7 [a, b, c]• 0.7 [a, b], 0.5 [a, b]⇒ 0.7 [a, b]• 0.7 [a, b, c], 0.5 [a, b]⇒ 0.7 [a, b, c]• 0.7 [a, b, c], 0.9[a, b]⇒ 0.7 [a, b, c], 0.9[a, b]• etc

  • 57

    Matches

    0.9[a, b, c]0.8[a, b, d][a, x][c, d, y]

    Match with confidence 0.5

    worlds

    [a,b,c][a,b,d]

    [a,b,c,d]

    1 2 3 4 5 6 7 8 9 10

  • 58

    Matches

    0.9[a, b, c]0.8[a, b, d][a, x][c, d, y]

    worlds

    [a,b,c][a,b,d]

    [a,b,c,d]

    1 2 3 4 5 6 7 8 9 10

    0.4[a,b,c,d]0.9[a, b, c]0.8[a, b, d][a, x][c, d, y]

  • 59

    Summary

    • Belief model well suited for ER• Evidence model is very complex and

    expensive!

  • 60

    Packages

    • Match does not use confidences– merge does compute confidences

    • 4 properties hold for deterministic attributes– e.g., =

    ignoring confidences

  • 61

    Partition Records

    – r1 = .9 [a:1, b:2]– r2 = .8 [a:1, c: 4, e:5]– r3 = .7 [b:2, c:4, f:6]– r4 = .8 [a:7, e:5, f:6]– r5 = .9 [a:7, b:2]

    r1

    r2

    r3

    r12r123

    r45r5

    r4

  • 62

    Expand Packages

    – r1 = .9 [a:1, b:2]– r2 = .8 [a:1, c: 4, e:5]– r3 = .7 [b:2, c:4, f:6]– r4 = .8 [a:7, e:5, f:6]– r5 = .9 [a:7, b:2]

    r1

    r2

    r3

    r12r123

    r1, r2, r3

    ...

  • 63

    Conclusion

    • ER is old and important problem• Our approach: generic• Confidences– challenging– two ways to tame:• thresholds• packages

  • Thanks.

  • 65

    Extra Slides

  • 66

    Taxonomy

    • Pairwise snaps vs. clustering• De-duplication vs. fidelity enhancement• Schema differences No• Relationships No• Exact vs. approximate• Generic vs application specific• Confidences ... later on

  • 67

    One Confidence Model

    [id1, a, b, c, d][id1, a, b, d][id1, a, x][id1, b, y]

    [id1, a, b, c, d]

    [id3, a, b, c][id3, a, b, d][id3, a, b, f, g][id3, a, b, f, g]

    [id3, a, b, f, g]

    [id2, a, b, c][id2, a, c, e][id2, a, c, e][id2, a, c, e]

    [id2, a, c, e]

    shorthand

  • 68

    Records Are Evidence

    [id1, a, b, c, d][id1, a, b, d][id1, a, x][id1, b, y]

    [id1, a, b, c, d]

    [id1, (3/4)a, (3/4)b, (1/4)c, (2/4)d, (1/4)x, (1/4)y]

    not 0.25

  • 69

    New Evidence

    [id1, a, b, c, d][id1, a, b, d][id1, a, x][id1, b, y]+ [id1, a, b, c, d]

    [id1, a, b, c, d]

    [id1, (4/5)a, (4/5)b, (2/5)c, (3/5)d, (1/5)x, (1/5)y]

    [id1, (3/4)a, (3/4)b, (1/4)c, (2/4)d, (1/4)x, (1/4)y]+ [id1, a, b, c, d]

  • 70

    No Ids

    [a, b, c][a, b, d][a, x][c, d, y] [a, b, (1/2)c, (1/2)d]

    [a, x][c, d, y]

    [a, b, c][a, b, d][a, x][c, d, y]

    0.7

    0.3

  • 71

    No Ids

    [a, b, c][a, b, d][a, x][c, d, y] [a, b, (1/2)c, (1/2)d]

    [a, x][c, d, y]

    [a, b, c][a, b, d][a, x][c, d, y]

    0.7

    0.3

    [(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y][a, x]

    [a, b, (1/2)c, (1/2)d][a, x][c, d, y]

    0.9

    0.1

  • 72

    Queries?

    [a, b, c][a, b, d][a, x][c, d, y] [a, b, (1/2)c, (1/2)d]

    [a, x][c, d, y]

    [a, b, c][a, b, d][a, x][c, d, y]

    0.7

    0.3

    [(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y][a, x]

    [a, b, (1/2)c, (1/2)d][a, x][c, d, y]

    0.9

    0.1

    Threshold = 0.5; Support = 2Maximal RecordExample: [a, b, c, d]

  • 73

    Queries?

    [a, b, c][a, b, d][a, x][c, d, y] [a, b, (1/2)c, (1/2)d]

    [a, x][c, d, y]

    [a, b, c][a, b, d][a, x][c, d, y]

    0.7

    0.3

    [(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y][a, x]

    [a, b, (1/2)c, (1/2)d][a, x][c, d, y]

    0.9

    0.1

    Threshold = 0.5; Support = 2Maximal RecordExample: [a, b, c, d]

  • 74

    Need Simpler Model?

  • 75

    Bonus Material

    • Entity Resolution, Confidences,and their relationship to Information Privacy

  • 76

    Privacy

    Nm: AliceAd: 32 FoxPh: 5551212

    1.0

    Nm: AliceAd: 32 FoxPh: 5551212

    1.0Nm: AliceAd: 32 Fox

    1.0Nm: AliceAd: 32 FoxPh: 5551212

    0.7Nm: AliceAd: 32 FoxPh: 5551212Ad: 14 Cat

    1.0

    Bob

    Alice

  • 77

    Leakage

    Nm: AliceAd: 32 FoxPh: 5551212

    1.0

    Nm: AliceAd: 32 FoxPh: 5550000

    0.7

    Bob

    Alice

    L = 0.6 (between 0 and 1)

  • 78

    Multi-Record Leakage

    Nm: AliceAd: 32 FoxPh: 5551212

    1.0

    Bob

    Alice

    LL = 0.9 (between 0 and 1,e.g., max L)

    r1, L = 0.9r2, L = 0.8r3, L = 0.7

  • 79

    Q1: Added Vulnerability?

    Bob

    Alice

    ΔLL = ??

    r1 r2 r3 r4p

    r4 may cause Bob’s records tosnap together!

  • 80

    Q2: Disinformation?

    Bob

    Alice

    ΔLL = ??

    r1 r2 r3 r4 (lies)p

    What is mostcost effectivedisinformation?

  • 81

    Q3: Verification?

    Bob

    Alicep

    What is best factto verify to increaseconfidence in hypothesis?

    r1, 0.9r2, 0.8r3, 0.7...

    hypothesis h (0.6)

  • 82

    Summary

    • Entity resolution is critical• Efficient resolution important• Confidences are important, but how?• ER is key aspect of info privacy

    – check www-db.stanford.edu forSwoosh paper & forthcoming paper

  • 83

    Thanks.

  • 84

    Extra Slides

  • 85

    Challenges

    • Exponential growth in complexity

    0.9 [ a: v1, b:v2 ]0.8 [ a:v1, c: v3 ]0.6 [ b:v2, c:v3, d:v4]0.75 [ a:v1, b:v2, c:v3]0.5 [a:v1, b:v2, c:v3, d: v4]...

  • 86

    Three Ideas to Tame Complexity

    • Thresholds• Domination• Packages

  • 87

    Thresholds

    0.9 [ a: v1, b:v2 ]0.8 [ a:v1, c: v3 ]0.6 [ b:v2, c:v3, d:v4]0.75 [ a:v1, b:v2, c:v3]0.5 [a:v1, b:v2, c:v3, d: v4]...

    T=0.7

  • 88

    Domination

    0.9 [ a: v1, b:v2, c:v3 ]0.8 [ a:v1, b: v2, c: v3 ]0.8 [ b:v2, c:v3 ]...

  • 89

    Domination

    0.9 [ a: v1, b:v2, c:v3 ]0.8 [ a:v1, b: v2, c: v3 ]0.8 [ b:v2, c:v3 ]...

  • 90

    Summary

    • Our approach: pairwise, generic, Swoosh• Confidences• Making Tractable:– threshold– domination– packages

  • 91

    Thanks You

  • 92

    What Swoosh Does NOT Do

    • Hash table with every pair seen:– records ri, rj– compared values vi, vj

    • Swoosh achieves the same effectwith our N2 space

  • 93

    Swoosh Performance (I)

  • 94

    Swoosh Performance (II)

  • 95

    Swoosh Performance (III)

    Generic Entity Resolution:�Identifying Real-World Entities in Large Data SetsEntity ResolutionApplicationsOutlineChallenges (1)Challenges (2)Challenges (3)Challenges (4)Challenges (5)TaxonomySchema DifferencesPair-Wise Snaps vs. ClusteringDe-Duplication vs. �Fidelity EnhancementRelationshipsUsing RelationshipsExact vs Approximate ERExact vs Approximate ERGeneric vs Application SpecificTaxonomyOutlineTaxonomyModelCorrect AnswerQuestionBrute Force AlgorithmBrute Force AlgorithmBrute Force AlgorithmQuestion # 1Question # 2ICAR PropertiesMore PropertiesICAR Properties EfficiencySwoosh AlgorithmsSwoosh PerformanceIf ICAR Properties Do Not Hold?If ICAR Properties Do Not Hold?If ICAR Properties Do Not Hold?Swoosh Without ICAR PropertiesDistributed SwooshDistributed SwooshDSwoosh PerformanceOutlineConclusionThanks.Generic Confidence ModelProblem: Properties May Not HoldER with ConfidencesImportant PropertyThresholds - ExampleThresholds - ExampleGoal: C-SwooshGoal: C-SwooshDoes Threshold Property Hold?Does Threshold Property Hold?Simple Confidence ModelRulesMatchesMatchesSummaryPackagesPartition RecordsExpand PackagesConclusionThanks.Extra SlidesTaxonomyOne Confidence ModelRecords Are EvidenceNew EvidenceNo IdsNo IdsQueries?Queries?Need Simpler Model?Bonus MaterialPrivacyLeakageMulti-Record LeakageQ1: Added Vulnerability?Q2: Disinformation?Q3: Verification?SummaryThanks.Extra SlidesChallengesThree Ideas to Tame ComplexityThresholdsDominationDominationSummaryThanks YouWhat Swoosh Does NOT DoSwoosh Performance (I)Swoosh Performance (II)Swoosh Performance (III)


Recommended