+ All Categories
Home > Documents > Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V....

Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V....

Date post: 20-Jan-2016
Category:
View: 233 times
Download: 0 times
Share this document with a friend
Popular Tags:
28
Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department University of California, Irvine Additional information is available at http://www.ics.uci.edu/~dvk Copyright © by Dmitri V. Kalashnikov, 2007 ACM IEEE Joint Conference on Digital Libraries 2007
Transcript
Page 1: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

Adaptive Graphical Approach to Entity Resolution

Dmitri V. Kalashnikov

Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra

Computer Science DepartmentUniversity of California, Irvine

Additional information is available at http://www.ics.uci.edu/~dvkCopyright © by Dmitri V. Kalashnikov, 2007

ACM IEEE Joint Conference on Digital Libraries 2007

Page 2: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

2

Structure of the Talk

Motivation

• Generic Disambiguation Framework – High-level

• Entity Resolution Approach– Part of the Framework

• Experiments

Page 3: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

3

Entity Resolution & Data Cleaning

Raw Dataset(s)

...J. Smith ...

.. John Smith ...

.. Jane Smith ...

MIT

Intel Inc. ?

A "nice" regular Database

Analysis on bad data leads to wrong conclusions!

•Uncertainty•Errors•Missing data

Page 4: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

4

Why do we need “Entity Resolution”?

q Hi, I’m Jane Smith.

I’d like to apply for a faculty

position.

Wow! I am sure we will accept a strong candidate

like that!

Jane Smith – Fresh Ph.D. Tom - Recruiter

OK, let me check

something quickly …

???

Publications:1. ……2. ……3. ……

Publications:1. ……2. ……3. ……

CiteSeer Rank

Page 5: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

5

Suspicious entries– Lets go to DBLP website

– which stores bibliographic entries of many CS authors

– Lets check two people– “A. Gupta”

– “L. Zhang”

What is the problem?

CiteSeer: the top-k most cited authors DBLP DBLP

Page 6: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

6

Comparing raw and cleaned CiteSeer

Rank Author Location

1 (100.00%) douglas schmidt cs@wustl

2 (100.00%) rakesh agrawal almaden@ibm

3 (100.00%) hector garciamolina @

4 (100.00%) sally floyd @aciri

5 (100.00%) jennifer widom @stanford

6 (100.00%) david culler cs@berkeley

6 (100.00%) thomas henzinger eecs@berkeley

7 (100.00%) rajeev motwani @stanford

8 (100.00%) willy zwaenepoel cs@rice

9 (100.00%) van jacobson lbl@gov

10 (100.00%) rajeev alur cis@upenn

11 (100.00%) john ousterhout @pacbell

12 (100.00%) joseph halpern cs@cornell

13 (100.00%) andrew kahng @ucsd

14 (100.00%) peter stadler tbi@univie

15 (100.00%) serge abiteboul @inria

Raw CiteSeer’s Top-K Most Cited Authors

Cleaned CiteSeer’s Top-K Most Cited Authors

Page 7: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

7

What is the lesson?

– Data should be cleaned first– E.g., determine the (unique) real authors of publications

– Solving such challenges is not always “easy”– This explains a large body of work on Entity Resolution

“Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results.

Page 8: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

8

Typical Data Processing Flow

Raw Data RepresentationData CleaningExtraction Analysis

Page 9: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

9

Two most common types of Entity Resolution

...J. Smith ...

.. John Smith ...

.. Jane Smith ...

MIT

Intel Inc.

Fuzzy lookup

– match references to objects– list of all objects is given

– [SDM’05], [TODS’06]

Fuzzy grouping

– group references that co-refer

– [IQIS’05], [JCDL’07]

Page 10: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

10

Structure of the Talk

• Motivation Generic Framework

– High-level

• Approach– Part of the Framework

• Experiments

Page 11: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

11

Traditional Approach to Entity Resolution

"J. Smith"

f2

f3

?

?

?

[email protected]

Yf2

f3

[email protected]?

X

Traditional MethodsFeatures and Context

"Jane Smith"

s (X,Y) = f (X,Y) Similarity = Similarity of Features

Page 12: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

12

Key Observation: More Info is Available

A "nice" regular DatabaseJane Smith

John Smith

J. Smith

=

Page 13: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

13

Solution: Main Idea

f1

f2

f3

?

?

?

f4

Y

f1

f2

f3

f4?

X

Traditional Methods

+ X Y

A

B C

D

E F

Relationship Analysis

ARG

features and context

s (X,Y) = c (X,Y) + γ f (X,Y)Similarity = Similarity of Features + “Connection Strength”

New Paradigm

Page 14: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

14

Illustrative Example

“Indirect connections”– Suppose your co-worker’s name is “John White”– Suppose you see on the Web, on my homepage

– My name: “Dmitri …”– Somebody named: “John White”

– Who is the “John White”?– From data you might establish a connection:

– “Dmitri” might be connected to more “John White”’s…

Dmitri

JCDL’07

Visited

<you>

Visited

<your ORG>

WorksAT WorksAT

John White

Page 15: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

15

Key Features of the Framework

Our goal is/was to create a framework, such that:– solid theoretic foundation

– lookup

– domain-independent framework

– self-tuning

– scales to large datasets

– robust under uncertainty

– high disambiguation quality

Page 16: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

16

Structure of the Talk

• Motivation

• Generic Framework – High-level

Approach– Part of the Framework

• Experiments

Page 17: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

17

Approach

• Graph Creation– Entity-Relationship Graph

• Consolidation Algorithm – Bottom-up clustering

• Adaptiveness to data– That is, self-tuning– Supervised learning

• External Data– To improve the quality further– A theoretic possibility

– Not tested yet

Page 18: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

18

ER Graph Creation

Page 19: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

19

Virtual Connected Subgraph (VCS)

person

publication

department

organization

similarity

regular

Nodes

Edges

VCS

• VCS– Similarity edges form VCSs– Subgraphs in the ER graph

1. “Virtual”– Contains only similarity edges

2. “Connected”– A path between any 2 nodes

3. Completeness– Adding more nodes/edges would violate (1) and (2)

• Logically, the Goal is– Partition each VCS properly

Page 20: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

20

Consolidation Algorithm: Merging

Page 21: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

21

Self-tuning via Supervised Learning

Page 22: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

22

Self-tuning (2)

Page 23: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

23

External Knowledge to Improve Quality

Page 24: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

24

Structure of the Talk

• Motivation

• Generic Framework – High-level

• Approach– Part of the Framework

Experiments

Page 25: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

25

Quality

“Context” is proposed in [Bhattacharya et al., DMKD’04]

The two algos are proposed in [Dong et al., SIGMOD’05]

Page 26: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

26

Scalability & Efficiency

Page 27: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

27

Impact of Random Relationships

Page 28: Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

28

Contact Information

• Info about our disambiguation project– http://www.ics.uci.edu/~dvk

• Overall design– Dmitri V. Kalashnikov– dvk [at] domain

• Implementation details in JCDL’07– Zhaoqi (Stella) Chen– chenz [at] domain– domain = ics.uci.edu


Recommended