+ All Categories
Home > Documents > CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course...

CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course...

Date post: 04-Jan-2016
Category:
Upload: carmel-miranda-dean
View: 229 times
Download: 1 times
Share this document with a friend
Popular Tags:
43
CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright © Dmitri V. Kalashnikov, 2010
Transcript
Page 1: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

CS295: Info Quality & Entity Resolution

University of California, IrvineFall 2010

Course introduction slidesInstructor: Dmitri V. Kalashnikov

Copyright © Dmitri V. Kalashnikov, 2010

Page 2: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

2

Class Organizational IssuesClass Organizational Issues

• Class Webpage– http://www.ics.uci.edu/~dvk/CS295.html– Will put these slides there

• Rescheduling of Class Time– Now

– Tue Thr 3:30-4:50 PM @ ICS209 – Twice a week

– New – Thr 3:00-5:20 PM @ ?– 10 min break in the middle– Once a week– Easier on students– Is this time slot OK?

Page 3: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

3

Class StructureClass Structure

• Student Presentation-Based Class– Students will present publications

– Papers cover recent trends (not comprehensive)

– Prepare slides– Slides will be collected after presentation

– Discussion

• Final grade– Quality of slides– Quality of presentations– Participation & attendance

– We are a small size class, so please attend!

– No exams!

Page 4: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

4

Tentative SyllabusTentative Syllabus

Page 5: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

5

Presentation

• Topics– All papers are split into “topics”– Covering a topic on a day

• Student Presentations– Each student will choose a day/topic to present– A student presents only during 1 day in quarter!

– If it is possible– To reduce workload

– A student will present 2(?) papers on that day– Please start preparing early!

Page 6: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

6

Tentative List of PublicationsTentative List of Publications

Page 7: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

7

How to present a paper

• Present “high-level” idea – Main idea of the paper [should be clear]

• Present technical depth of techniques1)Cover techniques in detail

2)Try to analyze the paper [if you can]– Discuss what you like about the paper– Criticize the technique

– Do you see flaws/weaknesses in the proposed methodology?– Do you think the techniques can be improved?– Do you think authors should have included additional info/algo?

• Present experiments– Explain datasets/setup/graphs (analyze results)– Criticize experiments [if you can]

– Large enough data? More experiments needed? Unexplained trends? etc

Page 8: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

8

Who wants to present first?Who wants to present first?

Page 9: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

9

Talk OverviewTalk Overview

• Class Organizational issuesIntro to Data Quality & Entity Resolution

Page 10: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

1010

Data Processing FlowData Processing Flow

Data Organizations & People

collect large amounts of data Many types of data

– Textual– Semi structured– Multimodal

Data Analysis Decisions

Analysis Data is analyzed for a variety

of purposes– Automated analysis: Data Mining

– Human in the loop: OLAP

– Ad hoc

Analysis for Decision Making– Business Decision Making– Etc

Page 11: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

1111

Quality of decisions depends on quality of dataQuality of decisions depends on quality of data

• Quality of data is critical

• $1 Billion market – Estimated by Forrester Group

• Data Quality– Very old research area– But no comprehensive textbook exists yet!

Quality of Data

Quality of Analysis

Quality of Decisions

Page 12: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

12

Example of Analysis on Bad Data: CiteSeerExample of Analysis on Bad Data: CiteSeer

CiteSeer: Top-CiteSeer: Top-kk most cited authors most cited authors DBLPDBLP DBLPDBLP

Unexpected EntriesUnexpected Entries– Lets check two people in DBLP

– “A. Gupta”

– “L. Zhang”• Analysis on bad data can lead to incorrect results.• Fix errors before analysis.

More than 80% of researchers working on data mining projects spend more than 40% of their project time on cleaning and preparation of data.

Page 13: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

1313

*Why* Data Quality Issues Arise?*Why* Data Quality Issues Arise?

Types of DQ Problems

–Ambiguity–Uncertainty–Erroneous data values–Missing Values–Duplication–etc

Page 14: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

1414

Example of Ambiguity

– Ambiguity– Categorical data– “Location: Washington”– D.C.? State? Something else?

Page 15: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

1515

Example of UncertaintyExample of Uncertainty

– Uncertainty– Numeric data– “John’s salary is between $50K and $80K”– Query: find all people with salary > $70K

Page 16: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

1616

Example of Erroneous DataExample of Erroneous Data

– Erroneous data values– <Name: “John Smixth”, Salary: “Irvine, CA”, Loc: $50K>

Page 17: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

1717

Example of Missing ValuesExample of Missing Values

– Missing Values– <Name: “John Smixth”, Salary: null, Loc: null>

Page 18: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

1818

Example of DuplicationExample of Duplication

– Duplication– <1/01/2010, “John Smith”, “Irvine, CA”, 50k>– <6/06/2010, “John Smith”, CA”, 55k>– Same? Different?

Page 19: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

Inherent Problems vs Errors in Inherent Problems vs Errors in PreprocessingPreprocessing

– Inherent Problems with Data– The dataset contains errors (like in prev. slide) – <Name: “John Smixth”, Salary: null, Loc: null>

– Errors in Preprocessing– The dataset might not contain errors– But preprocessing algo (e.g., extraction) fails

– Text: “John Smith lives in Irvine CA at 100 main st, his salary is $25K”

– Extractor:

<Person: Irvine,

Location: CA,

Salary $100K,

Address: null>

Page 20: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

2020

*When* Data Quality Issues Arise? Past.*When* Data Quality Issues Arise? Past.

Page 21: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

2121

*When* DQ Issues Arise? Present.*When* DQ Issues Arise? Present.

– Automated generation of DB content– Prime reason for DQ issues nowadays– Analyzing unstructured or semi-structured raw data

– Text / Web

– Extraction

– Merging DBs or Data sources– Duplicate information

– Inconsistent information

– Missing data

– Inherent problems with well structured data– As in the shown examples

Page 22: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

2222

Data Flaw wrt Data QualityData Flaw wrt Data Quality

Raw Data Analysis

DecisionsHandle

Data Quality

Two general ways to deal with DQ problems

1) Resolve them and then apply analysis on clean data– Classic Data Quality approach

2) Account for them in the analysis on dirty data– E.g. put data into probabilistic DBMS– Often not considered as DQ

Page 23: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

2323

Resolve only what is needed!Resolve only what is needed!

Raw Data Analysis

DecisionsHandle

Data Quality

– Data might have many different (types of) problems in it– Solve only those that might impact your analysis

– Example

Publication DB: <paper_id, author_name, title, venue>

<1, John Smith, “Title 1…”, SIGMxOD>

<2, John Smith, “Title 2…”, SIGIR>– All papers by John Smith

– Venues might have errors

– The rest is accurate

– Task: count papers => Do not fix venues!!!

Page 24: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

2424

Focus of this class: Focus of this class: Entity Resolution (ER)Entity Resolution (ER)

ER a very common Data Quality challenge Disambiguating uncertain references to objects Multiple Variations

− Record Linkage [winkler:tr99]− Merge/Purge [hernandez:sigmod95]− De-duplication [ananthakrishna:vldb02,sarawagi:kdd02]− Hardening soft databases [cohen:kdd00]− Reference Matching [mccallum:kdd00]− Object identification [tejada:kdd02]− Identity uncertainty [pasula:nips02, mccallum:iiweb03]− Coreference resolution [ng:acl02]− Fuzzy match and fuzzy grouping [@microsoft]− Name Disambiguation [han:jcdl04, li:aaai04]− Reference Disambiguation [km:siam05]− Object Consolidation [mccallum:kdd03wkshp, chen:iqis05]− Reference Reconciliation [dong:sigmod05]

Ironically, some of them are the same (duplication)!

Page 25: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

Entity Resolution: Lookup and GroupingEntity Resolution: Lookup and Grouping

… J. Smith ...

.. John Smith ...

.. Jane Smith ...

MIT

Intel Inc.

Lookup– List of all objects is given – Match references to objects

Grouping– No list of objects is given – Group references that corefer

2525

Page 26: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

2626

When ER challenge arises?When ER challenge arises?

2626

Merging multiple data sources (even structured)– “J. Smith” in DataBase1– “John Smith” in DataBase2– Do they co-refer?

References to people/objects/organization in raw data– Who is “J. Smith” mentioned as an author of a publication?

Location ambiguity– “Washington” (D.C.? WA? Other?)

Automated extraction from text– “He’s got his PhD/BS from UCSD and UCLA respectively.” – PhD: UCSD or UCLA?

Natural Language Processing (NLP) − “John met Jim and then he went to school”

− “he”: John or Jim?

Page 27: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

2727

Standard Approach to Entity ResolutionStandard Approach to Entity Resolution

– Choosing features to use – For comparing two references

– Choosing blocking functions– To avoid comparing all pairs

– Choosing similarity function– Outputs how similar are two references

– Choosing problem representation– How to represent it internally, e.g. as a graph

– Choosing clustering algorithm– How to group references

– Choosing quality metric– In experimental work– How to measure the quality of the results

Page 28: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

2828

Inherent Features: Standard ApproachInherent Features: Standard Approach

s (u,v) = f (u,v)

u v

J. Smith John Smith

Feature 2 Feature 2

Feature 3 Feature 3

[email protected] [email protected]

??

??

??

??

“Similarity function” “Feature-based similarity”

Deciding if two reference u and v co-refer

Analyzing their features

(if s(u,v) > t then u and v are declared to co-refer)

Page 29: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

Advanced Approach: Information UsedAdvanced Approach: Information Used

Jane Smith

John Smith

J. Smith

++ u

v

u v

J. Smith John Smith

Feature 2 Feature 2

Feature 3 Feature 3

[email protected] [email protected]

??

??

??

??

Inherent Features

Context FeaturesEntity Relationship Graph

(Social Network)

Web

External Data

Dataset

Public DatasetsE.g., DBLP, IMDBPublic Datasets

E.g., DBLP, IMDB

EncyclopediasE.g., Wikipedia

OntologiesE.g., DMOZ

Ask a person- Not frequently- Might not work well

(Condit.) Functional Dependencies & Consistency constraints

Page 30: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

30

Blocking FunctionsBlocking Functions

• Comparing each reference pair− N >> 1 references in dataset

− Each can co-refer with the remaining N-1

− Complexity N(N-1)/2 is too high…

• Blocking functions– A fast function that finds potential matches quickly

R1

R2

R3

R4

R5

R6

R1

R2

R3

R4

R5

R6

R1

R2

R3

R4

R5

R6

R1

R2

R3

R4

R5

R6

BF2- one lost- one extra

BF1 - one extra

Ground TruthNaïve for R1

Page 31: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

Blocking Functions (contd)Blocking Functions (contd)

• Multiple BFs could be used− Better if independent

− Use different record fields for blocking

• Examples− From [Winkler 1984]

1) Fst3 (ZipCode) + Fst4 (NAME)

2) Fst5 (ZipCode) + Fst6 (Street name)

3) 10-digit phone #

4) Fst3(ZipCode) + Fst4(LngstSubstring(NAME))

5) Fst10(NAME)

− BF4 is #1 single

− BF1 + BF4 is #1 pair

− BF1 + BF5 is #2 pair 31

Page 32: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

BFs: Other InterpretationsBFs: Other Interpretations

• Dataset is split into smaller Blocks− Matching operations are performed on Blocks

− Block is a clique

• Blocking− Applying somebody else’s technique first

− Not only will find candidates− But also will merge many (even most) cases

− Will only leave “tough cases”

− Apply your technique on these “tough cases”

32

Page 33: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

Basic Similarity FunctionsBasic Similarity Functions

u v

J. Smith John Smith

Feature 2 Feature 2

Feature 3 Feature 3

[email protected] [email protected]

0.80.8

0.20.2

0.30.3

0.00.0

− How to compare attribute values− Lots of metrics, e.g. Edit Distance, Jaro, Ad hoc

− Cross attribute comparisons

− How to combine attribute similarities− Many methods, e.g. supervised learning, Ad hoc

− How to mix it with other types of evidences− Not only inherent features

33333333

s (u,v) = f (u,v)

Page 34: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

Standardization & ParsingStandardization & Parsing

• Standardization− Converting attribute values into the same format− For proper comparison

• Examples− Convert all dates into MM/DD/YYYY format

− So that “Jun 1, 2010” matches with “6/01/10”

− Convert time into HH:mm:ss format− So that 3:00PM and 15:00 match

− Convert Doctor -> Dr.; Professor -> Prof.

• Parsing− Subdividing into proper fields− “Dr. John Smith” Jr. becomes − <PREF: Dr.; FNAME: John; LNAME: Smith; SFX: Jr.>

34

Page 35: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

Example of Similarity FunctionExample of Similarity Function

• Edit Distance (1965)− Comparing two strings

− The min number of edits …− Insertions

− Deletions

− Substitutions

− … needed to transform on string into another− Ex.: “Smith” vs. “Smithx” one del is needed.

− Dynamic programming solution

• Ex. of Advanced Version− Assign different costs to ins, del, sub

− Some errors are more expensive (unlikely) than others

− The distance d(s1,s2) is the min cost transformation

− Learn (supervised learning) costs from data

35

Page 36: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

ClusteringClustering

• Lots of methods exists− Really a lot!

• Basic Methods− Hierarchical

− Agglomerative− Decide threshold t

− if s(u,v) > t then merge(u,v)

− Partitioning

• Advanced Issues− How to decide the number of clusters K

− How to handle negative evidence & constarints

− Two step clustering & cluster refinement

− Etc, very vast area

36

a

b c

+1+1

+1

d

e f

+1+1

+1+1

-1

-1

Page 37: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

Quality MetricsQuality Metrics

• Purity of clusters− Do clusters contain mixed

elements? (~precision)

• Completeness of clusters− Do clusters contain all of its

elements? (~recall)

• Tradeoff between them− A single metric that combines

them (~F-measure)

37

1 11 1

2 22 2 2 2

1 1

Ideal Clustering

1 11 1

2 22 2 2

21

1

One Misassigned (Example 1)

1

1

1 1

2 22

2 2 2

1 1

Half Misassigned

1 11 1

2 22 2 2

211

One Misassigned (Example 2)

Page 38: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

Precision, Recall, and F-measurePrecision, Recall, and F-measure

• Assume− You perform an operation to find relevant (“+”) items

− E.g. Google “UCI” or some other terms

− R is the ground truth set, or the set of relevant entries

− A is the answer returned by some algorithm

• Precision− P = |A ∩ R| / |A|

− Which fraction of the answer A are correct (“+”) elements

• Recall− R = |A ∩ R| / |R|

− Which fraction of ground truth elements were found (in A)

− F-measure− F = 2/(1/P + 1/R) harmonic mean of precision and recall 38

Page 39: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

Quality Metric: Pairwise F-measureQuality Metric: Pairwise F-measure

− Example− R = {a1, a2, a5, a6, a9, a10}

− A = {a1, a3, a5, a7, a9}

− A ∩ R = {a1, a5, a9}

− Pre = |A ∩ R| / |A| = 3/5

− Rec = |A ∩ R| / |R| = 1/2

− Pairwise F-measure− “+” are pairs that should be merged

− “-” are pairs that should not be merged

− Now, given an answer, can compute Pre, Rec, F-measure

− A widely used metric in ER− But a bad choice in many circumstances!

− What is the good choice?39

Page 40: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

4040

Web People Search (WePS)Web People Search (WePS)

Person 1 Person 2

Person 3

Unknown beforehand

2. Top-K Webpages(related to any John Smith)

• Web domain

• Very active research area

• Many problem variations− E.g., context keywords

John Smith

1. Query Google with a person name

3. Task: Cluster Webpages(A cluster per person)

Person N

Page 41: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

Recall that…Recall that…

… J. Smith ...

.. John Smith ...

.. Jane Smith ...

MIT

Intel Inc.

Lookup– List of all objects is given – Match references to objects

Grouping– No list of objects is given – Group references that corefer

WePS is a grouping task

Page 42: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

User InterfaceUser Interface

User Input Results

Page 43: CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

43

System ArchitectureSystem Architecture

Top-K Webpages

Person1 Person2

Person3

ResultsClustering

Search Engine

Preprocessed Webpages

AuxiliaryInformationAuxiliary

Information

John Smith

Preprocessing- TF/IDF- NE/URL Extraction- ER Graph

Postprocessing- Custer Sketches- Cluster Rank- Webpage Rank


Recommended