Entity Resolution
David InouyeGeorgia Institute of Technology
2011 DIMACS REU Intern at Rutgers UniversityWilliam M. Pottenger, Ph.D., Mentor
* The content of this presentation has been adapted from a presentation given by Nir Grinberg.
06/07/2011 1
Introduction to Entity ResolutionEntity resolution is the problem of deciding if
two sets of data elements refer to the same real-world entity.
06/07/2011 2
Elements from Source 1 Elements from Source 2
? ?
?
Introduction to Entity ResolutionEntity resolution is the problem of deciding if
two sets of data elements refer to the same real-world entity.
06/07/2011 3
Elements from Source 1 Elements from Source 2
Objective/Approach
06/07/2011 4
Standardize and Encode
Calculate Similarity
Scores
Classify Using
Ground Truth Data
* WITS - https://wits.nctc.gov/; GTD - http://www.start.umd.edu/gtd/
Incidents in GTD*
Incidents in WITS*
Month: 6 Day: 28Year: 2005City: Dardsun, KupwaraType: Arson
Date: 06/27/2005City: KupwaraType: Fire attack
Phase 1: Standardize and Encode
06/07/2011 5
WITS Incident_ID
Date City State_Prov Country
40426 12/3/06 Udhampur Jammu and Kashmir
India
15649 6/27/2005 Kupwara Jammu and Kashmir
India
GTD Eventid
Iyear Imonth Iday City Provstate
country
200404140003
2004 4 14 Patna Bihar India
200506280004
2005 6 28 Dardsun Kupwara
Jammu & Kashmir (State)
India
Phase 1: Standardize and EncodeStandardize
DatesMap WITS weapon types to GTD weapon types
GeoCode location to latitude and longitude
Extract topic model distribution using LDA
06/07/2011 6
Phase 1: Latent Dirichlet AllocationGenerative
probabilistic modelAssumes topics are
probability distributions of words
Assumes documents are probability distributions of topics
06/07/2011 7
Topic 1 Topic 20
0.1
0.2
0.3
0.4
0.5
MoneyLoanBankRiverStream
Doc 1 Doc 2 Doc 30
0.20.40.60.8
1
Topic 1Topic 2
Phase 1: LDA Example
06/07/2011 8* Example from “Probabilistic Topic Models” by Mark Steyvers.http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf
Phase 1: Latent “Topics” (most probable words in topic)
killed, kashmir, attack, injured, militants, suspected, blast, kill, bombfired, upon, armed, killed, manipur, civilian, imphal, member, formercivilian, kashmir, jammu, night, residence, kidnapped, one, village, dodapolice, one, killing, wounding, officers, two, officer, others, injuringjammu, kashmir, baramula, security, one, armed, anantnag, hizbul, mujahedinassam, explosive, front, improvised, device, liberation, united, ied, ulfawidely, two, civilians, national, tripura, kidnapped, three, village, karbicausing, injuries, damage, damaging, fire, station, set, detonated, trainmaoist, party, communist, cpi, widely, pradesh, andhra, chhattisgarh, villagegrenade, threw, civilians, wounding, srinagar, vehicle, two, kashmir, jammu
06/07/2011 9
Phase 1: Latent “Topics” (most probable words in topic)
killed, kashmir, attack, injured, militants, suspected, blast, kill, bombfired, upon, armed, killed, manipur, civilian, imphal, member, formercivilian, kashmir, jammu, night, residence, kidnapped, one, village, dodapolice, one, killing, wounding, officers, two, officer, others, injuringjammu, kashmir, baramula, security, one, armed, anantnag, hizbul, mujahedinassam, explosive, front, improvised, device, liberation, united, ied, ulfawidely, two, civilians, national, tripura, kidnapped, three, village, karbicausing, injuries, damage, damaging, fire, station, set, detonated, trainmaoist, party, communist, cpi, widely, pradesh, andhra, chhattisgarh, villagegrenade, threw, civilians, wounding, srinagar, vehicle, two, kashmir, jammu
06/07/2011 10
Phase 2: Compute SimilarityDates
05/23/2001 vs. 05/22/2001Nominal strings such as country or city
“Jammu” vs. “Jammuu”GeoLocation
Lat 32.8/Long 74.7 vs. Lat 32.27/Long 75.6Topic distribution
06/07/2011 11Topic
1Topic
2Topic
3Topic
4
0
0.4
Topic 1
Topic 2
Topic 3
Topic 4
0
0.4
Phase 3: Classify as Match/Non-match
06/07/2011 12* The Center for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland provided the human annotated ground truth data.
Similarity Scores
Classifier
Model Based on Ground
Truth*
Match or Non-
match
Phase 3: Classifier Results
06/07/2011 13
ClassifiedNon-match Match
ClassNon-match 9875 511
Match 116 246
Accuracy Precision Recall0
0.20.40.60.8
10.94
0.32
0.68
My research possibilitiesClean up the ground truth data
Improve upon the HO-LDA algorithm
Consider how to compute different similarity scores
06/07/2011 14