Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | isaac-solomon |
View: | 24 times |
Download: | 4 times |
Detecting Nearly Duplicated Records in Location Datasets
Microsoft Research AsiaSearch Technology Center
Yu Zheng Xing Xie, Shuang Peng, James Fu
Background
Web maps and local search engines are frequently-usedThe quality of the services depends on geographic data
Background
Name Address GPS Position Phone Num. Category Type
The Matt’s Bar 701 5th Ave Seattle, WA 116.325, 35.364 1-56987452 Café YP
Silver Cloud Inn 314 7th Ave Redmond, WA 116.451, 35.209 1-25698716 Hotel POI
Point of interestsCollected by people holding GPS-enabled devices in the physical worldAccurate GPS coordinatesLess accurate address
Yellow pageInputted by people in a cyber environment, e.g., onlineAccurate addressInaccurate GPS coordinates (translated by geocoding)
Problem
Nearly duplicated POIsThe same entity in the physical worldWith slightly different presentations of name, address,
Caused by multiple resourcesDifferent vendors and channelsDifferent types: POI and YP
ResultsBring trouble to data managementConfuse users
Example:Seattle Premier Outlet MallSeattle Premium Outlet
What we do
Infer the similarity between two location entitiesBased on a machine learning based approachConsider multiple fields: name, address, coordinates, categories
Identify some useful features
Evaluate our method using real datasets
Similarities between two entitiesName similarityAddress similarityCategory similarity
Train a inference modelUsing these similarities as featuresA small human label training setApply to a large scale dataset
Methodology
Name similarity
Edit distance does not workThe concept of IDF
Shared part: ,Different part:
Output and as features
Galaxies Coffee House
Espresso DarerEspresso Diana
Galaxies CafeGalaxies
Coffee HouseCafe
EspressoDianaDarer
Same part Difference Record names Edit Dist.
9
4
Same
Diff.
Results
𝑉 𝑠=⟨𝑤1 ,𝑤2 ,…,𝑤 𝑖 ⟩
𝑆1= ∑𝑖=1
¿𝑉 𝑠∨¿𝑖𝑑𝑓 (𝑤 𝑖∈𝑉 𝑠 )¿
¿
𝑆2=𝑚𝑎𝑥𝑤𝑖∈𝑉 𝑑𝑖𝑑𝑓 (𝑤 𝑖)
𝑉 𝑑=⟨𝑤 ′1 ,𝑤 ′2 ,…,𝑤 ′𝑖 ⟩
Address similarity
the geospatially closer two records are located, the higher the probability these two records might be nearly duplicated
79 Beaver St, New York, NY 10005-281292 Water St, New York, NY 10005-3511
NewYorkCity1xxxx
Manhattan100xxx
LowerEast1000x
City
Borough
Street
UpperEast1002x
Queen113xxx
Area
5thStreet WallStreet
Example: The same building having two different address presentation
City structure
Address similarity
Insert YP data into the city structure according to their addressCalculate the mean coordinates of each leaf nodeInsert POI data into the city structure in terms of their coordinatesFind out the co-parent node in the structure
R1
R2
np
R1 R2
np
R1
R2
npA) B) C)
Map each entity to a category hierarchyFind the co-parent node of two entitiesThe lower lever the co-parent is on the high similar
Category similarity
Entertaiment
Restaurant
Level 3
Level 1
Level 2
ChineseRestaurant
Cinema
ItalianRestaurant
Education
E.g., some shops usually provide coffee, lunch and wine simultaneously. Therefore, different people would classify these shops into different categories
Experiments- Settings
Beijing DatasetIn total 0.7 million entities0.3m POIs and 0.4m YPs
Human labeledDecision tree + BaggingBaselines
Exact matchRule-based: edit distance and geo-distance
Datasets Training Set Test Set TotalD1 200 200 400D2 400 400 800D3 600 600 1200D4 800 800 1600
Experiments - Results
Single feature studyS1 and S2 are name similarityS3 denotes address similarityS4 represents category similarity
0.4
0.5
0.6
0.7
0.8
0.9
400 800 1200 1600
Prec
isio
n
Number of entity pairs
S1
S2
S3
S4
0.4
0.5
0.6
0.7
0.8
0.9
1
400 800 1200 1600
Rec
all
Number of entity pairs
S1
S2
S3
S4
Experiments - Results
Feature combination
FeaturesDuplicated Non-duplicated Overall
accuracyPre. Rec. Pre. Rec.
0.860 0.857 0.852 0.864 0.858
0.800 0.767 0.746 0.819 0.782
0.864 0.859 0.853 0.869 0.861
0.864 0.859 0.853 0.869 0.861
0.885 0.866 0.858 0.891 0.875
Experiments- results
FeaturesDuplicated Non-duplicated Overall
accuracyPre. Rec. Pre. Rec.
Exact Match 1 0.183 0.558 0.100 0.598
Rule-based method 0.780 0.701 0.736 0.808 0.755
Our approach 0.885 0.866 0.858 0.891 0.875
0.65
0.7
0.75
0.8
0.85
0.9
0.95
precision (Y) recall (Y) precision (N) recall (N) overall
Performance Measures
D1
D2
D3
D4
Conclusion
A classification model usingName similarityAddress similarityCategory similarity
Determine the nearly duplicated location dataWith a overall accuracy of 0.89