Date post: | 15-Apr-2017 |
Category: |
Engineering |
Upload: | javier-artiles |
View: | 378 times |
Download: | 1 times |
P H D TH E S I S DE F E N S E
JAVIER ARTILES PICÓNN L P & I R G R O U P , U N E D , M A D R I D
P H D S U P E R V IS O R S :
J U L I O G O N Z A L O A R R O Y O
E N R I Q U E A M I G O C A B R E R A
Web People Search
1
Finding people on the Web…
Web person profiling 80% U.S. companies check the web before hiring someone
In 30% cases web results impact hiring decision (source: notoriety.com).
Popularity & reputation management.
Further Natural Language processing. Biographical attribute extraction.
Summarization.
Simply, find out information about an individual.
2I. Introduction
Diving in… mixed results
1 - fitness guru
2 - photographer
3 - photographer
4 - photographer
5 - advertising Supervisor at Flamingo Las Vegas
6 - advertising Supervisor at Flamingo Las Vegas
7 - empty blog ?
8 - St. Louis, MO
9 - 55 years old LAS VEGAS, Nevada, Estados Unidos
10 - fitness guru
3I. Introduction
I. Introduction
Wikipedia lists 19 different people named “Michael Moore” …
Diving in… multiple celebrities
4
… but, only one person monopolizes the top Web search results
Diving in… query refinements
Yes users can (and do) refine their queries, but…
How to know which refinement yields the better results ?
5I. Introduction
if too general, we might include non-relevant documents
… actually there are two politicians with that name
Michael Moore politician
if too specific, we might miss relevant documents
… he has had other occupations
Michael Moore Missisippi attorney-general
How relevant is this problem?
11-17% of Web queries include a person name
4% of Web queries are just a person name
U.S. Census Bureau: 90,000 names shared by 100,000,000 people
Web People Search engines available since 2005 (Spock, Zoominfo, Arnetminer, etc.)
6I. Introduction
What we get What we want
fitness guru
•www.thatsfit.com/bloggers/martha-edwards/•www.thecardioblog.com/bloggers/martha-edwards/
photographer
•www.marthaedwards.ca•www.thecancerblog.com/bloggers/martha-edwards/
advertising Supervisor at Flamingo Las Vegas
•www.linkedin.com/pub/martha-edwards/4/378/136
St. Louis, Mo
•www.facebook.com/meedwards?ref=mf
Stagecoach Plc, United Kingdom
•www.zoominfo.com/.../Edwards_Martha_1175619539.aspx
7I. Introduction
This is not an easy task
8I. Introduction
Goals
9
Formalize the name disambiguation problem in Web search results:
Review the name disambiguation problem in the state of the art.
Motivate empirically the need for automatic methods.
Create an evaluation framework:
Define a task.
Create a testbed corpus.
Adopt evaluation methodology and quality measures.
Analyze the impact of different document representations.
I. Introduction
How we addressed the problem
Task formalization
Preliminary studies
First evaluation campaign
Data acquisition
Community building
Evaluation methodology refinement
Second evaluation campaignConsolidatedmethodology
Empirical studies
10I. Introduction
Web People Search
11
I. Introduction.
II. Benchmarking.
I. The WePS-1 Campaign.
II. Comparison of Extrinsic Clustering Evaluation Metrics Based on Formal Constraints.
III. Combining Evaluation Metrics via the Unanimous Improvement Ratio.
IV. The WePS-2 Campaign.
III. Empirical Studies.
I. The Scope of Query Refinement in the WePS Task.
II. The Role of Named Entities in WePS.
Contributions.
WePS-1: clustering task
search engine
system
12
fitness guru
•www.thatsfit.com/bloggers/martha-edwards/•www.thecardioblog.com/bloggers/martha-edwards/
photographer
•www.marthaedwards.ca•www.thecancerblog.com/bloggers/martha-edwards/
advertising Supervisor at Flamingo Las Vegas
•www.linkedin.com/pub/martha-edwards/4/378/136
St. Louis, Mo
•www.facebook.com/meedwards?ref=mf
Stagecoach Plc, United Kingdom
•www.zoominfo.com/.../Edwards_Martha_1175619539.aspx
II. i. The WePS-1 Campaign
Testbed generation process
Person name selection
Each names is sent as a query to a Web search engine
Collect the top 100 search results for each name
Manually group the pages according to the individual they refer to
Wikipedia George Foster
James Hamilton
Martha Edwards
Thomas Fraser
Thomas Kirk
13
US Census
ACL’06
II. i. The WePS-1 Campaign
Annotation: a nice page
14II. i. The WePS-1 Campaign
Annotation: kind of difficult...
15II. i. The WePS-1 Campaign
Annotation: frankly, no clue
16II. i. The WePS-1 Campaign
SIGIR 2005 preliminary testbed
Manual annotation consisted of: Clustering of the pages according to the individual they refer to.
Biographical attributes.
Page classification (home page, part of h.p., reference, other).
Points for improvement:
WePS-1 should concentrate efforts on clustering annotation.
Add more name sources.
Names shared with non-person entities.
Also consider ambiguity within documents (overlapping clustering).
17II. i. The WePS-1 Campaign
Training Test
names source
entities documents
Wikipedia 23.14 99.00
ECDL06 15.30 99.20
WEB03 5.90 47.20
avg. 10.76 71.02
WePS-1 Training and Test collections
18
The test data turned out to have a much average higher ambiguity, even for the same name sources.
names source
entities documents
Wikipedia 56.50 99.30
ACL06 31.00 98.40
Census 50.30 99.10
avg. 45.93 98.93
II. i. The WePS-1 Campaign
Purity (P): rewards clusters without noise
Inverse Purity (IP): rewards grouping items from same category
Fα=0.5: harmonic mean of P, IP
Fα=0.2: bias for IP
One in one
P: 1.00IP: 0.48F0,5:0.65
12
34
56
All in one
P: 0.50 IP: 1.00F0,5: 0.67
1
2
3
4 5
6
Evaluation Metrics Baselines
19II. i. The WePS-1 Campaign
Cheat system(Paul Kalmar)
P: 0.75IP: 1.00F0,5: 0.86
112 2
3 34
4
6 65 5
Purity measures can be cheated in WePS!
purity inv. purity Fa=0.5
S4 0.81 Cheat S 1.00 S1 0.79
S3 0.75 S14 0.95 Cheat S 0.78
S2 0.73 S13 0.93 S2 0.77
S1 0.72 S15 0.91 S3 0.77
Cheat S 0.64 S5 0.90 S4 0.69
S6 0.60 S10 0.89 S5 0.67
S9 0.58 S7 0.88 S6 0.66
S8 0.55 S1 0.88 S7 0.64
S5 0.53 S12 0.83 S8 0.62
S7 0.50 S11 0.82 S9 0.61
WePS-1 Systems ranking
20
team F α=0.5 purity inv. purity
CU_COMSEM 0.79 0.72 0.88
CHEAT_SYSTEM 0.78 0.64 1.00
IRST-BP 0.77 0.75 0.80
PSNUS 0.77 0.73 0.82
UVA 0.69 0.81 0.60
FICO 0.67 0.53 0.90
UNN 0.66 0.60 0.73
ONE_IN_ONE 0.64 1.00 0.47
AUG 0.64 0.50 0.88
SWAT-IV 0.62 0.55 0.71
UA-ZSA 0.61 0.58 0.64
TITPI 0.60 0.45 0.89
JHU1-13 0.58 0.45 0.82
DFKI2 0.53 0.39 0.83
WIT 0.52 0.36 0.93
UC3M_13 0.51 0.35 0.95
UBC-AS 0.45 0.30 0.91
ALL_IN_ONE 0.45 0.29 1.00
The most common system configuration:
• Full document BoW• HAC (single link)
• Cosine similarity• Trained similarity threshold
Frequent “singleton” people in WePS-1
Alpha parametrizationhas a strong effect in the systems ranking
II. i. The WePS-1 Campaign
WePS-1 Summary
21
Variability across test cases is large and unpredictable.
Testbed creation is more difficult and expensive than expected.
Purity measures can be cheated ! Are purity and inverse purity the best options available
among clustering metrics ?
Metrics combination has a strong effect in how we measure the contribution of systems. How does the combination of metrics affect the systems
ranking ?
II. i. The WePS-1 Campaign
Web People Search
22
I. Introduction.
II. Benchmarking.
I. The WePS-1 Campaign.
II. Comparison of Extrinsic Clustering Evaluation Metrics Based on Formal Constraints.
III. Combining Evaluation Metrics via the Unanimous Improvement Ratio.
IV. The WePS-2 Campaign.
III. Empirical Studies.
I. The Scope of Query Refinement in the WePS Task.
II. The Role of Named Entities in WePS.
Contributions.
Comparing clustering evaluation metrics
Which of the current clustering metrics is more appropriate for the WePS task ?
We compare different families of clustering metrics.
We define constraints in order to characterize metric
families.
We adapt metrics to the overlapping clustering problem.
23II. ii. Clustering Evaluation Metrics
Formal constraints: Cluster homogeneity
24
o Let S be a set of items belonging to categories L1 … Ln.
o Let D1 be a cluster distribution with one cluster C containing items from two categories Li, Lj.
o Let D2 be a distribution identical to D1, except for the fact that the cluster C is split into two clusters containing the items with category Li and the items with
category Lj, respectively.
o Then Q(D1) < Q(D2).
Human validation: 92 %
II. ii. Clustering Evaluation Metrics
Formal constraints: Cluster completeness
25
o Let D1 be a distribution such as that two clusters C1, C2 only contain items belonging to the same category L.
o Let D2 be an identical distribution, except for the fact that C1 and C2 are merged into a single cluster.
o Then Q(D1) < Q(D2).
II. ii. Clustering Evaluation Metrics
Human validation: 90 %
Formal constraints: Rag Bag
26
o Let Cclean be a cluster with n items belonging to the same category.
o Let Cnoisy be a cluster merging n items from unary categories.
o Let D1 be a distribution with a new item from a new category merged with the highly clean cluster Cclean, and D2 another distribution with this new item merged with the highly noisy cluster Cnoisy .
o Then Q(D1) < Q(D2).
II. ii. Clustering Evaluation Metrics
Human validation: 95 %
Formal constraints: Cluster size vs. quantity
27
o Let us consider a distribution D containing a cluster Cl with n+1 items belonging to the same category L, and n additional clusters C1 … Cn, each of them containing two items from the same category L1 … Ln.
o If D1 is a new distribution similar to D, where each Ci is split in two unary clusters, and D2 is a distribution similar to D, where Cl is split in one cluster of size n and one cluster of size 1.
o Then Q(D1) < Q(D2).
II. ii. Clustering Evaluation Metrics
Human validation: 100%
Comparison of evaluation
metrics
28
BCubed
Pairs counting
Entropy
Edit distance
Set matching
II. ii. Clustering Evaluation Metrics
BCubed Precision and Recall
29II. ii. Clustering Evaluation Metrics
Evaluation on overlapping clustering
30
If n people are mentioned in a document with the same name, this document should appear in n clusters.
The metrics reviewed so far do not consider overlapping clustering.
II. ii. Clustering Evaluation Metrics
BCubed extended for overlapping clustering
31
To extend BCubed we must take into account the multiplicity of item occurrences in clusters and classes:
Precision decreases when two elements share too many clusters
Multiplicity precision and recall are integrated in the overall BCubed metrics:
Recall decreases when two elements share too few clusters
II. ii. Clustering Evaluation Metrics
BCubed extended for overlapping clustering
32
Perfect clustering
Recall(e1, e2) = min(2,2)/2 = 1Precision(e1, e2) = min(2,2)/2 = 1
e
e e
e
Losing Recall
Recall(e1, e2) = min(1,2)/2 = 0.5Precision(e1, e2) = min(1,2)/1 = 1
e
e e
Losing Precision
Recall(e1, e2) = min(3,2)/2 = 1Precision(e1, e2) = min(3,2)/3 = 0.66
e
e
e
ee
e
II. ii. Clustering Evaluation Metrics
WePS-1 results revisited
33
Purity and Inverse Purity BCubed Precision and Recall
II. ii. Clustering Evaluation Metrics
Clustering Metrics Summary
We have proposed a set of formal constraints for clustering evaluation metrics.
The combination of BCubed precision and recall is the only one that satisfies all constraints.
We have extended BCubed to handle overlapping clustering.
We have tested BCubed extended with the WePS-1 results and found that it effectively discriminates the baselines and the cheat system.
34II. ii. Clustering Evaluation Metrics
Web People Search
35
I. Introduction.
II. Benchmarking.
I. The WePS-1 Campaign.
II. Comparison of Extrinsic Clustering Evaluation Metrics Based on Formal Constraints.
III. Combining Evaluation Metrics via the Unanimous Improvement Ratio.
IV. The WePS-2 Campaign.
III. Empirical Studies.
I. The Scope of Query Refinement in the WePS Task.
II. The Role of Named Entities in WePS.
Contributions.
How does the combination of metrics affect the systems ranking ?
36
Ranking is highly sensitive to α parametrization in F.
II. iii. Unanimous Improvement Ratio
0,25
0,35
0,45
0,55
0,65
0,75
0,85
0,95
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
S15
S16
seq-1
seq-100
F v
alu
e
F parameterization (α value) Bias towards
precisionBias towards
recall
Can get statistical significance for contradictory results if α is
changed
seq-1 S14 p (Wilcoxon)
Fα=0.5 0.61 0.49 0.022
Fα=0.2 0.52 0.66 0.015
Unanimous Improvement Ratio
37
Counts the number of topics for which system a improves system b
according to all evaluation metrics.
II. iii. Unanimous Improvement Ratio
Unanimous Improvement Ratio
38
UIR Rewards robustness across α values
0,07
0,32
0
0,1
0,2
0,3
0,4
0,5
∆Fα=0.5 UIR
0,08
0,42
0
0,1
0,2
0,3
0,4
0,5
∆Fα=0.5 UIR
0,07
0,39
0
0,1
0,2
0,3
0,4
0,5
∆Fα=0.5 UIR
II. iii. Unanimous Improvement Ratio
Metrics Combination Summary
The comparison of systems in clustering tasks is highly sensitive to the metrics combination criterion.
UIR allows us to combine metrics without assigning relative weights to each metric.
UIR rewards robust improvements across different alpha values of F-measure.
UIR is a complementary method to assess the best approach during the system training process.
39II. iii. Unanimous Improvement Ratio
Web People Search
40
I. Introduction.
II. Benchmarking.
I. The WePS-1 Campaign.
II. Comparison of Extrinsic Clustering Evaluation Metrics Based on Formal Constraints.
III. Combining Evaluation Metrics via the Unanimous Improvement Ratio.
IV. The WePS-2 Campaign.
III. Empirical Studies.
I. The Scope of Query Refinement in the WePS Task.
II. The Role of Named Entities in WePS.
Contributions.
WePS clustering task and ...
searchengine
system
41
fitness guru
•www.thatsfit.com/bloggers/martha-edwards/•www.thecardioblog.com/bloggers/martha-edwards/
photographer
•www.marthaedwards.ca•www.thecancerblog.com/bloggers/martha-edwards/
advertising Supervisor at Flamingo Las Vegas
•www.linkedin.com/pub/martha-edwards/4/378/136
St. Louis, Mo
•www.facebook.com/meedwards?ref=mf
Stagecoach Plc, United Kingdom
•www.zoominfo.com/.../Edwards_Martha_1175619539.aspx
II. iv. The WePS-2 Campaign
Input Output
• John TaitName
• Specialist Information Systems Consulting ServicesOccupation
• http://johntait.netHomepage
• Information Retrieval FacilityAffiliation
• ViennaLocation
• Chief Scientific OfficerWork
… we also included an Attribute Extraction taskSatoshi Sekine and Javier Artiles. WePS2 Attribute Extraction Task.
In 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.
42II. iv. The WePS-2 Campaign
WePS-2 data
43
Training set: WePS 1 dataset (same methodology & size).
Followed WePS 1 guidelines.
10 x 3 new ambiguous person names (Wikipedia, US census and ACL'08 PC members).
150 web pages from the top search results.
HTML pages as well as search results metadata (snippet, rank...).
Filtered out non-HTML documents and pages without the name on them.
Also developed a GUI for the annotation task
II. iv. The WePS-2 Campaign
Annotation: public profiles from social networks
44II. iv. The WePS-2 Campaign
Annotation: genealogies
45II. iv. The WePS-2 Campaign
II. iv. The WePS-2 Campaign
WePS-1 vs. WePS-2 datasets
Average ambiguity is much lower on the WePS 2 data.
There is still a wide variety of ambiguity cases.
As it did on WePS-1, this added an extra challenge to the task.
WePS-1 data WePS-2 data
Training
source entities
Wikipedia 23.14
ECLD06 15.30
Web03 5.90
avg. 10.76
46
Test
source entities
Wikipedia 10.70
ACL06 14.20
Census 30.30
avg. 18.46
Test
source entities
Wikipedia 56.50
ACL06 31.00
Census 50.30
avg. 45.93
WePS-2 Clustering Resultsusing B-Cubed
Baselines:
• One-in-one
• All-in one
• Cheat system
• Hierarchical Agglomerative Cluster (HAC) with tokens
• HAC with bigrams
Upper bounds:
• Oracle HAC w tokens
• Oracle HAC w bigrams
47II. iv. The WePS-2 Campaign
Macroaveraged Scores
F-measures BCubed
rank run α=0.5 α=0.2 Pre. Rec.
BEST-HAC-TOKENS 0.85 0.84 0.89 0.83
BEST-HAC-BIGRAMS 0.85 0.83 0.91 0.81
1 PolyUHK 0.82 0.80 0.87 0.79
2 UVA_1 0.81 0.80 0.85 0.80
3 ITC-UT_1 0.81 0.76 0.93 0.73
4 XMEDIA_3 0.72 0.68 0.82 0.66
5 UCI_2 0.71 0.77 0.66 0.84
6 LANZHOU_1 0.70 0.67 0.80 0.66
7 FICO_3 0.70 0.64 0.85 0.62
8 UMD_4 0.70 0.63 0.94 0.60
HAC-BIGRAMS 0.67 0.59 0.95 0.55
9 UGUELPH_1 0.63 0.75 0.54 0.93
10 CASIANED_4 0.63 0.68 0.65 0.75
HAC-TOKENS 0.59 0.52 0.95 0.48
11 AUG_4 0.57 0.56 0.73 0.58
12 UPM-SINT_4 0.56 0.59 0.60 0.66
ALL_IN_ONE 0.53 0.66 0.43 1.00
CHEAT_SYS 0.52 0.65 0.43 1.00
13 UNN_2 0.52 0.48 0.76 0.47
14 ECNU_1 0.41 0.44 0.50 0.55
15 UNED_3 0.40 0.38 0.66 0.39
16 PRIYAVEN 0.39 0.37 0.61 0.38
ONE_IN_ONE 0.34 0.27 1.00 0.24
17 BUAP_1 0.33 0.27 0.89 0.25
II. iv. The WePS-2 Campaign
System F0.5 Improved Systems (UIR > 0.25)Reference
systemUIR for the
reference system
(S1) PolyUHK 0.82 S2 S4 S6 S7 S8 S11 … S17 B1 - -
(S2) ITC-UT_1 0.81 S4 S6 S7 S8 S11 … S17 B1 S1 0.26
(S3) UVA_1 0.81 S2 S4 S7 S8 S11 … S17 B1 - -
(S4) XMEDIA_3 0.72 S11 S13 … S17 S1 0.58
(S5) UCI_2 0.71 S12 … S16 - -
(S6) UMD_4 0.70 S4 S7 S11 S13 … S17 B1 S1 0.35
(S7) FICO_3 0.70 S11 S13 … S17 S2 0.65
(S8) LANZHOU_1 0.70 S11 … S17 S1 0.74
(S9) UGUELPH_1 0.63 S4 S12 S14 S16 - -
(S10) CASIANED_4 0.63 S12 … S16 - -
(S11) AUG_4 0.57 S14 … S17 S3 0.68
(S12) UPM-SINT_4 0.56 S14 S16 S1 0.71
(B100) ALL_IN_ONE 0.53 Bcheat - -
(S13) UNN_2 0.52 S15 S16 S1 0.90
(Bcheat) CHEAT_SYS 0.52 - B100 0.65
(S14) ECNU_1 0.41 - S1 0.90
(S15) UNED_3 0.40 S16 S1 0.97
(S16) PRIYAVEN 0.39 - S1 1.00
(B1) ONE_IN_ONE 0.34 S17 S1 0.29
(S17) BUAP_1 0.33 - S6 0.84 48
Results of UIR on the WePS-2 dataset
Run Features Feat. weighting Similarity Clustering
PolyUHK
Local sentences, full text BoW, URL tokens, title tokens in root page, unigram and bigrams, snippet based features. TFIDF Cosine similarity HAC
UVA_1stemmed words (Porter stemmer, standard stopwords list) Modified TFIDF Cosine similarity HAC
ITC-UT_1 NEs, compound key words, link features Overlap coefficient Two stage HAC
UMD_4Tokens, NEs, variations of the ambiguous name, hyperlinks Jaro-Winkler, Jaccard HAC
XMEDIA_3 Local unigrams and bigrams, Self informationCosine similarity and learned similarity metrics QT variant
UCI_2NEs, web overlap statistics for person and organization TFIDF
Cosine similarity, Skyline classifier for web based features Two stage clust.
LANZHOU_1 NEs , email, phone, date, occupation TFIDF Cosine similarity HAC
FICO_3NEs, URL tokens, page title tokens, NE lists, name match, gender
Heuristic based on matching and non matching features
Greedy agglomeration within a block
UGUELPH_1 Full text BoW Modified TFIDF Chamaleon Clust.
CASIANED_4NEs, tokens, URL tokens, snippet
TFIDF for tokens, special weighting for NEs Cosine similarity
Classify pages according to the person profession.
AUG_4
place/date of birth/death, NEs, IP address, geo. location coordinates, weighted keywords, URL, email address, telephone, fax Gain ratio Cosine similarity
Fuzzy ants clustering, Agnes (hierarchical clustering)
UPM-SINT_4 Full text BoW Word overlap
ECNU_1 Stemmed words seleted based o chi^2 measure Term frequency Cosine similarity K-means
PRIYAVEN Full text BoW TFIDF Weighted Jaccard Fuzzy ants clustering
UNED_3Relevant terms extracted with language model techniques
Kullback-Leibler divergence
Language models and cosine similarity Heuristic
BUAP_1 NEs Term frequency
49
WePS-2 summary
Consolidation of the WePS community: 17 research teams took part in the WePS-2 clustering task.
WePS-2 now provides benchmarking datasets and standardized evaluation metrics for the clustering and attribute extraction subtasks.
Now we can empirically answer questions such as:
How good are manual query refinements in WePS ?
What is the role of Named Entities in this task ?
50II. iv. The WePS-2 Campaign
Web People Search
51
I. Introduction.
II. Benchmarking.
I. The WePS-1 Campaign.
II. Comparison of Extrinsic Clustering Evaluation Metrics Based on Formal Constraints.
III. Combining Evaluation Metrics via the Unanimous Improvement Ratio.
IV. The WePS-2 Campaign.
III. Empirical Studies.
I. The Scope of Query Refinement in the WePS Task.
II. The Role of Named Entities in WePS.
Contributions.
Query Refinements in WePS
52
How good are manual query refinements ?
Are they a feasible people search strategy ?
III. i. The Scope of Query Refinement in the WePS Task
Query Refinements in WePS
53
Tokens, bigrams, trigrams …
•football, publications,
research
•curriculum vitae, full
professor
Named Entities(person, location, organization)
•John Smith, Mary Jones
•Kansas City, Suntherland
•University of Suntherland
Manually extracted attributes(occupation, affiliation, email…)
•Occupation: Full professor
•born in 1940
•born in London
John Tait + refinement
Precision 6/8
Recall 6/6
Coverage 1
retrieved documentsrelevantnon relevant
John Tait doc. collection from
WePS
Trying to find John Tait, the researcher
III. i. The Scope of Query Refinement in the WePS Task
Query Refinements in WePS
54
We will simulate query refinements for the people in the WePS testbed.
The best query refinements will be obtained from the documents and applied to refine the corresponding name document set.
III. i. The Scope of Query Refinement in the WePS Task
III. i. The Scope of Query Refinement in the WePS Task
Results for popular people (clusters of size >= 3)
for test case we selectthe best …
F α=0.5 precision recall coverage
token 0.87 0.90 0.86 1.00
bigram 0.79 0.95 0.70 1.00
trigram 0.75 0.96 0.65 1.00… … … … …
Best n-gram 0.89 0.95 0.85 1.00
affiliation 0.51 0.96 0.39 0.81
occupation 0.52 0.93 0.40 0.80
email 0.35 0.96 0.23 0.33… … … … …
Best manual attribute 0.60 0.97 0.47 0.92
location 0.62 0.87 0.53 1.00
organization 0.67 0.96 0.56 1.00
person 0.59 0.95 0.47 1.00
Best named entity 0.74 0.95 0.63 1.00
Best 0.89 0.96 0.85 1.00
55
Very good results using all refinement types…
… but tokens and word n-grams also achieve the highest results.
There is usually at least one QR that leads to the desired set of results…
… but not necessarily an intuitive choice
Lower coverage in the manually extracted QRs
Manually tagged attributes: very precise, but they are not always present
Scope of Query Refinements: Summary
There is not a single type of refinement that leads to optimal results, but a combination of diverse types.
Search results clustering might indeed be of practical help assist users searching for people in the Web.
56III. i. The Scope of Query Refinement in the WePS Task
Web People Search
57
I. Introduction.
II. Benchmarking.
I. The WePS-1 Campaign.
II. Comparison of Extrinsic Clustering Evaluation Metrics Based on Formal Constraints.
III. Combining Evaluation Metrics via the Unanimous Improvement Ratio.
IV. The WePS-2 Campaign.
III. Empirical Studies.
I. The Scope of Query Refinement in the WePS Task.
II. The Role of Named Entities in WePS.
Contributions.
How effective are Named Entities compared to other features for document representation in WePS ?
Document representation: NEs vs other approaches
58
D1: 0.1 0.23 0 0 0 0.43
D2: 0.40 0 0.43 0.23
D3:23 0.54 0.54 0 0 0
D4: 0.40 0 0.43 0.23
…
Dn:0.1 0.23 0 0 0 0.43
Document collection
Similarity ClusteringDocument
representationDocument
representation
III. ii. The Role of Named Entities in WePS
Reformulating the WePS task
59
Single features
Classification task over coreferent
document pairs
Non dependent on clustering algorithm.
WePS1 and WePS2 corpora: 293,000 document pairs.
Similarity between pairs using each
feature.
WePS1 and WePS2 corpora: 293,000
document pairs.
Evaluate results:Precision and Recall.
III. ii. The Role of Named Entities in WePS
Token based features
60
Tokens provide the best overall performance
III. ii. The Role of Named Entities in WePS
Word n-gram based features
61
n-grams: more precise than single tokens at the
cost of recall
III. ii. The Role of Named Entities in WePS
Named entities: Stanford NE tagger
62
Taken individually,NEs do not improve tokens
III. ii. The Role of Named Entities in WePS
Reformulating the WePS task
63
Combination of features
WePS1 and WePS2 corpora: 293,000
document pairs.
Similarity using feature
combinations
Evaluate results:Machine Learning
andUpper boundary
III. ii. The Role of Named Entities in WePS
Combining similarity criteria
64
PWA measures the classification accuracy of one similarity criterion
PWA(x)=Prob(Simx ( )>Simx( ))DA DA' DBDC
MaxPWA(<x1,x2...,xn>) =
Prob(∃ xi X .Simxi ( )> Simxi( ))DA DA' DBDC
MaxPWA estimates the upper boundary accuracy of a set of similarity criteria
III. ii. The Role of Named Entities in WePS
Combining similarity criteria
65
Decision Tree and MaxPWA results are consistent.
Adding new features to tokens improves the classification.
NEs do not offer a competitive advantage when compared to non-linguistic features.
III. ii. The Role of Named Entities in WePS
Tokens
All features
(including NEs)
Tokens + ngrams
Is our setting competitive with state of the art systems ?
Document representation: NEs vs other approaches
66
D1: 0.1 0.23 0 0 0 0.43
D2: 0.40 0 0.43 0.23
D3:23 0.54 0.54 0 0 0
D4: 0.40 0 0.43 0.23
…
Dn:0.1 0.23 0 0 0 0.43
Document collection
Similarity ClusteringDocument
representationClustering
III. ii. The Role of Named Entities in WePS
Results on the clustering taks
67
Output of the Decision Tree classifier as similarity metric.
These similarities where fed into a Hierarchical Clustering Algorithm.
A distance threshold was trained using WePS-1 data.
Comparable to the bestparticipant in WePS-2.
Adding NEs does not improveresults
III. ii. The Role of Named Entities in WePS
Document representation: NEs vs other approaches
68
Individual features
Feature combinations
Validation on the clustering task
III. ii. The Role of Named Entities in WePS
Named entities do not seem to provide a
competitive advantage in the clustering process
when compared to a combination of simplerfeatures (tokens, n-grams, etc.).
This is not a prescription against the use of NEs:
They can be appropriate for presentation purposes.
Other approaches might be able to improve results using
NEs information.
Role of Named Entities Summary
69III. ii. The Role of Named Entities in WePS
Web People Search
70
I. Introduction.
II. Benchmarking.
I. The WePS-1 Campaign.
II. Comparison of Extrinsic Clustering Evaluation Metrics Based on Formal Constraints.
III. Combining Evaluation Metrics via the Unanimous Improvement Ratio.
IV. The WePS-2 Campaign.
III. Empirical Studies.
I. The Scope of Query Refinement in the WePS Task.
II. The Role of Named Entities in WePS.
Contributions.
Insights Products
Study and characterization of the available evaluation metrics.
Extension of BCubed for overlapped
clustering tasks.
Development of metrics combination
method (Unanimous Improvement Ratio) that is not dependent on weighting.
Query refinements are effective but very
diverse and unfeasible in a WePS scenario.
Named Entities do not seem to provide a competitive advantage over simpler
features.
Development of a reference testbed for the task.
Currently more than 80 citations to the
WePS-1 task description paper.
Several paper use WePS data as
the de facto standard for the task.
A document clustering evaluation package.
An annotation GUI for document grouping tasks.
Contributions
71IV. Contributions
Further directions
Exploration of new approaches to the representation of documents (Wikipedia, Google n-gram corpus).
Application of the evaluation methods developed for WePS to other domains and tasks.
Future WePS evaluation campaigns.
Search for organizations.
Multilingual search (documents in different languages referring to the same person).
A new task integrating the clustering and attribute extraction problems.
72IV. Contributions
73
Thank you !
Previous work
Related NLP tasks:
Cross Document Coreference.
Word Sense Disambiguation.
Word Sense Induction.
Test collections:
Until 2006 newswire collections, Web collections are predominant now.
Manual annotation, but also pseudo-ambiguity generation.
In most cases, created ad-hoc for a particular research.
Disambiguation methods:
Hierarchical Agglomerative Clustering (HAC) is the most frequently employed method.
74I. Introduction
Unanimous Improvement Ratio
75
UIR Reflects the range of improvement
UIR = 0.03 UIR = 0.45 UIR = 0.77
Cross-document Coreference Web People Search
Tries to link mentions to the same entities in a
collection of texts.
Groups documents that contain a mention to the
same individual.
Web People Search vs. other NLP tasks
[…] Captain John Smith (c. January
1580 – June 21, 1631) […]
[…] John Smith was an
English adventurer[…]
Doc . 1Doc.2
Doc . 1
Doc . 1
[…] John Smith was an
English adventurer[…]
Doc . 2
76
Word Sense Disambiguation
Web People Search
Can rely on dictionaries to define the number of “senses” of anambiguous term.
Common words disambiguation.
The number of senses is not knowa priori.
Person names disambiguation.
Person names disambiguation.
Web pages, open domain.
Web People Search vs. other NLP tasks
Word Sense Induction
Common words disambiguation
Citation disambiguation
Handles very structuredinformation on a closed domain(scientific literature).
77
WePS-1 Training and test collections
78
Attributes
Occupation, affiliation & work are the most
common
Most attributes appear
in less than 1/10 of the documents
79
Attribute
Extraction Results
Difficult task!
80
Scores per attribute
81
Different
Attributes, Different Results
4 types of attributes based on the
characteristics
Attribute Description Performance Comments
Phone,
FAX,
email,
Website
There is a typical
pattern
R: 74-40 (ECNU, UvA)
Disambiguation is needed.
Degree,
Nationality
Unfamiliar NE, but
candidates are limited
R: 43-42 (CASIANED)
We need a good NE tagger for the
category. Maybe possible.
D. Of birth, Birth place,
Other name,
Affiliation,
School,
Mentor,
Relative
Typical NE, disambiguati
on is needed
R: 55-17
(MIVTU, UvA, PolyUHK)
NE tagger is ready. We need
good disambiguation
Award,
Major,
Occupation
Unfamiliar and difficult
NE type
R: 17-38
(UvA)
We need a good NE tagger for the
category. It looks very difficult
82
Typical
SystemStrategy
Most systems use two phase strategy
1. Find the candidates
• Use NE tagger, gazetteer, regular expression to find candidates which have the same type to the target attribute
2. Filter (verify) the candidates
• Select only those which are the attribute-values of the target person. It can be done by local pattern, supervised classification, distance & cue phrase.
83
PWA(x)=Prob(Simx ( )>Simx( ))DA DA' DBDC
The classification accuracy of one similarity criterion is:
We want to learn the relative weigh of feature classes (e.g. person names vs. tokens)
Evaluation by a machine learn algorithm: (e.g. Decision tree) Upperbound of any algorithm: When combining similarity criteria,
at least one of them should identify the coreferent document pair:
MaxPWA(<x1,x2...,xn>)=
Prob(∃ xi X .Simx
i( )> Simx
i( ))DA DA' DB
DC
Combining similarity criteria: PWA and MaxPWA (upper boundary)
84
WePS-1 summary
We have built a manual testbed corpus for development and evaluation of WePS systems. 47 person names, almost 4700 documents. Double annotation of the test data.
We have done a systematic evaluation and comparison of WePSsystems. 29 expressedtheir interest in the task. 16 teams submittedresults within the deadline
Variability accross test cases is large and unpredictable.
Testbed creation is more difficult and expensive than expected.
Purity measures can be cheated ! Are purity and inverse purity the best options available among clustering metrics ?
The combination of metrics has a strong effect in how we measure the contribution of systems (baselines are an extreme case). How does the combination of metrics affect the systems ranking ?
87
Contributions
A study of the actual need name disambiguation systems.
In most cases a there is an optimal query refinement for an individual…
… but this refinement is unlikely to be known in advance.
a date, a related person name, a place, the title of a book ?
Results support the interest raised in the scientific community and the Web search business.
88
Contributions
Development of reference test collections.
We have carried two dedicated evaluation campaigns: WePS-1 and WePS-2.
The problem has been stadarised as a serch results mining task (clustering and IE).
Creation of standard benchmarks for the WePS task.
Around 8,000 manually annotated web documents.
Including biographical features in WePS-2.
Manual annotation for WePS has shown to be a difficult process.
Lack of context in some documents, uncertainty even when infromation is available ,high ambiguity, etc.
Too much information (genealogies)… or too little (public profiles from social networks).
Importance of training assesors and reaching a concensus for clustering two documents.
89
Contributions
Development of improved clustering evaluation metrics.
We have defined four constraints for clustering quality metrics.
We have tested these constraints againsta families of clustering metrics.
Only BCubed satisfies all constraints.
An additional constraint was defined to account for overlapping clustering.
BCubed has been extended for overlapping clustering and succesfully applied to WePS results.
The Unanimous Improvement Ratio measure has been proposed.
Complements Precision and Recall weighting functions (F-measure)
Indicates the robustness of improvements across different α values of F.
90
Contributions
The relevance of the Clustering Stopping Criterion
Ambiguity of person names is very variable.
From one to more than 70 in the top 100 search results.
This variability represents a challenge for clustering systems,
A baseline system can achieve higher scores than the best team by using the best similarity threshold for each topic.
Training a stopping criterion with a baseline approach achieves poor results...
… but a competitive results is achieved if we also train the relative weight of document similarity metrics.
In terms of evaluation the trade-off between precision and recall metrics is determined by the stopping criteria.
This leads to a high variability of rankings depending on the evaluation metrics combination.
UIR provides complementary information in this context.
91
Contributions
Study of the role of Named Entities and other features in the WePS task.
In the clustering process, NEs are not necessarily more useful thanfeatures such as word n-grams.
In our experiments, linguistic information (NEs, noun phrases, etc.) did not provide better results than computationally cheap featuressuch as tokens and word n-grams.
More sophisticated ways of using this type of information might yieldbetter results.
92
Contributions
A large testbed for the WePS task.
Manually annotated collections for WePS-1 and WePS-2 campaigns.
Also available pre-processed , annotated with NLP tools and indexed with Lucene.
An annotation GUI has been developed to ease the the manual grouping of web documents.
It can be reused for annotation on other disambiguation problems.
An evaluation package.
Includes standard clustering evaluation metrics.
Implements BCubed metrics and the Unanimous Improvement Ratio measure.
93
Further directions
Exploration of new approaches to the representation of documents (Wikipedia, Google n-gram corpus).
Application of the evaluation methods developed for WePS to other domains and tasks.
Future WePS evaluation campaigns.
Search for organisations.
Multilingual search (documents in different languages referring to the same person).
A new task integrating the clustering and attribute extraction problems.
94