A Two Tier Framework for Context-Aware Service Organization & Discovery
Wei Zhang1, Jian Su2, Bin Chen2,WentingWang2, Zhiqiang Toh2, Yanchuan Sim2, Yunbo Cao3, Chin Yew Lin3 and Chew Lim Tan1
1 National University of Singapore
Text Analysis Conference, November 14-15, 2011
I2R-NUS-MSRA at TAC 2011: Entity Linking
2 Institute for Infocomm Research
3 Microsoft Research Asia
A Two Tier Framework for Context-Aware Service Organization & Discovery
Outline
Text Analysis Conference, November 14-15, 2011 2
I2R-NUS-MSRA at TAC 2011: Entity Linking
Ø I2R-NUS team at TAC Ø incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
Ø Acronym Expansion Ø Semantic Features Ø Instance Selection
Ø Investigate three algorithms for NIL query clustering Ø Spectral Graph Partitioning (SGP) Ø Hierarchical Agglomerative Clustering (HAC) Ø Latent Dirichlet allocation (LDA) Ø Combination system
Ø Offline Combination with the system of MSRA team at KB linking step
A Two Tier Framework for Context-Aware Service Organization & Discovery
Outline
Text Analysis Conference, November 14-15, 2011 3
I2R-NUS-MSRA at TAC 2011: Entity Linking
Ø I2R-NUS team at TAC Ø incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
Ø Acronym Expansion Ø Semantic Features Ø Instance Selection
Ø Investigate three algorithms for NIL query clustering Ø Spectral Graph Partitioning (SGP) Ø Hierarchical Agglomerative Clustering (HAC) Ø Latent Dirichlet allocation (LDA) Ø Combination system
Ø Combine with the system of MSRA team at KB linking step
A Two Tier Framework for Context-Aware Service Organization & Discovery
Acronym Expansion - Motivation
Text Analysis Conference, November 14-15, 2011 4
I2R-NUS-MSRA at TAC 2011: Entity Linking
Ø Expanding an acronym from its context to reduce the ambiguities of a name Ø E.g.TSE in Wikipedia refers to 33 entries Vs. Tokyo Stock Exchange is unambiguous.
A Two Tier Framework for Context-Aware Service Organization & Discovery
Step 1 – Find Expansion Candidates
Text Analysis Conference, November 14-15, 2011 5
I2R-NUS-MSRA at TAC 2011: Entity Linking
Identifying Candidate Expansions (e.g. for ACM)
A Two Tier Framework for Context-Aware Service Organization & Discovery
Step 2 – Candidate Expansions Ranking
Text Analysis Conference, November 14-15, 2011 6
I2R-NUS-MSRA at TAC 2011: Entity Linking
Ø Using SVM classifier to rank the candidates Ø Our SVM based acronym expansion
Ø can handle link acronyms and full strings in the different sentences in the articles
Ø Number of common characters between acronym and leading character of the expansion.
Ø can handle acronym with swapped letters. Ø E.g. Communist Party of China Vs. CCP Ø Sentence distance between acronym and expansion
A Two Tier Framework for Context-Aware Service Organization & Discovery
Outline
Text Analysis Conference, November 14-15, 2011 7
I2R-NUS-MSRA at TAC 2011: Entity Linking
Ø I2R-NUS team at TAC Ø incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
Ø Acronym Expansion Ø Semantic Features Ø Instance Selection
Ø Investigate three algorithms for NIL query clustering Ø Spectral Graph Partitioning (SGP) Ø Hierarchical Agglomerative Clustering (HAC) Ø Latent Dirichlet allocation (LDA) Ø Combination system
Ø Combine with the system of MSRA team at KB linking step
A Two Tier Framework for Context-Aware Service Organization & Discovery
Related Work on Context Similarity
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 8
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���
Ø Zhang et al., 2010; Zheng et al., 2010; Dredze et al., 2010 Ø Term Matching Ø However, 1) Michael Jordan is a leading researcher in machine learning
and artificial intelligence. 2) Michael Jordan is currently a full professor at the University
of California, Berkeley. 3) Michael Jordan (born February, 1963) is a former American
professional basketball player. 4) Michael Jordan wins NBA MVP of 91-92 season.
No Term Match
A Two Tier Framework for Context-Aware Service Organization & Discovery
Our System - A Wikipedia-LDA model
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 9
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���
1) Michael Jordan is a leading researcher in machine learning and artificial intelligence.
2) Michael Jordan is currently a full professor at the University of California, Berkeley.
3) Michael Jordan (born February, 1963) is a former American
professional basketball player. 4) Michael Jordan wins NBA MVP of 91-92 season. Topic:
Basketball
Topic: Science
A Two Tier Framework for Context-Aware Service Organization & Discovery
Wikipedia – LDA Model
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 10
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���
P( word i | category j)
Document P( category i | document j)
Document
0
0.2
0.4
0.6
0.8
1
…
…
A Two Tier Framework for Context-Aware Service Organization & Discovery
Wikipedia – LDA Model
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 11
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���
1) Michael Jordan is a leading
researcher in machine learning and artificial intelligence.
2) Michael Jordan is currently a full professor at the University of California, Berkeley.
3) Michael Jordan (born February, 1963) is a former American professional basketball player.
4) Michael Jordan wins NBA MVP of 91-92 season.
A Two Tier Framework for Context-Aware Service Organization & Discovery
Outline
Text Analysis Conference, November 14-15, 2011 12
I2R-NUS-MSRA at TAC 2011: Entity Linking
Ø I2R-NUS team at TAC Ø incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
Ø Acronym Expansion Ø Semantic Features Ø Instance Selection
Ø Investigate three algorithms for NIL query clustering Ø Spectral Graph Partitioning (SGP) Ø Hierarchical Agglomerative Clustering (HAC) Ø Latent Dirichlet allocation (LDA) Ø Combination system
Ø Combine with the system of MSRA team at KB linking step
A Two Tier Framework for Context-Aware Service Organization & Discovery
Related Work
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 13
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���
Ø Vector Space Model Ø Difficult to combine bag of words (BOW) with other features. Ø Performance needs to be improved
Ø Supervised Approaches Ø Using manual annotated training instances
Ø Dredze et al., 2010; Zheng et al., 2010
Ø Using automatically generated training instances Ø Zhang et al. 2010
A Two Tier Framework for Context-Aware Service Organization & Discovery
Related Work
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 14
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���
Ø Auto-generate training instance (Zhang et al., 2010)
(News Article) Obama Campaign Drops The George W. Bush Talking Point …
A Two Tier Framework for Context-Aware Service Organization & Discovery
Related Work
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 15
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���
Ø From “George W. Bush” articles
Ø No positive instances for “George H. W. Bush” “George P. Bush” and “George Washington Bush” generated Ø No negative instances for “George W. Bush” generated
Ø Such positive negative training instance distributions may not be the same with the original ambiguous cases in the raw text collection
Ø The distribution of the unambiguous mentions may not be the same in test data
A Two Tier Framework for Context-Aware Service Organization & Discovery
The Approach in Our System
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 16
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���
Ø An instance selection approach Ø Select an informative, representative, and diverse subset from the auto-generated data set. Ø Reduce the effect of the distribution differences
A Two Tier Framework for Context-Aware Service Organization & Discovery
Instance Selection
The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 17
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���
Small Initial data set
training SVM Classifier
Test on auto-generated
data set
2-D data set Illustration
SVM hyperplane
Select Informative, representative and diverse Instances
Add these selected instances to Initial data set
A Two Tier Framework for Context-Aware Service Organization & Discovery
Outline
Text Analysis Conference, November 14-15, 2011 18
I2R-NUS-MSRA at TAC 2011: Entity Linking
Ø I2R-NUS team at TAC Ø incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
Ø Acronym Expansion Ø Semantic Features Ø Instance Selection
Ø Investigate three algorithms for NIL query clustering Ø Spectral Graph Partitioning (SGP) Ø Hierarchical Agglomerative Clustering (HAC) Ø Latent Dirichlet allocation (LDA) Ø Combination system
Ø Combine with the system of MSRA team at KB linking step
A Two Tier Framework for Context-Aware Service Organization & Discovery
Ø Advantages over other clustering techniques Ø Globally optimized results Ø Efficient in time and space Ø Generally, produce a better result
Ø Success in many areas Ø Image segmentation Ø Gene expression clustering
Spectral Clustering
A Two Tier Framework for Context-Aware Service Organization & Discovery
Spectral Clustering
A = QɅQ-1
Ø Eigen Decomposition on Graph Laplacian Ø Dimensionality Reduction Ø (Luxburg, 2006)
George W. Bush
George H.W. Bush
A Two Tier Framework for Context-Aware Service Organization & Discovery
Hierarchical Agglomerative Clustering
Text Analysis Conference, November 14-15, 2011 21
I2R-NUS-MSRA at TAC 2011: Entity Linking
Ø Convert a doc into a feature vector: Wikipedia concepts, bag-of-words and named entities.
Ø Estimate the weight of each feature using Query Relevance Weighting Model (Long and Shi, 2010):
Ø this model shows good performance in Web People Search Ø In our work, original query name, its Wikipedia redirected names and its coreference chain mentions are all considered as appearances of the query name in the text.
Ø Similarity scores : cosine similarity and overlap similarity.
A Two Tier Framework for Context-Aware Service Organization & Discovery
Hierarchical Agglomerative Clustering
Text Analysis Conference, November 14-15, 2011 22
I2R-NUS-MSRA at TAC 2011: Entity Linking
Ø Docs referred to the same entity are clustered according to doc pair-wise similarity scores. Ø Start with singleton: each doc is a cluster Ø If there are two docs D and D' in clusters Ci and Cj respectively:
Two clusters Ci and Cj are merged to form a new cluster Cij
if Sim(D,D' ) > γ
Calculate the similarity between the new cluster Cij and all remaining
clusters
γ = 0.25
A Two Tier Framework for Context-Aware Service Organization & Discovery
Latent Dirichlet Allocation (LDA)
Text Analysis Conference, November 14-15, 2011 23
I2R-NUS-MSRA at TAC 2011: Entity Linking
Ø LDA has been applied to many NLP tasks such as: summarization and text classification
Ø In our approach, the learned topics can represent the underlying entities of the ambiguous names
Ø Generative story:
A Two Tier Framework for Context-Aware Service Organization & Discovery
Text Analysis Conference, November 14-15, 2011 24
I2R-NUS-MSRA at TAC 2011: Entity Linking ���
Ø Three classes SVM classifier to decide which system to be trusted
Ø Features: scores given by the three systems
Three Clustering Systems Combination
Combine with the system of MSRA team at KB linking step
Ø Binary SVM classifier to decide which system to be trusted Ø Features: scores given by the two systems
A Two Tier Framework for Context-Aware Service Organization & Discovery
Experiment for Three Clustering Algorithms
Text Analysis Conference, November 14-15, 2011 25
I2R-NUS-MSRA at TAC 2011: Entity Linking ���
Algorithms Eval 09 Eval 10 Eval 10+
SGP 0.745 0.954 0.809
HAC 0.666 0.950 0.789
LDA 0.782 0.981 0.841
Combination 0.795 0.982 0.852
A Two Tier Framework for Context-Aware Service Organization & Discovery
Submissions
Text Analysis Conference, November 14-15, 2011 26
I2R-NUS-MSRA at TAC 2011: Entity Linking ���
Systems Acc. Precision Recall F1
Full 0.863 0.815 0.849 0.831
Partial 0.844 0.797 0.829 0.813
Highest - - - 0.846
Median - - - 0.716
A Two Tier Framework for Context-Aware Service Organization & Discovery
Conclusion
Text Analysis Conference, November 14-15, 2011 27
I2R-NUS-MSRA at TAC 2011: Entity Linking ���
Ø Incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
Ø Acronym Expansion Ø Semantic Features Ø Instance Selection
Ø Investigate three algorithms for NIL query clustering
Ø Spectral Graph Partitioning (SGP) Ø Hierarchical Agglomerative Clustering (HAC) Ø Latent Dirichlet allocation (LDA)