+ All Categories
Home > Documents > I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two...

I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two...

Date post: 26-Dec-2019
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
27
Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 , Zhiqiang Toh 2 ,Yanchuan Sim 2 ,Yunbo Cao 3 , Chin Yew Lin 3 and Chew Lim Tan 1 1 National University of Singapore Text Analysis Conference, November 14-15, 2011 I2R-NUS-MSRA at TAC 2011: Entity Linking 2 Institute for Infocomm Research 3 Microsoft Research Asia
Transcript
Page 1: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Wei Zhang1, Jian Su2, Bin Chen2,WentingWang2, Zhiqiang Toh2, Yanchuan Sim2, Yunbo Cao3, Chin Yew Lin3 and Chew Lim Tan1

1 National University of Singapore

Text Analysis Conference, November 14-15, 2011

I2R-NUS-MSRA at TAC 2011: Entity Linking

2 Institute for Infocomm Research

3 Microsoft Research Asia

Page 2: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Outline

Text Analysis Conference, November 14-15, 2011 2

I2R-NUS-MSRA at TAC 2011: Entity Linking

Ø I2R-NUS team at TAC Ø incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

Ø Acronym Expansion Ø Semantic Features Ø Instance Selection

Ø Investigate three algorithms for NIL query clustering Ø Spectral Graph Partitioning (SGP) Ø Hierarchical Agglomerative Clustering (HAC) Ø Latent Dirichlet allocation (LDA) Ø  Combination system

Ø Offline Combination with the system of MSRA team at KB linking step

Page 3: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Outline

Text Analysis Conference, November 14-15, 2011 3

I2R-NUS-MSRA at TAC 2011: Entity Linking

Ø I2R-NUS team at TAC Ø incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

Ø Acronym Expansion Ø Semantic Features Ø Instance Selection

Ø Investigate three algorithms for NIL query clustering Ø Spectral Graph Partitioning (SGP) Ø Hierarchical Agglomerative Clustering (HAC) Ø Latent Dirichlet allocation (LDA) Ø  Combination system

Ø Combine with the system of MSRA team at KB linking step

Page 4: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Acronym Expansion - Motivation

Text Analysis Conference, November 14-15, 2011 4

I2R-NUS-MSRA at TAC 2011: Entity Linking

Ø Expanding an acronym from its context to reduce the ambiguities of a name Ø  E.g.TSE in Wikipedia refers to 33 entries Vs. Tokyo Stock Exchange is unambiguous.

Page 5: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Step 1 – Find Expansion Candidates

Text Analysis Conference, November 14-15, 2011 5

I2R-NUS-MSRA at TAC 2011: Entity Linking

Identifying Candidate Expansions (e.g. for ACM)

Page 6: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Step 2 – Candidate Expansions Ranking

Text Analysis Conference, November 14-15, 2011 6

I2R-NUS-MSRA at TAC 2011: Entity Linking

Ø Using SVM classifier to rank the candidates Ø Our SVM based acronym expansion

Ø can handle link acronyms and full strings in the different sentences in the articles

Ø Number of common characters between acronym and leading character of the expansion.

Ø  can handle acronym with swapped letters. Ø E.g. Communist Party of China Vs. CCP Ø Sentence distance between acronym and expansion

Page 7: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Outline

Text Analysis Conference, November 14-15, 2011 7

I2R-NUS-MSRA at TAC 2011: Entity Linking

Ø I2R-NUS team at TAC Ø incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

Ø Acronym Expansion Ø Semantic Features Ø Instance Selection

Ø Investigate three algorithms for NIL query clustering Ø Spectral Graph Partitioning (SGP) Ø Hierarchical Agglomerative Clustering (HAC) Ø Latent Dirichlet allocation (LDA) Ø  Combination system

Ø Combine with the system of MSRA team at KB linking step

Page 8: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Related Work on Context Similarity

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 8

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���

Ø  Zhang et al., 2010; Zheng et al., 2010; Dredze et al., 2010 Ø  Term Matching Ø However, 1) Michael Jordan is a leading researcher in machine learning

and artificial intelligence. 2) Michael Jordan is currently a full professor at the University

of California, Berkeley. 3) Michael Jordan (born February, 1963) is a former American

professional basketball player. 4) Michael Jordan wins NBA MVP of 91-92 season.

No Term Match

Page 9: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Our System - A Wikipedia-LDA model

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 9

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���

1) Michael Jordan is a leading researcher in machine learning and artificial intelligence.

2) Michael Jordan is currently a full professor at the University of California, Berkeley.

3) Michael Jordan (born February, 1963) is a former American

professional basketball player. 4) Michael Jordan wins NBA MVP of 91-92 season. Topic:

Basketball

Topic: Science

Page 10: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Wikipedia – LDA Model

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 10

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���

P( word i | category j)

Document P( category i | document j)

Document

0

0.2

0.4

0.6

0.8

1

Page 11: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Wikipedia – LDA Model

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 11

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���

1) Michael Jordan is a leading

researcher in machine learning and artificial intelligence.

2) Michael Jordan is currently a full professor at the University of California, Berkeley.

3) Michael Jordan (born February, 1963) is a former American professional basketball player.

4) Michael Jordan wins NBA MVP of 91-92 season.

Page 12: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Outline

Text Analysis Conference, November 14-15, 2011 12

I2R-NUS-MSRA at TAC 2011: Entity Linking

Ø I2R-NUS team at TAC Ø incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

Ø Acronym Expansion Ø Semantic Features Ø Instance Selection

Ø Investigate three algorithms for NIL query clustering Ø Spectral Graph Partitioning (SGP) Ø Hierarchical Agglomerative Clustering (HAC) Ø Latent Dirichlet allocation (LDA) Ø  Combination system

Ø Combine with the system of MSRA team at KB linking step

Page 13: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Related Work

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 13

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���

Ø Vector Space Model Ø Difficult to combine bag of words (BOW) with other features. Ø Performance needs to be improved

Ø Supervised Approaches Ø Using manual annotated training instances

Ø Dredze et al., 2010; Zheng et al., 2010

Ø Using automatically generated training instances Ø Zhang et al. 2010

Page 14: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Related Work

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 14

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���

Ø Auto-generate training instance (Zhang et al., 2010)

(News Article) Obama Campaign Drops The George W. Bush Talking Point …

Page 15: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Related Work

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 15

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���

Ø From “George W. Bush” articles

Ø No positive instances for “George H. W. Bush” “George P. Bush” and “George Washington Bush” generated Ø No negative instances for “George W. Bush” generated

Ø Such positive negative training instance distributions may not be the same with the original ambiguous cases in the raw text collection

Ø The distribution of the unambiguous mentions may not be the same in test data

Page 16: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

The Approach in Our System

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 16

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���

Ø An instance selection approach Ø Select an informative, representative, and diverse subset from the auto-generated data set. Ø Reduce the effect of the distribution differences

Page 17: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Instance Selection

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 17

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection ���

Small Initial data set

training SVM Classifier

Test on auto-generated

data set

2-D data set Illustration

SVM hyperplane

Select Informative, representative and diverse Instances

Add these selected instances to Initial data set

Page 18: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Outline

Text Analysis Conference, November 14-15, 2011 18

I2R-NUS-MSRA at TAC 2011: Entity Linking

Ø I2R-NUS team at TAC Ø incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

Ø Acronym Expansion Ø Semantic Features Ø Instance Selection

Ø Investigate three algorithms for NIL query clustering Ø Spectral Graph Partitioning (SGP) Ø Hierarchical Agglomerative Clustering (HAC) Ø Latent Dirichlet allocation (LDA) Ø  Combination system

Ø Combine with the system of MSRA team at KB linking step

Page 19: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Ø Advantages over other clustering techniques Ø Globally optimized results Ø Efficient in time and space Ø Generally, produce a better result

Ø Success in many areas Ø Image segmentation Ø Gene expression clustering

Spectral Clustering

Page 20: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Spectral Clustering

A = QɅQ-1

Ø Eigen Decomposition on Graph Laplacian Ø Dimensionality Reduction Ø  (Luxburg, 2006)

George W. Bush

George H.W. Bush

Page 21: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Hierarchical Agglomerative Clustering

Text Analysis Conference, November 14-15, 2011 21

I2R-NUS-MSRA at TAC 2011: Entity Linking

Ø Convert a doc into a feature vector: Wikipedia concepts, bag-of-words and named entities.

Ø Estimate the weight of each feature using Query Relevance Weighting Model (Long and Shi, 2010):

Ø  this model shows good performance in Web People Search Ø In our work, original query name, its Wikipedia redirected names and its coreference chain mentions are all considered as appearances of the query name in the text.

Ø Similarity scores : cosine similarity and overlap similarity.

Page 22: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Hierarchical Agglomerative Clustering

Text Analysis Conference, November 14-15, 2011 22

I2R-NUS-MSRA at TAC 2011: Entity Linking

Ø Docs referred to the same entity are clustered according to doc pair-wise similarity scores. Ø Start with singleton: each doc is a cluster Ø If there are two docs D and D' in clusters Ci and Cj respectively:

Two clusters Ci and Cj are merged to form a new cluster Cij

if Sim(D,D' ) > γ

Calculate the similarity between the new cluster Cij and all remaining

clusters

γ = 0.25

Page 23: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Latent Dirichlet Allocation (LDA)

Text Analysis Conference, November 14-15, 2011 23

I2R-NUS-MSRA at TAC 2011: Entity Linking

Ø LDA has been applied to many NLP tasks such as: summarization and text classification

Ø  In our approach, the learned topics can represent the underlying entities of the ambiguous names

Ø Generative story:

Page 24: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Text Analysis Conference, November 14-15, 2011 24

I2R-NUS-MSRA at TAC 2011: Entity Linking ���

Ø Three classes SVM classifier to decide which system to be trusted

Ø Features: scores given by the three systems

Three Clustering Systems Combination

Combine with the system of MSRA team at KB linking step

Ø Binary SVM classifier to decide which system to be trusted Ø Features: scores given by the two systems

Page 25: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Experiment for Three Clustering Algorithms

Text Analysis Conference, November 14-15, 2011 25

I2R-NUS-MSRA at TAC 2011: Entity Linking ���

Algorithms Eval 09 Eval 10 Eval 10+

SGP 0.745 0.954 0.809

HAC 0.666 0.950 0.789

LDA 0.782 0.981 0.841

Combination 0.795 0.982 0.852

Page 26: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Submissions

Text Analysis Conference, November 14-15, 2011 26

I2R-NUS-MSRA at TAC 2011: Entity Linking ���

Systems Acc. Precision Recall F1

Full 0.863 0.815 0.849 0.831

Partial 0.844 0.797 0.829 0.813

Highest - - - 0.846

Median - - - 0.716

Page 27: I2R-NUS-MSRA at TAC 2011: Entity Linkingyanchuan.sg/assets/papers/zhang2011nus-slides.pdf · A Two Tier Framework for Context-Aware Service Organization & Discovery Outline Text Analysis

A Two Tier Framework for Context-Aware Service Organization & Discovery

Conclusion

Text Analysis Conference, November 14-15, 2011 27

I2R-NUS-MSRA at TAC 2011: Entity Linking ���

Ø Incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

Ø Acronym Expansion Ø Semantic Features Ø Instance Selection

Ø Investigate three algorithms for NIL query clustering

Ø Spectral Graph Partitioning (SGP) Ø Hierarchical Agglomerative Clustering (HAC) Ø Latent Dirichlet allocation (LDA)


Recommended