Aggregated Semantic Matching for Short Text Entity Linkingsim(m;ttl) between the mention and entity...

Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), pages 476–485Brussels, Belgium, October 31 - November 1, 2018. c©2018 Association for Computational Linguistics

476

Aggregated Semantic Matching for Short Text Entity Linking

Feng Nie1∗, Shuyan Zhou2∗, Jing Liu3∗, Jinpeng Wang4, Chin-Yew Lin4, Rong Pan1∗

1Sun Yat-Sen University 2Harbin Institute of Technology 3Baidu Inc. 4Microsoft Research Asia{fengniesysu, alexisxy0418}@gmail.com,

[email protected], {jinpwa, cyl}@microsoft.com, [email protected]

Abstract

The task of entity linking aims to identify con-cepts mentioned in a text fragments and linkthem to a reference knowledge base. Entitylinking in long text has been well studied inprevious work. However, short text entity link-ing is more challenging since the texts arenoisy and less coherent. To better utilize thelocal information provided in short texts, wepropose a novel neural network framework,Aggregated Semantic Matching (ASM), inwhich two different aspects of semantic infor-mation between the local context and the can-didate entity are captured via representation-based and interaction-based neural semanticmatching models, and then two matching sig-nals work jointly for disambiguation with arank aggregation mechanism. Our evaluationshows that the proposed model outperformsthe state-of-the-arts on public tweet datasets.

1 Introduction

The task of entity linking aims to link a men-tion that appears in a piece of text to an entry(i.e. entity) in a knowledge base. For example,as shown in Table 1, given a mention Trump ina tweet, it should be linked to the entity DonaldTrump1 in Wikipedia. Recent research has shownthat entity linking can help better understand thetext of a document (Schuhmacher and Ponzetto,2014) and benefits several tasks, including namedentity recognition (Luo et al.) and information re-trieval (Xiong et al., 2017b). The research of entitylinking mainly considers two types of documents:long text (e.g. news articles and web documents)and short text (e.g. tweets). In this paper, we focuson short text, particularly tweet entity linking.

∗Correspondence author is Rong Pan. This work wasdone when the first and second author were interns and thethird author was an employee at Microsoft Research Asia.

1https://en.wikipedia.org/wiki/Donald Trump

TweetThe vile #Trump humanity raises its gentle facein Canada ... chapeau to #TrudeauCandidatesDonald Trump, Trump (card games), ...

Table 1: An illustration of short text entity linking,with mention Trump underlined.

One of the major challenges in entity link-ing task is ambiguity, where an entity mentioncould denote to multiple entities in a knowledgebase. As shown in Table 1, the mention Trumpcan refer to U.S. president Donald Trump andalso the card name Trump (card games). Manyof recent approaches for long text entity linkingtake the advantage of global context which cap-tures the coherence among the mapped entitiesfor a set of related mentions in a single docu-ment (Cucerzan, 2007; Han et al., 2011; Glober-son et al., 2016; Heinzerling et al., 2017). How-ever, short texts like tweets are often concise andless coherent, which lack the necessary informa-tion for the global methods. In the NEEL dataset(Weller et al., 2016), there are only 3.4 mentions ineach tweet on average. Several studies (Liu et al.,2013; Huang et al., 2014) investigate collectivetweet entity linking by pre-collecting and consid-ering multiple tweets simultaneously. However,multiple texts are not always available for collec-tion and the process is time-consuming. Thus, weargue that an efficient entity disambiguation whichrequires only a single short text (e.g., a tweet) andcan well utilize local contexts is better suited inreal word applications.

In this paper, we investigate entity disambigua-tion in a setting where only local information isavailable. Recent neural approaches have showntheir superiority in capturing rich semantic sim-

477

ilarities from mention contexts and entity con-tents. Sun et al. (2015); Francis-Landau et al.(2016) proposed using convolutional neural net-works (CNN) with Siamese (symmetric) archi-tecture to capture the similarity between texts.These approaches can be viewed as represen-tation-focused semantic matching models. Therepresentation-focused model first builds a rep-resentation for a single text (e.g., a context oran entity description) with a neural network, andthen conducts matching between the abstract rep-resentation of two pieces of text. Even thoughsuch models capture distinguishable informationfrom both mention and entity side, some con-crete matching signals are lost (e.g., exact match),since the matching between two texts happens af-ter their individual abstract representations havebeen obtained. To enhance the representation-focused models, inspired by recent advances in in-formation retrieval (Lu and Li, 2013; Guo et al.,2016; Xiong et al., 2017a), we propose using in-teraction-focused approach to capture the con-crete matching signals. The interaction-focusedmethod tries to build local interactions (e.g., co-sine similarity) between two pieces of text, andthen uses neural networks to learn the final match-ing score based on the local interactions.

The representation- and interaction-focused ap-proach capture abstract- and concrete-level match-ing signal respectively, they would be comple-ment each other if designed appropriately. Onestraightforward way to combine multiple seman-tic matching signals is to apply a linear regres-sion layer to learn a static weight for each match-ing signal(Francis-Landau et al., 2016). However,we observe that the importance of different sig-nals can be different case by case. For example,as shown in Table 1, the context word Canadais the most important word for the disambiguationof Trudeau. In this case, the concrete-level match-ing signal is required. While for the tweet “#Star-Wars #theForceAwakens #StarWarsForceAwakens@StarWars”, @StarWars is linked to the entityStar Wars2. In this case, the whole tweet describesthe same topic “Star Wars”, thus the abstract-levelsemantics matching signal is helpful. To addressthis issue, we propose using a rank aggregationmethod to dynamically combine multiple seman-tic matching signals for disambiguation.

In summary, we focus on entity disambiguation

2https://en.wikipedia.org/wiki/Star Wars

by leveraging only the local information. Specif-ically, we propose using both representation-focused model and interaction-focused model forsemantic matching and view them as complemen-tary to each other. To overcome the issue of thestatic weights in linear regression, we apply rankaggregation to combine multiple semantic match-ing signals captured by two neural models on mul-tiple text pairs. We conduct extensive experimentsto examine the effectiveness of our proposed ap-proach, ASM, on both NEEL dataset and MSRtweet entity linking (MSR-TEL for short) dataset.

2 Background

2.1 NotationsGiven a tweet t, it contains a set of identifiedqueries Q = (q1, ..., qn). Each query q in a tweet tconsists of m and ctx, where m denotes an entitymention and ctx denotes the context of the men-tion, i.e., a piece of text surroundingm in the tweett. An entity is an unambiguous page (e.g., DonaldTrump) in a referent Knowledge Base (KB). Eachentity e consists of ttl and desc, where ttl denotesthe title of e and desc denotes the description of e(e.g., the article defining e).

2.2 An Overview of the Linking SystemTypically, an entity linking system consists ofthree components: mention detection, candidategeneration and entity disambiguation. In this sec-tion, we will briefly presents the existing solutionsfor the first two components. In next section, wewill introduce our proposed aggregated semanticmatching for entity disambiguation.

2.2.1 Mention DetectionGiven a tweet t with a sequence of wordsw1, ..., wn, our goal is to identify the possible en-tity mentions in the tweet t. Specifically, everyword wi in tweet t requires a label to indicatethat whether it is an entity mention word or not.Therefore, we view it as a traditional named entityrecognition (NER) problem and use BIO taggingschema. Given the tweet t, we aim to assign labelsy = (y1, ..., yn) for each word in the tweet t.

yi =

B wi is a begin word of a mention,I wi is a non-begin word of a mention,O wi is not a mention word.

In our implementation, we apply an LSTM-CRFbased NER tagging model which automatically

478

Model Overview

Mention Detection and Candidate Generation

Tweet Data

Convolution Neural Network with Max-Pooling

Neural Relevance Model with Kernel-Pooling

Semantic Matching

Knowledge Base

Linking Results

Rank Aggregation

Figure 1: An overview of aggregated semanticmatching for entity disambiguation.

learns contextual features for sequence tagging viarecurrent neural networks (Lample et al., 2016).

2.2.2 Candidate GenerationGiven a mention m, we use several heuristic rulesto generate candidate entities similar to (Bunescuand Pasca, 2006; Huang et al., 2014; Sun et al.,2015). Specifically, given a mention m, we re-trieve an entity as a candidate from KB, if itmatches one of the following conditions: (a) theentity title exactly matches the mention, (b) theanchor text of the entity exactly matches the men-tion, (c) the title of the entity’s redirected page ex-actly matches the mention Additionally, we adda special candidate NIL for each mention, whichrefers to a new entity out of KB. Given a mention,multiple candidates can be retrieved. Hence, weneed to do entity disambiguation.

3 Aggregated Semantic Matching Model

We investigate entity disambiguation using onlylocal information provided in short texts in thispaper. Here, the local information includes a men-tion and its context in a tweet. Similar to (Francis-Landau et al., 2016), given a query q and an en-tity e, we consider semantic matching on the fourtext pairs for disambiguation: (1) the similaritysim(m, ttl) between the mention and entity ti-tle, (2) the similarity sim(m, desc) between themention and entity description, (3) the similaritysim(ctx, desc) between the context and entity de-scription, (4) the similarity sim(ctx, ttl) betweenthe context and entity description. Fig. 1 illustratesan overview of our proposed Aggregated SemanticMatching for entity disambiguation. First, we usea representation-focused model and an interaction-focused neural model for semantic matching onfour text pairs. Then, we introduce a pairwise rankaggregation to combine multiple semantic match-

ing signals captured by the two neural models onfour text pairs.

3.1 Semantic Matching

Formally, given two texts T1 and T2, the semanticsimilarity of the two texts is measured as a scoreproduced by a matching function based on the rep-resentation of each text:

match(T1, T2) = F (Φ(T1),Φ(T2)) (1)

where Φ is a function to learn the text representa-tion, and F is the matching function based on theinteraction between the representations.

Existing neural semantic matching modelscan be categorized into two types: (a) therepresentation-focused model which takes a com-plex representation learning function and usesa relatively simple matching function, (b) theinteraction-focused model which usually takes asimple representation learning function and usesa complex matching function. In the remainingof this section, we will present the details of arepresentation-focused model (M-CNN) and aninteraction-focused model (K-NRM). We will alsodiscuss the advantages of these two models in theentity linking task.

3.1.1 Convolution Neural Matching withMax Pooling (M-CNN)

Given two pieces of text T1 = {w11, ..., w

1n} and

T2 = {w21, ..., w

2m}, M-CNN aims to learn com-

positional and abstract representations (Φ) for T1and T2 using a convolution neural network with amax pooling layer(Francis-Landau et al., 2016).

Figure 2a illustrates the architecture of M-CNNmodel. Given a sequence of words w1, ..., wn,we embed each word into a d dimensional vector,which yields a set of word vectors v1, ..., vn. Wethen map those word vectors into a fixed-size vec-tor using a convolution network with a filter bankM ∈ Ru×d, where window size is l and u is thenumber of filters. The convolution feature matrixH ∈ Rk×(n−l+1) is obtained by concatenating theconvolution outputs

−→h i:

−→hj = max{0,Mvj:(j+l)}

H = [−→h 1, ...,

−→h n−l+1]

(2)

where vj:j+l is a concatenation of the given wordvectors and the max is element-wise. In this way,we extract word-level n-gram features of T1 and

479

CNN

CNNEntity Description

Mention Contexts

Semantic similarity

MaxPooling

MaxPooling

0.2, …, 0.5

0.1, …, 0.4

0.3, …, 0.8

0.1, …, 0.3

0.1, …, 0.5

0.2, …, 0.5

0.4, …, 0.5

ℎ1

ℎ2

ℎ1′

ℎ2′

ℎ3′

Ԧ𝑧1

Ԧ𝑧2

𝑠

#Trump

is

visiting

Donald

John

Trump

born

(a) M-CNN Model

𝑠

ℎ1

ℎ2

ℎ1′

ℎ2′

ℎ3′

Interaction

KernelPooling

Semantic similarity

Soft-TF features

CNNEntity Description

CNNMention Contexts

0.2, …, 0.5

0.1, …, 0.4

0.3, …, 0.8

0.1, …, 0.3

0.1, …, 0.5

0.2, …, 0.5

0.4, …, 0.5

𝑀11 𝑀21

𝑀12 𝑀22

𝑀13 𝑀23

𝐾1

𝐾𝑛

…

#Trump

is

visiting

Donald

John

Trump

born

(b) K-NRM Model

Figure 2: The Architecture of models.

T2 respectively. To capture the distinguishable in-formation of T1 and T2, a max-pooling layer is ap-plied and yields a fixed-length vector −→z1 and −→z2for T1 and T2. The semantic similarity betweenT1 and T2 is measured using a cosine similaritymatch(T1, T2) = cosine(−→z1 ,−→z2 ).

In summary, M-CNN extracts distinguishableinformation representing the overall semantics(i.e. representations) of a string text by usinga convolution neural network with max-pooling.However, the concrete matching signals (e.g., ex-act match) are lost, as the matching happens aftertheir individual representation. We therefore intro-duce an interaction-focused model to better cap-ture the concrete matching in the next section.

3.1.2 Neural Relevance Model with KernelPooling (K-NRM)

As shown in Fig. 2b, K-NRM captures the localinteractions between T1 and T2 , and then uses akernel-pooling layer (Xiong et al., 2017a) to softlycount the frequencies of the local patterns. The fi-nal matching score is conducted based on the pat-terns. Therefore, the concrete matching informa-tion is captured.

Different from M-CNN, K-NRM builds the lo-cal interactions between T1 and T2 based on theword-level n-gram feature matrix calculated inEq. 2. Formally, we construct a translation matrixM , where each element in M is the cosine simi-larity between an n-gram feature vector

−→h

q

i in T1and an n-gram feature vector

−→h

e

j in T2, calculated

as Mij = cosine(−→h

q

i ,−→h

e

j).Then, a scoring feature vector φ(M) is gener-

ated by a kernel-pooling technique.

φ(M) =n−l+1∑i=1

√−→K (Mi)

−→K (Mi) = {K1(Mi), ...,KK(Mi)}

(3)

where−→K (Mi) applies K kernels to the i-th

row of the translation matrix, and generates aK−dimensional scoring feature vector for the i-th n-gram feature in the query. The sqrt-sum ofthe scoring feature vectors of all n-gram featuresin query forms the scoring feature vector φ for thewhole query, where the sqrt reduces the range ofthe value in each kernel vector. Note that the effectof−→K depends on the kernel used. We use the RBF

kernel in this paper.

Kk(Mi) =∑j

exp(−(Mij − µk)2

2σ2) (4)

The RBF kernel Kk calculates how pairwise sim-ilarities between n-gram feature vectors are dis-tributed around its mean µk: the more similaritiesclosed to its mean µk, the higher the output valueis. The kernel functions act as ‘soft-TF’ bins,where µ defines the similarity level that ‘soft-TF’focuses on and σ defines the range of its ‘soft-TF’count. Then the semantic similarity is capturedwith a linear layermatch(T1, T2) = wTφ(M)+b,where φ(M) is the scoring feature vector.

In summary, K-NRM captures the concretematching signals based on word-level n-gram fea-ture interactions between T1 and T2. In contrast,M-CNN captures the compositional and abstractmeaning of a whole text. Thus, we produce thesemantic matching signals using both models tocapture different aspect of semantics that are use-ful for entity linking.

3.2 Normalization Scoring Layer

We compute 4 types of semantic similarities be-tween the query q and the candidate entity e(e.g., sim(m, tit), sim(m, desc), sim(ctx, tit),sim(ctx, desc)) with the above two semanticmatching models. We obtain 8 semantic match-ing signals, denoted as f1(q, e), ..., f8(q, e) in to-

480

tal. The normalized ranking score for each seman-tic matching signals fi(q, e) is calculated as

si(q, e, f) =exp(fi(q, e))∑e′ exp(fi(q, e

′))(5)

where e′stands for any of the candidate entities for

the given mention m. We then produce 8 semanticmatching scores for each candidate entity of m,denoted as Sq,e = {s1, ..., s8}.

3.3 Rank Aggregation

Given a query q, we obtain multiple semanticmatching signals for each entity candidate afterthe last step. To take advantage of different se-mantic matching models on different text pairs, astraightforward approach is using a linear regres-sion layer to combine multiple semantic matchingsignals (Francis-Landau et al., 2016). The linearcombination learns a static weight for each match-ing signal. However, as we pointed out previously,the importance of different signals varies for dif-ferent queries. In some cases, the abstract-levelsignals are important. While the concrete-levelsignals are more important in other cases. To ad-dress this issue, we introduce a pairwise rank ag-gregation method to aggregate multiple semanticmatching signals.

In the area of information retrieval, rank ag-gregation is combining rankings from multiple re-trieval systems and producing a better new rank-ing (Carterette and Petkova, 2006). In our prob-lem, given a query q, we have one ranking of theentity candidates for each semantic matching sig-nal. We aim to find the final ranking by aggregat-ing multiple rankings. Specifically, given a rank-ing of entities for one semantic matching signal,e1 � e2 � e3 . . . , where i � j means entity i isranked above j, we extract all entity pairs (ei, ej)from the ranking and assume that if ei � ej , thenei is preferred to ej . We union all pairwise prefer-ences generated from multiple rankings as a singleset, from which the final ranking is learned. In thispaper, we apply TrueSkill (Herbrich et al., 2006)which is a Bayesian skill rating model. We presenta two-layer version of TrueSkill with no-draw.

TrueSkill assumes that the practical perfor-mance of each player in a game follows a nor-mal distribution N(µ, σ2), where µ means theskill level of the player and σ stands for the un-certainty of the estimated skill level. Basically,TrueSkill learns the skill levels of players by lever-

aging Bayes’ theorem. Given the current esti-mated skill levels of two players (prior probabil-ity) and the outcome of a new game between them(likelihood), TrueSkill model updates its estima-tion of player skill levels (posterior probability).TrueSkill updates the skill level µ and the un-certainty σ intuitively: (a) if the outcome of anew competition is expected, i.e., the player withhigher skill level wins the game, it will cause smallupdates in skill level µ and uncertainty σ; (b) if theoutcome of a new competition is unexpected, i.e.,the player with lower skill level wins the game, itwill cause large updates in skill level µ and uncer-tainty σ. According to these intuitions, the equa-tions to update the skill level µ and uncertainty σare as follows:

µwinner = µwinner +σ2winner

c∗ v(

t

c,ε

c)

µloser = µloser −σ2loserc∗ v(

t

c,ε

c)

σ2winner = σ2winner ∗ [1− σ2winner

c2∗ w(

t

c,ε

c)]

σ2loser = σ2loser ∗ [1−σ2loserc2∗ w(

t

c,ε

c)]

(6)where t = µwinner − µloser and c2 = 2β2 +σ2winner + σ2loser. Here, ε is a parameter repre-senting the probability of a draw in one game, andv(t, ε) and w(t, ε) are weighting factors for skilllevel µ and standard deviation σ respectively. βis a parameter representing the range of skills. Inthis paper, we set the initial values of the skill levelµ and the standard deviation σ of each player thesame as the default values used in (Herbrich et al.,2006). We use µ − 3β to rank entities following(Herbrich et al., 2006).

4 Experiments

In this section, we describe our experimental re-sults on tweet entity linking. Particularly, weinvestigate the difference between two semanticmatching models and the effectiveness of jointlycombining these two semantic matching signals.

4.1 Datasets & Evaluation MetricIn our experiments, we evaluate our proposedmodel ASM on the following two datasets.

NEEL Weller et al. (2016). We use the datasetof Named Entity Extraction & Linking Challenge2016. The training dataset consists of 6,025 tweetsand includes 6,374 non-NIL queries and 2,291

481

NIL queries. The validation dataset consists of100 tweets and includes 253 non-NIL queries and85 NIL queries. The testing dataset consists of 300tweets and includes 738 non-NIL queries and 284NIL queries.

MSR-TEL Guo et al. (2013)3. This datasetconsists of 428 tweets and 770 non-NIL queries.Since the NEEL test dataset has distribution biasproblem, we add MSR-TEL as another dataset forthe evaluation. In the NEEL testing dataset, 384out of 1022 queries refer to three entities: ‘Don-ald Trump’, ‘Star Wars’ and ‘Star Wars (The ForceAwakens)’.

In this paper, we use accuracy as the major eval-uation metric for entity disambiguation. Formally,we denote N as the number of queries and M asthe number of correctly linked mentions given thegold mention (the top-ranked entity is the goldenentity), accuracy = M

N . Besides, we use preci-sion, recall and F1 measure to evaluate the end-to-end system. Formally, we denote N

′as the num-

ber of mentions identified by a system and M′

asthe correctly linked mentions. Thus, precision =M′

N ′, recall = M

′

N and F1 = 2 ∗ precision∗recallprecision+recall .

4.2 Data PreprocessingTweet data All tweets are normalized in thefollowing way. First, we use the Twitter-awaretokenizer in NLTK4 to tokenize words in a tweet.We convert each hyperlink in tweets to a specialtoken URL. Since hashtags usually does notcontain any space between words, we use a webservice5 to break hastags into tokens (e.g., theservice will break ‘#TheForceAwakens’ into ‘theforce awakens’) by following (Guo et al., 2013).Regarding to usernames (@) in tweets, we replacethem with their screen name (e.g., the screen nameof the user ‘@jimmyfallon’ is ‘jimmy fallon’).Wikipedia data We use the Wikipedia Dump onDecember 2015 as the reference knowledge base.Since the most important information of an entityis usually at the beginning of its Wikipedia article,we utilize only the first 200 words in the article asits entity description. We use the default Englishword tokenizer in NLTK to do the tokenizationfor each Wikipedia article.Word embedding We use the word2vectoolkit (Mikolov et al., 2013) to pre-train word

3Guo et al. (2013) only used a subset of this dataset forevaluation. Instead, we test on the full dataset.

4Natural Language Toolkit. http://www.nltk.org5http://web-ngram.research.microsoft.com/info/break.html

embeddings on the whole English WikipediaDump. The dimensionality of the word embed-dings is set to 400. Note that we do not update theword embeddings during training.

4.3 Experimental SetupIn our main experiment, we compare our proposedapproaches with the following baselines: (a) Theofficially ranked 1st and 2nd systems in NEEL2016 challenge. We denote these two systems asRank1 and Rank2. (b) TagMe. Ferragina andScaiella (2010) is an end-to-end linking system,which jointly performs mention detection and en-tity disambiguation. It focuses on short texts, in-cluding tweets. (c) Cucerzan. (Cucerzan, 2007)is a supervised entity disambiguation system thatwon TAC KBP competition in 2010. (d) M-CNN.To the best of our knowledge, (Francis-Landauet al., 2016) is the state-of-the-art neural disam-biguation model. (e) Ensemble. The rank ag-gregated combination of two M-CNN models withdifferent random seeds.

To fairly compare with the baselines ofCucerzan and M-CNN, we use the same mentiondetection and candidate generation for them andour approaches. We train an LSTM-CRF basedtagger (Lample et al., 2016) for mention detectionby using the NEEL training dataset. The preci-sion, recall, and F1 of mention detection on NEELtesting dataset are 96.1%, 89.2%, 92.6% respec-tively. The precision, recall, and F1 of mentiondetection on MSR-TEL dataset are 80.3% 83.8%and 82% respectively. As we described in the pre-vious section, we use the heuristic rules for can-didate generation. The recall of candidate gen-eration on NEEL and MSR-TEL is 88.7% and92.5%.

When training our model, we use the stochasticgradient descent algorithm and the AdaDelta opti-mizer (Zeiler, 2012). The gradients are computedvia back-propagation. The dimensionality of thehidden units in convolution neural network is setto 300. All the parameters are initialized with auniform distribution U(−0.01, 0.01). Since thereis NIL entity in the dataset, we tune a NIL thresh-old for the prediction of NIL entities according tothe validation dataset.

4.4 Main ResultsThe end-to-end performance of various ap-proaches on the two datasets is shown in Table 2.Since there are no publicly available codes of

482

MethodsNEEL MSR-TEL6

Precision Recall F1 Precision Recall F1Rank 1 - - 50.1 - - -Rank 2 - - 39.6 - - -TagMe 25.3 62.9 36.2 14.5 69.2 23.8Cucerzan 65.4 57.9 61.4 62.6 63.3 62.9M-CNN 69.5 64.9 67.1 61.6 62.3 62.1

+pre-train 69.7 65.1 67.3 64.5 65.2 64.8Ensemble 69.7 65.1 67.3 63.5 64.2 63.8

+pre-train 70.2 65.5 67.8 64.9 65.6 65.2ASM 70.6 65.9 68.2 64.2 64.9 64.5

+pre-train 72.2 67.4 69.7 66.2 66.9 66.5

Table 2: End-to-end performance of the systems on the two datasets

Methods NEEL MSR-TELCucerzan 65.4 75.5M-CNN 72.8 74.7

+pre-train 72.9 77.6Ensemble 72.9 76.4

+pre-train 73.5 78.1ASM 73.9 77.4

+pre-train 75.5 79.4

Table 3: The accuracy of entity disambiguationwith golden mentions on the two datasets.

Rank1 and Rank2, we give only the F1 scores ofthese two systems on NEEL dataset according toWeller et al. (2016). Note that the baseline systemsRank1, Rank2 and TagMe use different mentiondetection.

The systems of Rank1, Rank2, TagMe andCucerzan are feature engineering based ap-proaches. The systems of M-CNN and ASM areneural based approaches. From Table 2, wecan observe that neural based approaches aresuperior to the feature engineering based ap-proaches. Table 2 also shows that ASM out-performs the neural based method M-CNN. Ourproposed method ASM also shows improvementsover Ensemble, which indicates the neces-sity of combining representation- and interaction-focused models in entity disambiguation.

Moreover, we pre-train both M-CNN,Ensemble and ASM by using 0.5 millionanchors in Wikipedia, and fine-tune the model pa-rameters using non-NIL queries in NEEL trainingdataset. From Table 2, we can observe that theperformance of neural models will be improvedby using pre-training. The results in Table 2 show

(m, ttl) (ctx, desc) All PairsM-CNN 64.8 66.7 72.8K-NRM 64.1 66.8 72.7

ASM 65.1 69.7 73.9

Table 4: The performance of two semantic match-ing models and their combinations on NEELdataset.

that our proposed ASM is still superior to M-CNNand Ensemble in the setting of pre-training.

Since entity disambiguation is our focus, wealso give the disambiguation accuracy of differ-ent approaches by using the golden mentions inTable 3. Similarly, we observe that our proposedASM outperforms baseline systems.

4.5 Model Analysis

In this section, we discuss several key observa-tions based on the experimental results, and wemainly report the entity disambiguation accuracywhen given the golden mentions.

4.5.1 Effect of Different Semantic MatchingMethods

We empirically analyze the difference betweenthe two semantic matching models (M-CNN andK-NRM) and show the benefits when combing thesemantic matching signals from these two models.

6Note that the performance of all systems on MSR-TELdataset might be under estimated, since not all mentions ineach tweet were manually annotated. For example, a cor-rectly identified mention given by a system, which was notmanually annotated, will be judged as wrong. But we stillgive the comparisons of different approaches on MSR-TELdataset.

483

M-CNN win M-CNN lossK-NRM win 58.3% 6.3%K-NRM loss 5.8% 29.6%

Table 5: The win-loss analysis of M-CNN and K-NRM on the pair (ctx, desc).

Query: the vile #Trump humanity raises itsgentle face in Canada ... chapeau to#Trudeau,URL

M-CNN: Kevin TrudeauK-NRM: Justin TrudeauQuery: RT @ MingNa : What is my plan to

avoid spoiler about #theForceAwak-ens ? No Internet except to post my@StarWars

M-CNN: Star WarsK-NRM: Comparison of Star Trek and Star

Wars

Table 6: The top-1 results of M-CNN and K-NRMusing (ctx,desc) pair for two queries. Mention isin bold and the golden answer is underlined.

We first compare the performance of two se-mantic matching models over the two text pairs:(a) (m, ttl) and (b) (ctx, desc). These two pairspresents two extreme of the information used inthe systems: (m, ttl) consumes the minimumamount of information from a query and an entity,while (ctx, desc) consumes the maximum amountof information from a query and an entity. Fromthe first two columns in Table 4, we can observethat M-CNN performs comparably with K-NRM onthe two text pairs. ASM that combines the twomodels obtains performance gains on the two indi-vidual text pairs. The third column in Table 4 alsoshows that ASM gives performance gains whenusing all text pairs. This indicates that M-CNNand K-NRM capture complementary informationfor entity disambiguation.

Moreover, we observe that the performancegains are different on the two pairs (m, ttl) and(ctx, desc). The gain on (ctx, desc) is relativelylarger. This indicates that M-CNN and K-NRM cap-ture more different information when the text islong. Additionally, we show the win-loss analy-sis of the two semantic matching model for non-NIL queries on (ctx, desc) in Table 5. The 12.1%(=6.3% + 5.8%) difference between these twomodels confirms the necessity of combination.

MethodWithout Pre-Train With Pre-TrainNEEL MSR-TEL NEEL MSR-TEL

Linear 73.1 75.7 73.8 78.1ASM 73.9 77.4 75.5 79.4

Table 7: Comparison of rank aggregation and lin-ear combination on two datasets.

To further investigate the difference betweenthe two semantic matching models on short text,we did case study. Table 6 gives two examples.In the first example, the correct answer is ‘JustinTrudeau’ which contains the words of ‘Canada’and ‘trump’ in its entity description. However,M-CNN fails to capture this concrete matching in-formation, since the concrete information of textmight be lost after the convolution layer and max-pooling layer. In contrast, K-NRM builds the n-gram level local interactions between texts, andthus successfully captures the concrete matchinginformation (e.g. exact match) that results in a cor-rect linking result. In the second example, bothcandidate entities ‘Star Wars’ and ‘Comparison ofStar Trek and Star Wars’ contains the phrase ‘StarWars’ for multiple times in their entity descrip-tions. In this case, K-NRM fails to distinguish thecorrect entity ‘Star Wars’ from the wrong entity‘Comparision of Star Trek and Star Wars’, becauseit relies too much on the soft-TF information formatching. However, the soft-TF information inthe descriptions of the two entities is similar. Incontrast, M-CNN captures the whole meaning ofthe text and links the mention to the correct entity.A detailed analysis of n-grams extracted from theM-CNN is provided in the Appendix.

4.6 Effect of Rank Aggregation

Table 4 shows that the combination of multiplesemantic matching signals yields the best perfor-mance. Table 7 compares two different combi-nation of M-CNN and K-NRM models, the resultshows that the rank aggregation method outper-forms the linear combination. The rank aggrega-tion method dynamically summarizes win-loss re-sults for each signal and generates the final overallranking by considering all win-loss results. Theimprovement of our method over the linear com-bination confirms that the importance of differentsemantic signals varies for different queries, andour method is more suitable for combining multi-ple semantic signals.

484

5 Related Work

Existing entity linking methods can roughly fallinto two categories. Early work focus on local ap-proaches, which identifies one mention each time,and each mention is disambiguated separately us-ing hand-crafted features (Bunescu and Pasca,2006; Ji and Grishman, 2008; Milne and Witten,2008; Zheng et al., 2010). While recent work onentity linking has largely focus on global methods,which takes the mentions in the document as in-puts and find their corresponding entities simul-taneously by considering the coherency of entityassignments within a document. (Cucerzan, 2007;Hoffart et al., 2011; Globerson et al., 2016; Ganeaand Hofmann, 2017).

Global models can tap into highly discrimina-tive semantic signals (e.g. coreference and en-tity relatedness) that are unavailable to local meth-ods, and have significantly outperformed the lo-cal approach on standard datasets(Globerson et al.,2016). However, global approaches are difficult toapply in domains where only short and noisy textis available (e.g. tweets). Many techniques havebeen proposed to short texts including tweets. Liuet al. (2013) and Huang et al. (2014) investigatethe collective tweet entity linking by consideringmultiple tweets simultaneously. Meij et al. (2012)and Guo et al. (2013) perform joint detection anddisambiguation of mentions for tweet entity link-ing using feature based learning methods.

Recently, some neural network methods havebeen applied to entity linking to model the localcontextual information. He et al. (2013) inves-tigate Stacked Denoising Auto-encoders to learnentity representation. Sun et al. (2015); Francis-Landau et al. (2016) apply convolutional neuralnetworks for entity linking. Eshel et al. (2017)use recurrent neural networks to model the men-tion contexts. Nie et al. (2018) uses a co-attentionmechanism to select informative contexts and en-tity description for entity disambiguation. How-ever, none of these methods consider combiningrepresentation- and interaction-focused semanticmatching methods to capture the semantic simi-larity for entity linking, and use rank aggregationmethod to combine multiple semantic signals.

6 Conclusion

We propose an aggregated semantic matchingframework, ASM, for short text entity linking.The combination of the representation-focused

semantic matching method and the interaction-focused semantic matching method capture bothcompositional and concrete matching signals (e.g.exact match). Moreover, the pairwise rank aggre-gation is applied to better combine multiple se-mantic signals. We have shown the effectivenessof ASM over two datasets through comprehensiveexperiments. In the future, we will try our modelfor long text entity linking.

7 Acknowledgement

We thank the anonymous reviewers for their help-ful comments. We also thank Jin-Ge Yao, ZhiruiZhang, Shuangzhi Wu and Yin Lin for helpful con-versations and comments on the work.

ReferencesR Bunescu and M Pasca. 2006. Using encyclope-

dic knowledge for named entity disambiguation. InEACL, Trento, Italy.

Ben Carterette and Desislava Petkova. 2006. Learninga ranking from pairwise preferences. In Proceedingsof the 29th annual international ACM SIGIR confer-ence on Research and development in informationretrieval, pages 629–630. ACM.

S Cucerzan. 2007. Large-scale named entity disam-biguation based on wikipedia data. In EMNLP-CoNLL, volume 2007.

Yotam Eshel, Noam Cohen, and Kira Radinsky. 2017.Named entity disambiguation for noisy text. InCoNLL, volume 2017.

Paolo Ferragina and Ugo Scaiella. 2010. TAGME:on-the-fly annotation of short text fragments (bywikipedia entities). In Proceedings of CIKM 2010,pages 1625–1628.

Matthew Francis-Landau, Greg Durrett, and DanKlein. 2016. Capturing semantic similarity for en-tity linking with convolutional neural networks. InIn Proceedings of NAACL-HLT 2016, pages 1256–1261.

Octavian-Eugen Ganea and Thomas Hofmann. 2017.Deep joint entity disambiguation with local neuralattention. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing.

Amir Globerson, Nevena Lazic, Soumen Chakrabarti,Amarnag Subramanya, Michael Ringgaard, and Fer-nando Pereira. 2016. Collective entity resolutionwith multi-focal attention. In Proceedings of ACL2016.

485

Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. BruceCroft. 2016. A deep relevance matching model forad-hoc retrieval. In Proceedings of CIKM 2016,pages 55–64.

Stephen Guo, Ming-Wei Chang, and Emre Kiciman.2013. To link or not to link? a study on end-to-endtweet entity linking. In NAACL-HLT 2013.

Xianpei Han, Le Sun, and Jun Zhao. 2011. Collectiveentity linking in web text: a graph-based method. InProceeding of the SIGIR 2011, pages 765–774.

Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, LongkaiZhang, and Houfeng Wang. 2013. Learning entityrepresentation for entity disambiguation. In Pro-ceedings of ACL 2013.

Benjamin Heinzerling, Michael Strube, and Chin-YewLin. 2017. Trust, but verify better entity linkingthrough automatic verification. EACL.

Ralf Herbrich, Tom Minka, and Thore Graepel. 2006.Trueskilltm: A bayesian skill rating system. In InProceedings of NIPS 2006., pages 569–576.

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor-dino, Hagen Furstenau, Manfred Pinkal, Marc Span-iol, Bilyana Taneva, Stefan Thater, and GerhardWeikum. 2011. Robust disambiguation of namedentities in text. In Proceedings of EMNLP 2011.

Hongzhao Huang, Yunbo Cao, Xiaojiang Huang, HengJi, and Chin-Yew Lin. 2014. Collective tweet wiki-fication based on semi-supervised graph regulariza-tion. In Proceedings of ACL 2014.

Heng Ji and Ralf Grishman. 2008. Refining event ex-traction through cross-document inference. In Pro-ceedings of ACL 2008.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In In Proceedings of NAACL-HLT 2016, pages 260–270.

Xiaohua Liu, Yitong Li, Haocheng Wu, Ming Zhou,Furu Wei, and Yi Lu. 2013. Entity linking fortweets. In ACL (1), pages 1304–1311.

Zhengdong Lu and Hang Li. 2013. A deep architecturefor matching short texts. In In Proceedings of NIPS2006., pages 1367–1375.

Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Za-iqing Nie. Joint entity recognition and disambigua-tion. In Proceedings of the 2015 Conference on Em-pirical Methods in Natural Language Processing,EMNLP 2015, Lisbon, Portugal, September 17-21,2015.

Edgar Meij, Wouter Weerkamp, and Maarten de Rijke.2012. Adding semantics to microblog posts. In Pro-ceedings of the WSDM, 2012, pages 563–572.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their composition-ality. In C. J. C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Q. Weinberger, editors, Ad-vances in Neural Information Processing Systems26, pages 3111–3119.

David N. Milne and Ian H. Witten. 2008. Learning tolink with wikipedia. In Proceedings of CIKM 2008.

Feng Nie, Yunbo Cao, Jinpeng Wang, Chin-Yew Lin,and Rong Pan. 2018. Mention and entity descriptionco-attention for entity disambiguation. In Proceed-ings of the Thirty-Second AAAI Conference on Ar-tificial Intelligence, New Orleans, Louisiana, USA,February 2-7, 2018.

Michael Schuhmacher and Simone Paolo Ponzetto.2014. Knowledge-based graph document modeling.In Proceedings of CIKM 2014, pages 543–552.

Yaming Sun, Lei Lin, Duyu Tang, Nan Yang, ZhenzhouJi, and Xiaolong Wang. 2015. Modeling mention,context and entity with neural networks for entitydisambiguation. In Proceedings of IJCAI 2015.

Katrin Weller, Aba-Sah Dadzie, and DanicaRadovanovic. 2016. Making sense of microp-osts (#microposts2016) social sciences track. InProceedings of the 6th Workshop on ’Making Senseof Microposts’ co-located with the 25th Interna-tional World Wide Web Conference (WWW 2016),Montreal, Canada, April 11, 2016., pages 29–32.

Chenyan Xiong, Zhuyun Dai, Jamie Callan, ZhiyuanLiu, and Russell Power. 2017a. End-to-end neuralad-hoc ranking with kernel pooling. In Proceedingsof SIGIR 2017, pages 55–64.

Chenyan Xiong, Russell Power, and Jamie Callan.2017b. Explicit semantic ranking for academicsearch via knowledge graph embedding. In Pro-ceedings of the 26th International Conference onWorld Wide Web, WWW 2017, Perth, Australia,April 3-7, 2017, pages 1271–1279.

Matthew D Zeiler. 2012. Adadelta: an adaptive learn-ing rate method.

Zhicheng Zheng, Fangtao Li, Minlie Huang, and Xi-aoyan Zhu. 2010. Learning to link entities withknowledge base. In NAACL-HLT 2010.

Date post:	28-Mar-2021
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Aggregated Semantic Matching for Short Text Entity Linkingsim(m;ttl) between the mention and entity...

Documents