+ All Categories
Home > Documents > Probabilistic Bag-Of-Hyperlinks Model for Entity … Bag-Of-Hyperlinks Model for Entity Linking...

Probabilistic Bag-Of-Hyperlinks Model for Entity … Bag-Of-Hyperlinks Model for Entity Linking...

Date post: 21-Jun-2018
Category:
Upload: trinhthuan
View: 222 times
Download: 0 times
Share this document with a friend
12
Probabilistic Bag-Of-Hyperlinks Model for Entity Linking Octavian-Eugen Ganea Dept. of Computer Science ETH Zurich, Switzerland [email protected] Marina Ganea * Dept. of Computer Science ETH Zurich, Switzerland [email protected] Aurelien Lucchi Dept. of Computer Science ETH Zurich, Switzerland [email protected] Carsten Eickhoff Dept. of Computer Science ETH Zurich, Switzerland [email protected] Thomas Hofmann Dept. of Computer Science ETH Zurich, Switzerland [email protected] ABSTRACT Many fundamental problems in natural language process- ing rely on determining what entities appear in a given text. Commonly referenced as entity linking, this step is a fun- damental component of many NLP tasks such as text un- derstanding, automatic summarization, semantic search or machine translation. Name ambiguity, word polysemy, con- text dependencies and a heavy-tailed distribution of entities contribute to the complexity of this problem. We here propose a probabilistic approach that makes use of an effective graphical model to perform collective entity disambiguation. Input mentions (i.e., linkable token spans) are disambiguated jointly across an entire document by com- bining a document-level prior of entity co-occurrences with local information captured from mentions and their sur- rounding context. The model is based on simple sufficient statistics extracted from data, thus relying on few parame- ters to be learned. Our method does not require extensive feature engineer- ing, nor an expensive training procedure. We use loopy be- lief propagation to perform approximate inference. The low complexity of our model makes this step sufficiently fast for real-time usage. We demonstrate the accuracy of our ap- proach on a wide range of benchmark datasets, showing that it matches, and in many cases outperforms, existing state- of-the-art methods. Keywords Entity linking; Entity disambiguation; Wikification; Prob- abilistic graphical models; Approximate inference; Loopy belief propagation 1. INTRODUCTION Digital systems are producing increasing amounts of data every day. With daily global volumes of several terabytes of * Currently at Google Inc. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. newly textual content, there is a growing need for automatic methods for text aggregation, summarization, and, eventu- ally, semantic understanding. Entity linking is a key step to- wards these goals as it reveals the semantics of spans of text that refer to real-world entities. In practice, this is achieved by establishing a mapping between potentially ambiguous surface forms of entities and their canonical representations such as corresponding Wikipedia 1 articles or Freebase 2 en- tries. Figure 1 illustrates the difficulty of this task when dealing with real-world data. The main challenges arise from word ambiguities inherent to natural language: surface form synonymy, i.e., different spans of text referring to the same entity, and homonymy, i.e., the same name being shared by multiple entities. Figure 1: An entity disambiguation problem show- casing five given mentions and their potential entity candidates. We here describe and evaluate a novel light-weight and fast alternative to heavy machine-learning approaches for document-level entity disambiguation with Wikipedia. Our model is primarily based on simple empirical statistics ac- quired from a training dataset and relies on a very small number of learned parameters. This has certain advantages like a very fast training procedure that can be applied to massive amounts of data, as well as a better understanding of the model compared to increasingly popular deep learn- 1 http://en.wikipedia.org/ 2 https://www.freebase.com/ arXiv:1509.02301v3 [cs.CL] 29 Jan 2016
Transcript
Page 1: Probabilistic Bag-Of-Hyperlinks Model for Entity … Bag-Of-Hyperlinks Model for Entity Linking Octavian-Eugen Ganea Dept. of Computer Science ETH Zurich, Switzerland ganeao@inf.ethz.ch

Probabilistic Bag-Of-Hyperlinks Model for Entity Linking

Octavian-Eugen GaneaDept. of Computer Science

ETH Zurich, [email protected]

Marina Ganea∗

Dept. of Computer ScienceETH Zurich, Switzerland

[email protected]

Aurelien LucchiDept. of Computer Science

ETH Zurich, [email protected]

Carsten EickhoffDept. of Computer Science

ETH Zurich, [email protected]

Thomas HofmannDept. of Computer Science

ETH Zurich, [email protected]

ABSTRACTMany fundamental problems in natural language process-

ing rely on determining what entities appear in a given text.Commonly referenced as entity linking, this step is a fun-damental component of many NLP tasks such as text un-derstanding, automatic summarization, semantic search ormachine translation. Name ambiguity, word polysemy, con-text dependencies and a heavy-tailed distribution of entitiescontribute to the complexity of this problem.

We here propose a probabilistic approach that makes useof an effective graphical model to perform collective entitydisambiguation. Input mentions (i.e., linkable token spans)are disambiguated jointly across an entire document by com-bining a document-level prior of entity co-occurrences withlocal information captured from mentions and their sur-rounding context. The model is based on simple sufficientstatistics extracted from data, thus relying on few parame-ters to be learned.

Our method does not require extensive feature engineer-ing, nor an expensive training procedure. We use loopy be-lief propagation to perform approximate inference. The lowcomplexity of our model makes this step sufficiently fast forreal-time usage. We demonstrate the accuracy of our ap-proach on a wide range of benchmark datasets, showing thatit matches, and in many cases outperforms, existing state-of-the-art methods.

KeywordsEntity linking; Entity disambiguation; Wikification; Prob-

abilistic graphical models; Approximate inference; Loopybelief propagation

1. INTRODUCTIONDigital systems are producing increasing amounts of data

every day. With daily global volumes of several terabytes of

∗Currently at Google Inc.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

newly textual content, there is a growing need for automaticmethods for text aggregation, summarization, and, eventu-ally, semantic understanding. Entity linking is a key step to-wards these goals as it reveals the semantics of spans of textthat refer to real-world entities. In practice, this is achievedby establishing a mapping between potentially ambiguoussurface forms of entities and their canonical representationssuch as corresponding Wikipedia1 articles or Freebase2 en-tries. Figure 1 illustrates the difficulty of this task whendealing with real-world data. The main challenges arise fromword ambiguities inherent to natural language: surface formsynonymy, i.e., different spans of text referring to the sameentity, and homonymy, i.e., the same name being shared bymultiple entities.

Figure 1: An entity disambiguation problem show-casing five given mentions and their potential entitycandidates.

We here describe and evaluate a novel light-weight andfast alternative to heavy machine-learning approaches fordocument-level entity disambiguation with Wikipedia. Ourmodel is primarily based on simple empirical statistics ac-quired from a training dataset and relies on a very smallnumber of learned parameters. This has certain advantageslike a very fast training procedure that can be applied tomassive amounts of data, as well as a better understandingof the model compared to increasingly popular deep learn-

1http://en.wikipedia.org/2https://www.freebase.com/

arX

iv:1

509.

0230

1v3

[cs

.CL

] 2

9 Ja

n 20

16

Page 2: Probabilistic Bag-Of-Hyperlinks Model for Entity … Bag-Of-Hyperlinks Model for Entity Linking Octavian-Eugen Ganea Dept. of Computer Science ETH Zurich, Switzerland ganeao@inf.ethz.ch

ing architectures (e.g., He et al. [14]). As a prerequisite, weassume that a given input set of mentions was already dis-covered via a mention detection procedure3. Our startingpoint is the natural assumption that each entity depends (i)on its mention, (ii) its neighboring local contextual words,and (iii) on other entities that appear in the same document.

In order to enforce these conditions, we rely on a con-ditional probabilistic model that consists of two parts: (1)the likelihood of a candidate entity given the referring tokenspan and its surrounding context, and (2) the prior jointdistribution of the candidate entities corresponding to allthe mentions in a document. Our model relies on the max-product algorithm to collectively infer entities for all men-tions in a given document.

We further illustrate these modeling decisions. In the ex-ample depicted in Figure 1, each highlighted mention con-strains the set of possible entity candidates to a limited sizeset, yet leaves a significant level of ambiguity. However,there is one collective way of linking that is jointly consis-tent with all the chosen entities and supported by contextualcues. Intuitively, the related entities Thomas_Muller andGermany_national_football_team are likely to appear inthe same document, especially in the presence of contextualwords related to soccer, like “team” or “goal”.

Our main contributions are outlined below: (1) We em-ploy rigorous probabilistic semantics for the entity disam-biguation problem by introducing a principled probabilis-tic graphical model that requires a simple and fast train-ing procedure. (2) At the core of our joint probabilisticmodel, we derive a minimal set of potential functions thatproficiently explain statistics of observed training data. (3)Throughout a range of experiments performed on severalstandard datasets using the Gerbil platform [37], we demon-strate competitive or state of the art quality compared tosome of the best existing approaches. (4) Moreover, ourtraining procedure is solely based on publicly available Wiki-pedia hyperlink statistics and the method does not requireextensive hyperparameter tuning, nor feature engineering,making this paper a self-contained manual of implementingan entity disambiguation system from scratch.

The remainder of this paper is structured as follows: Sec-tion 2 briefly discusses relevant entity linking literature. Sec-tion 3 formally introduces our probabilistic graphical modeland details the initialization and learning procedure of themodel’s parameters. Section 4 describes the inference pro-cess used for collective entity resolution. Section 5 empir-ically demonstrates the merits of the proposed method onmultiple standard collections of manually annotated docu-ments. Finally, in Section 6, we conclude with a summary ofour findings and an overview of ongoing and future work.

2. RELATED WORKThere is a substantial body of existing work dedicated to

the task of entity linking with Wikipedia (Wikification). Wecan identify four major paradigms of how this challenge isapproached.

Local models consider the individual context of each entitymention in isolation in order to reduce the size of the decisionspace. In one of the early entity linking papers, Mihalcea

3For example, using a named-entity recognition system.However, note that our approach is not restricted to namedentities, but targets any Wikipedia entity.

and Csomai [21] propose an entity disambiguation schemebased on similarity statistics between the mention contextand the entity’s Wikipedia page. Milne and Witten [22] fur-ther refine their scheme with special focus on the mentiondetection step. Bunescu and Pasca [2] present a Wikipedia-driven approach, making use of manually created resourcessuch as redirect and disambiguation pages. Dredze et al. [7]cast the entity linking task as a retrieval problem, treatingmentions and their contexts as queries, and ranking candi-date entities according to their likelihood of being referredto.

Global models attempt to jointly disambiguate all men-tions in a document based on the assumption that the un-derlying entities are correlated and consistent with the maintopic of the document. While this approach tends to resultin superior accuracy, the space of possible entity assignmentsgrows combinatorially. As a consequence, many approachesin this group rely on approximate inference mechanisms.Cucerzan [5] uses high-dimensional vector space representa-tions of candidate entities and attempts to iteratively choosecandidates that optimize the mutual proximity to existingcandidates. Kulkarni et al. [19] exploit topical informationabout candidate entities and try to harmonize these topicsacross all assigned entities. Ratinov et al. [27] prune the listof entity mentions using support vector machines trainedon a range of similarity and term overlap features betweenentity representations. Ferragina and Scaiella [10] focus onshort documents such as tweets or search engine snippets.Based on evidence across all mentions, the authors employa voting scheme for entity disambiguation. Cheng et al. [4]and Singh et al. [31] describe models for jointly capturingthe interdependence between the tasks of entity tagging, re-lation extraction and co-reference resolution. Similarly, Dur-rett and Klein [8] describe a graphical model for collectivelyaddressing the tasks of named entity recognition, entity dis-ambiguation and co-reference resolution.

Graph-based models establish relationships between can-didate entities and mentions using structural models. Forinference, various approaches are employed, ranging fromdensest graph estimation algorithms (Hoffart et al. [15]) tograph traversal methods such as random graph walks (Guoand Barbosa [11], Han et al. [13]). In a similar fashion, thesetechniques can be combined to enhance the quality of bothentity linking and word sense disambiguation in a synergisticsolution (Moro et al. [23]).

The above approaches are limited because they assume asingle topic per document. Naturally, topic modelling can beused for entity disambiguation by attempting to harmonizethe individual distribution of latent topics across candidateentities. Houlsby and Ciaramita [16] and Pilz and Paaß [26]rely on Latent Dirichlet Allocation (LDA) and compare theresulting topic distribution of the input document to thetopic distributions of the disambiguated entities’ Wikipediapages. Han and Sun [12] propose a joint model of mentioncontext compatibility and topic coherence, allowing them tosimultaneously draw from both local (terms, mentions) aswell as global (topic distributions) information. Kataria etal. [18] use a semi-supervised hierarchical LDA model basedon a wide range of features extracted from Wikipedia pagesand topic hierarchies.

In contrast to previous work on this problem, our methodexploits co-occurrence statistics in a fully probabilistic man-ner using a graph-based model that addresses collective en-

Page 3: Probabilistic Bag-Of-Hyperlinks Model for Entity … Bag-Of-Hyperlinks Model for Entity Linking Octavian-Eugen Ganea Dept. of Computer Science ETH Zurich, Switzerland ganeao@inf.ethz.ch

tity disambiguation. It combines a clean and light-weightprobabilistic model with an elegant, real-time inference algo-rithm. An advantage over increasingly popular deep learn-ing architectures for entity linking (e.g. Sun et al. [34], Heet al. [14]) is the speed of our training procedure that relieson count statistics from data and that learns only very fewparameters. State-of-art accuracy is achieved without theneed for special-purpose computational heuristics.

3. PROBABILISTIC MODELIn this section, we formally define the entity linking task

that we address in this work and describe our modeling ap-proach in detail.

3.1 Problem Definition and FormulationLet E be a knowledge base (KB) of entities, V a finite dic-

tionary of phrases or names and C a context representation.Formally, we seek a mapping F : (V, C)n → En, that takesas input a sequence of linkable mentions m = (m1, . . . ,mn)along with their contexts c = (c1, . . . , cn) and produces ajoint entity assignment e = (e1, . . . , en). Here n refers tothe number of linkable spans in a document. Our problemis also known as entity disambiguation or link generation inthe literature. 4

We can construct such a mapping F in a probabilistic ap-proach, by learning a conditional probability model p(e|m, c)from data and then employing (approximate) probabilisticinference in order to find the maximum a posteriori (MAP)assignment, hence:

F (m, c) := arg maxe∈En

p(e|m, c) . (1)

In the sequel, we describe how to estimate such a modelfrom a corpus of entity-linked documents. Finally, we showin Section 4 how to apply belief propagation (max-product)for approximate inference in this model.

3.2 Maximum Entropy ModelsAssume a corpus of entity-linked documents is available.

Specifically, we used the set of Wikipedia pages togetherwith their respective Wiki hyperlinks. These hyperlinks areconsidered ground truth annotations, the mention being thelinked span of text and the truth entity being the Wikipediapage it refers to. One can extract two kinds of basic statis-tics from such a corpus: First, counts of how often eachentity was referred to by a specific name. Second, pairwiseco-occurrence counts for entities in documents. Our fun-damental conjecture is that most of the relevant informa-tion needed for entity disambiguation is contained in thesecounts, that they are sufficient statistics. We thus requestthat our probability model reproduces these counts in ex-pectation. As this alone typically yields an ill-defined prob-lem, we follow the maximum entropy principle of Jaynes [17]:Among the feasible set of distributions we favor the one withmaximal entropy.

Formally, let D be an entity-linked document collection.Ignoring mention contexts for now, we extract for each doc-ument d ∈ D a sequence of mentions m(d) and their cor-responding target entities e(d), both of length n(d). As-

4Note that we do not address the issues of mention de-tection or nil identification in this work. Rather, our inputis a document along with a fixed set of linkable mentionscorresponding to existing KB entities.

suming exchangeability of random variables within these se-quences, we reduce each (e,m) to statistics (or features)about mention-entity and entity-entity co-occurrence as fol-lows:

φe,m(e,m) :=

n∑i=1

1[ei = e]·1[mi = m], ∀(e,m)∈E×V (2)

ψ{e,e′}(e) :=∑i<j

1[{ei, ej} = {e, e′}], ∀e, e′ ∈ E , (3)

where 1[·] is the indicator function. Note that we use thesubscript notation {e, e′} for ψ to take into account the sym-metry in e, e′ as well the fact that one may have e = e′.

The document collection provides us with empirical esti-mates for the expectation of these statistics under an i.i.d. sam-pling model for documents, namely the averages

φe,m(D) :=1

|D|∑d∈D

φe,m(e(d),m(d)) , (4)

ψ{e,e′}(D) :=1

|D|∑d∈D

ψ{e,e′}(e(d)) . (5)

Note that in entity disambiguation, the mention sequencem is always considered given, while we seek to predict thecorresponding entity sequence e. It is thus not necessaryto try to model the joint distribution p(e,m), but sufficientto construct a conditional model p(e|m). Following Bergeret al. [1] this can be accomplished by taking the empiricaldistribution p(m|D) of mention sequences and combining itwith a conditional model via p(e,m) = p(e|m)·p(m|D). Wethen require that:

Ep[φe,m] = φe,m(D) and Ep[ψ{e,e′}] = ψ{e,e′}(D), (6)

which yields |E|·|V|+(|E|

2

)+|E|moment constraints on p(e|m).

The maximum entropy distributions, fulfilling constraintsas stated in Eq. (6) form a conditional exponential familyfor which φ(·,m) and ψ(·, ·) are sufficient statistics. We thusknow that there are canonical parameters ρe,m and λ{e,e′}(formally corresponding to Lagrange multipliers) such thatthe maximum entropy distribution can be written as

p(e|m; ρ, λ) =1

Z(m)exp [〈ρ, φ(e,m)〉+ 〈λ, ψ(e)〉] (7)

where Z(m) is the partition function

Z(m) :=∑e∈En

exp [〈ρ, φ(e,m)〉+ 〈λ, ψ(e)〉] . (8)

Here we interpret (e,m) and {e, e′} as multi-indices and sug-gestively define the shorthands

〈ρ, φ〉 :=∑e,m

ρe,mφe,m, 〈λ, ψ〉 :=∑{e,e′}

λ{e,e′}ψ{e,e′} . (9)

Note that we can switch between the statistics view and theraw data view by observing that

〈ρ, φ(e,m)〉 =

n∑i=1

ρei,mi , 〈λ, ψ(e)〉 =∑i<j

λ{ei,ej} . (10)

While the maximum entropy principle applied to our funda-mental conjecture restricts the form of our model to a finite-dimensional exponential family, we need to investigate waysof finding the optimal or – as we will see – an approximatelyoptimal distribution in this family. To that extent, we firstre-interpret the obtained model as a factor graph model.

Page 4: Probabilistic Bag-Of-Hyperlinks Model for Entity … Bag-Of-Hyperlinks Model for Entity Linking Octavian-Eugen Ganea Dept. of Computer Science ETH Zurich, Switzerland ganeao@inf.ethz.ch

E1 E2

E3E4

m1 m2

m3m4

Figure 2: Proposed factor graph for a documentwith four mentions. Each mention node mi is pairedwith its corresponding entity node Ei, while all en-tity nodes are connected through entity-entity pairfactors.

3.3 Markov Network and Factor GraphComplementary to the maximum entropy estimation per-

spective, we want to present a view on our model in terms ofprobabilistic graphical models and factor graphs. InspectingEq. (7) and interpreting φ and ψ as potential functions, wecan recover a Markov network that makes conditional inde-pendence assumptions of the following type: an entity linkei and a mention mj with i 6= j are independent, given mi

and e−i, where e−i denotes the set of entity variables in thedocument excluding ei. This means that a mention mj onlyinfluences a variable ei through the intermediate variableej . However, the functional form in Eq. (7) goes beyondthese conditional independences in that it limits the orderof interaction among the variables. A variable ei interactswith neighbors in its Markov blanket through pairwise po-tentials. In terms of a factor graph decomposition, p(e|m)decomposes into functions of two arguments only, modelingpairwise interactions between entities on one hand, and be-tween entities and their corresponding mentions on the otherhand.

We emphasize the factor model view by rewriting (7) as

p(e|m; ρ, λ) ∝∏i

exp [ρei,mi ] ·∏i<j

exp[λ{ei,ej}

](11)

where we think of ρ and λ as functions

ρ : E × V → R, (e,m) 7→ ρe,m

λ : E ∪ E2 → R, {e, e′} 7→ λ{e,e′}

An example of a factor graph (n = 4) is shown in Figure2. We will investigate in the sequel how the factor graphstructure can be further exploited.

3.4 (Pseudo–)Likelihood MaximizationWhile the maximum entropy approach directly motivates

the exponential form of Eq. (7) and is amenable to a plausi-ble factor graph interpretation, it does not by itself suggestan efficient parameter fitting algorithm. As is known byconvex duality, the optimal parameters can be obtained bymaximizing the conditional likelihood of the model underthe data,

L(ρ, λ;D) =∑d

log p(e(d)|m(d); ρ, λ) (12)

However, specialized algorithms for maximum entropy esti-mation such as generalized iterative scaling [6] are known tobe slow, whereas gradient-based methods require the com-

putation of gradients of L, which involves evaluating expec-tations with regard to the model, since

∇ρ logZ(m) = Epφ(e,m), ∇λ logZ(m) = Epψ(e) . (13)

The exact inference problem of computing these model ex-pectations, however, is not generally tractable due to thepairwise couplings through the ψ-statistics.

As an alternative to maximizing the likelihood in Eq. (12),we have investigated an approximation known as, pseudo-likelihood maximization [35, 38]. Its main benefits are lowcomputational complexity, simplicity and practical success.Switching to the Markov network view, the pseudo-likelihoodestimator predicts each variable conditioned on the value ofall variables in its Markov blanket. The latter consists of theminimal set of variables that renders a variable condition-ally independent of everything else. In our case the Markovblanket consists of all variables that share a factor with agiven variable. Consequently, the Markov blanket of ei isN (ei) := (mi, e−i). The posterior is then approximated inthe pseudo-likelihood approach as:

p(e|m; ρ, λ) :=

n∏i=1

p(ei|N (ei); ρ, λ) , (14)

which results in the tractable log-likelihood function

L(ρ, λ;D) :=∑d∈D

n(d)∑i=1

log p(e(d)i |N (e

(d)i ); ρ, λ) . (15)

Introducing additional L2-norm penalties γ(‖λ‖22+‖ρ‖22)

to further regularize L, we have utilized parallel stochasticgradient descent (SGD) [28] with sparse updates to learnparameters ρ, λ. From a practical perspective, we only keepfor each token span m parameters ρe,m for the most fre-quently observed entities e. Moreover, we only use λ{e,e′}for entity pairs (e, e′) that co-occurred together a sufficientnumber of times in the collection D.5 As we will discuss inmore detail in Section 5, our experimental findings suggestthis brute-force learning approach to be somewhat ineffec-tive, which has motivated us to develop simpler, yet moreeffective plug-in estimators as described below.

3.5 Bethe ApproximationThe major computational difficulty with our model lies

in the pairwise couplings between entities and the fact thatthese couplings are dense: The Markov dependency graphbetween different entity links in a document is always a com-plete graph. Let us consider what would happen, if thedependency structure were loop-free, i.e., it would form atree. Then we could rewrite the prior probability in terms ofmarginal distributions in the so-called Bethe form. Encod-ing the tree structure in a symmetric relation T , we wouldget

p(e) =

∏{i,j}∈T p(ei, ej)∏ni=1 p(ei)

di−1, di := |{j : {i, j} ∈ T }| . (16)

The Bethe approximation [39] pursues the idea of using theabove representation as an unnormalized approximation for

5For the Wikipedia collection, even after these pruningsteps, we ended up with more than 50 million parameters intotal.

Page 5: Probabilistic Bag-Of-Hyperlinks Model for Entity … Bag-Of-Hyperlinks Model for Entity Linking Octavian-Eugen Ganea Dept. of Computer Science ETH Zurich, Switzerland ganeao@inf.ethz.ch

p(e), even when the Markov network has cycles. How doesthis relate to the exponential form in Eq. (7)? By simplepattern matching, we see that if we choose

λ{e,e′} = log

(p(e, e′)

p(e) p(e′)

), ∀e, e′ ∈ E (17)

we can apply Eq. (16) to get an approximate distribution

p(e) ∝∏i<j p(ei, ej)∏ni=1 p(ei)

n−2=

n∏i=1

p(ei)∏i<j

p(ei, ej)

p(ei) p(ej)

= exp

[∑i

log p(ei) +∑i<j

λ{ei,ej}

],

(18)

where we see the same exponential form in λ appearing asin Eq. (10). We complete this argument by observing thatwith

ρe,m = log p(e) + log p(m|e) (19)

we obtain a representation of a joint distribution that ex-actly matches the form in Eq. (7).

What have we gained so far? We started from the desireof constructing a model that would agree with the observeddata on the co-occurrence probabilities of token spans andtheir linked entities as well as on the co-link probability ofentity pairs within a document. This has led to the con-ditional exponential family in Eq. (7). We have then pro-posed pseudo-likelihood maximization as a way to arrive at atractable learning algorithm to try to fit the massive amountof parameters ρ and λ. Alternatively, we have now seenthat a Bethe approximation of the joint prior p(e) yieldsa conditional distribution p(e|m) that (i) is a member ofthe same exponential family, (ii) has explicit formulas forhow to choose the parameters from pairwise marginals, and(iii) would be exact in the case of a dependency tree. Weclaim that the benefits of computational simplicity togetherwith the correctness guarantee for non-dense dependencynetworks outweighs the approximation loss, relative to themodel with the best generalization performance within theconditional exponential family. In order to close the subop-timality gap further, we suggest some important refinementsbelow.

3.6 Parameter CalibrationWith the previous suggestion, one issue comes into play:

The total contribution coming from the pairwise interac-tions between entities will scale with

(n2

), while the entity–

mention compatibility contributions will scale with n, thetotal number of mentions. This is a direct observation ofthe number of terms contributing to the sums in (10). How-ever, for practical reasons, it is somewhat implausible that,as n grows, the prior p(e) should dominate and the contri-bution of the likelihood term should vanish. The model isnot well-calibrated with regard to n.

We propose to correct for this effect by adding a normal-ization factor to the λ-parameters by replacing (17) with:

λne,e′ =2

n− 1log

(p(e, e′)

p(e) · p(e′)

), ∀e, e′ ∈ E (20)

where now these parameters scale inversely with n, the num-ber of entity links in a document, making the corresponding

sum in Eq. (7) scale with n. With this simple change, a sub-stantial accuracy improvement was observed empirically, thedetails of which are reported in our experiments.

The re-calibration in Eq. (20) can also be justified by thefollowing combinatorial argument: For a given set Y of ran-dom variables, define an Y-cycle as a graph containing asnodes all variables in Y, each with degree exactly 2, con-nected in a single cycle. Let Ξ be the set enumerating allpossible Y-cycles. Then, |Ξ|= (n − 1)!, where n is the sizeof Y.

In our case, if the entity variables e per document wouldhave formed a cycle of length n instead of a complete sub-graph, the Bethe approximation would have been writtenas:

pπ(e) ∝∏

(i,j)∈E(π) p(ei, ej)∏i p(ei)

, ∀π ∈ Ξ (21)

where E(π) is the set of edges of the e-cycle π. However,as we do not desire to further constrain our graph with ad-ditional independence assumptions, we propose to approx-imate the joint prior p(e) by the average of the Bethe ap-proximation of all possible π, that is

log p(e) ≈ 1

|Ξ|∑π∈Ξ

log pπ(e) . (22)

Since each pair (ei, ej) would appear in exactly 2(n − 2)!e-cycles, one can derive the final approximation:

p(e) ≈∏i<j p(ei, ej)

2n−1∏

i p(ei). (23)

Distributing marginal probabilities over the parameters start-ing from Eq. (23) and applying a similar argument as inEq. (18) results in the assignment given by Eq. (20). Whilethe above line of argument is not a strict mathematicalderivation, we believe this to shed further light on the em-pirically observed effectiveness of the parameter re-scaling.

3.7 Integrating ContextThe model that we have discussed so far does not consider

the local context of a mention. This is a powerful source ofinformation that a competitive entity linking system shouldutilize. For example, words like “computer”, “company” or“device” are more likely to appear near references of theentity Apple_Inc. than of the entity Apple_fruit. Wedemonstrate in this section how this integration can be eas-ily done in a principled way on top of the current probabilis-tic model. This showcases the extensibility of our approach.Enhancing our model with additional knowledge such as en-tity categories or word co-reference can also be done in arigorous way, so we hope that this provides a template forfuture extensions.

As stated in Section 3.1, for each mention mi in a doc-ument, we maintain a context representation ci consistingof the bag of words surrounding the mention within a win-dow of length K6. Hence, ci can be viewed as an additionalrandom variable with an observed outcome. At this stage,we make additional reasonable independence assumptionsthat increase tractability of our model. First, we assume

6Throughout our experiments, we used a context windowof size K = 100, intuitively chosen and without extensivevalidation.

Page 6: Probabilistic Bag-Of-Hyperlinks Model for Entity … Bag-Of-Hyperlinks Model for Entity Linking Octavian-Eugen Ganea Dept. of Computer Science ETH Zurich, Switzerland ganeao@inf.ethz.ch

that, knowing the identity of the linked entity ei, the men-tion token span mi is just the surface form of the entity, soit brings no additional information for the generative pro-cess describing the surrounding context ci. Formally, thismeans that mi and ci are conditionally independent givenei. Consequently, we obtain a factorial expression for thejoint model

p(e,m, c) = p(e)p(m, c|e) = p(e)

n∏i=1

p(mi|ei)p(ci|ei) (24)

This is a simple extension of the previous factor graph thatincludes context variables. Second, we assume conditionalindependence of the words in ci given an entity ei which letus factorize the context probabilities as

p(ci|ei) =∏wj∈ci

p(wj |ei) . (25)

Note that this assumption is commonly made in models us-ing bag-of-word representations or naıve Bayes classifiers.

While this completes the argument from a joint modelpoint of view, we need to consider one more aspect for theconditional distribution p(e|m, c) that we are interested in.If we cannot afford (computationally as well as with regardto training data size) a full-blown discriminative learningapproach, then how do we balance the relative influence ofthe context ci and the mention token span mi on ei? Forinstance, the effect of ci will depend on the chosen windowsize K, which is not realistic.

To address this issue, we resort to a hybrid approach,where, in the spirit of the Bethe approximation, we continueto express our model in terms of simple marginal distribu-tions that can be easily estimated independently from data,yet that allow for a small number of parameters (in our case“small” equals 2) to be chosen to optimize the conditionallog-likelihood p(e|m, c). We thus introduce weights ζ andτ that control the importance of the context factors and,respectively, of the entity-entity interaction factors. Puttingequations (19), (20), (24) and (25) together, we arrive atthe final model that will be subsequently referred to as thePBoH model (Probabilistic Bag of Hyperlinks):

log p(e|m, c) =

n∑i=1

log p(ei|mi) + ζ∑wj∈ci

log p(wj |ei)

+

n− 1

∑i<j

log

(p(ei, ej)

p(ei) p(ej)

)+const .

(26)

Here we used the identity p(m|e)p(e) = p(e|m)p(m) and ab-sorbed all log p(m) terms in the constant. We use grid-searchon a validation set for the remaining problem of optimizingover the parameters ζ, τ . Details are provided in section 5.

3.8 Smoothing Empirical ProbabilitiesIn order to estimate the probabilities involved in Eq. (26),

we rely on an entity annotated corpus of text documents,e.g., Wikipedia Web pages together with their hyperlinkswhich we view as ground truth annotations. From this cor-pus, we derive empirical probabilities for a name-to-entitydictionary p(m|e) based on counting how many times an en-

tity appeared referenced by a given name7. We also computethe pairwise probabilities p(e, e′) obtained by counting thepairwise co-occurrence of entities e and e′ within the samedocument. Similarly, we obtained empirical values for themarginals p(e) =

∑e′ p(e, e

′) and for the context word-entitystatistics p(w|e).

In the absence of huge amounts of data, estimating suchprobabilities from counts is subject to sparsity. For instance,in our statistics, there are 8 times more distinct pairs ofentities that co-occur in at most 3 Wikipedia documentscompared to the total number of distinct pairs of entitiesthat appear together in at least 4 documents. Thus, it isexpected that the heavy tail of infrequent pairs of entitieswill have a strong impact on the accuracy of our system.

Traditionally, various smoothing techniques are employedto address sparsity issues arising commonly in areas such asnatural language processing. Out of the wealth of methods,we decided to use the absolute discounting smoothing tech-nique [40] that involves interpolation of higher and lower or-der (backoff) models. In our case, whenever insufficient datais available for a pair of entities (e, e′), we assume the twoentities are drawn from independent distributions. Thus,if we denote by N(e, e′) the total number of corpus docu-ments that link both e and e′, and by Nep the total numberof pairs of entities referenced in each document, then thefinal formula for the smoothed entity pairwise probabilitiesis:

p(e, e′) =max(N(e, e′)− δ, 0)

Nep+ (1− µe)p(e)p(e′) (27)

where δ ∈ [0, 1] is a fixed discount and µe is a constant thatassures that

∑e

∑e′ p(e, e

′) = 1. δ was set by performing acoarse grid search on a validation set. The best δ value wasfound to be 0.5.

The word-entity empirical probabilities p(w|e) were com-puted based on the Wikipedia corpus by counting the fre-quency with which word w appears in the context windowsof size K around the hyperlinks pointing to e. In orderto avoid memory explosion, we only considered the entity-words pairs for which these counts are at least 3. Theseempirical estimates are also sparse, so we used absolute dis-counting smoothing for their correction by backing off to theunbiased estimates p(w). The latter can be much more ac-curately estimated from any text corpus. Finally, we obtain:

p(w|e) =max(N(w, e)− ξ, 0)

Nwp+ (1− µw)p(w) . (28)

Again ξ ∈ [0, 1] was optimized by grid search to be 0.5.

4. INFERENCEAfter introducing our model and showing how to train it

in the previous section, we now explain the inference processused for prediction.

4.1 Candidate SelectionAt test time, for each mention to be disambiguated, we

first select a set of potential candidates by considering thetopR ranked entities based on the local mention-entity prob-ability dictionary p(e|m). We found R = 64 to be a good

7In our implementation we summed the mention-entitycounts from Wikipedia hyperlinks with the Crosswikiscounts [32]

Page 7: Probabilistic Bag-Of-Hyperlinks Model for Entity … Bag-Of-Hyperlinks Model for Entity Linking Octavian-Eugen Ganea Dept. of Computer Science ETH Zurich, Switzerland ganeao@inf.ethz.ch

compromise between efficiency and accuracy loss. Second,we want to keep the average number of candidates per men-tion as small as possible in order to reduce the running timewhich is quadratic in this number (see the next section fordetails). Consequently, we further limit the number of can-didates per mention by keeping only the top 10 entity can-didates re-ranked by the local mention-context-entity com-patibility defined as

log p(ei|mi, ci) = log p(ei|mi) + ζ∑wj∈ci

log p(wj |ei) +const .

(29)

These pruning heuristics result in a significantly improvedrunning time at an insignificant accuracy loss.

If the given mention is not found in our map p(e|m), wetry to replace it by the closest name in this dictionary. Sucha name is picked only if the Jaccard distance between theset of letter trigrams of these two strings is smaller than athreshold that we empirically picked as 0.5. Otherwise, themention is not linked at all.

4.2 Belief PropagationCollectively disambiguating all mentions in a text involves

iterating through an exponential number of possible entityresolutions. Exact inference in general graphical models isNP-hard, therefore approximations are employed. We pro-pose solving the inference problem through the loopy beliefpropagation (LBP) [24] technique, using the max-product al-gorithm that approximates the MAP solution in a run-timepolynomial in n, the number of input mentions. For the sakeof brevity, we only present the algorithm for the maximumentropy model described by Eq. (7); A similar approach wasused for the enhanced PBoH model given by Eq. (26).

Our proposed graphical model is a fully connected graphwhere each node corresponds to an entity random variable.Unary potentials exp(ρm,e) model the entity-mention com-patibility, while pairwise potentials exp(λ{e,e′}) express enti-ty-entity correlations. For the posterior in Eq. (7), one canderive the update equation of the logarithmic message thatis sent in round t+ 1 from entity random variable Ei to theoutcome ej of the entity random variable Ej :

mt+1Ei→Ej

(ej) = (30)

maxei

ρei,mi +λ{ei,ej}+∑

1≤k≤n;k 6=j

mtEk→Ei

(ei)

Note that, for simplicity, we skip the factor graph frame-work and send messages directly between each pair of entityvariables. This is equivalent to the original BP framework.

We chose to update messages synchronously: in each roundt, each two entity nodes Ei and Ej exchange messages. Thisis done until convergence or until an allowed maximum num-ber of iterations (15 in our experiments) is reached. Theconvergence criterion is:

max1≤i,j≤n;ej∈E

|mt+1Ei→Ej

(ej)−mtEi→Ej

(ej)|≤ ε (31)

where ε = 10−5. This setting was sufficient in most of thecases to reach convergence.

Dataset # non-NIL mentions # documentsAIDA test A 4791 216AIDA test B 4485 231

MSNBC 656 20AQUAINT 727 50

ACE04 257 35

Table 1: Statistics on some of the used datasets

In the end, the final entity assignment is determined by:

e∗i = arg maxei

ρei,mi +∑

1≤k≤n

mtEk→Ei

(ei)

(32)

The complexity of the belief propagation algorithm is, inour case, O(n2 · r2), with n being the number of mentionsin a document and r being the average number of candidateentities per mention (10 in our case). More details regardingthe run-time and convergence of the loopy BP algorithm canbe found in Section 5.

5. EXPERIMENTSWe now present the experimental evaluation of our method.

We first uncover some practical details of our approach. Fur-ther, we show an empirical comparison between PBoH andwell known or recent competitive entity disambiguation sys-tems. We use the Gerbil testing platform [37] version 1.1.4with the D2KB setting in which a document together with afixed set of mentions to be annotated are given as input. Werun additional experiments that allow us to compare againstmore recent approaches, such as [16] and [11].

Note that in all the experiments we assume that we haveaccess to a set of linkable token spans for each document. Inpractice this set is obtained by first applying a mention de-tection approach which is not part of our method. Our maingoal is then to annotate each token span with a Wikipediaentity8.

Evaluation metrics. We quantify the quality of an entitylinking system by measuring common metrics such as pre-cision, recall and F1 scores.

Let M∗ be the ground truth entity annotations associatedwith a given set of mentions X. Note that in all the resultsreported, mentions that contain NIL or empty ground truthentities are discarded before the evaluation; this decision istaken as well in Gerbil version 1.1.4. Let M be the outputannotations of an entity disambiguation system on the sameinput. Then, our quality metrics are computed as follows:

• Precision: P = |M∩M∗||M|

• Recall: R = |M∩M∗||M∗|

• F1 score: F1 = 2·P ·RP+R

We mostly report results in terms of F1 scores, namelymacro-averaged F1@MA (aggregated across documents),and micro-averaged F1@MI (aggregated across mentions).For a fair comparison with Houlsby and Ciaramita [16], we

8In PBoH, we refrain from annotating mentions for whichno candidate entity is found according to the procedure de-scribed in Section 4.1.

Page 8: Probabilistic Bag-Of-Hyperlinks Model for Entity … Bag-Of-Hyperlinks Model for Entity Linking Octavian-Eugen Ganea Dept. of Computer Science ETH Zurich, Switzerland ganeao@inf.ethz.ch

F1@MIF1@MA

AC

E2004

AID

A/C

oN

LL

-Com

ple

te

AID

A/C

oN

LL

-Test

A

AID

A/C

oN

LL

-Test

B

AID

A/C

oN

LL

-Tra

inin

g

AQ

UA

INT

DB

pedia

Sp

otl

ight

IIT

B

KO

RE

50

Mic

rop

ost

s2014-T

est

Mic

rop

ost

s2014-T

rain

MSN

BC

N3-R

eute

rs-1

28

N3-R

SS-5

00

AGDISTIS 65.8377.63

60.2756.97

59.0653.36

58.3258.03

61.0557.53

60.1058.62

36.6133.25

41.2343.38

34.1630.20

42.4361.08

50.3962.87

75.4273.82

67.9575.52

59.8870.80

Babelfy 63.2076.71

78.0073.81

75.7771.26

80.3674.52

78.0174.22

72.2773.23

51.0551.97

57.1355.36

73.1269.77

47.2062.11

50.6061.02

78.1775.73

58.6159.87

69.1776.00

DBpedia Spotlight 70.3880.02

58.8460.59

54.9054.11

57.6961.34

60.0462.23

74.0373.13

69.2767.23

65.4462.81

37.5932.90

56.4371.63

56.2667.99

69.2769.82

56.4458.77

57.6365.03

Dexter 18.7216.97

48.4645.29

45.4442.17

48.5946.20

49.2545.85

38.2838.15

26.7022.75

28.5328.48

17.2012.54

31.2744.02

35.2142.07

36.8639.42

32.7431.85

31.1133.55

Entityclassifier.eu 12.7412.3

46.642.86

44.1342.36

44.0241.31

47.8343.36

21.6719.59

22.5918.0

18.4619.54

27.9725.2

29.1239.53

32.6938.41

41.2440.3

28.424.84

21.7722.2

Kea 80.0887.57

73.3973.26

70.967.91

72.6473.31

74.2274.47

81.8481.27

73.6376.60

72.0370.52

57.9553.17

63.476.54

64.6774.32

85.4987.4

63.264.45

69.2975.93

NERD-ML 54.8972.22

54.6252.35

52.8549.6

52.5951.34

55.5553.23

49.6846.06

46.845.59

51.0849.91

29.9624.75

38.6557.91

39.8353.74

64.0367.28

54.9662.9

61.2267.3

TagMe 2 81.9389.09

72.0771.19

69.0766.5

70.6270.38

73.272.45

76.2775.12

63.3165.1

57.2355.8

57.3454.67

56.8171.66

59.1470.45

75.9677.05

59.3267.55

78.0583.2

WAT 80.086.49

83.8283.59

81.8280.25

84.3484.12

84.2184.22

76.8277.64

65.1868.24

61.1459.36

58.9953.13

59.5673.89

61.9672.65

77.7279.08

64.3865.81

68.2176.0

Wikipedia Miner 77.1486.36

64.7266.17

61.6561.67

60.7163.19

66.4867.93

75.9674.63

62.5761.43

58.5956.98

41.6335.0

54.8869.29

55.9367.0

64.2564.68

60.0566.51

64.5472.23

PBoH 87.1990.40

86.7286.85

86.6385.48

87.3986.32

86.5987.30

86.6486.14

79.4880.13

62.4761.04

61.7055.83

74.1984.48

73.0881.25

89.5489.62

76.5483.31

71.2478.33

Table 2: Micro and macro F1 scores reported by Gerbil for 14 datasets and 11 entity linking systems includingPBoH. For each dataset and each metric, we highlight in red the best system and in blue the second bestsystem.

DatasetsAIDA test A AIDA test B

Systems R@MI R@MA R@MI R@MA

LocalMention 69.73 69.30 67.98 72.75TagMe reimpl. 76.89 74.57 78.64 78.21AIDA 79.29 77.00 82.54 81.66S & Y - 84.22 - -Houlsby et al. 79.65 76.61 84.89 83.51

PBoH 85.70 85.26 87.61 86.44

Table 3: AIDA test-a and AIDA test-b datasets re-sults.

also report micro-recall R@MI and macro-recall R@MAon the AIDA datasets.

Note that, in our case, the precision and recall are notnecessarily identical since a method may not consider anno-tating certain mentions 8.

Pseudo-likelihood training. We briefly mention some ofthe practical issues that we encounter with the likelihoodmaximization described in Section 3.4. From the practicalperspective, for each mention m, we only considered the setof parameters ρm,e limited to the top 64 candidate entities eper mention, ranked by p(e|m) . Additionally, we restrictedthe set λe,e′ to entity pairs (e, e′) that co-occurred togetherin at least 7 documents throughout the Wikipedia corpus. Intotal, a set of 26 millions ρ and 39 millions λ parameters werelearned using the previously described procedure. Note that

the universe of all Wikipedia entities is of size ∼ 4 million.For the SGD procedure, we tried different initializations

of these parameters, including ρm,e = log p(e|m), λe,e′ = 0,as well as the parameters given by Eq. (17). However, in allcases, the accuracy gain on a sample of 1000 Wikipedia testpages was small or negligible compared to the LocalMentionbaseline (described below). One reason is the inherent spar-sity of the data: the parameters associated with the long tailof infrequent entity pairs are updated rarely and expectedto be defective at the end of the SGD procedure. However,these scattered pairs are crucial for the effectiveness and cov-erage of the entity disambiguation system. To overcome thisproblem, we refined our model as described in Section 3.5and subsequent sections.

PBoH training details. Wikipedia itself is a valuable re-source for entity linking since each internal hyperlink can beconsidered as the ground truth annotation for the respectiveanchor text. In our system, the training is solely done onthe entire Wikipedia corpus9. Hyper-parameters are grid-searched such that the micro F1 plus macro F1 scores aremaximized over the combined held-out set containing onlythe AIDA Test-A dataset and a Wikipedia validation setconsisting of random 1000 pages. As a preprocessing stepin our training procedure, we removed all annotations andhyperlinks that point to non-existing, disambiguation or listWikipedia pages.

The PBoH system used in the experimental comparison

9We used the Wikipedia dump from February 2014

Page 9: Probabilistic Bag-Of-Hyperlinks Model for Entity … Bag-Of-Hyperlinks Model for Entity Linking Octavian-Eugen Ganea Dept. of Computer Science ETH Zurich, Switzerland ganeao@inf.ethz.ch

Datasetsnew MSNBC new AQUAINT new ACE2004

Systems F1@MI F1@MA F1@MI F1@MA F1@MI F1@MA

LocalMention 73.64 77.71 87.33 86.80 84.75 85.70

Cucerzan 88.34 87.76 78.67 78.22 79.30 78.22M & W 78.43 80.37 85.13 84.84 81.29 84.25Han et al. 88.46 87.93 79.46 78.80 73.48 66.80AIDA 78.81 76.26 56.47 56.46 80.49 84.13GLOW 75.37 77.33 83.14 82.97 81.91 83.18RI 90.22 90.87 87.72 87.74 86.60 87.13REL-RW 91.37 91.73 90.74 90.58 87.68 89.23

PBoH 91.06 91.19 89.27 88.94 88.71 88.46

Table 4: Results on the newer versions of the MSNBC, AQUAINT and ACE04 datasets.

is the model given by Eq. (26) for which grid search of thehyper-parameters suggested using ζ = 0.075, τ = 0.5, δ =0.5, ξ = 0.5.

Datasets. We evaluate our approach on 14 well-known pub-lic entity linking datasets built from various sources. Statis-tics of some of them are shown in Table 1, and their de-scriptions are provided below. For information on the otherdatasets used only in the Gerbil experiments, refer to [37].

• The CoNLL-AIDA dataset is an entity annotated cor-pus of Reuters news documents introduced by Hoffartet al. [15]. It is much larger than most of the other ex-isting EL datasets, making it an excellent evaluationtarget. The data is divided in three parts: Train (notused in our current setting for training, but only in theGerbil evaluation), Test-A (used for validation) andTest-B (used for blind evaluation). Similar to Houlsbyand Ciaramita [16] and others, we report results alsoon the validation set Test-A.

• The AQUAINT dataset introduced by Milne and Wit-ten [22] contains documents from a news corpus fromthe Xinhua News Service, the New York Times andthe Associated Press.

• MSNBC [5] - a dataset of news documents that in-cludes many mentions which do not easily map toWikipedia titles because of their rare surface forms ordistinctive lexicalization.

• The ACE04 dataset [27] is a subset of ACE2004 Coref-erence documents annotated using Amazon Mechani-cal Turk. Note that the ACE04 dataset contains men-tions that are annotated with NIL entities, meaningthat no proper Wikipedia entity was found. Followingcommon practice, we removed all the mentions corre-sponding to these NIL entities prior to our evaluation.

Note that the Gerbil platform uses an old version of theAQUAINT, MSNBC and ACE04 datasets that contain someno-longer existing Wikipedia entities. A new cleaned versionof these sets10 was released by Guo & Barbosa [11]. Wereport results for the new cleaned datasets in Table 4, whileTable 2 contains results for the old versions currently usedby Gerbil.

10http://www.cs.ualberta.ca/~denilson/data/deos14_ualberta_experiments.tgz

DatasetsAIDAtest A

AIDAtest B

MSNBC AQUAINT ACE04

Avg. nummentionsper doc

22.18 19.41 32.8 14.54 7.34

Conv. rate 100% 99.56% 100% 100% 100%Avg. run-ning time(ms/doc)

445.56 203.66 371.65 40.42 10.88

Avg. num.rounds

2.86 2.83 3.0 2.56 2.25

Table 5: Loopy belief propagation statistics. Av-erage running time, number of rounds and conver-gence rate of our inference procedure are provided.

Systems. For comparison, we selected a broad range of com-petitor systems from the vast literature in this field. TheGerbil platform already integrates the methods of Agdis-tis [36], Babelfy [23], DBpedia Spotlight [20], Dexter [3],Kea [33], Nerd-ML [29], Tagme2 [9], WAT [25], WikipediaMiner [22] and Illinois Wikifier [27]. We furthermore com-pare against Cucerzan [5] – the first collective EL systemthat uses optimization techniques, M& W [22]– a popularmachine learning approach, Han et al. [13] – a graph baseddisambiguation system that uses random walks for joint dis-ambiguation, AIDA [15] – a performant graph based ap-proach, GLOW [27] – a system that uses local and globalcontext to perform joint entity disambiguation, RI [4] – anapproach using relational inference for mention disambigua-tion, and REL-RW [11], a recent system that iterativelysolves mentions relying on an online updating random walkmodel. In addition, on the AIDA datasets we also compareagainst S& Y [30] – an apparatus for combining the NERand EL tasks, and Houlsby et al. [16] – a topic modellingLDA-based approach for EL.

To empirically assess the accuracy gain introduced by eachincremental step of our approach, we ran experiments onseveral of our method’s components, individually: Local-Mention – links mentions to entities solely based on thetoken span statistics, i.e., e∗ = arg maxe p(e|m); Unnorm– uses the unnormalized mention-entity model described inSection 3.5; Rescaled – relies on the rescaled model pre-sented in Section 3.6; LocalContext – disambiguates anentity based on the mention and the local context proba-bility given by Equation (29), i.e., e∗ = arg maxe p(e|m, c).Note that Unnorm, Rescaled and PBoH use the loopybelief propagation procedure for inference.

Page 10: Probabilistic Bag-Of-Hyperlinks Model for Entity … Bag-Of-Hyperlinks Model for Entity Linking Octavian-Eugen Ganea Dept. of Computer Science ETH Zurich, Switzerland ganeao@inf.ethz.ch

DatasetsMSNBC AQUAINT ACE2004

Avg # men-tions per doc

36.95 14.54 8.68

Systems # entities # entities # entitiesPBoH 247.19 95.38 66.66REL-RW 382773.6 242443.1 256235.49

Table 6: Average number of entities that appear inthe graph built by PBoH and by REL-RW

5.1 ResultsResults of the experiments run on the Gerbil platform are

shown in Table 2. Detailed results are also provided1112.We obtain the highest performance on 11 datasets and thesecond highest performance on 2 datasets, showing the ef-fectiveness of our method.

Other results are presented in Table 3 and Table 4. Thehighest accuracy for the cleaned version of AQUAINT, MSNBCand ACE04 was previously reported by Guo & Barbosa [11],while Houlsby et al. [16] dominate the AIDA datasets. Notethat the performance of the baseline systems shown in thesetwo tables is taken from [11] and [16].

All these methods are tested in the setting where a fixedset of mentions is given as input, without requiring the men-tion detection step.

Discussion. Several observations are worth noting here.First, the simple LocalMention component alone outper-forms many EL systems. However, our experimental resultsshow that PBoH consistently beats LocalMention on all thedatasets. Second, PBoH produces state-of-the-art resultson both development (Test-A) and blind evaluation (Test-B) parts of the AIDA dataset. Third, on the AQUAINT,MSNBC and ACE04 datasets, PBoH outperforms all butone of the presented EL systems and is competitive with thestate-of-art approaches. The method whose performance iscloser to ours is REL-RW [11] whose average F1 score is onlyslightly higher than ours (+0.6 on average). However, thereare significant advantages of our method that make it easierto use for practitioners. First, our approach is conceptuallysimpler and only requires sufficient statistics computed fromWikipedia. Second, PBoH shows a superior computationalcomplexity manifested in significantly lower run times (Ta-ble 5), making it a good fit for large-scale real-time entitylinking systems; this is not the case for REL-RW qualifiedas “time consuming” by its authors. Third, the number ofentities in the underlying graph, and thus the required mem-ory, is significantly lower for PBoH (see statistics providedin Table 6).

Incremental accuracy gainsTo give further insight to our method, Table 7 provides

an overview of the contribution brought step by step byeach incremental component of the Full PBoH system. Itcan be noted that PBoH performs best, outranking all itsindividual components.

11The PBoH Gerbil experiment is available at http://gerbil.aksw.org/gerbil/experiment?id=201510160025.

12The detailed Gerbil results of the baseline sys-tems can be accessed at http://gerbil.aksw.org/gerbil/experiment?id=201510160026

DatasetsAIDA test A AIDA test B

Systems R@MI R@MA R@MI R@MA

LocalMention 69.73 69.30 67.98 72.75Unnorm 69.77 69.95 75.87 75.12Rescaled 75.09 74.25 74.76 78.28LocalContext 82.50 81.56 85.46 84.08PBoH 85.53 85.09 87.51 86.39

Table 7: Accuracy gains of individual PBoH com-ponents.

Reproducibility of the experimentsOur experiments are easily reproducible using the details

provided in this paper. Our learning procedure is only basedon statistics coming from the set of Wikipedia webpages. Asa consequence, one can implement a real-time highly accu-rate entity disambiguation system solely based on the detailsdescribed in this paper.

Our code is publicly available at : https://github.com/

dalab/pboh-entity-linking

6. CONCLUSIONIn this paper, we described a light-weight graphical model

for entity linking via approximate inference. Our methodemploys simple sufficient statistics that rely on three sourcesof information: First, a probabilistic name to entity mapp(e|m) derived from a large corpus of hyperlinks; second,observational data about the pairwise co-occurrence of en-tities within documents from a Web collection; third, entity- contextual words statistics. Our experiments based on anumber of popular entity linking benchmarking collectionsshow improved performance as compared to several well-known or recent systems.

There are several promising directions of future work.Currently, our model considers only pairwise potentials. Inthe future, it would be interesting to investigate the useof higher-order potentials and submodular optimization inan entity linking pipeline, thus allowing us to capture theinterplay between entire groups of entity candidates (e.g.,through the use of entity categories). Additionally, we willfurther enrich our probabilistic model with statistics fromnew sources of information. We expect some of the per-formance gains that other papers report from using entitycategories or semantic relations to be additive with regardto our system’s current accuracy.

7. REFERENCES[1] A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra. A

maximum entropy approach to natural language processing.Computational linguistics, 22(1):39–71, 1996.

[2] R. C. Bunescu and M. Pasca. Using encyclopedicknowledge for named entity disambiguation. In EACL,volume 6, pages 9–16, 2006.

[3] D. Ceccarelli, C. Lucchese, S. Orlando, R. Perego, andS. Trani. Dexter: an open source framework for entitylinking. In Proceedings of the sixth international workshopon Exploiting semantic annotations in informationretrieval, pages 17–20. ACM, 2013.

[4] X. Cheng and D. Roth. Relational inference forwikification. Urbana, 51:61801, 2013.

Page 11: Probabilistic Bag-Of-Hyperlinks Model for Entity … Bag-Of-Hyperlinks Model for Entity Linking Octavian-Eugen Ganea Dept. of Computer Science ETH Zurich, Switzerland ganeao@inf.ethz.ch

[5] S. Cucerzan. Large-scale named entity disambiguationbased on wikipedia data. In EMNLP-CoNLL, volume 7,pages 708–716, 2007.

[6] J. N. Darroch and D. Ratcliff. Generalized iterative scalingfor log-linear models. The annals of mathematicalstatistics, pages 1470–1480, 1972.

[7] M. Dredze, P. McNamee, D. Rao, A. Gerber, and T. Finin.Entity disambiguation for knowledge base population. InProceedings of the 23rd International Conference onComputational Linguistics, pages 277–285. Association forComputational Linguistics, 2010.

[8] G. Durrett and D. Klein. A joint model for entity analysis:Coreference, typing, and linking. Transactions of theAssociation for Computational Linguistics, 2:477–490,2014.

[9] P. Ferragina and U. Scaiella. Fast and accurate annotationof short texts with wikipedia pages. arXiv preprintarXiv:1006.3498, 2010.

[10] P. Ferragina and U. Scaiella. Tagme: on-the-fly annotationof short text fragments (by wikipedia entities). InProceedings of the 19th ACM international conference onInformation and knowledge management, pages 1625–1628.ACM, 2010.

[11] Z. Guo and D. Barbosa. Robust entity linking via randomwalks. In Proceedings of the 23rd ACM InternationalConference on Conference on Information and KnowledgeManagement, CIKM ’14, pages 499–508, New York, NY,USA, 2014. ACM.

[12] X. Han and L. Sun. An entity-topic model for entitylinking. In Proceedings of the 2012 Joint Conference onEmpirical Methods in Natural Language Processing andComputational Natural Language Learning, pages 105–115.Association for Computational Linguistics, 2012.

[13] X. Han, L. Sun, and J. Zhao. Collective entity linking inweb text: a graph-based method. In Proceedings of the 34thinternational ACM SIGIR conference on Research anddevelopment in Information Retrieval, pages 765–774.ACM, 2011.

[14] Z. He, S. Liu, M. Li, M. Zhou, L. Zhang, and H. Wang.Learning entity representation for entity disambiguation. InACL (2), pages 30–34, 2013.

[15] J. Hoffart, M. A. Yosef, I. Bordino, H. Furstenau,M. Pinkal, M. Spaniol, B. Taneva, S. Thater, andG. Weikum. Robust disambiguation of named entities intext. In Proceedings of the Conference on EmpiricalMethods in Natural Language Processing, pages 782–792.Association for Computational Linguistics, 2011.

[16] N. Houlsby and M. Ciaramita. A scalable gibbs sampler forprobabilistic entity linking. In Advances in InformationRetrieval, pages 335–346. Springer, 2014.

[17] E. T. Jaynes. On the rationale of maximum-entropymethods. Proceedings of the IEEE, 70(9):939–952, 1982.

[18] S. S. Kataria, K. S. Kumar, R. R. Rastogi, P. Sen, andS. H. Sengamedu. Entity disambiguation with hierarchicaltopic models. In Proceedings of the 17th ACM SIGKDDinternational conference on Knowledge discovery and datamining, pages 1037–1045. ACM, 2011.

[19] S. Kulkarni, A. Singh, G. Ramakrishnan, andS. Chakrabarti. Collective annotation of wikipedia entitiesin web text. In Proceedings of the 15th ACM SIGKDDinternational conference on Knowledge discovery and datamining, pages 457–466. ACM, 2009.

[20] P. N. Mendes, M. Jakob, A. Garcıa-Silva, and C. Bizer.Dbpedia spotlight: shedding light on the web of documents.In Proceedings of the 7th International Conference onSemantic Systems, pages 1–8. ACM, 2011.

[21] R. Mihalcea and A. Csomai. Wikify!: linking documents toencyclopedic knowledge. In Proceedings of the sixteenthACM conference on Conference on information andknowledge management, pages 233–242. ACM, 2007.

[22] D. Milne and I. H. Witten. Learning to link with wikipedia.In Proceedings of the 17th ACM conference on Information

and knowledge management, pages 509–518. ACM, 2008.[23] A. Moro, A. Raganato, and R. Navigli. Entity Linking

meets Word Sense Disambiguation: a Unified Approach.Transactions of the Association for ComputationalLinguistics (TACL), 2:231–244, 2014.

[24] K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy beliefpropagation for approximate inference: An empirical study.In Proceedings of the Fifteenth Conference on Uncertaintyin Artificial Intelligence, UAI’99, pages 467–475, SanFrancisco, CA, USA, 1999. Morgan Kaufmann PublishersInc.

[25] F. Piccinno and P. Ferragina. From tagme to wat: a newentity annotator. In Proceedings of the first internationalworkshop on Entity recognition & disambiguation, pages55–62. ACM, 2014.

[26] A. Pilz and G. Paaß. From names to entities using thematiccontext distance. In Proceedings of the 20th ACMinternational conference on Information and knowledgemanagement, pages 857–866. ACM, 2011.

[27] L. Ratinov, D. Roth, D. Downey, and M. Anderson. Localand global algorithms for disambiguation to wikipedia. InProceedings of the 49th Annual Meeting of the Associationfor Computational Linguistics: Human LanguageTechnologies-Volume 1, pages 1375–1384. Association forComputational Linguistics, 2011.

[28] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: Alock-free approach to parallelizing stochastic gradientdescent. In J. Shawe-taylor, R. Zemel, P. Bartlett,F. Pereira, and K. Weinberger, editors, Advances in NeuralInformation Processing Systems 24, pages 693–701. 2011.

[29] G. Rizzo, M. van Erp, and R. Troncy. Benchmarking theextraction and disambiguation of named entities on thesemantic web. In Proceedings of the 9th InternationalConference on Language Resources and Evaluation, 2014.

[30] A. Sil and A. Yates. Re-ranking for joint named-entityrecognition and linking. In Proceedings of the 22nd ACMinternational conference on Conference on information &knowledge management, pages 2369–2374. ACM, 2013.

[31] S. Singh, S. Riedel, B. Martin, J. Zheng, and A. McCallum.Joint inference of entities, relations, and coreference. InProceedings of the 2013 workshop on Automated knowledgebase construction, pages 1–6. ACM, 2013.

[32] V. I. Spitkovsky and A. X. Chang. A cross-lingualdictionary for English Wikipedia concepts. In Proceedingsof the Eighth International Conference on LanguageResources and Evaluation (LREC 2012), Istanbul, Turkey,May 2012.

[33] N. Steinmetz and H. Sack. Semantic multimediainformation retrieval based on contextual descriptions. InThe Semantic Web: Semantics and Big Data, pages382–396. Springer, 2013.

[34] Y. Sun, L. Lin, D. Tang, N. Yang, Z. Ji, and X. Wang.Modeling mention, context and entity with neural networksfor entity disambiguation.

[35] C. Sutton and A. McCallum. Piecewise training forstructured prediction. Machine Learning, 77(2-3):165–194,2009.

[36] R. Usbeck, A.-C. N. Ngomo, M. Roder, D. Gerber, S. A.Coelho, S. Auer, and A. Both. Agdistis-graph-baseddisambiguation of named entities using linked data. In TheSemantic Web–ISWC 2014, pages 457–471. Springer, 2014.

[37] R. Usbeck, M. Roder, A.-C. Ngonga Ngomo, C. Baron,A. Both, M. Brummer, D. Ceccarelli, M. Cornolti,D. Cherix, B. Eickmann, et al. Gerbil: General entityannotator benchmarking framework. In Proceedings of the24th International Conference on World Wide Web, pages1133–1143. International World Wide Web ConferencesSteering Committee, 2015.

[38] S. V. N. Vishwanathan, N. N. Schraudolph, M. W.Schmidt, and K. P. Murphy. Accelerated training ofconditional random fields with stochastic gradient methods.In Proceedings of the 23rd International Conference on

Page 12: Probabilistic Bag-Of-Hyperlinks Model for Entity … Bag-Of-Hyperlinks Model for Entity Linking Octavian-Eugen Ganea Dept. of Computer Science ETH Zurich, Switzerland ganeao@inf.ethz.ch

Machine Learning, ICML ’06, pages 969–976, New York,NY, USA, 2006. ACM.

[39] J. Yedidia, W. Freeman, and Y. Weiss. Generalized beliefpropagation. In Advances in Neural Information ProcessingSystems (NIPS), volume 13, pages 689–695, Dec. 2000.

[40] C. Zhai and J. Lafferty. A study of smoothing methods forlanguage models applied to information retrieval. ACMTransactions on Information Systems (TOIS),22(2):179–214, 2004.


Recommended