+ All Categories
Home > Documents > Hierarchical Losses and New Resources for Fine-grained Entity … · 2018. 7. 19. · Accurately...

Hierarchical Losses and New Resources for Fine-grained Entity … · 2018. 7. 19. · Accurately...

Date post: 31-Dec-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
13
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 97–109 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics 97 Hierarchical Losses and New Resources for Fine-grained Entity Typing and Linking Shikhar Murty* UMass Amherst [email protected] Patrick Verga* UMass Amherst [email protected] Luke Vilnis UMass Amherst [email protected] Irena Radovanovic Chan Zuckerberg Initiative [email protected] Andrew McCallum UMass Amherst [email protected] Abstract Extraction from raw text to a knowledge base of entities and fine-grained types is often cast as prediction into a flat set of entity and type labels, neglecting the rich hierarchies over types and entities con- tained in curated ontologies. Previous at- tempts to incorporate hierarchical struc- ture have yielded little benefit and are re- stricted to shallow ontologies. This paper presents new methods using real and com- plex bilinear mappings for integrating hi- erarchical information, yielding substan- tial improvement over flat predictions in entity linking and fine-grained entity typ- ing, and achieving new state-of-the-art re- sults for end-to-end models on the bench- mark FIGER dataset. We also present two new human-annotated datasets containing wide and deep hierarchies which we will release to the community to encourage fur- ther research in this direction: MedMen- tions, a collection of PubMed abstracts in which 246k mentions have been mapped to the massive UMLS ontology; and Type- Net, which aligns Freebase types with the WordNet hierarchy to obtain nearly 2k en- tity types. In experiments on all three datasets we show substantial gains from hierarchy-aware training. 1 Introduction Identifying and understanding entities is a cen- tral component in knowledge base construction (Roth et al., 2015) and essential for enhanc- ing downstream tasks such as relation extraction *equal contribution Data and code for experiments: https://github. com/MurtyShikhar/Hierarchical-Typing (Yaghoobzadeh et al., 2017b), question answering (Das et al., 2017; Welbl et al., 2017) and search (Dalton et al., 2014). This has led to consider- able research in automatically identifying entities in text, predicting their types, and linking them to existing structured knowledge sources. Current state-of-the-art models encode a textual mention with a neural network and classify the mention as being an instance of a fine grained type or entity in a knowledge base. Although in many cases the types and their entities are arranged in a hierarchical ontology, most approaches ignore this structure, and previous attempts to incorporate hi- erarchical information yielded little improvement in performance (Shimaoka et al., 2017). Addi- tionally, existing benchmark entity typing datasets only consider small label sets arranged in very shallow hierarchies. For example, FIGER (Ling and Weld, 2012), the de facto standard fine grained entity type dataset, contains only 113 types in a hi- erarchy only two levels deep. In this paper we investigate models that ex- plicitly integrate hierarchical information into the embedding space of entities and types, using a hierarchy-aware loss on top of a deep neural net- work classifier over textual mentions. By using this additional information, we learn a richer, more robust representation, gaining statistical efficiency when predicting similar concepts and aiding the classification of rarer types. We first validate our methods on the narrow, shallow type system of FIGER, out-performing state-of-the-art meth- ods not incorporating hand-crafted features and matching those that do. To evaluate on richer datasets and stimulate fur- ther research into hierarchical entity/typing pre- diction with larger and deeper ontologies, we in- troduce two new human annotated datasets. The first is MedMentions, a collection of PubMed ab-
Transcript
Page 1: Hierarchical Losses and New Resources for Fine-grained Entity … · 2018. 7. 19. · Accurately linking textual biological entity mentions to an existing knowledge base is ex-tremely

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 97–109Melbourne, Australia, July 15 - 20, 2018. c©2018 Association for Computational Linguistics

97

Hierarchical Losses and New Resources forFine-grained Entity Typing and Linking

Shikhar Murty*UMass Amherst

[email protected]

Patrick Verga*UMass Amherst

[email protected]

Luke VilnisUMass Amherst

[email protected]

Irena RadovanovicChan Zuckerberg Initiative

[email protected]

Andrew McCallumUMass Amherst

[email protected]

Abstract

Extraction from raw text to a knowledgebase of entities and fine-grained types isoften cast as prediction into a flat set ofentity and type labels, neglecting the richhierarchies over types and entities con-tained in curated ontologies. Previous at-tempts to incorporate hierarchical struc-ture have yielded little benefit and are re-stricted to shallow ontologies. This paperpresents new methods using real and com-plex bilinear mappings for integrating hi-erarchical information, yielding substan-tial improvement over flat predictions inentity linking and fine-grained entity typ-ing, and achieving new state-of-the-art re-sults for end-to-end models on the bench-mark FIGER dataset. We also present twonew human-annotated datasets containingwide and deep hierarchies which we willrelease to the community to encourage fur-ther research in this direction: MedMen-tions, a collection of PubMed abstracts inwhich 246k mentions have been mappedto the massive UMLS ontology; and Type-Net, which aligns Freebase types with theWordNet hierarchy to obtain nearly 2k en-tity types. In experiments on all threedatasets we show substantial gains fromhierarchy-aware training.

1 Introduction

Identifying and understanding entities is a cen-tral component in knowledge base construction(Roth et al., 2015) and essential for enhanc-ing downstream tasks such as relation extraction

*equal contributionData and code for experiments: https://github.

com/MurtyShikhar/Hierarchical-Typing

(Yaghoobzadeh et al., 2017b), question answering(Das et al., 2017; Welbl et al., 2017) and search(Dalton et al., 2014). This has led to consider-able research in automatically identifying entitiesin text, predicting their types, and linking them toexisting structured knowledge sources.

Current state-of-the-art models encode a textualmention with a neural network and classify themention as being an instance of a fine grained typeor entity in a knowledge base. Although in manycases the types and their entities are arranged in ahierarchical ontology, most approaches ignore thisstructure, and previous attempts to incorporate hi-erarchical information yielded little improvementin performance (Shimaoka et al., 2017). Addi-tionally, existing benchmark entity typing datasetsonly consider small label sets arranged in veryshallow hierarchies. For example, FIGER (Lingand Weld, 2012), the de facto standard fine grainedentity type dataset, contains only 113 types in a hi-erarchy only two levels deep.

In this paper we investigate models that ex-plicitly integrate hierarchical information into theembedding space of entities and types, using ahierarchy-aware loss on top of a deep neural net-work classifier over textual mentions. By usingthis additional information, we learn a richer, morerobust representation, gaining statistical efficiencywhen predicting similar concepts and aiding theclassification of rarer types. We first validateour methods on the narrow, shallow type systemof FIGER, out-performing state-of-the-art meth-ods not incorporating hand-crafted features andmatching those that do.

To evaluate on richer datasets and stimulate fur-ther research into hierarchical entity/typing pre-diction with larger and deeper ontologies, we in-troduce two new human annotated datasets. Thefirst is MedMentions, a collection of PubMed ab-

Page 2: Hierarchical Losses and New Resources for Fine-grained Entity … · 2018. 7. 19. · Accurately linking textual biological entity mentions to an existing knowledge base is ex-tremely

98

stracts in which 246k concept mentions have beenannotated with links to the Unified Medical Lan-guage System (UMLS) ontology (Bodenreider,2004), an order of magnitude more annotationsthan comparable datasets. UMLS contains over3.5 million concepts in a hierarchy having averagedepth 14.4. Interestingly, UMLS does not distin-guish between types and entities (an approach weheartily endorse), and the technical details of link-ing to such a massive ontology lead us to refer toour MedMentions experiments as entity linking.Second, we present TypeNet, a curated mappingfrom the Freebase type system into the WordNethierarchy. TypeNet contains over 1900 types withan average depth of 7.8.

In experimental results, we show improvementswith a hierarchically-aware training loss on eachof the three datasets. In entity-linking MedMen-tions to UMLS, we observe a 6% relative increasein accuracy over the base model. In experimentson entity-typing from Wikipedia into TypeNet, weshow that incorporating the hierarchy of types andincluding a hierarchical loss provides a dramatic29% relative increase in MAP. Our models evenprovide benefits for shallow hierarchies allowingus to match the state-of-art results of Shimaokaet al. (2017) on the FIGER (GOLD) dataset with-out requiring hand-crafted features.

We will publicly release the TypeNet and Med-Mentions datasets to the community to encouragefurther research in truly fine-grained, hierarchicalentity-typing and linking.

2 New Corpora and Ontologies

2.1 MedMentions

Over the years researchers have constructed manylarge knowledge bases in the biomedical domain(Apweiler et al., 2004; Davis et al., 2008; Chatr-aryamontri et al., 2017). Many of these knowl-edge bases are specific to a particular sub-domainencompassing a few particular types such as genesand diseases (Pinero et al., 2017).

UMLS (Bodenreider, 2004) is particularly com-prehensive, containing over 3.5 million concepts(UMLS does not distinguish between entities andtypes) defining their relationships and a curated hi-erarchical ontology. For example LETM1 ProteinIS-A Calcium Binding Protein IS-A Binding Pro-tein IS-A Protein IS-A Genome Encoded Entity.This fact makes UMLS particularly well suited formethods explicitly exploiting hierarchical struc-

ture.Accurately linking textual biological entity

mentions to an existing knowledge base is ex-tremely important but few richly annotated re-sources are available. Even when resources do ex-ist, they often contain no more than a few thou-sand annotated entity mentions which is insuffi-cient for training state-of-the-art neural networkentity linkers. State-of-the-art methods must in-stead rely on string matching between entity men-tions and canonical entity names (Leaman et al.,2013; Wei et al., 2015; Leaman and Lu, 2016). Toaddress this, we constructed MedMentions, a new,large dataset identifying and linking entity men-tions in PubMed abstracts to specific UMLS con-cepts. Professional annotators exhaustively anno-tated UMLS entity mentions from 3704 PubMedabstracts, resulting in 246,000 linked mentionspans. The average depth in the hierarchy of a con-cept from our annotated set is 14.4 and the maxi-mum depth is 43.

MedMentions contains an order of magnitudemore annotations than similar biological entitylinking PubMed datasets (Dogan et al., 2014; Weiet al., 2015; Li et al., 2016). Additionally, thesedatasets contain annotations for only one or twoentity types (genes or chemicals and disease etc.).MedMentions instead contains annotations for awide diversity of entities linking to UMLS. Statis-tics for several other datasets are in Table 1 andfurther statistics are in 2.

Dataset mentions unique entitiesMedMentions 246,144 25,507BCV-CDR 28,797 2,356NCBI Disease 6,892 753BCII-GN Train 6,252 1,411NLM Citation GIA 1,205 310

Table 1: Statistics from various biological entitylinking data sets from scientific articles. NCBIDisease (Dogan et al., 2014) focuses exclusivelyon disease entities. BCV-CDR (Li et al., 2016)contains both chemicals and diseases. BCII-GNand NLM (Wei et al., 2015) both contain genes.

Statistic Train Dev Test#Abstracts 2,964 370 370#Sentences 28,457 3,497 3,268#Mentions 199,977 24,026 22,141#Entities 22,416 5,934 5,521

Table 2: MedMentions statistics.

Page 3: Hierarchical Losses and New Resources for Fine-grained Entity … · 2018. 7. 19. · Accurately linking textual biological entity mentions to an existing knowledge base is ex-tremely

99

2.2 TypeNet

TypeNet is a new dataset of hierarchical entitytypes for extremely fine-grained entity typing.TypeNet was created by manually aligning Free-base types (Bollacker et al., 2008) to noun synsetsfrom the WordNet hierarchy (Fellbaum, 1998),naturally producing a hierarchical type set.

To construct TypeNet, we first consider all Free-base types that were linked to more than 20 enti-ties. This is done to eliminate types that are ei-ther very specific or very rare. We also removeall Freebase API types, e.g. the [/freebase, /data-world, /schema, /atom, /scheme, and /topics] do-mains.

For each remaining Freebase type, we generatea list of candidate WordNet synsets through a sub-string match. An expert annotator then attemptedto map the Freebase type to one or more synsetsin the candidate list with a parent-of, child-of orequivalence link by comparing the definitions ofeach synset with example entities of the Freebasetype. If no match was found, the annotator man-ually formulated queries for the online WordNetAPI until an appropriate synset was found. SeeTable 9 for an example annotation.

Two expert annotators independently alignedeach Freebase type before meeting to resolve anyconflicts. The annotators were conservative withassigning equivalence links resulting in a greaternumber of child-of links. The final dataset con-tained 13 parent-of, 727 child-of, and 380 equiv-alence links. Note that some Freebase types havemultiple child-of links to WordNet, making Type-Net, like WordNet, a directed acyclic graph. Wethen took the union of each of our annotated Free-base types, the synset that they linked to, and anyancestors of that synset.

Typeset Count Depth Gold KB linksCoNLL-YAGO 4 1 YesOntoNotes 5.0 19 1 NoGillick et al. (2014) 88 3 YesFiger 112 2 YesHyena 505 9 NoFreebase 2k 2 YesWordNet 16k 14 NoTypeNet* 1,941 14 Yes

Table 3: Statistics from various type sets. Type-Net is the largest type hierarchy with a gold map-ping to KB entities. *The entire WordNet could beadded to TypeNet increasing the total size to 17ktypes.

We also added an additional set of 614 FB→ FB links 4. This was done by computingconditional probabilities of Freebase types givenother Freebase types from a collection of 5 mil-lion randomly chosen Freebase entities. The con-ditional probability P(t2 | t1) of a Freebase typet2 given another Freebase type t1 was calculatedas #(t1,t2)

#t1. Links with a conditional probability

less than or equal to 0.7 were discarded. The re-maining links were manually verified by an expertannotator and valid links were added to the finaldataset, preserving acyclicity.

Freebase Types 1081WordNet Synsets 860child-of links 727equivalence links 380parent-of links 13Freebase-Freebase links 614

Table 4: Stats for the final TypeNet dataset. child-of, parent-of, and equivalence links are from Free-base types→WordNet synsets.

3 Model

3.1 Background: Entity Typing and Linking

We define a textual mention m as a sentence withan identified entity. The goal is then to classify mwith one or more labels. For example, we couldtake the sentence m = “Barack Obama is thePresident of the United States.” with the identifiedentity string Barack Obama. In the task of entitylinking, we want to map m to a specific entity ina knowledge base such as “m/02mjmr” in Free-base. In mention-level typing, we label m withone or more types from our type system T suchas tm = {president, leader, politician} (Ling andWeld, 2012; Gillick et al., 2014; Shimaoka et al.,2017). In entity-level typing, we instead considera bag of mentions Be which are all linked to thesame entity. We label Be with te, the set of alltypes expressed in all m ∈ Be (Yao et al., 2013;Neelakantan and Chang, 2015; Verga et al., 2017;Yaghoobzadeh et al., 2017a).

3.2 Mention Encoder

Our model converts each mention m to a d dimen-sional vector. This vector is used to classify thetype or entity of the mention. The basic model de-picted in Figure 1 concatenates the averaged wordembeddings of the mention string with the out-put of a convolutional neural network (CNN). The

Page 4: Hierarchical Losses and New Resources for Fine-grained Entity … · 2018. 7. 19. · Accurately linking textual biological entity mentions to an existing knowledge base is ex-tremely

100

BarackObamais the

presidentof the USA

Mean Max Pool

MLP

CNN

Figure 1: Sentence encoder for all our models.The input to the CNN consists of the concatena-tion of position embeddings with word embed-dings. The output of the CNN is concatenatedwith the mean of mention surface form embed-dings, and then passed through a 2 layer MLP.

word embeddings of the mention string captureglobal, context independent semantics while theCNN encodes a context dependent representation.

3.2.1 Token Representation

Each sentence is made up of s tokens which aremapped to dw dimensional word embeddings. Be-cause sentences may contain mentions of morethan one entity, we explicitly encode a distin-guished mention in the text using position embed-dings which have been shown to be useful in stateof the art relation extraction models (dos Santoset al., 2015; Lin et al., 2016) and machine trans-lation (Vaswani et al., 2017). Each word embed-ding is concatenated with a dp dimensional learnedposition embedding encoding the token’s relativedistance to the target entity. Each token within thedistinguished mention span has position 0, tokensto the left have a negative distance from [−s, 0),and tokens to the right of the mention span have apositive distance from (0, s]. We denote the finalsequence of token representations as M .

3.2.2 Sentence Representation

The embedded sequence M is then fed into ourcontext encoder. Our context encoder is a singlelayer CNN followed by a tanh non-linearity toproduce C. The outputs are max pooled across

time to get a final context embedding, mCNN.

ci = tanh(b+w∑

j=0

W [j]M [i− bw2c+ j])

mCNN = max0≤i≤n−w+1

ci

Each W [j] ∈ Rd×d is a CNN filter, the bias b ∈Rd, M [i] ∈ Rd is a token representation, and themax is taken pointwise. In all of our experimentswe set w = 5.

In addition to the contextually encoded men-tion, we create a global mention encoding, mG,by averaging the word embeddings of the tokenswithin the mention span.

The final mention representation mF is con-structed by concatenating mCNN and mG and ap-plying a two layer feed-forward network withtanh non-linearity (see Figure 1):

mF = W2 tanh(W1

[mSFMmCNN

]+ b1) + b2

4 Training

4.1 Mention-Level TypingMention level entity typing is treated as multi-label prediction. Given the sentence vector mF,we compute a score for each type in typeset T as:

yj = tj>mF

where tj is the embedding for the jth type in Tand yj is its corresponding score. The mention islabeled with tm, a binary vector of all types wheretmj = 1 if the jth type is in the set of gold typesfor m and 0 otherwise. We optimize a multi-labelbinary cross entropy objective:

Ltype(m) = −∑j

tmj log yj + (1− tmj ) log(1− yj)

4.2 Entity-Level TypingIn the absence of mention-level annotations, weinstead must rely on distant supervision (Mintzet al., 2009) to noisily label all mentions of entitye with all types belonging to e. This procedure in-evitably leads to noise as not all mentions of anentity express each of its known types. To allevi-ate this noise, we use multi-instance multi-labellearning (MIML) (Surdeanu et al., 2012) whichoperates over bags rather than mentions. A bagof mentions Be = {m1,m2, . . . ,mn} is the set of

Page 5: Hierarchical Losses and New Resources for Fine-grained Entity … · 2018. 7. 19. · Accurately linking textual biological entity mentions to an existing knowledge base is ex-tremely

101

all mentions belonging to entity e. The bag is la-beled with te, a binary vector of all types wheretej = 1 if the jth type is in the set of gold types fore and 0 otherwise.

For every entity, we subsample k mentions fromits bag of mentions. Each mention is then encodedindependently using the model described in Sec-tion 3.2 resulting in a bag of vectors. Each of thek sentence vectors mi

F is used to compute a scorefor each type in te:

yij = tj>mi

F

where tj is the embedding for the jth type in te

and yi is a vector of logits corresponding to the ith

mention. The final bag predictions are obtainedusing element-wise LogSumExp pooling acrossthe k logit vectors in the bag to produce entity levellogits y:

y = log∑i

exp(yi)

We use these final bag level predictions to opti-mize a multi-label binary cross entropy objective:

Ltype(Be) = −∑j

tej log yj + (1− tej) log(1− yj)

4.3 Entity LinkingEntity linking is similar to mention-level entitytyping with a single correct class per mention. Be-cause the set of possible entities is in the mil-lions, linking models typically integrate an aliastable mapping entity mentions to a set of possiblecandidate entities. Given a large corpus of entitylinked data, one can compute conditional probabil-ities from mention strings to entities (Spitkovskyand Chang, 2012). In many scenarios this data isunavailable. However, knowledge bases such asUMLS contain a canonical string name for eachof its curated entities. State-of-the-art biologi-cal entity linking systems tend to operate on vari-ous string edit metrics between the entity mentionstring and the set of canonical entity strings in theexisting structured knowledge base (Leaman et al.,2013; Wei et al., 2015).

For each mention in our dataset, we generate100 candidate entities ec = (e1, e2, . . . , e100) eachwith an associated string similarity score csim.See Appendix A.5.1 for more details on candidategeneration. We generate the sentence representa-tion mF using our encoder and compute a similar-ity score between mF and the learned embedding

e of each of the candidate entities. This score andstring cosine similarity csim are combined via alearned linear combination to generate our finalscore. The final prediction at test time e is themaximally similar entity to the mention.

φ(m, e) = α e>mF + β csim(m, e)

e = argmaxe∈ec

φ(m, e)

We optimize this model by multinomial cross en-tropy over the set of candidate entities and correctentity e.

Llink(m, ec) = − φ(m, e) + log∑e′∈ec

expφ(m, e′)

5 Encoding Hierarchies

Both entity typing and entity linking treat the labelspace as prediction into a flat set. To explicitly in-corporate the structure between types/entities intoour training, we add an additional loss. We con-sider two methods for modeling the hierarchy ofthe embedding space: real and complex bilinearmaps, which are two of the state-of-the-art knowl-edge graph embedding models.

5.1 Hierarchical Structure ModelsBilinear: Our standard bilinear model scores a hy-pernym link between (c1, c2) as:

s(c1, c2) = c1>Ac2

where A ∈ Rd×d is a learned real-valued non-diagonal matrix and c1 is the child of c2 in thehierarchy. This model is equivalent to RESCAL(Nickel et al., 2011) with a single IS-A relationtype. The type embeddings are the same whetherused on the left or right side of the relation. Wemerge this with the base model by using the pa-rameter A as an additional map before type/entityscoring.Complex Bilinear: We also experiment witha complex bilinear map based on the ComplExmodel (Trouillon et al., 2016), which was shownto have strong performance predicting the hyper-nym relation in WordNet, suggesting suitabilityfor asymmetric, transitive relations such as thosein our type hierarchy. ComplEx uses complex val-ued vectors for types, and diagonal complex ma-trices for relations, using Hermitian inner products(taking the complex conjugate of the second ar-gument, equivalent to treating the right-hand-side

Page 6: Hierarchical Losses and New Resources for Fine-grained Entity … · 2018. 7. 19. · Accurately linking textual biological entity mentions to an existing knowledge base is ex-tremely

102

type embedding to be the complex conjugate of theleft hand side), and finally taking the real part ofthe score1. The score of a hypernym link between(c1, c2) in the ComplEx model is defined as:

s(c1, c2) = Re(< c1, rIS-A, c2 >)

= Re(∑k

c1krk c2k)

= 〈Re(c1),Re(rIS-A),Re(c2)〉+ 〈Re(c1), Im(rIS-A), Im(c2)〉+ 〈Im(c1),Re(rIS-A), Im(c2)〉− 〈Im(c1), Im(rIS-A),Re(c2)〉

where c1, c2 and rIS-A are complex valued vectorsrepresenting c1, c2 and the IS-A relation respec-tively. Re(z) represents the real component of zand Im(z) is the imaginary component. As notedin Trouillon et al. (2016), the above function is an-tisymmetric when rIS-A is purely imaginary.

Since entity/type embeddings are complex vec-tors, in order to combine it with our base model,we also need to represent mentions with complexvectors for scoring. To do this, we pass the out-put of the mention encoder through two differentaffine transformations to generate a real and imag-inary component:

Re(mF) = WrealmF + breal

Im(mF) = WimgmF + bimg

where mF is the output of the mention encoder,and Wreal, Wimg ∈ Rd×d and breal, bimg ∈ Rd .

5.2 Training with HierarchiesLearning a hierarchy is analogous to learning em-beddings for nodes of a knowledge graph with asingle hypernym/IS-A relation. To train these em-beddings, we sample (c1, c2) pairs, where eachpair is a positive link in our hierarchy. For eachpositive link, we sample a set N of n negativelinks. We encourage the model to output highscores for positive links, and low scores for neg-ative links via a binary cross entropy (BCE) loss:

Lstruct = − log σ(s(c1i, c2i))

+∑N

log(1− σ(s(c1i, c′2i)))

L = Ltype/link + γLstruct

1This step makes the scoring function technically not bi-linear, as it commutes with addition but not complex multi-plication, but we term it bilinear for ease of exposition.

where s(c1, c2) is the score of a link (c1, c2), andσ(·) is the logistic sigmoid. The weighting param-eter γ is ∈ {0.1, 0.5, 0.8, 1, 2.0, 4.0}. The finalloss function that we optimize is L.

6 Experiments

We perform three sets of experiments: mention-level entity typing on the benchmark datasetFIGER, entity-level typing using Wikipedia andTypeNet, and entity linking using MedMentions.

6.1 ModelsCNN: Each mention is encoded using the modeldescribed in Section 3.2. The resulting embeddingis used for classification into a flat set labels. Spe-cific implementation details can be found in Ap-pendix A.2.CNN+Complex: The CNN+Complex model isequivalent to the CNN model but uses complexembeddings and Hermitian dot products.Transitive: This model does not add an additionalhierarchical loss to the training objective (unlessotherwise stated). We add additional labels toeach entity corresponding to the transitive closure,or the union of all ancestors of its known types.This provides a rich additional learning signal thatgreatly improves classification of specific types.Hierarchy: These models add an explicit hierar-chical loss to the training objective, as describedin Section 5, using either complex or real-valuedbilinear mappings, and the associated parametersharing.

6.2 Mention-Level Typing in FIGERTo evaluate the efficacy of our methods we firstcompare against the current state-of-art models ofShimaoka et al. (2017). The most widely used typesystem for fine-grained entity typing is FIGERwhich consists of 113 types organized in a 2 levelhierarchy. For training, we use the publicly avail-able W2M data (Ren et al., 2016) and optimize themention typing loss function defined in Section-4.1 with the additional hierarchical loss wherespecified. For evaluation, we use the manually an-notated FIGER (GOLD) data by Ling and Weld(2012). See Appendix A.2 and A.3 for specificimplementation details.

6.2.1 ResultsIn Table 5 we see that our base CNN models (CNNand CNN+Complex) match LSTM models of Shi-maoka et al. (2017) and Gupta et al. (2017), the

Page 7: Hierarchical Losses and New Resources for Fine-grained Entity … · 2018. 7. 19. · Accurately linking textual biological entity mentions to an existing knowledge base is ex-tremely

103

Model Acc Macro F1 Micro F1Ling and Weld (2012) 47.4 69.2 65.5Shimaoka et al. (2017) † 55.6 75.1 71.7Gupta et al. (2017)† 57.7 72.8 72.1Shimaoka et al. (2017)‡ 59.6 78.9 75.3CNN 57.0 75.0 72.2+ hierarchy 58.4 76.3 73.6CNN+Complex 57.2 75.3 72.9+ hierarchy 59.7 78.3 75.4

Table 5: Accuracy and Macro/Micro F1 on FIGER(GOLD). † is an LSTM model. ‡ is an attentiveLSTM along with additional hand crafted features.

previous state-of-the-art for models without hand-crafted features. When incorporating structureinto our models, we gain 2.5 points of accuracy inour CNN+Complex model, matching the overallstate of the art attentive LSTM that relied on hand-crafted features from syntactic parses, topic mod-els, and character n-grams. The structure can helpour model predict lower frequency types which isa similar role played by hand-crafted features.

6.3 Entity-Level Typing in TypeNet

Next we evaluate our models on entity-level typ-ing in TypeNet using Wikipedia. For each en-tity, we follow the procedure outlined in Section4.2. We predict labels for each instance in the en-tity’s bag and aggregate them into entity-level pre-dictions using LogSumExp pooling. Each typeis assigned a predicted score by the model. Wethen rank these scores and calculate average pre-cision for each of the types in the test set, and usethese scores to calculate mean average precision(MAP). We evaluate using MAP instead of accu-racy which is standard in large knowledge baselink prediction tasks (Verga et al., 2017; Trouil-lon et al., 2016). These scores are calculated onlyover Freebase types, which tend to be lower in thehierarchy. This is to avoid artificial score inflationcaused by trivial predictions such as ‘entity.’ SeeAppendix A.4 for more implementation details.

6.3.1 ResultsTable 6 shows the results for entity level typ-ing on our Wikipedia TypeNet dataset. We seethat both the basic CNN and the CNN+Complexmodels perform similarly with the CNN+Complexmodel doing slightly better on the full data regime.We also see that both models get an improvementwhen adding an explicit hierarchy loss, even be-fore adding in the transitive closure. The tran-sitive closure itself gives an additional increase

Model Low Data Full DataCNN 51.72 68.15+ hierarchy 54.82 75.56+ transitive 57.68 77.21+ hierarchy + transitive 58.74 78.59CNN+Complex 50.51 69.83+ hierarchy 55.30 72.86+ transitive 53.71 72.18+ hierarchy + transitive 58.81 77.21

Table 6: MAP of entity-level typing in Wikipediadata using TypeNet. The second column showsresults using 5% of the total data. The last columnshows results using the full set of 344,246 entities.

Model original normalizedmention tfidf 61.09 74.66CNN 67.42 82.40+ hierarchy 67.73 82.77CNN+Complex 67.23 82.17+ hierarchy 68.34 83.52

Table 7: Accuracy on entity linking in MedMen-tions. Maximum recall is 81.82% because we usean imperfect alias table to generate candidates.Normalized scores consider only mentions whichcontain the gold entity in the candidate set. Men-tion tfidf is csim from Section 4.3.

in performance to both models. In both of thesecases, the basic CNN model improves by a greateramount than CNN+Complex. This could be a re-sult of the complex embeddings being more dif-ficult to optimize and therefore more susceptibleto variations in hyperparameters. When adding inboth the transitive closure and the explicit hierar-chy loss, the performance improves further. Weobserve similar trends when training our modelsin a lower data regime with ~150,000 examples,or about 5% of the total data.

In all cases, we note that the baseline modelsthat do not incorporate any hierarchical informa-tion (neither the transitive closure nor the hierar-chy loss) perform ~9 MAP worse, demonstratingthe benefits of incorporating structure information.

6.4 MedMentions Entity Linking with UMLS

In addition to entity typing, we evaluate ourmodel’s performance on an entity linking taskusing MedMentions, our new PubMed / UMLSdataset described in Section 2.1.

6.4.1 ResultsTable 7 shows results for baselines and our pro-posed variant with additional hierarchical loss.None of these models incorporate transitive clo-

Page 8: Hierarchical Losses and New Resources for Fine-grained Entity … · 2018. 7. 19. · Accurately linking textual biological entity mentions to an existing knowledge base is ex-tremely

104

Tips and Pitfalls in Direct Ligation of Large Spontaneous Splenorenal Shunt during Liver Transplantation Patients with largespontaneous splenorenal shunt . . .baseline: Direct [Direct → General Modifier → Qualifier → Property or Attribute]+hierarchy: Ligature (correct) [Ligature → Surgical Procedures → medical treatment approach ]A novel approach for selective chemical functionalization and localized assembly of one-dimensional nanostructures.baseline: Structure [Structure → order or structure → general epistemology]+hierarchy: Nanomaterials (correct) [Nanomaterials → Nanoparticle Complex → Drug or Chemical by Structure]Gcn5 is recruited onto the il-2 promoter by interacting with the NFAT in T cells upon TCR stimulation .baseline: Interleukin-27 [Interleukin-27 → IL2 → Interleukin Gene]+hierarchy: IL2 Gene (correct) [IL2 Gene → Interleukin Gene]

Table 8: Example predictions from MedMentions. Each example shows the sentence with entity mentionspan in bold. Baseline, shows the predicted entity and its ancestors of a model not incorporating struc-ture. Finally, +hierarchy shows the prediction and ancestors for a model which explicitly incorporatesthe hierarchical structure information.

sure information, due to difficulty incorporating itin our candidate generation, which we leave to fu-ture work. The Normalized metric considers per-formance only on mentions with an alias table hit;all models have 0 accuracy for mentions other-wise. We also report the overall score for com-parison in future work with improved candidategeneration. We see that incorporating structure in-formation results in a 1.1% reduction in absoluteerror, corresponding to a ~6% reduction in relativeerror on this large-scale dataset.

Table 8 shows qualitative predictions for mod-els with and without hierarchy information incor-porated. Each example contains the sentence (withtarget entity in bold), predictions for the baselineand hierarchy aware models, and the ancestors ofthe predicted entity. In the first and second exam-ple, the baseline model becomes extremely depen-dent on TFIDF string similarities when the goldcandidate is rare (≤ 10 occurrences). This showsthat modeling the structure of the entity hierar-chy helps the model disambiguate rare entities. Inthe third example, structure helps the model un-derstand the hierarchical nature of the labels andprevents it from predicting an entity that is overlyspecific (e.g predicting Interleukin-27 rather thanthe correct and more general entity IL2 Gene).

Note that, in contrast with the previous tasks,the complex hierarchical loss provides a signifi-cant boost, while the real-valued bilinear modeldoes not. A possible explanation is that UMLSis a far larger/deeper ontology than even TypeNet,and the additional ability of complex embeddingsto model intricate graph structure is key to realiz-ing gains from hierarchical modeling.

7 Related Work

By directly linking a large set of mentions and typ-ing a large set of entities with respect to a new on-tology and corpus, and our incorporation of struc-tural learning between the many entities and typesin our ontologies of interest, our work draws onmany different but complementary threads of re-search in information extraction, knowledge basepopulation, and completion.

Our structural, hierarchy-aware loss betweentypes and entities draws on research in KnowledgeBase Inference such as Jain et al. (2018), Trouil-lon et al. (2016) and Nickel et al. (2011). Com-bining KB completion with hierarchical structurein knowledge bases has been explored in (Dalviet al., 2015; Xie et al., 2016). Recently, Wu et al.(2017) proposed a hierarchical loss for text classi-fication.

Linking mentions to a flat set of entities, of-ten in Freebase or Wikipedia, is a long-standingtask in NLP (Bunescu and Pasca, 2006; Cucerzan,2007; Durrett and Klein, 2014; Francis-Landauet al., 2016). Typing of mentions at varying lev-els of granularity, from CoNLL-style named en-tity recognition (Tjong Kim Sang and De Meulder,2003), to the more fine-grained recent approaches(Ling and Weld, 2012; Gillick et al., 2014; Shi-maoka et al., 2017), is also related to our task.A few prior attempts to incorporate a very shal-low hierarchy into fine-grained entity typing havenot lead to significant or consistent improvements(Gillick et al., 2014; Shimaoka et al., 2017).

The knowledge base Yago (Suchanek et al.,2007) includes integration with WordNet and typehierarchies have been derived from its type system(Yosef et al., 2012). Del Corro et al. (2015) usemanually crafted rules and patterns (Hearst pat-terns (Hearst, 1992), appositives, etc) to automati-

Page 9: Hierarchical Losses and New Resources for Fine-grained Entity … · 2018. 7. 19. · Accurately linking textual biological entity mentions to an existing knowledge base is ex-tremely

105

cally match entity types to Wordnet synsets.Recent work has moved towards unifying these

two highly related tasks by improving entity link-ing by simultaneously learning a fine grained en-tity type predictor (Gupta et al., 2017). Learninghierarchical structures or transitive relations be-tween concepts has been the subject of much re-cent work (Vilnis and McCallum, 2015; Vendrovet al., 2016; Nickel and Kiela, 2017)

We draw inspiration from all of this prior work,and contribute datasets and models to address pre-vious challenges in jointly modeling the structureof large-scale hierarchical ontologies and mappingtextual mentions into an extremely fine-grainedspace of entities and types.

8 Conclusion

We demonstrate that explicitly incorporating andmodeling hierarchical information leads to in-creased performance in experiments on entity typ-ing and linking across three challenging datasets.Additionally, we introduce two new human-annotated datasets: MedMentions, a corpus of246k mentions from PubMed abstracts linked tothe UMLS knowledge base, and TypeNet, a newhierarchical fine-grained entity typeset an orderof magnitude larger and deeper than previousdatasets.

While this work already demonstrates consid-erable improvement over non-hierarchical model-ing, future work will explore techniques such asBox embeddings (Vilnis et al., 2018) and Poincareembeddings (Nickel and Kiela, 2017) to representthe hierarchical embedding space, as well as meth-ods to improve recall in the candidate generationprocess for entity linking. Most of all, we are ex-cited to see new techniques from the NLP commu-nity using the resources we have presented.

9 Acknowledgements

We thank Nicholas Monath, Haw-Shiuan Changand Emma Strubell for helpful comments onearly drafts of the paper. Creation of the Med-Mentions corpus is supported and managed bythe Meta team at the Chan Zuckerberg Initia-tive. A pre-release of the dataset is available athttp://github.com/chanzuckerberg/MedMentions. This work was supported in partby the Center for Intelligent Information Retrievaland the Center for Data Science, in part by theChan Zuckerberg Initiative under the project

Scientific Knowledge Base Construction., and inpart by the National Science Foundation underGrant No. IIS-1514053. Any opinions, findingsand conclusions or recommendations expressed inthis material are those of the authors and do notnecessarily reflect those of the sponsor.

ReferencesRolf Apweiler, Amos Bairoch, Cathy H Wu, Winona C

Barker, Brigitte Boeckmann, Serenella Ferro, Elis-abeth Gasteiger, Hongzhan Huang, Rodrigo Lopez,Michele Magrane, et al. 2004. Uniprot: the univer-sal protein knowledgebase. Nucleic acids research,32(suppl 1):D115–D119.

Olivier Bodenreider. 2004. The unified medical lan-guage system (umls): integrating biomedical termi-nology. Nucleic acids research, 32(suppl 1):D267–D270.

Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: a collab-oratively created graph database for structuring hu-man knowledge. In Proceedings of the 2008 ACMSIGMOD international conference on Managementof data, pages 1247–1250. AcM.

Razvan C Bunescu and Marius Pasca. 2006. Using en-cyclopedic knowledge for named entity disambigua-tion. In Eacl, volume 6, pages 9–16.

Andrew Chatr-aryamontri, Rose Oughtred, LorrieBoucher, Jennifer Rust, Christie Chang, Nadine KKolas, Lara O’Donnell, Sara Oster, ChandraTheesfeld, Adnane Sellam, et al. 2017. The biogridinteraction database: 2017 update. Nucleic acids re-search, 45(D1):D369–D379.

Silviu Cucerzan. 2007. Large-scale named entity dis-ambiguation based on wikipedia data. In Proceed-ings of the 2007 joint conference on empirical meth-ods in natural language processing and computa-tional natural language learning (EMNLP-CoNLL).

Jeffrey Dalton, Laura Dietz, and James Allan. 2014.Entity query feature expansion using knowledgebase links. In Proceedings of the 37th internationalACM SIGIR conference on Research & developmentin information retrieval, pages 365–374. ACM.

Bhavana Dalvi, Einat Minkov, Partha P Talukdar, andWilliam W Cohen. 2015. Automatic gloss findingfor a knowledge base using ontological constraints.In Proceedings of the Eighth ACM InternationalConference on Web Search and Data Mining, pages369–378. ACM.

Rajarshi Das, Manzil Zaheer, Siva Reddy, and AndrewMcCallum. 2017. Question answering on knowl-edge bases and text using universal schema andmemory networks. In Proceedings of the 55th An-nual Meeting of the Association for Computational

Page 10: Hierarchical Losses and New Resources for Fine-grained Entity … · 2018. 7. 19. · Accurately linking textual biological entity mentions to an existing knowledge base is ex-tremely

106

Linguistics (Volume 2: Short Papers), pages 358–365, Vancouver, Canada. Association for Computa-tional Linguistics.

Allan Peter Davis, Cynthia G Murphy, Cynthia ASaraceni-Richards, Michael C Rosenstein,Thomas C Wiegers, and Carolyn J Mattingly.2008. Comparative toxicogenomics database: aknowledgebase and discovery tool for chemical–gene–disease networks. Nucleic acids research,37(suppl 1):D786–D792.

Luciano Del Corro, Abdalghani Abujabal, RainerGemulla, and Gerhard Weikum. 2015. Finet:Context-aware fine-grained named entity typing. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing (EMNLP).

Rezarta Islamaj Dogan, Robert Leaman, and ZhiyongLu. 2014. Ncbi disease corpus: a resource for dis-ease name recognition and concept normalization.Journal of biomedical informatics, 47:1–10.

Greg Durrett and Dan Klein. 2014. A joint model forentity analysis: Coreference, typing, and linking.Transactions of the Association for ComputationalLinguistics, 2:477–490.

Christiane Fellbaum. 1998. WordNet. Wiley OnlineLibrary.

Matthew Francis-Landau, Greg Durrett, and DanKlein. 2016. Capturing semantic similarity for en-tity linking with convolutional neural networks. InProceedings of NAACL-HLT, pages 1256–1261.

Dan Gillick, Nevena Lazic, Kuzman Ganchev, JesseKirchner, and David Huynh. 2014. Context-dependent fine-grained entity type tagging. CoRR,abs/1412.1820.

Xavier Glorot and Yoshua Bengio. 2010. Understand-ing the difficulty of training deep feedforward neuralnetworks. In Proceedings of the International Con-ference on Artificial Intelligence and Statistics (AIS-TATS).

Nitish Gupta, Sameer Singh, and Dan Roth. 2017. En-tity linking via joint encoding of types, descriptions,and context. In Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 2671–2680, Copenhagen, Denmark.Association for Computational Linguistics.

Marti A Hearst. 1992. Automatic acquisition of hy-ponyms from large text corpora. In Proceedings ofthe International Conference on Computational Lin-guistics (COLING).

Prachi Jain, Shikhar Murty, Mausam, and SoumenChakrabarti. 2018. Mitigating the effect of out-of-vocabulary entity pairs in matrix factorization forknowledge base inference. In The 27th Interna-tional Joint Conference on Artificial Intelligence (IJ-CAI), Stockholm, Sweden.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRR,abs/1412.6980.

Robert Leaman, Rezarta Islamaj Dogan, and Zhiy-ong Lu. 2013. Dnorm: disease name normaliza-tion with pairwise learning to rank. Bioinformatics,29(22):2909–2917.

Robert Leaman and Zhiyong Lu. 2016. Taggerone:joint named entity recognition and normaliza-tion with semi-markov models. Bioinformatics,32(18):2839–2846.

Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sci-aky, Chih-Hsuan Wei, Robert Leaman, Allan PeterDavis, Carolyn J Mattingly, Thomas C Wiegers, andZhiyong Lu. 2016. Biocreative v cdr task corpus:a resource for chemical disease relation extraction.Database, 2016.

Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan,and Maosong Sun. 2016. Neural relation extractionwith selective attention over instances. In Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 2124–2133, Berlin, Germany. Associa-tion for Computational Linguistics.

Xiao Ling and Daniel S Weld. 2012. Fine-grained en-tity recognition. In Twenty-Sixth AAAI Conferenceon Artificial Intelligence.

Edward Loper and Steven Bird. 2002. Nltk: The natu-ral language toolkit. In Proceedings of the ACL-02Workshop on Effective tools and methodologies forteaching natural language processing and computa-tional linguistics-Volume 1, pages 63–70. Associa-tion for Computational Linguistics.

Mike Mintz, Steven Bills, Rion Snow, and Daniel Ju-rafsky. 2009. Distant supervision for relation ex-traction without labeled data. In Proceedings ofthe Joint Conference of the 47th Annual Meeting ofthe ACL and the 4th International Joint Conferenceon Natural Language Processing of the AFNLP,pages 1003–1011, Suntec, Singapore. Associationfor Computational Linguistics.

Arvind Neelakantan and Ming-Wei Chang. 2015. In-ferring missing entity type instances for knowledgebase completion: New dataset and methods. In Pro-ceedings of the 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages515–525, Denver, Colorado. Association for Com-putational Linguistics.

Maximilian Nickel and Douwe Kiela. 2017.Poincar\’e embeddings for learning hierarchicalrepresentations. arXiv preprint arXiv:1705.08039.

Maximilian Nickel, Volker Tresp, and Hans-PeterKriegel. 2011. A three-way model for collectivelearning on multi-relational data. In Proceedings ofthe International Conference on Machine Learning(ICML).

Page 11: Hierarchical Losses and New Resources for Fine-grained Entity … · 2018. 7. 19. · Accurately linking textual biological entity mentions to an existing knowledge base is ex-tremely

107

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP).

Janet Pinero, Alex Bravo, Nuria Queralt-Rosinach,Alba Gutierrez-Sacristan, Jordi Deu-Pons, EmilioCenteno, Javier Garcıa-Garcıa, Ferran Sanz, andLaura I Furlong. 2017. Disgenet: a comprehensiveplatform integrating information on human disease-associated genes and variants. Nucleic acids re-search, 45(D1):D833–D839.

Xiang Ren, Wenqi He, Meng Qu, Clare R. Voss, HengJi, and Jiawei Han. 2016. Label noise reduction inentity typing by heterogeneous partial-label embed-ding. In Proceedings of the 22nd ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining, San Francisco, CA, USA, August13-17, 2016, pages 1825–1834.

Benjamin Roth, Nicholas Monath, David Belanger,Emma Strubell, Patrick Verga, and Andrew McCal-lum. 2015. Building knowledge bases with universalschema: Cold start and slot-filling approaches.

Cıcero Nogueira dos Santos, Bing Xiang, and BowenZhou. 2015. Classifying relations by ranking withconvolutional neural networks. In Proceedings ofthe Annual Meeting of the Association for Compu-tational Linguistics and the 7th International JointConference on Natural Language Processing of theAsian Federation of Natural Language ProcessingACL.

Sonse Shimaoka, Pontus Stenetorp, Kentaro Inui, andSebastian Riedel. 2017. Neural architectures forfine-grained entity type classification. In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics:Volume 1, Long Papers, pages 1271–1280, Valencia,Spain. Association for Computational Linguistics.

Valentin I Spitkovsky and Angel X Chang. 2012. Across-lingual dictionary for english wikipedia con-cepts.

Nitish Srivastava, Geoffrey E. Hinton, AlexKrizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-nov. 2014. Dropout: a simple way to prevent neuralnetworks from overfitting. Journal of MachineLearning Research.

Fabian M. Suchanek, Gjergji Kasneci, and GerhardWeikum. 2007. Yago: a core of semantic knowl-edge. In Proceedings of the International Confer-ence on World Wide Web (WWW).

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,and Christopher D Manning. 2012. Multi-instancemulti-label learning for relation extraction. In Pro-ceedings of the 2012 joint conference on empiricalmethods in natural language processing and compu-tational natural language learning, pages 455–465.Association for Computational Linguistics.

Erik F Tjong Kim Sang and Fien De Meulder.2003. Introduction to the conll-2003 shared task:Language-independent named entity recognition. InProceedings of the seventh conference on Naturallanguage learning at HLT-NAACL 2003-Volume 4,pages 142–147. Association for Computational Lin-guistics.

Theo Trouillon, Johannes Welbl, Sebastian Riedel, EricGaussier, and Guillaume Bouchard. 2016. Complexembeddings for simple link prediction. In Proceed-ings of the International Conference on MachineLearning (ICML).

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Conference on Advances in Neural In-formation Processing (NIPS).

Ivan Vendrov, Ryan Kiros, Sanja Fidler, and RaquelUrtasun. 2016. Order-embeddings of images andlanguage. ICLR.

Patrick Verga, Arvind Neelakantan, and Andrew Mc-Callum. 2017. Generalizing to unseen entities andentity pairs with row-less universal schema. In Pro-ceedings of the 15th Conference of the EuropeanChapter of the Association for Computational Lin-guistics: Volume 1, Long Papers, pages 613–622,Valencia, Spain. Association for Computational Lin-guistics.

Luke Vilnis, Xiang Li, Shikhar Murty, and An-drew McCallum. 2018. Probabilistic embedding ofknowledge graphs with box lattice measures. In The56th Annual Meeting of the Association for Compu-tational Linguistics (ACL), Melbourne, Australia.

Luke Vilnis and Andrew McCallum. 2015. Word rep-resentations via gaussian embedding. ICLR.

Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu.2015. Gnormplus: an integrative approach for tag-ging genes, gene families, and protein domains.BioMed research international, 2015.

Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2017. Constructing datasets for multi-hopreading comprehension across documents. arXivpreprint arXiv:1710.06481.

Cinna Wu, Mark Tygert, and Yann LeCun. 2017.Hierarchical loss for classification. CoRR,abs/1709.01062.

Ruobing Xie, Zhiyuan Liu, and Maosong Sun. 2016.Representation learning of knowledge graphs withhierarchical types. In IJCAI, pages 2965–2971.

Yadollah Yaghoobzadeh, Heike Adel, and HinrichSchutze. 2017a. Corpus-level fine-grained entitytyping. arXiv preprint arXiv:1708.02275.

Page 12: Hierarchical Losses and New Resources for Fine-grained Entity … · 2018. 7. 19. · Accurately linking textual biological entity mentions to an existing knowledge base is ex-tremely

108

Yadollah Yaghoobzadeh, Heike Adel, and HinrichSchutze. 2017b. Noise mitigation for neural entitytyping and relation extraction. In Proceedings ofthe 15th Conference of the European Chapter of theAssociation for Computational Linguistics: Volume1, Long Papers, pages 1183–1194, Valencia, Spain.Association for Computational Linguistics.

Limin Yao, Sebastian Riedel, and Andrew McCallum.2013. Universal schema for entity type prediction.In Proceedings of the 2013 workshop on Automatedknowledge base construction, pages 79–84. ACM.

Mohamed Amir Yosef, Sandro Bauer, Johannes Hof-fart, Marc Spaniol, and Gerhard Weikum. 2012.Hyena: Hierarchical type classification for entitynames. In Proceedings of the International Confer-ence on Computational Linguistics (COLING).

Page 13: Hierarchical Losses and New Resources for Fine-grained Entity … · 2018. 7. 19. · Accurately linking textual biological entity mentions to an existing knowledge base is ex-tremely

109

A Supplementary Materials

A.1 TypeNet Construction

Freebase type: musical chordExample entities: psalms chord, power chord

harmonic seventh chordchord.n.01: a straight line connecting two points on a curvechord.n.02: a combination of three or morenotes that blend harmoniously when sounded togethermusical.n.01: a play or film whose action and dialogue isinterspersed with singing and dancing

Table 9: Example given to TypeNet annota-tors. Here, the Freebase type to be linked ismusical chord. This type is annotated in Free-base belonging to the entities psalms chord, har-monic seventh chord, and power chord. Belowthe list of example entities are candidate Word-Net synsets obtained by substring matching be-tween the Freebase type and all WordNet synsets.The correctly aligned synset is chord.n.02 shownin bold.

A.2 Model Implementation DetailsFor all of our experiments, we use pretrained 300dimensional word vectors from Pennington et al.(2014). These embeddings are fixed during train-ing. The type vectors and entity vectors are all 300dimensional vectors initialized using Glorot ini-tialization (Glorot and Bengio, 2010). The num-ber of negative links for hierarchical training n ∈{16, 32, 64, 128, 256}.

For regularization, we use dropout (Srivastavaet al., 2014) with p ∈ {0.5, 0.75, 0.8} on the sen-tence encoder output and L2 regularize all learnedparameters with λ ∈ {1e-5, 5e-5, 1e-4}. All ourparameters are optimized using Adam (Kingmaand Ba, 2014) with a learning rate of 0.001. Wetune our hyper-parameters via grid search andearly stopping on the development set.

A.3 FIGER Implementation DetailsTo train our models, we use the mention typingloss function defined in Section-5. For modelswith structure training, we additionally add in thehierarchical loss, along with a weight that is ob-tained by tuning on the dev set. We follow thesame inference time procedure as Shimaoka et al.(2017) For each mention, we first assign the typewith the largest probability according to the log-its, and then assign additional types based on thecondition that their corresponding probability begreater than 0.5.

A.4 Wikipedia Data and ImplementationDetails

At train time, each training example randomlysamples an entity bag of 10 mentions. At test timewe classify bags of 20 mentions of an entity. Thedataset contains a total of 344,246 entities mappedto the 1081 Freebase types from TypeNet. We con-sider all sentences in Wikipedia between 10 and50 tokens long. Tokenization and sentence split-ting was performed using NLTK (Loper and Bird,2002). From these sentences, we considered allentities annotated with a cross-link in Wikipediathat we could link to Freebase and assign types inTypeNet. We then split the data by entities into a90-5-5 train, dev, test split.

A.5 UMLS Implementation detailsWe pre-process each string by lowercasing and re-moving stop words. We consider ngrams from size1 to 5 and keep the top 100,000 features and the fi-nal vectors are L2 normalized. For each mention,In our experiments we consider the top 100 mostsimilar entities as the candidate set.

A.5.1 Candidate Generation DetailsEach mention and each canonical entity string inUMLS are mapped to TFIDF character ngram vec-tors. We pre-process each string by lowercasingand removing stop words. We consider ngramsfrom size 1 to 5 and keep the top 100,000 featuresand the final vectors are L2 normalized. For eachmention, we calculate the cosine similarity, csim,between the mention string and each canonical en-tity string. In our experiments we consider the top100 most similar entities as the candidate set.


Recommended