SenseBERT: Driving Some Sense into BERT · Speciﬁcally, we add a masked-word sense pre-diction...

SenseBERT: Driving Some Sense into BERT

Yoav Levine Barak Lenz Or Dagan Ori Ram Dan Padnos Or SharirShai Shalev-Shwartz Amnon Shashua Yoav Shoham

AI21 Labs, Tel Aviv, Israel

{yoavl,barakl,ord,orir,...}@ai21.com

Abstract

The ability to learn from large unlabeled cor-pora has allowed neural language models toadvance the frontier in natural language under-standing. However, existing self-supervisiontechniques operate at the word form level,which serves as a surrogate for the underly-ing semantic content. This paper proposes amethod to employ weak-supervision directlyat the word sense level. Our model, namedSenseBERT, is pre-trained to predict not onlythe masked words but also their WordNet su-persenses. Accordingly, we attain a lexical-semantic level language model, without the useof human annotation. SenseBERT achieves sig-nificantly improved lexical understanding, aswe demonstrate by experimenting on SemEvalWord Sense Disambiguation, and by attaininga state of the art result on the ‘Word in Context’task.

1 Introduction

Neural language models have recently undergonea qualitative leap forward, pushing the state of theart on various NLP tasks. Together with advancesin network architecture (Vaswani et al., 2017), theuse of self-supervision has proven to be centralto these achievements, as it allows the network tolearn from massive amounts of unannotated text.

The self-supervision strategy employed in BERT(Devlin et al., 2019) involves masking some of thewords in an input sentence, and then training themodel to predict them given their context. Otherproposed approaches for self-supervised objectives,including unidirectional (Radford et al., 2019), per-mutational (Yang et al., 2019), or word insertion-based (Chan et al., 2019) methods, operate simi-larly, over words. However, since a given wordform can possess multiple meanings (e.g., the word‘bass’ can refer to a fish, a guitar, a type of singer,etc.), the word itself is merely a surrogate of its

actual meaning in a given context, referred to as itssense. Indeed, the word-form level is viewed as asurface level which often introduces challengingambiguity (Navigli, 2009).

In this paper, we bring forth a novel methodol-ogy for applying weak-supervision directly on thelevel of a word’s meaning. By infusing word-senseinformation into BERT’s pre-training signal, weexplicitely expose the model to lexical semanticswhen learning from a large unannotated corpus.We call the resultant sense-informed model Sense-BERT. Specifically, we add a masked-word senseprediction task as an auxiliary task in BERTs pre-training. Thereby, jointly with the standard word-form level language model, we train a semantic-level language model that predicts the missingwords meaning. Our method does not require sense-annotated data; self-supervised learning from unan-notated text is facilitated by using WordNet (Miller,1998), an expert constructed inventory of wordsenses, as weak supervision.

We focus on a coarse-grained variant of a word’ssense, referred to as its WordNet supersense, inorder to mitigate an identified brittleness of fine-grained word-sense systems, caused by arbitrarysense granularity, blurriness, and general subjec-tiveness (Kilgarriff, 1997; Schneider, 2014). Word-Net lexicographers organize all word senses into 45supersense categories, 26 of which are for nouns,15 for verbs, 3 for adjectives and 1 for adverbs (seefull supersense table in the supplementary materi-als). Disambiguating a word’s supersense has beenwidely studied as a fundamental lexical categoriza-tion task (Ciaramita and Johnson, 2003; Basile,2012; Schneider and Smith, 2015).

We employ the masked word’s allowed super-senses list from WordNet as a set of possible labelsfor the sense prediction task. The labeling of wordswith a single supersense (e.g., ‘sword’ has only thesupersense noun.artifact) is straightforward: We

arX

iv:1

908.

0564

6v2

[cs

.CL

] 1

8 M

ay 2

020

train the network to predict this supersense giventhe masked word’s context. As for words with mul-tiple supersenses (e.g., ‘bass’ can be: noun.food,noun.animal, noun.artifact, noun.person, etc.), wetrain the model to predict any of these senses, lead-ing to a simple yet effective soft-labeling scheme.

We show that SenseBERTBASE outscores bothBERTBASE and BERTLARGE by a large margin ona supersense variant of the SemEval Word SenseDisambiguation (WSD) data set standardized in Ra-ganato et al. (2017). Notably, SenseBERT re-ceives competitive results on this task without fune-tuning, i.e., when training a linear classifier overthe pretrained embeddings, which serves as a tes-tament for its self-acquisition of lexical semantics.Furthermore, we show that SenseBERTBASE sur-passes BERTLARGE in the Word in Context (WiC)task (Pilehvar and Camacho-Collados, 2019) fromthe SuperGLUE benchmark (Wang et al., 2019),which directly depends on word-supersense aware-ness. A single SenseBERTLARGE model achievesstate of the art performance on WiC with a score of72.14, improving the score of BERTLARGE by 2.5points.

2 Related Work

Neural network based word embeddings first ap-peared as a static mapping (non-contextualized),where every word is represented by a constant pre-trained embedding (Mikolov et al., 2013; Penning-ton et al., 2014). Such embeddings were shownto contain some amount of word-sense informa-tion (Iacobacci et al., 2016; Yuan et al., 2016;Arora et al., 2018; Le et al., 2018). Addition-ally, sense embeddings computed for each wordsense in the word-sense inventory (e.g. WordNet)have been employed, relying on hypernymity re-lations (Rothe and Schutze, 2015) or the gloss foreach sense (Chen et al., 2014). These approachesrely on static word embeddings and require a largeamount of annotated data per word sense.

The introduction of contextualized word embed-dings (Peters et al., 2018), for which a given word’sembedding is context-dependent rather than pre-computed, has brought forth a promising prospectfor sense-aware word embeddings. Indeed, visual-izations in Reif et al. (2019) show that sense sen-sitive clusters form in BERT’s word embeddingspace. Nevertheless, we identify a clear gap inthis abilty. We show that a vanilla BERT modeltrained with the current word-level self-supervision,

burdened with the implicit task of disambiguat-ing word meanings, often fails to grasp lexicalsemantics, exhibiting high supersense misclassi-fication rates. Our suggested weakly-supervisedword-sense signal allows SenseBERT to signifi-cantly bridge this gap.

Moreover, SenseBERT exhibits an improvementin lexical semantics ability (reflected by the Wordin Context task score) even when compared to mod-els with WordNet infused linguistic knowledge.Specifically we compare to Peters et al. (2019)who re-contextualize word embeddings via a word-to-entity attention mechanism (where entities areWordNet lemmas and synsets), and to Loureiro andJorge (2019) which construct sense embeddingsfrom BERT’s word embeddings and use the Word-Net graph to enhance coverage (see quantitativecomparison in table 3).

3 Incorporating Word-SupersenseInformation in Pre-training

In this section, we present our proposed method forintegrating word sense-information within Sense-BERT’s pre-training. We start by describing thevanilla BERT architecture in subsection 3.1. Weconceptually divide it into an internal transformerencoder and an external mapping W which trans-lates the observed vocabulary space into and out ofthe transformer encoder space [see illustration infigure 1(a)].

In the subsequent subsections, we frame our con-tribution to the vanilla BERT architecture as an ad-dition of a parallel external mapping to the wordssupersenses space, denoted S [see illustration in fig-ure 1(b)]. Specifically, in section 3.2 we describethe loss function used for learning S in parallel toW , effectively implementing word-form and word-sense multi-task learning in the pre-training stage.Then, in section 3.3 we describe our methodologyfor adding supersense information in S to the initialTransformer embedding, in parallel to word-levelinformation added by W . In section 3.4 we ad-dress the issue of supersense prediction for out-of-vocabulary words, and in section 3.5 we describeour modification of BERT’s masking strategy, pri-oritizing single-supersensed words which carry aclearer semantic signal.

3.1 Background

The input to BERT is a sequence of words {x(j) ∈{0, 1}DW }Nj=1 where 15% of the words are re-

+

Wx(1)

Wx(j) ywords

Wx(N)

p(1)

p(j)

p(N)

(a) BERT

(b) SenseBERT

WT

+ +

Wx(1)

Wx(j)

ywords

Wx(N)

SMx(1)

SMx(j)

SMx(N)

p(1)

p(j)

p(N)

W

W

WT

ysensesSTS

x(1)

1

j

N

x(N)

[MASK]

x(1)

x(N)

[MASK]Transformer

encoder

1

j

N

Transformerencoder

Figure 1: SenseBERT includes a masked-word supersense prediction task, pre-trained jointly with BERT’s originalmasked-word prediction task (Devlin et al., 2019) (see section 3.2). As in the original BERT, the mapping from theTransformer dimension to the external dimension is the same both at input and at output (W for words and S forsupersenses), where M denotes a fixed mapping between word-forms and their allowed WordNet supersenses (seesection 3.3). The vectors p(j) denote positional embeddings. For clarity, we omit a reference to a sentence-levelNext Sentence Prediction task trained jointly with the above.

placed by a [MASK] token (see treatment of sub-word tokanization in section 3.4). Here N is theinput sentence length, DW is the word vocabularysize, and x(j) is a 1-hot vector corresponding tothe jth input word. For every masked word, theoutput of the pretraining task is a word-score vec-tor ywords ∈ RDW containing the per-word score.BERT’s architecture can be decomposed to (1) aninternal Transformer encoder architecture (Vaswaniet al., 2017) wrapped by (2) an external mappingto the word vocabulary space, denoted by W .1

The Transformer encoder operates over a se-quence of word embeddings v

(j)input ∈ Rd, where

d is the Transformer encoder’s hidden dimension.These are passed through multiple attention-basedTransformer layers, producing a new sequenceof contextualized embeddings at each layer. TheTransformer encoder output is the final sequenceof contextualized word embeddings v(j)output ∈ Rd.

The external mapping W ∈ Rd×DW is effec-tively a translation between the external word vo-cabulary dimension and the internal Transformerdimension. Original words in the input sentenceare translated into the Transformer block by apply-ing this mapping (and adding positional encodingvectors p(j) ∈ Rd):

v(j)input = Wx(j) + p(j) (1)

1For clarity, we omit a description of the Next SentencePrediction task which we employ as in Devlin et al. (2019).

The word-score vector for a masked word at po-sition j is extracted from the Transformer en-coder output by applying the transpose: ywords =

W>v(j)output [see illustration in figure 1(a)]. The

use of the same matrix W as the mapping in andout of the transformer encoder space is referred toas weight tying (Inan et al., 2017; Press and Wolf,2017).

Given a masked word in position j, BERT’soriginal masked-word prediction pre-training taskis to have the softmax of the word-score vectorywords = W>v

(j)output get as close as possible to a

1-hot vector corresponding to the masked word.This is done by minimizing the cross-entropy lossbetween the softmax of the word-score vector anda 1-hot vector corresponding to the masked word:

LLM = − log p(w|context), (2)

where w is the masked word, the context is com-posed of the rest of the input sequence, and theprobability is computed by:

p(w|context) =exp

(ywordsw

)∑w′ exp

(ywordsw′

) , (3)

where ywordsw denotes the wth entry of the word-

score vector.

3.2 Weakly-Supervised SupersensePrediction Task

Jointly with the above procedure for training theword-level language model of SenseBERT, wetrain the model to predict the supersense of everymasked word, thereby training a semantic-level lan-guage model. This is done by adding a parallel ex-ternal mapping to the words supersenses space, de-noted S ∈ Rd×DS [see illustration in figure 1(b)],where DS = 45 is the size of supersenses vocabu-lary. Ideally, the objective is to have the softmax ofthe sense-score vector ysenses ∈ RDS := S>v

(j)output

get as close as possible to a 1-hot vector correspond-ing to the word’s supersense in the given context.

For each word w in our vocabulary, we employthe WordNet word-sense inventory for constructingA(w), the set of its “allowed” supersenses. Specifi-cally, we apply a WordNet Lemmatizer on w, ex-tract the different synsets that are mapped to thelemmatized word in WordNet, and define A(w) asthe union of supersenses coupled to each of thesesynsets. As exceptions, we set A(w) = ∅ forthe following: (i) short words (up to 3 characters),since they are often treated as abbreviations, (ii)stop words, as WordNet does not contain their mainsynset (e.g. ‘he’ is either the element helium or thehebrew language according to WordNet), and (iii)tokens that represent part-of-word (see section 3.4for further discussion on these tokens).

Given the above construction, we employ a com-bination of two loss terms for the supersense-levellanguage model. The following allowed-sensesterm maximizes the probability that the predictedsense is in the set of allowed supersenses of themasked word w:

LallowedSLM = − log p (s ∈ A(w)|context)

= − log∑

s∈A(w)

p(s|context), (4)

where the probability for a supersense s is givenby:

p(s|context) =exp(ysenses

s )∑s′ exp(y

sensess′ )

. (5)

The soft-labeling scheme given above, whichtreats all the allowed supersenses of the maskedword equally, introduces noise to the supersense la-bels. We expect that encountering many contexts ina sufficiently large corpus will reinforce the correctlabels whereas the signal of incorrect labels willdiminish. To illustrate this, consider the followingexamples for the food context:

1. “This bass is delicious”(supersenses: noun.food, noun.artifact, etc.)

2. “This chocolate is delicious”(supersenses: noun.food, noun.attribute, etc.)

3. “This pickle is delicious”(supersenses: noun.food, noun.state, etc.)

Masking the marked word in each of the examplesresults in three identical input sequences, each witha different sets of labels. The ground truth label,noun.food, appears in all cases, so that its probabil-ity in contexts indicating food is increased whereasthe signals supporting other labels cancel out.

While LallowedSLM pushes the network in the right

direction, minimizing this loss could result in thenetwork becoming overconfident in predicting astrict subset of the allowed senses for a given word,i.e., a collapse of the prediction distribution. Thisis especially acute in the early stages of the trainingprocedure, when the network could converge to thenoisy signal of the soft-labeling scheme.

To mitigate this issue, the following regulariza-tion term is added to the loss, which encouragesa uniform prediction distribution over the allowedsupersenses:

LregSLM = −

∑s∈A(w)

1

|A(w)|log p(s|context), (6)

i.e., a cross-entropy loss with a uniform distributionover the allowed supersenses.

Overall, jointly with the regular word level lan-guage model trained with the loss in eq. 2, we trainthe semantic level language model with a combinedloss of the form:

LSLM = LallowedSLM + Lreg

SLM . (7)

3.3 Supersense Aware Input Embeddings

Though in principle two different matrices couldhave been used for converting in and out of theTranformer encoder, the BERT architecture em-ploys the same mapping W . This approach, re-ferred to as weight tying, was shown to yield the-oretical and pracrical benefits (Inan et al., 2017;Press and Wolf, 2017). Intuitively, constructing theTransformer encoder’s input embeddings from thesame mapping with which the scores are computedimproves their quality as it makes the input moresensitive to the training signal.

Verb Supersenses Noun Supersenses Other (adv./adj.) Abstract Concrete Concrete - Entities

(a) All Supersenses

noun.object

noun.substance

noun.bodynoun.plant

(b) Noun Supersenses

noun.person

noun.feeling

noun.shape

noun.attribute

noun.location

noun.group

noun.animal

noun.artifact

noun.food

Figure 2: UMAP visualization of supersense vectors (rows of the classifier S) learned by SenseBERT at pre-training.(a) Clustering by the supersense’s part-of speech. (b) Within noun supersenses, semantically similar supersensesare clustered together (see more details in the supplementary materials).

We follow this approach, and insert our newlyproposed semantic-level language model matrixS in the input in addition to W [as depicted infigure 1(b)], such that the input vector to the Trans-former encoder (eq. 1) is modified to obey:

v(j)input = (W + SM)x(j) + p(j), (8)

where p(j) are the regular positional embeddingsas used in BERT, and M ∈ RDS×DW is a static 0/1matrix converting between words and their allowedWordNet supersenses A(w) (see construction de-tails above).

The above strategy for constructing v(j)input allows

for the semantic level vectors in S to come into playand shape the input embeddings even for wordswhich are rarely observed in the training corpus.For such a word, the corresponding row in W ispotentially less informative, since due to the lowword frequency the model did not have sufficientchance to adequately learn it. However, since themodel learns a representation of its supersense, thecorresponding row in S is informative of the se-mantic category of the word. Therefore, the inputembedding in eq. 8 can potentially help the modelto elicit meaningful information even when themasked word is rare, allowing for better exploita-tion of the training corpus.

3.4 Rare Words Supersense PredictionAt the pre-processing stage, when an out-of-vocabulary (OOV) word is encountered in the cor-pus, it is divided into several in-vocabulary sub-word tokens. For the self-supervised word pre-

diction task (eq. 2) masked sub-word tokens arestraightforwardly predicted as described in sec-tion 3.1. In contrast, word-sense supervision isonly meaningful at the word level. We comparetwo alternatives for dealing with tokenized OOVwords for the supersense prediction task (eq. 7).

In the first alternative, called 60K vocabulary, weaugment BERT’s original 30K-token vocabulary(which roughly contained the most frequent words)with additional 30K new words, chosen accordingto their frequency in Wikipedia. This vocabularyincrease allows us to see more of the corpus aswhole words for which supersense prediction is ameaningful operation. Additionally, in accordancewith the discussion in the previous subsection, oursense-aware input embedding mechanism can helpthe model extract more information from lower-frequency words. For the cases where a sub-wordtoken is chosen for masking, we only propagatethe regular word level loss and do not train thesupersense prediction task.

The above addition to the vocabulary results inan increase of approximately 23M parameters overthe 110M parameters of BERTBASE and an increaseof approximately 30M parameters over the 340Mparameters of BERTLARGE (due to different embed-ding dimensions d = 768 and d = 1024, respec-tively). It is worth noting that similar vocabularysizes in leading models have not resulted in in-creased sense awareness, as reflected for examplein the WiC task results (Liu et al., 2019).

As a second alternative, referred to as averageembedding, we employ BERT’s regular 30K-token

(a) (b)

Dan cooked a bass on the grill.

The [MASK] fell to the floor.

The bass player was exceptional.

noun.artifactverb.creation

noun.foodnoun.person

noun.person

adj.allnoun.artifact

noun.artifact (sword, chair, ...)

noun.person (man, girl, ...)

52%17%

Gill [MASK] the bread.

verb.contact (cut, buttered, ...)

verb.consumption (ate, chewed, ...)

verb.change (heated, baked, ...)

verb.possession (took, bought, ...)

33%20%11%6%

Figure 3: (a) A demonstration of supersense probabilities assigned to a masked position within context, as givenby SenseBERT’s word-supersense level semantic language model (capped at 5%). Example words correspondingto each supersense are presented in parentheses. (b) Examples of SenseBERT’s prediction on raw text, when theunmasked input sentence is given to the model. This beyond word-form abstraction ability facilitates a more naturalelicitation of semantic content at pre-training.

vocabulary and employ a whole-word-maskingstrategy. Accordingly, all of the tokens of a to-kenized OOV word are masked together. In thiscase, we train the supersense prediction task to pre-dict the WordNet supersenses of this word from theaverage of the output embeddings at the locationof the masked sub-words tokens.

3.5 Single-Supersensed Word MaskingWords that have a single supersense are good an-chors for obtaining an unambiguous semantic sig-nal. These words teach the model to accuratelymap contexts to supersenses, such that it is thenable to make correct context-based predictions evenwhen a masked word has several supersenses. Wetherefore favor such words in the masking strategy,choosing 50% of the single-supersensed words ineach input sequence to be masked. We stop if40% of the overall 15% masking budget is filledwith single-supersensed words (this rarly happens),and in any case we randomize the choice of theremaining words to complete this budget. As inthe original BERT, 1 out of 10 words chosen formasking is shown to the model as itself rather thanreplaced with [MASK].

4 Semantic Language ModelVisualization

A SenseBERT pretrained as described in section 3(with training hyperparameters as in Devlin et al.(2019)), has an immediate non-trivial bi-product.The pre-trained mapping to the supersenses space,denoted S, acts as an additional head predicting aword’s supersense given context [see figure 1(b)].We thereby effectively attain a semantic-level lan-

SenseBERTBASE SemEval-SS Fine-tuned

30K no OOV 81.930K average OOV 82.760K no OOV 83

Table 1: Testing variants for predicting supersensesof rare words during SenseBERT’s pretraining, as de-scribed in section 5.1. Results are reported on theSemEval-SS task (see section 5.2). 30K/60K stand forvocabulary size, and no/average OOV stand for not pre-dicting senses for OOV words or predicting senses fromthe average of the sub-word token embeddings, respec-tively.

guage model that predicts the missing word’s mean-ing jointly with the standard word-form level lan-guage model.

We illustrate the resultant mapping in fig-ure 2, showing a UMAP dimensionality reduc-tion (McInnes et al., 2018) of the rows of S,which corresponds to the different supersenses. Aclear clustering according to the supersense part-of-speech is apparent in figure 2(a). We furtheridentify finer-grained semantic clusters, as shownfor example in figure 2(b) and given in more detailin the supplementary materials.

SenseBERT’s semantic language model allowspredicting a distribution over supersenses ratherthan over words in a masked position. Figure 3(a)shows the supersense probabilities assigned bySenseBERT in several contexts, demonstrating themodel’s ability to assign semantically meaningfulcategories to the masked position.

Finally, we demonstrate that SenseBERT enjoys

(a)SemEval-SS

(b)WiC

The team used a battery of the newly developed “gene probes”BERT SenseBERT

noun.artifact noun.group

noun.quantity noun.body

Same Different

Ten shirt-sleeved ringers stand in a circle, one foot ahead of the other in a prize-fighter's stance

Sent. A: The kick must be synchronized with the arm movements.

Sent. B:A sidecar is a smooth drink but it has a powerful kick.

Different SameSent. A: Plant bugs in the dissident’s apartment.

Sent. B:Plant a spy in Moscow.

Figure 4: Example entries of (a) the SemEval-SS task, where a model is to predict the supersense of the markedword, and (b) the Word in Context (WiC) task where a model must determine whether the underlined word is usedin the same/different supersense within sentences A and B. In all displayed examples, taken from the correspondingdevelopment sets, SenseBERT predicted the correct label while BERT failed to do so. A quantitative comparisonbetween models is presented in table 2.

an ability to view raw text at a lexical semanticlevel. Figure 3(b) shows example sentences andtheir supersense prediction by the pretrained model.Where a vanilla BERT would see only the wordsof the sentence “Dan cooked a bass on the grill”,SenseBERT would also have access to the super-sense abstraction: “[Person] [created] [food] on the[artifact]”. This sense-level perspective can helpthe model extract more knowledge from every train-ing example, and to generalize semantically similarnotions which do not share the same phrasing.

5 Lexical Semantics Experiments

In this section, we present quantitative evaluationsof SenseBERT, pre-trained as described in sec-tion 3. We test the model’s performance on asupersense-based variant of the SemEval WSD testsets standardized in Raganato et al. (2017), andon the Word in Context (WiC) task (Pilehvar andCamacho-Collados, 2019) (included in the recentlyintroduced SuperGLUE benchmark (Wang et al.,2019)), both directly relying on the network’s abil-ity to perform lexical semantic categorization.

5.1 Comparing Rare Words SupersensePrediction Methods

We first report a comparison of the two methods de-scribed in section 3.4 for predicting the supersensesof rare words which do not appear in BERT’s origi-nal vocabulary. The first 60K vocabulary methodenriches the vocabulary and the second averageembedding method predicts a supersense from theaverage embeddings of the sub-word tokens com-

prising an OOV word. During fine-tuning, whenencountering an OOV word we predict the super-senses from the rightmost sub-word token in the60K vocabulary method and from the average ofthe sub-word tokens in the average embeddingmethod.

As shown in table 1, both methods perform com-parably on the SemEval supersense disambigua-tion task (see following subsection), yielding animprovement over the baseline of learning super-sense information only for whole words in BERT’soriginal 30K-token vocabulary. We continue withthe 60K-token vocabulary for the rest of the ex-periments, but note the average embedding optionas a viable competitor for predicting word-levelsemantics.

5.2 SemEval-SS: Supersense Disambiguation

We test SenseBERT on a Word Supersense Dis-ambiguation task, a coarse grained variant of thecommon WSD task. We use SemCor (Milleret al., 1993) as our training dataset (226, 036 an-notated examples), and the SenseEval (Edmondsand Cotton, 2001; Snyder and Palmer, 2004) / Se-mEval (Pradhan et al., 2007; Navigli et al., 2013;Moro and Navigli, 2015) suite for evaluation (over-all 7253 annotated examples), following Raganatoet al. (2017). For each word in both training and testsets, we change its fine-grained sense label to itscorresponding WordNet supersense, and thereforetrain the network to predict a given word’s super-sense. We name this Supersense disambiguationtask SemEval-SS. See figure 4(a) for an example

SemEval-SS Frozen SemEval-SS Fine-tuned Word in Context

BERTBASE 65.1 79.2 –BERTLARGE 67.3 81.1 69.6SenseBERTBASE 75.6 83.0 70.3SenseBERTLARGE 79.5 83.7 72.1

Table 2: Results on a supersense variant of the SemEval WSD test set standardized in Raganato et al. (2017), whichwe denote SemEval-SS, and on the Word in Context (WiC) dataset (Pilehvar and Camacho-Collados, 2019) includedin the recently introduced SuperGLUE benchmark (Wang et al., 2019). These tasks require a high level of lexicalsemantic understanding, as can be seen in the examples in figure 4. For both tasks, SenseBERT demonstrates aclear improvement over BERT in the regular fine-tuning setup, where network weights are modified during trainingon the task. Notably, SenseBERTLARGE achieves state of the art performance on the WiC task. In the SemEval-SSFrozen setting, we train a linear classifier over pretrained embeddings, without changing the network weights. Theresults show that SenseBERT introduces a dramatic improvement in this setting, implying that its word-sense awarepre-training (section 3) yields embeddings that carries lexical semantic information which is easily extractablefor the benefits of downstream tasks. Results for BERT on the SemEval-SS task are attained by employing thepublished pre-trained BERT models, and the BERTLARGE result on WiC is taken from the baseline scores publishedon the SuperGLUE benchmark (Wang et al., 2019) (no result has been published for BERTBASE).

Word in Context

ELMo† 57.7BERT sense embeddings †† 67.7BERTLARGE

‡ 69.6RoBERTa‡‡ 69.9KnowBERT-W+W� 70.9SenseBERT 72.1

Table 3: Test set results for the WiC dataset.†Pilehvar and Camacho-Collados (2019)††Loureiro and Jorge (2019)‡Wang et al. (2019)‡‡Liu et al. (2019)�Peters et al. (2019)

from this modified data set.We show results on the SemEval-SS task for

two different training schemes. In the first, wetrained a linear classifier over the ‘frozen’ outputembeddings of the examined model – we do notchange the the trained SenseBERT’s parameters inthis scheme. This Frozen setting is a test for theamount of basic lexical semantics readily presentin the pre-trained model, easily extricable by fur-ther downstream tasks (reminiscent of the semanticprobes employed in Hewitt and Manning (2019);Reif et al. (2019).

In the second training scheme we fine-tuned theexamined model on the task, allowing its param-eters to change during training (see full trainingdetails in the supplementary materials). Resultsattained by employing this training method reflect

the model’s potential to acquire word-supersenseinformation given its pre-training.

Table 2 shows a comparison between vanillaBERT and SenseBERT on the supersense dis-ambiguation task. Our semantic level pre-training signal clearly yields embeddings withenhanced word-meaning awareness, relative toembeddings trained with BERT’s vanilla word-level signal. SenseBERTBASE improves the scoreof BERTBASE in the Frozen setting by over 10points and SenseBERTLARGE improves that ofBERTLARGE by over 12 points, demonstrating com-petitive results even without fine-tuning. In thesetting of model fine-tuning, we see a clear demon-stration of the model’s ability to learn word-levelsemantics, as SenseBERTBASE surpasses the scoreof BERTLARGE by 2 points.

5.3 Word in Context (WiC) Task

We test our model on the recently introduced WiCbinary classification task. Each instance in WiChas a target word w for which two contexts areprovided, each invoking a specific meaning of w.The task is to determine whether the occurrencesof w in the two contexts share the same meaningor not, clearly requiring an ability to identify theword’s semantic category. The WiC task is definedover supersenses (Pilehvar and Camacho-Collados,2019) – the negative examples include a word usedin two different supersenses and the positive onesinclude a word used in the same supersense. Seefigure 4(b) for an example from this data set.

Score CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE

BERTBASE (OURS) 77.5 50.1 92.6 88.7/84.3 85.7/84.6 71.0/88.9 83.6 89.4 67.9SenseBERTBASE 77.9 54.6 92.2 89.2/85.2 83.5/82.3 70.3/88.8 83.6 90.6 67.5

Table 4: Results on the GLUE benchmark test set.

Results on the WiC task comparing Sense-BERT to vanilla BERT are shown in table 2.SenseBERTBASE surpasses a larger vanilla model,BERTLARGE. As shown in table 3, a singleSenseBERTLARGE model achieves the state of theart score in this task, demonstrating unprecedentedlexical semantic awareness.

5.4 GLUE

The General Language Understanding Evaluation(GLUE; Wang et al. (2018)) benchmark is a popu-lar testbed for language understanding models. Itconsists of 9 different NLP tasks, covering differentlinguistic phenomena. We evaluate our model onGLUE, in order to verify that SenseBERT gains itslexical semantic knowledge without compromisingperformance on other downstream tasks. Due toslight differences in the data used for pretrainingBERT and SenseBERT (BookCorpus is not pub-licly available), we trained a BERTBASE model withthe same data used for our models. BERTBASE andSenseBERTBASE were both finetuned using the ex-act same procedures and hyperparameters. Theresults are presented in table 4. Indeed, Sense-BERT performs on par with BERT, achieving anoverall score of 77.9, compared to 77.5 achievedby BERTBASE.

6 Conclusion

We introduce lexical semantic information intoa neural language model’s pre-training objective.This results in a boosted word-level semantic aware-ness of the resultant model, named SenseBERT,which considerably outperforms a vanilla BERT ona SemEval based Supersense Disambiguation taskand achieves state of the art results on the Wordin Context task. This improvement was obtainedwithout human annotation, but rather by harnessingan external linguistic knowledge source. Our workindicates that semantic signals extending beyondthe lexical level can be similarly introduced at thepre-training stage, allowing the network to elicitfurther insight without human supervision.

Acknowledgments

We acknowledge useful comments and assistancefrom our colleagues at AI21 Labs. We would alsolike to thank the anonymous reviewers for theirvaluable feedback.

ReferencesSanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma,

and Andrej Risteski. 2018. Linear algebraic struc-ture of word senses, with applications to polysemy.Transactions of the Association for ComputationalLinguistics, 6:483–495.

Pierpaolo Basile. 2012. Super-sense tagging using sup-port vector machines and distributional features. InInternational Workshop on Evaluation of NaturalLanguage and Speech Tool for Italian, pages 176–185. Springer.

William Chan, Nikita Kitaev, Kelvin Guu, MitchellStern, and Jakob Uszkoreit. 2019. KERMIT: Genera-tive insertion-based modeling for sequences. arXivpreprint arXiv:1906.01604.

Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014.A unified model for word sense representation anddisambiguation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1025–1035, Doha, Qatar.Association for Computational Linguistics.

Massimiliano Ciaramita and Mark Johnson. 2003. Su-persense tagging of unknown nouns in WordNet. InProceedings of the 2003 Conference on EmpiricalMethods in Natural Language Processing, pages 168–175.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186, Minneapolis, Minnesota. Association forComputational Linguistics.

Philip Edmonds and Scott Cotton. 2001. SENSEVAL-2: Overview. In Proceedings of SENSEVAL-2 Sec-ond International Workshop on Evaluating WordSense Disambiguation Systems, pages 1–5, Toulouse,France. Association for Computational Linguistics.

https://doi.org/10.1162/tacl_a_00034

https://doi.org/10.1162/tacl_a_00034

https://arxiv.org/abs/1906.01604


https://doi.org/10.3115/v1/D14-1110

https://doi.org/10.3115/v1/D14-1110

https://www.aclweb.org/anthology/W03-1022


https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://www.aclweb.org/anthology/S01-1001


John Hewitt and Christopher D. Manning. 2019. Astructural probe for finding syntax in word represen-tations. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4129–4138, Minneapolis, Minnesota. Association forComputational Linguistics.

Ignacio Iacobacci, Mohammad Taher Pilehvar, andRoberto Navigli. 2016. Embeddings for word sensedisambiguation: An evaluation study. In Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 897–907, Berlin, Germany. Associationfor Computational Linguistics.

Hakan Inan, Khashayar Khosravi, and Richard Socher.2017. Tying word vectors and word classifiers: Aloss framework for language modeling. In ICLR.

Adam Kilgarriff. 1997. I dont believe in word senses.Computers and the Humanities, 31(2):91–113.

Minh Le, Marten Postma, Jacopo Urbani, and PiekVossen. 2018. A deep dive into word sense disam-biguation with LSTM. In Proceedings of the 27thInternational Conference on Computational Linguis-tics, pages 354–365, Santa Fe, New Mexico, USA.Association for Computational Linguistics.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.RoBERTa: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692.

Daniel Loureiro and Alıpio Jorge. 2019. Languagemodelling makes sense: Propagating representationsthrough WordNet for full-coverage word sense disam-biguation. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics,pages 5682–5691, Florence, Italy. Association forComputational Linguistics.

Leland McInnes, John Healy, and James Melville. 2018.UMAP: Uniform manifold approximation and pro-jection for dimension reduction. arXiv preprintarXiv:1802.03426.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositionality.In Advances in Neural Information Processing Sys-tems 26, pages 3111–3119. Curran Associates, Inc.

George A Miller. 1998. WordNet: An electronic lexicaldatabase. MIT press.

George A. Miller, Claudia Leacock, Randee Tengi, andRoss T. Bunker. 1993. A semantic concordance.In Human Language Technology: Proceedings ofa Workshop Held at Plainsboro, New Jersey, March21-24, 1993.

Andrea Moro and Roberto Navigli. 2015. SemEval-2015 task 13: Multilingual all-words sense disam-biguation and entity linking. In Proceedings of the9th International Workshop on Semantic Evaluation(SemEval 2015), pages 288–297, Denver, Colorado.Association for Computational Linguistics.

Roberto Navigli. 2009. Word sense disambiguation: Asurvey. ACM Comput. Surv., 41(2).

Roberto Navigli, David Jurgens, and Daniele Vannella.2013. SemEval-2013 task 12: Multilingual wordsense disambiguation. In Second Joint Conferenceon Lexical and Computational Semantics (*SEM),Volume 2: Proceedings of the Seventh InternationalWorkshop on Semantic Evaluation (SemEval 2013),pages 222–231, Atlanta, Georgia, USA. Associationfor Computational Linguistics.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 Confer-ence on Empirical Methods in Natural Language Pro-cessing (EMNLP), pages 1532–1543, Doha, Qatar.Association for Computational Linguistics.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long Papers), pages 2227–2237,New Orleans, Louisiana. Association for Computa-tional Linguistics.

Matthew E. Peters, Mark Neumann, Robert Logan, RoySchwartz, Vidur Joshi, Sameer Singh, and Noah A.Smith. 2019. Knowledge enhanced contextual wordrepresentations. In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural Language Pro-cessing and the 9th International Joint Conferenceon Natural Language Processing (EMNLP-IJCNLP),pages 43–54, Hong Kong, China. Association forComputational Linguistics.

Mohammad Taher Pilehvar and Jose Camacho-Collados.2019. WiC: the word-in-context dataset for evalu-ating context-sensitive meaning representations. InProceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers), pages 1267–1273,Minneapolis, Minnesota. Association for Computa-tional Linguistics.

Sameer Pradhan, Edward Loper, Dmitriy Dligach, andMartha Palmer. 2007. SemEval-2007 task-17: En-glish lexical sample, SRL and all words. In Pro-ceedings of the Fourth International Workshop onSemantic Evaluations (SemEval-2007), pages 87–92,Prague, Czech Republic. Association for Computa-tional Linguistics.

Ofir Press and Lior Wolf. 2017. Using the output em-bedding to improve language models. In Proceedings

https://doi.org/10.18653/v1/N19-1419

https://doi.org/10.18653/v1/N19-1419

https://doi.org/10.18653/v1/N19-1419

https://doi.org/10.18653/v1/P16-1085

https://doi.org/10.18653/v1/P16-1085

https://openreview.net/pdf?id=r1aPbsFle

https://openreview.net/pdf?id=r1aPbsFle

https://www.aclweb.org/anthology/C18-1030




https://doi.org/10.18653/v1/P19-1569

https://doi.org/10.18653/v1/P19-1569

https://doi.org/10.18653/v1/P19-1569

https://doi.org/10.18653/v1/P19-1569



http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

https://www.aclweb.org/anthology/H93-1061

https://doi.org/10.18653/v1/S15-2049

https://doi.org/10.18653/v1/S15-2049

https://doi.org/10.18653/v1/S15-2049

https://doi.org/10.1145/1459352.1459355

https://doi.org/10.1145/1459352.1459355



https://doi.org/10.3115/v1/D14-1162

https://doi.org/10.3115/v1/D14-1162

https://doi.org/10.18653/v1/N18-1202

https://doi.org/10.18653/v1/N18-1202

https://doi.org/10.18653/v1/D19-1005

https://doi.org/10.18653/v1/D19-1005

https://doi.org/10.18653/v1/N19-1128

https://doi.org/10.18653/v1/N19-1128



https://www.aclweb.org/anthology/E17-2025


of the 15th Conference of the European Chapter ofthe Association for Computational Linguistics: Vol-ume 2, Short Papers, pages 157–163, Valencia, Spain.Association for Computational Linguistics.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

Alessandro Raganato, Jose Camacho-Collados, andRoberto Navigli. 2017. Word sense disambiguation:A unified evaluation framework and empirical com-parison. In Proceedings of the 15th Conference ofthe European Chapter of the Association for Compu-tational Linguistics: Volume 1, Long Papers, pages99–110, Valencia, Spain. Association for Computa-tional Linguistics.

Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda BViegas, Andy Coenen, Adam Pearce, and Been Kim.2019. Visualizing and measuring the geometry ofBERT. In Advances in Neural Information Process-ing Systems 32, pages 8594–8603. Curran Associates,Inc.

Sascha Rothe and Hinrich Schutze. 2015. AutoEx-tend: Extending word embeddings to embeddingsfor synsets and lexemes. In Proceedings of the 53rdAnnual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Confer-ence on Natural Language Processing (Volume 1:Long Papers), pages 1793–1803, Beijing, China. As-sociation for Computational Linguistics.

Nathan Schneider. 2014. Lexical semantic analysis innatural language text. Unpublished Doctoral Disser-tation, Carnegie Mellon University.

Nathan Schneider and Noah A. Smith. 2015. A corpusand model integrating multiword expressions andsupersenses. In Proceedings of the 2015 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, pages 1537–1547, Denver, Colorado.Association for Computational Linguistics.

Benjamin Snyder and Martha Palmer. 2004. The En-glish all-words task. In Proceedings of SENSEVAL-3,the Third International Workshop on the Evaluationof Systems for the Semantic Analysis of Text, pages41–43, Barcelona, Spain. Association for Computa-tional Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems 30, pages 5998–6008. Curran Asso-ciates, Inc.

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Aman-preet Singh, Julian Michael, Felix Hill, Omer Levy,and Samuel Bowman. 2019. SuperGLUE: A stickierbenchmark for general-purpose language understand-ing systems. In Advances in Neural Information

Processing Systems 32, pages 3266–3280. CurranAssociates, Inc.

Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel Bowman. 2018. GLUE:A multi-task benchmark and analysis platform for nat-ural language understanding. In Proceedings of the2018 EMNLP Workshop BlackboxNLP: Analyzingand Interpreting Neural Networks for NLP, pages353–355, Brussels, Belgium. Association for Com-putational Linguistics.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.XLNet: Generalized autoregressive pretraining forlanguage understanding. In Advances in Neural In-formation Processing Systems 32, pages 5753–5763.Curran Associates, Inc.

Dayu Yuan, Julian Richardson, Ryan Doherty, ColinEvans, and Eric Altendorf. 2016. Semi-supervisedword sense disambiguation with neural models. InProceedings of COLING 2016, the 26th InternationalConference on Computational Linguistics: TechnicalPapers, pages 1374–1385, Osaka, Japan. The COL-ING 2016 Organizing Committee.

A Supersenses and Their Representationin SenseBERT

We present in table 5 a comprehensive list of Word-Net supersenses, as they appear in the WordNetdocumentation. In fig. 5 we present a Dendro-gram of an Agglomerative hierarchical clusteringover the supersense embedding vectors learned bySenseBERT in pre-training. The clustering showsa clear separation between Noun senses and Verbsenses. Furthermore, we can observe that semanti-cally related supersenses are clustered together (i.e,noun.animal and noun.plant).

B Training Details

As hyperparameters for the fine-tuning, we usedmax seq length = 128, chose learning rates from{5e−6, 1e−5, 2e−5, 3e−5, 5e−5}, batch sizesfrom {16, 32}, and fine-tuned up to 10 epochs forall the datasets.




http://papers.nips.cc/paper/9065-visualizing-and-measuring-the-geometry-of-bert.pdf

http://papers.nips.cc/paper/9065-visualizing-and-measuring-the-geometry-of-bert.pdf

https://doi.org/10.3115/v1/P15-1173

https://doi.org/10.3115/v1/P15-1173

https://doi.org/10.3115/v1/P15-1173

https://doi.org/10.3115/v1/N15-1177

https://doi.org/10.3115/v1/N15-1177

https://doi.org/10.3115/v1/N15-1177



http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

http://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf



https://doi.org/10.18653/v1/W18-5446

https://doi.org/10.18653/v1/W18-5446

https://doi.org/10.18653/v1/W18-5446

http://papers.nips.cc/paper/8812-xlnet-generalized-autoregressive-pretraining-for-language-understanding.pdf

http://papers.nips.cc/paper/8812-xlnet-generalized-autoregressive-pretraining-for-language-understanding.pdf



Nouns Verbs

verb.consumption

verb.bodyverb.em

otionverb.w

eather

verb.changeverb.stativeverb.creationverb.perceptionverb.cognitionverb.com

munication

verb.possessionverb.socialverb.m

otionverb.com

petitionverb.contact

noun.eventnoun.phenom

enonnoun.possessionnoun.feelingnoun.shapenoun.processadj.pplnoun.m

otive

noun.foodnoun.objectnoun.bodynoun.anim

alnoun.plant

noun.time

noun.quantitynoun.substance

noun.artifact

noun.actnoun.com

munication

adj.alladv.alladj.pertnull

noun.groupnoun.locationnoun.person

noun.statenoun.cognitionnoun.attributenoun.relation

Figure 5: Dendrogram visualization of an Agglomerative hierarchical clustering over the supersense vectors (rowsof the classifier S) learned by SenseBERT.

Name Content Name Contentadj.all All adjective clusters noun.quantity Nouns denoting quantities and units

of measureadj.pert Relational adjectives (pertainyms) noun.relation Nouns denoting relations between

people or things or ideasadv.all All adverbs noun.shape Nouns denoting two and three

dimensional shapesnoun.Tops Unique beginner for nouns noun.state Nouns denoting stable states of affairs

noun.act Nouns denoting acts or actions noun.substance Nouns denoting substances

noun.animal Nouns denoting animals noun.time Nouns denoting time and temporalrelations

noun.artifact Nouns denoting man-made objects verb.body Verbs of grooming, dressingand bodily care

noun.attribute Nouns denoting attributes of people verb.change Verbs of size, temperature change,and objects intensifying, etc.

noun.body Nouns denoting body parts verb.cognition Verbs of thinking, judging, analyzing,doubting

noun.cognition Nouns denoting cognitive verb.communication Verbs of telling, asking, ordering,processes and contents singing

noun.communication Nouns denoting communicative verb.competition Verbs of fighting, athletic activitiesprocesses and contents

noun.event Nouns denoting natural events verb.consumption Verbs of eating and drinking

noun.feeling Nouns denoting feelings verb.contact Verbs of touching, hitting, tying,and emotions digging

noun.food Nouns denoting foods and drinks verb.creation Verbs of sewing, baking, painting,performing

noun.group Nouns denoting groupings of people verb.emotion Verbs of feelingor objects

noun.location Nouns denoting spatial position verb.motion Verbs of walking, flying, swimming

noun.motive Nouns denoting goals verb.perception Verbs of seeing, hearing, feeling

noun.object Nouns denoting natural objects verb.possession Verbs of buying, selling, owning(not man-made)

noun.person Nouns denoting people verb.social Verbs of political and socialactivities and events

noun.phenomenon Nouns denoting natural phenomena verb.stative Verbs of being, having, spatial relations

noun.plant Nouns denoting plants verb.weather Verbs of raining, snowing, thawing,thundering

noun.possession Nouns denoting possession adj.ppl Participial adjectivesand transfer of possession

noun.process Nouns denoting natural processes

Table 5: A list of supersense categories from WordNet lexicographer.

Date post:	19-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

SenseBERT: Driving Some Sense into BERT · Speciﬁcally, we add a masked-word sense pre-diction...

Documents