+ All Categories
Home > Documents > Exploiting Noisy Data in Distant Supervision Relation Classification · 2019. 6. 1. · Proceedings...

Exploiting Noisy Data in Distant Supervision Relation Classification · 2019. 6. 1. · Proceedings...

Date post: 06-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
10
Proceedings of NAACL-HLT 2019, pages 3216–3225 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics 3216 Exploiting Noisy Data in Distant Supervision Relation Classification Kaijia Yang, Liang He, Xin-yu Dai, Shujian Huang, Jiajun Chen National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China {yangkj,heliang}@nlp.nju.edu.cn {daixinyu,huangsj,chenjj}@nju.edu.cn Abstract Distant supervision has obtained great progress on relation classification task. However, it still suffers from noisy labeling problem. Different from previous works that underutilize noisy data which inherently characterize the property of classification, in this paper, we propose RCEND, a novel framework to enhance Relation Classification by Exploiting Noisy Data. First, an instance discriminator with reinforcement learning is designed to split the noisy data into correctly labeled data and incorrectly labeled data. Second, we learn a robust relation classifier in semi-supervised learning way, whereby the correctly and incorrectly labeled data are treated as labeled and unlabeled data respec- tively. The experimental results show that our method outperforms the state-of-the-art models. 1 Introduction Relation classification plays a crucial role in natu- ral language processing (NLP) tasks, such as ques- tion answering and knowledge base completion (Xu et al., 2016; Han et al., 2018a). The goal of relation classification is to predict relations of the target entity pair given a plain text. Traditional su- pervised learning methods (Zelenko et al., 2002; Bunescu and Mooney, 2005; Zhou et al., 2005) heavily rely on large scale annotated data which is time and labor consuming. Mintz et al. (2009) pro- posed distant supervision (DS) to automatically generate training data for relation classification based on the assumption that if two target entities have a relation in knowledge base (KB), sentences containing this entity pair might express the re- lation. For example, if a relational fact <Apple, founder, Steve Jobs> exists in KB, distant super- vision will assign founder as the label of all sen- tences that contain “Apple” and “Steve Jobs” to- gether. Sentence DS Gold S1:Al Gore was waiting to board a commercial flight from Nashville to Miami... LivedIn NA S2:There were also performers who were born in Louisiana , in- cluding Lucinda Williams... LivedIn BornIn S3:Boggs was married, had three young children and lived in Brew- ster NA LivedIn Table 1: Examples of noisy labeling problem in dis- tant supervision relation classification. S1 and S2 are heuristically labeled as LivedIn by DS, but neither of them mention the relation while S2 mentions the BornIn relation. S3 expresses the LivedIn relation but it is mislabeled as NA since no relation of the entity pair exist in KB. However, it suffers from noisy labeling problem due to the irrelevance of aligned text and incom- pleteness of KB, which consists of false positives and false negatives. The false positives means that not all sentences containing two entities mention the relation in KB, such as S1 and S2 in Table 1. And the false negatives are sentences are misla- beled as no relation (NA) due to the absence of relational fact in KB even though they express the target relation, such as S3 in Table 1. In order to reduce the impact of noisy data, pre- vious works (Riedel et al., 2010; Hoffmann et al., 2011; Surdeanu et al., 2012; Zeng et al., 2015; Lin et al., 2016; Han et al., 2018b) adopt Multi- Instance Learning (MIL) for relation classifica- tion. Recent studies (Feng et al., 2018; Qin et al., 2018b,a) introduce reinforcement learning (RL) and adversarial learning to filter out incorrectly la- beled sentences and achieve significant improve- ments. However, there are two remaining chal- lenges of noisy labeling problem. Most of these approaches focus on solving the false positives but overlook false nega-
Transcript
Page 1: Exploiting Noisy Data in Distant Supervision Relation Classification · 2019. 6. 1. · Proceedings of NAACL-HLT 2019 , pages 3216 3225 Minneapolis, Minnesota, June 2 - June 7, 2019.

Proceedings of NAACL-HLT 2019, pages 3216–3225Minneapolis, Minnesota, June 2 - June 7, 2019. c©2019 Association for Computational Linguistics

3216

Exploiting Noisy Data in Distant Supervision Relation Classification

Kaijia Yang, Liang He, Xin-yu Dai, Shujian Huang, Jiajun ChenNational Key Laboratory for Novel Software Technology,

Nanjing University, Nanjing, 210023, China{yangkj,heliang}@nlp.nju.edu.cn

{daixinyu,huangsj,chenjj}@nju.edu.cn

AbstractDistant supervision has obtained greatprogress on relation classification task.However, it still suffers from noisy labelingproblem. Different from previous worksthat underutilize noisy data which inherentlycharacterize the property of classification,in this paper, we propose RCEND, a novelframework to enhance Relation Classificationby Exploiting Noisy Data. First, an instancediscriminator with reinforcement learning isdesigned to split the noisy data into correctlylabeled data and incorrectly labeled data.Second, we learn a robust relation classifierin semi-supervised learning way, wherebythe correctly and incorrectly labeled data aretreated as labeled and unlabeled data respec-tively. The experimental results show thatour method outperforms the state-of-the-artmodels.

1 Introduction

Relation classification plays a crucial role in natu-ral language processing (NLP) tasks, such as ques-tion answering and knowledge base completion(Xu et al., 2016; Han et al., 2018a). The goal ofrelation classification is to predict relations of thetarget entity pair given a plain text. Traditional su-pervised learning methods (Zelenko et al., 2002;Bunescu and Mooney, 2005; Zhou et al., 2005)heavily rely on large scale annotated data which istime and labor consuming. Mintz et al. (2009) pro-posed distant supervision (DS) to automaticallygenerate training data for relation classificationbased on the assumption that if two target entitieshave a relation in knowledge base (KB), sentencescontaining this entity pair might express the re-lation. For example, if a relational fact <Apple,founder, Steve Jobs> exists in KB, distant super-vision will assign founder as the label of all sen-tences that contain “Apple” and “Steve Jobs” to-gether.

Sentence DS GoldS1:Al Gore was waiting to boarda commercial flight from Nashvilleto Miami...

LivedIn NA

S2:There were also performerswho were born in Louisiana , in-cluding Lucinda Williams...

LivedIn BornIn

S3:Boggs was married, had threeyoung children and lived in Brew-ster

NA LivedIn

Table 1: Examples of noisy labeling problem in dis-tant supervision relation classification. S1 and S2 areheuristically labeled as LivedIn by DS, but neitherof them mention the relation while S2 mentions theBornIn relation. S3 expresses the LivedIn relation but itis mislabeled as NA since no relation of the entity pairexist in KB.

However, it suffers from noisy labeling problemdue to the irrelevance of aligned text and incom-pleteness of KB, which consists of false positivesand false negatives. The false positives means thatnot all sentences containing two entities mentionthe relation in KB, such as S1 and S2 in Table 1.And the false negatives are sentences are misla-beled as no relation (NA) due to the absence ofrelational fact in KB even though they express thetarget relation, such as S3 in Table 1.

In order to reduce the impact of noisy data, pre-vious works (Riedel et al., 2010; Hoffmann et al.,2011; Surdeanu et al., 2012; Zeng et al., 2015;Lin et al., 2016; Han et al., 2018b) adopt Multi-Instance Learning (MIL) for relation classifica-tion. Recent studies (Feng et al., 2018; Qin et al.,2018b,a) introduce reinforcement learning (RL)and adversarial learning to filter out incorrectly la-beled sentences and achieve significant improve-ments. However, there are two remaining chal-lenges of noisy labeling problem.

• Most of these approaches focus on solvingthe false positives but overlook false nega-

Page 2: Exploiting Noisy Data in Distant Supervision Relation Classification · 2019. 6. 1. · Proceedings of NAACL-HLT 2019 , pages 3216 3225 Minneapolis, Minnesota, June 2 - June 7, 2019.

3217

DS true positive data

DS false positive data

DS true negative data

DS false negative data

Figure 1: Illustration of false positive and false negativecases

tives. As illustrated in Figure 1, they con-centrate on discovering the false positive in-stances1 which are suppressed or removed atlast and obtain a better decision boundary(green dashed line) than without considera-tion of false positive instances. Nevertheless,there are still a lot of false negative instancesexpressing similar semantic information withpositive data. These instances also provideevidence for the target relation. The incor-rect labels will weaken the discriminative ca-pability of available features and confuse themodel if they stay the same. However, whenwe remedy the label correctly, we indeed pos-sess the optimal decision boundary (red solidline).

• There lacks an effective method to fully uti-lize noisy data of distant supervision. (Xuet al., 2013; Liu et al., 2017) apply meth-ods such as pseudo-labels to directly correctthe label of noisy data and Luo et al. (2017)design a dynamic transition matrix to modelnoise patterns. They still suffer from thedrawback of error propagation during train-ing.

To tackle the above challenges, we proposea novel framework exploiting noisy data to en-hance distant supervision relation classification.We design an instance discriminator with rein-forcement learning to recognize both false positiveand false negative instances simultaneously, andfurther split the noisy dataset into two sets, rep-resenting correctly labeled and incorrectly labeleddata respectively. Additionally, we learn a ro-bust relation classifier applying a semi-supervisedlearning method, whereby the correctly and incor-rectly labeled data are regarded as labeled and un-labeled data. On the one hand, we mitigate the

1In this paper, instance is the same as sentence

side effect of incorrectly labeled data by recog-nizing them and treating them as unlabeled data.On the other hand, taking full advantage of the in-correctly labeled data in semi-supervised learningway facilitates robust property of model and im-proves generalization performance. Our contribu-tions in this work are three-fold:

• We propose a deep reinforcement learningframework to discriminate both false-positiveand false-negative instances simultaneously.

• We introduce a semi-supervised learningmethod to fully exploit the noisy data in dis-tant supervision relation classification.

• We conduct experiments on a widely usedbenchmark dataset and the results show thatour method achieves significant improve-ments as compared with strong baselines.

2 Related work

Many efforts based on supervised learning (Ze-lenko et al., 2002; Bunescu and Mooney, 2005;Zhou et al., 2005) have been devoted to relationclassification. As is well-known, achieving a goodperformance while applying supervised learningparadigm requires a large amount of high-qualityannotated data. To address the issue of data spar-sity, Mintz et al. (2009) propose distant supervi-sion to automatically annotate large scale train-ing data, which inevitably results in noisy labelingproblem.

To tolerate noisy instances in positive ex-amples, most early approaches employ multi-instance learning framework, including multi-instance single-label learning (Riedel et al., 2010)and multi-instance multi-label learning (Hoff-mann et al., 2011; Surdeanu et al., 2012). Re-cently, deep learning has also been introduced topropose an end-to-end convolutional neural net-work for relation classification (Zeng et al., 2014).In the sentences bag of one entity pair, Zenget al. (2015) select the most reliable sentence,and Lin et al. (2016) propose attention schemesto de-emphasize unreliable sentences. Han et al.(2018b) incorporate hierarchical information ofrelations to enhance the attention scheme. Butthey fail to handle the issue where all sentencesin one bag are mislabeled.

Feng et al. (2018); Qin et al. (2018b,a) fur-ther achieve improvement by using reinforcement

Page 3: Exploiting Noisy Data in Distant Supervision Relation Classification · 2019. 6. 1. · Proceedings of NAACL-HLT 2019 , pages 3216 3225 Minneapolis, Minnesota, June 2 - June 7, 2019.

3218

Instance Discriminator with Reinforcement Learning Relation Classifier with Semi-Supervised Learning

PosAgent

FN FPTP

NegAgent

DL

DNADPOS

TN

DU

Instance

State

Agent

Classifier

xu

Encoder

z

Decoder

Encoder

Decoder

Uloss

yu xl yl

z

xu’ xl’

ypred

Lloss

Closs

reward

Figure 2: The framework of train process. Instance discriminator, consisting of PosAgent and NegAgent, aimsto recognize false-positive (FP) and false-negative (FN) instances from positive dataset (DPOS) and NA dataset(DNA) respectively. Afterward, true-positive (TP) and true-negative (TN) instances are split into labeled data (Dl)while FP and FN instances are split into unlabeled data (Du). We adopt SemiVAE, which consists of an encoder, adecoder and a classifier, to train a robust relation classifier with semi-supervised learning utilizing the above dataDl,Du. More details are introduced in Section 3.2 and 3.3.

learning and adversarial learning to explicitly re-move incorrectly labeled sentences. However,they neglect the useful inherent information ofthose sentences which should be replaced cor-rectly. In other words, they remove the noise ratherthan utilize it in the right way.

Furthermore, Xu et al. (2013) correct false neg-ative instances by using pseudo-relevance feed-back to expand the origin knowledge base. Liuet al. (2017) apply a dynamic soft-label instead ofthe immutable hard label produced by DS duringthe training process. Luo et al. (2017) design atransition matrix which characterizes the underly-ing noise pattern to correct noisy labels. They uti-lize the noisy data and address the false negativeproblem to some extent, but they still suffer fromthe drawback that errors may be propagated be-cause the model is unable to correct its own mis-takes.

In this work, we propose a unified frameworkfor learning a discriminator to recognize bothfalse-positive and false-negative instances with re-inforcement learning, and utilizing the incorrectlylabeled data as unlabeled data in semi-supervisedlearning way.

3 Methodology

In this section, we introduce our framework andthe details of instance discriminator and relationclassifier as follows.

3.1 Framework

In MIL paradigm, the entire instances are splitinto multiple entity-pair bags {Bhi,ti}ki=1. Thesentences in Bh,t mention both head entity h andtail entity t. Here we denote dataset as D ={(xi, yi)}ni=1, where xi is a sentence associatedwith the corresponding entity pair, yi is a noisy re-lation label produced by distant supervision and nis the total number of sentences contained in eachbag. As mentioned above, NA is a special relationwhich indicates the sentence does not express anyrelations in the KB. We define other relations inthe KB as positive relations. Accordingly, we splitthe dataset into DPOS and DNA.

In the scenario of distant supervision, an idealmodel is not only capable of capturing valid su-pervision information about correctly labeled datawith less noise, but also leveraging informationcontained in incorrectly labeled data by correctingthe noisy label implicitly.

As a result, we solve the task of distant super-vision relation classification in two steps. As de-picted in Figure 2, we design an instance discrim-inator to heuristically recognize false positive andfalse negative instances from the noisy distantly-supervised dataset with reinforcement learning.The correctly labeled instances discovered by thediscriminator are split into labeled data while theincorrectly labeled ones are split into unlabeleddata. The details of the instance discriminator

Page 4: Exploiting Noisy Data in Distant Supervision Relation Classification · 2019. 6. 1. · Proceedings of NAACL-HLT 2019 , pages 3216 3225 Minneapolis, Minnesota, June 2 - June 7, 2019.

3219

will be introduced in Section 3.2. After scanningthe entire noisy dataset, we train a robust classi-fier with semi-supervised learning utilizing abovelabeled data and unlabeled data. The details ofthe relation classifier will be introduced in Section3.3. Meanwhile, the relation classifier provides re-wards to the instance discriminator for updatingparameters of its policy function.

3.2 Instance discriminator

We regard recognizing incorrectly labeled in-stances as a reinforcement learning problem. Theinstance discriminator acts as an agent interactingwith the environment that consists of a noisydataset and a relation classifier. The agent isparameterized with a policy network π(a|s; θ)which gives the probability distribution of actiona at each state s and receives reward r from therelation classifier to update parameters θ. Notethat NA indicates that there is no relation betweentwo entities or that the relation is of no interest.The relation NA is very ambiguous since instanceshave no unified pattern. Thus we cannot decidewhether a sentence belongs to NA only by the factthat it does not express any other positive relation.Under this consideration, we adopt two agents,PosAgent and NegAgent, to recognize falsepositive and false negative instances respectively.The definitions of the components in RL areintroduced as follows.

State The state includes the semantic andsyntactic information of current sentence and therelation label given by DS. We use a piecewiseconvolutional neural network (PCNN) (Zenget al., 2015) to convert each sentence into real-valued vector x and build a class representationmatrix M to represent each relation type. As wedecide whether the current sentence is correctlylabeled according to the similarity between thesemantic meanings of sentence and relation, weonly take the current sentence into considerationwithout sentences in early states. For PosAgent,state sp is the concatenation of the currentsentence vector x and corresponding relationembedding. For NegAgent, we represent state snby the vector of relational scores based on therepresentation of the current sentence x.

sp = [x;M [y]]

sn = Mx+ b(1)

where y is relation label of the current sentence.b ∈ Rnr is a bias vector and nr is the number ofclass.

Action We desire the agent to distinguishwhether the current sentence is mislabeled or not.Therefore, the action of our agent is defined asai ∈ {0, 1}, where 0 indicates the sentence isincorrectly labeled and 1 indicates the sentence iscorrectly labeled.

Reward The reward function can reflect theadvantage of redistributing the noisy data. Aspreviously mentioned, the actions of our agentredistribute noisy data into labeled data andunlabeled data, corresponding to correctly labeledand incorrectly labeled instances. Therefore, theaverage likelihood of labeled data will be largerthan unlabeled data when the agent makes correctactions. We define the difference of likelihoodbetween them as the reward to evaluate theperformance of our policy. Consequently, thereward is defined as follows:

r = λ(1

|L|∑x∈L

pφ(y|x)−1

|U |∑x∈U

pφ(y|x)) (2)

where L and U is the subset of labeled data andunlabeled data respectively, and y is the relationlabel given by DS. pφ(y|x) is calculated by the re-lation classifier from the semi-supervised learningframework. λ is used to scale the difference to arational numeric range.

Training the Policy-based AgentThe objective of the agent is to maximize theexpected reward of the actions sampled fromthe probability distribution. Given a mini-batchdata B, following the policy, our agent pro-duces a set of probability distributions of ac-tions π(ai|si; θ). Based on the actions, the agentachieves a performance-driven reward r. We usea policy gradient strategy to compute the gradientand update our agent referring to the policy gra-dient theorem (Sutton et al., 1999) and the REIN-FORCE algorithm (Williams, 1992). The param-eters of the policy network are updated accordingto the following gradient:

θ ← θ + α

|B|∑i=1

5θr log π(ai|si; θ) (3)

As the goal of our agent is to determine whetheran annotated sentence expresses target relation

Page 5: Exploiting Noisy Data in Distant Supervision Relation Classification · 2019. 6. 1. · Proceedings of NAACL-HLT 2019 , pages 3216 3225 Minneapolis, Minnesota, June 2 - June 7, 2019.

3220

with weak supervision, we need a relation clas-sifier to compute the reward for updating the pol-icy network. We first pre-train our classifier onthe entire dataset with supervised learning untilrough convergence. Then we pre-train the pol-icy network by receiving rewards from the pre-trained classifier with the parameters frozen. Thepre-training strategy is necessary as it saves timethat would otherwise be spent training the modelby trial and error. It is also widely used by otherrelated works (Silver et al., 2016; Bahdanau et al.,2016). The training procedure for instance dis-criminator is summarized in Algorithm 1.

3.3 Relation Classifier

In order to reach the maximum utilization of noisydata, we train our relation classifier with semi-supervised learning. Below, we introduce PCNNand SemiVAE, the method we adopt for semi-supervised learning.

PCNN

We take the widely used CNN architecture(PCNN) (Zeng et al., 2015; Lin et al., 2016) to en-code input sentences into low-dimensional vectorsand predict their corresponding relation labels.

Given a sentence containing an entity pair, werepresent the i-th word as vi by concatenating itsword embedding wi and position embedding piwhich encodes the relative distances from it to twoentities (vi ∈ Rd, wi ∈ Rdw , pi ∈ Rdp , d = dw +dp).

Afterward, the convolution layer applies a ker-nel of the window size l to slide over the in-put sequence {v1,v2, ...,vm} and output the dh-dimensional hidden embeddings h, where h ∈Rm×dc and dc is the number of feature maps.

Then, piecewise max-pooling is used to di-vide the hidden embeddings into three parts{h1,h2,h3} by the position of head and tail enti-ties. We perform max-pooling on each part respec-tively and get final embedding x by concatenatingthe pooling results, where x ∈ Rds(ds = dc× 3) .

x = [max(h1);max(h2);max(h3)] (4)

Finally, we formalize the probability of predict-ing y given sentence x as follows:

o = Mx+ b,

pφ(y|x) =exp (oy)∑nrk=1 exp (ok)

(5)

where M ∈ Rnr×ds is the class embeddings ofeach relation and b ∈ Rnr is a bias vector.

Semi-supervised VAESemiVAE, a semi-supervised learning methodbased on variational inference, is introduced anddeveloped by (Kingma et al., 2014; Xu et al.,2017). The inference model consists of threecomponents as follows. An encoder networkpϕ(z|x, y) encodes data x and label y into a la-tent variable z. The decoder network pψ(x|z, y)is used to estimate the probability of generatingx given z and categorical label y. Finally, classi-fier pφ(y|x) predict the corresponding label y ofx. We model both encoder and decoder by mul-tilayer perceptron (MLP) and employ the PCNNmodel as the classifier in SemiVAE.

For the case of labeled data (xl, yl), the evi-dence lower bound is:

log pψ(xl, yl)≥Epϕ(z|xl,yl)[log pψ(x′l|yl, z)]

+log pψ(yl)−DKL(pϕ(z|xl, yl)||p(z))=−L(xl, yl)

(6)

where first term represent the expectation of theconditional log-likelihood on latent variable z andthe last term is Kullabck-Leibler divergence be-tween the prior distribution p(z) and the latentposterior pφ(z|xl, yl).

For the case of unlabeled data xu, the unob-served label yu is obtained from the classifier inthe inference model. The variational lower boundis:

log pψ(xu)≥∑y

pφ(yu|xu)(−L(xu, yu))+H(pφ(yu|xu))

= −U(xu)(7)

whereH denotes the entropy of pφ(yu|xu).Since the classifier pφ(y|x) is unable to learn

directly from labeled data, a classification loss isintroduced as:

C = E(x,y)∈Dl[− log pφ(y|x)] (8)

To maximize the evidence lower bound of bothlabeled data and unlabeled data and minimize theclassification loss, the objective is defined as:

J =∑

(x,y)∈Dl

L(x, y) +∑x∈Du

U(x) + βC (9)

where Dl and Du are labeled and unlabeled datarespectively, β is a factor used to scale the classi-fication loss of labeled data.

Page 6: Exploiting Noisy Data in Distant Supervision Relation Classification · 2019. 6. 1. · Proceedings of NAACL-HLT 2019 , pages 3216 3225 Minneapolis, Minnesota, June 2 - June 7, 2019.

3221

Algorithm 1 Reinforcement Learning Algorithmfor Instance Discriminator.Input: Origin dataset DPOS = {(xi, yi)}ni=1.

( To be clear, we demonstrate the training pro-cedure of PosAgent. NegAgent is trained inthe same way. )

Output: Labeled data Dl , unlabeled data Du.1: Initialize parameters of policy network as θ.2: for step t = 1→ T do3: Partition DPOS into minibatches of size bs4: for minibatch B ⊂ DPOS do5: B = {(xj , yj)}bsj=1

6: Sample actions for each sentence in B:aj ∼ π(aj |sj ; θ)

7: if aj == 0 then8: Add xj to Du9: else

10: Add (xj , yj) to Dl11: end if12: Calculate reward r by Eq.(2)13: Update θ by Eq.(3)14: end for15: end for

Algorithm 2 Semi-supervised Learning Algo-rithm for Relation Classifier.Input: Labeled data Dl, unlabeled data Du.1: Initialize parameters of relation classifier as φ.2: for epoch i = 1→ N do3: Sample m data pair (xl, yl) from Dl4: Sample m data xu from Du and predict

their corresponding unobserved label yu viapφ(y|x)

5: Update φ by Eq.(9)6: end for

After our reinforcement learning process, weobtain an instance discriminator which possessesthe capability of recognizing incorrectly labeledinstances from the noisy dataset. Additionally, theentire DS dataset D is split into labeled data Dland unlabeled data Du. Therefore, we utilize theabove data to train SemiVAE model and obtaina robust relation classifier which explicitly learnsfrom correctly labeled data and correct incorrectlylabeled data implicitly. The training procedure forrelation classifier is summarized in Algorithm 2.

4 Experiment

4.1 Datasets and Evaluation

We evaluate our model on a widely used datasetthat is generated by aligning entity pairs from

Batch size bs 160Word Dimension dw 50Position Dimension dp 5×2Convolution Filter Dimension dc 230Convolution Window Size l 3Latent Variable Dimension dz 100Dropout p 0.5Regulator λ, β 100, 2

Table 2: Hyperparameter settings

Freebase with New York Times corpus(NYT)2 anddeveloped by (Riedel et al., 2010). Entity men-tions are recognized by the Stanford named entityrecognizer (Finkel et al., 2005). The relation factsin Freebase are divided into two parts for train-ing and testing respectively. The sentences fromthe corpus of the years 2005-2006 are used as thetraining instances, and sentences from 2007 areused as the testing instances. There are 52 posi-tive relations and a special relation NA.

Following previous works, we evaluate ourmodel on the held-out evaluation, which comparesrelation facts extracted from the test corpus withthose in Freebase. We adopt aggregated preci-sion/recall curves and precision@N (P@N) to il-lustrate the performance of our model.

4.2 Parameter Settings

We adopt the Adam (Kingma and Ba, 2014) opti-mizer to optimize our instance discriminator andrelation classifier with learning rate 0.0001 and0.001 respectively. We also apply dropout to pre-vent overfitting. More detailed hyperparametersettings are presented in Table 2.

4.3 Overall Evaluations Results

We adopt the following baselines with which wecompare our model:

• Mintz (Mintz et al., 2009) is the original dis-tantly supervised model. MultiR (Hoffmannet al., 2011) and MIML(Surdeanu et al.,2012) handle overlapping relation problemwith graphical model in multi-instance andmulti-instance multi-label framework. Allabove models are based on handcrafted fea-ture.

• PCNN+ONE (Zeng et al., 2015) andPCNN+ATT (Lin et al., 2016) are both ro-bust models to solve noisy labeling problem

2http://iesl.cs.umass.edu/riedel/ecml/

Page 7: Exploiting Noisy Data in Distant Supervision Relation Classification · 2019. 6. 1. · Proceedings of NAACL-HLT 2019 , pages 3216 3225 Minneapolis, Minnesota, June 2 - June 7, 2019.

3222

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40Recall

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Precision

RCENDPCNN+HATTPCNN+ATT+SLPCNN+ONE+SLPCNN+ATTPCNN+ONEMIMLREMultiRMintz

Figure 3: Precision-recall curves of our model andbaselines.

P@N 100 200 300 MeanPCNN+ONE 72.3 69.7 64.1 68.7PCNN+ATT 76.2 73.1 67.4 72.2PCNN+ONE+SL 84.0 81.0 74.0 79.7PCNN+ATT+SL 87.0 84.5 77.0 82.8PCNN+HATT 88.0 79.5 75.3 80.9RCEND 95.0 87.5 84.4 88.9

Table 3: Top-N precision (P@N) of our model andbaselines

based on the at-least-one assumption and se-lective attention. PCNN+HATT (Han et al.,2018b) is a attention-based method whichemploys hierarchical attention to exploit cor-relations among relations.

• PCNN+ONE+SL and PCNN+ATT+SL(Liuet al., 2017) use a soft-label method to allevi-ate the negative impact of the noisy labelingproblem.

We compare our model with aforementionedbaselines and the results are shown in Figure 3.From the overall result we can see that: (1) Allfeature-based models preform poorly as their fea-tures are derived from NLP tools, which will gen-erate errors that propagate through in model. (2)PCNN+ONE and PCNN+ATT boost the perfor-mance because they reduce noise in the bag ofentity pair by selecting the most confident sen-tence or de-emphasize the incorrectly labeled sen-tences with an attention mechanism. (3) WhenPCNN+ONE and PCNN+ATT use soft labels,they achieve an improvement. This indicates cor-recting the noisy label is helpful to relation classi-fication in MIL scheme. (4) PCNN+HATT furtherenhances the performance as it incorporates hier-archical information of relations to improve theattention mechanism. (5) Our method RCENDachieves the best precision over the entire recall

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40Recall

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Prec

ision

RCENDRCEND w/o SemiPCNN+HATT

Figure 4: Precision-recall curves of our model with dif-ferent settings.

P@N 100 200 300 MeanRCEND 95.0 87.5 84.4 88.9RCEND w/o Semi 90.0 84.6 79.1 84.6RCEND(P) 87.1 82.1 80.1 83.3RCEND(N) 89.1 85.1 81.1 85.1

Table 4: Top-N precision (P@N) of our model withdifferent settings.

range compared with all baselines. The perfor-mance achieves further improvement when we re-gard the incorrectly labeled sentences as unlabeleddata and adopt a semi-supervised learning methodto train our model. It shows that exploiting noisydata with our method is beneficial to promote dis-tant supervision relation classification.

We also report the result of Precisions@N (100,200, 300) in Table 3. We can see that our methodoutperforms the baselines on the precision valuesof top N triples extracted.

4.4 Impact of Unlabeled Data

To further verify the impact of the unlabeled data,we conduct experiments with both utilization andnon-utilization of unlabeled data. The results arepresented in Figure 4. Note that, the methodRCEND w/o Semi is similar to the method pro-posed by (Feng et al., 2018), which only removesthe incorrectly labeled sentences but does not fullyutilize them. We can see that it achieves higherprecision over the entire level of recall comparedto PCNN+HATT, the best noise-tolerate methodin MIL scheme, which shows that removing noiseis better than dealing with them with soft atten-tion weights. However, it is still unable to sur-pass our method. In Table 4, our method alsoshows notable improvement over RCEND w/oSemi. This demonstrates that fully utilizing noisydata is more advantageous than reducing them.

Page 8: Exploiting Noisy Data in Distant Supervision Relation Classification · 2019. 6. 1. · Proceedings of NAACL-HLT 2019 , pages 3216 3225 Minneapolis, Minnesota, June 2 - June 7, 2019.

3223

Type Sentence Predict DSC1: [Oliver O’Grady] is now a silver-haired , twinkly-eyed resident of [Ireland] ,where Ms. Berg often films him in parks ...

Nationality NA

FN C2: ... said [John Allison] , editor of [Opera] magazine , based in London. EmployedBy NAC3: [Jean-Pierre Bacri] is a famous writer , who is too self-centered to care abouthis lonely , overweight , 20-year-old daughter , [Lolita Marilou Berry]...

ChildrenOf NA

C4: they wanted to interview [Bill Cosby] after they met with a former TempleUniversity employee who has accused him of groping her in his home in suburban[Philadelphia]

LivedIn BornIn

FP C5: “Without the fog , [London] wouldn’t be a beautiful city.”the French painterClaude Monet wrote to his wife , Alice , during one of his long visits to [England]from France.

NA Capital

C6: MTV Goes to Africa MTV opened its first local music channel in Africa thisweek , a step touted by the singer [Lebo Mathosa] , above , at an MTV event in[Johannesburg].

NA DieIn

Table 5: Examples for case study. The first three sentences are examples of false negative case and the final threeare examples of false positive case.

This can be partially explained due to the labelrectification of the incorrectly labeled data dur-ing semi-supervised learning with correctly la-beled data which improves the generalization per-formance.

4.5 Impact of False Positives and FalseNegatives

The goal of this experiment is to inspect whetherthe relation classifier is enhanced more through theutilization of false negatives or through the uti-lization of false positives. As depicted in Figure5, RCEND(P) only recognizes the false positivesentences in DPOS by PosAgent and regards themas unlabeled data. Likewise, RCEND(N) onlydiscovers and utilizes false negative sentences.RCEND(P) and RCEND(N) behave similarly andachieve further improvement when utilizing bothfalse-positive and false-negative sentences, whichimplies that both of them are important and pro-mote the ability of our relation classifier. And theresults in Table 4 also show utilizing false negativedata performs slightly better than false positivessince false negative data might be predicted as pos-itive relation and increase samples of the target re-lation to learn a more accurate decision boundary.

4.6 Case Study

We sample some examples of incorrectly labeleddata which are regarded as unlabeled data duringtraining. In Table 5, it can be seen that our dis-criminator recognizes both false positive and falsenegative instances. For example, though the fact(John Allison, EmployedBy, Opera) is absent inthe KB due to the incompleteness of the KB, C2

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40Recall

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Precision

RCENDRCEND(P)RCEND(N)PCNN+HATT

Figure 5: Precision-recall curves of our model with dif-ferent settings.

expresses EmployedBy relation and provides evi-dence of target relation. Additionally, C4 is mis-labeled as BornIn due to the relational fact (BillCosby, BornIn, Philadelphia), even though it men-tions LivedIn relation. Further more, they are allpredicted correctly by our relation classifier in theend which shows that our model indeed capturesthe valid information of noisy data and exploitsthem to enhance its ability.

5 Conclusion

In this paper, we proposed RCEND to fully exploitvalid information of the noisy data in distant su-pervision relation classification. The instance dis-criminator is trained with reinforcement learning,which aims to recognize the instances mislabeledby distant supervision. We treat the correctly la-beled instances as labeled data and incorrectly la-beled ones as unlabeled data. Afterward, we adopta semi-supervised learning method to learn a ro-

Page 9: Exploiting Noisy Data in Distant Supervision Relation Classification · 2019. 6. 1. · Proceedings of NAACL-HLT 2019 , pages 3216 3225 Minneapolis, Minnesota, June 2 - June 7, 2019.

3224

bust relation classifier to utilize the data. In thisway, not only can our model reduce the side effectof noisy labels, but also adequately take advantageof valid information contained in noisy data. Ex-periments demonstrate that our model outperformsstate-of-the-art baselines.

Acknowledgments

We would like to express gratitude to Robert Rid-ley and the anonymous reviewers for their valuablefeedback on the paper. This work is supported bythe National Natural Science Foundation of China(No. 61672277, U1836221), the Jiangsu Provin-cial Research Foundation for Basic Research (No.BK20170074).

ReferencesDzmitry Bahdanau, Philemon Brakel, Kelvin Xu,

Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron C.Courville, and Yoshua Bengio. 2016. An actor-criticalgorithm for sequence prediction. Computing Re-search Repository, arXiv:1607.07086.

Razvan Bunescu and Raymond Mooney. 2005. Ashortest path dependency kernel for relation extrac-tion. In Proceedings of Human Language Technol-ogy Conference and Conference on Empirical Meth-ods in Natural Language Processing.

Jun Feng, Minlie Huang, Li Zhao, and Xiaoyan Yang,Yang amd Zhu. 2018. Reinforcement learning forrelation classification from noisy data.

Jenny Rose Finkel, Trond Grenager, and ChristopherManning. 2005. Incorporating non-local informa-tion into information extraction systems by gibbssampling. In Proceedings of the Annual Meetingof the Association for Computational Linguistics,pages 363–370. Association for Computational Lin-guistics.

Xu Han, Zhiyuan Liu, and Maosong Sun. 2018a. Neu-ral knowledge acquisition via mutual attention be-tween knowledge graph and text. In Proceedingsof AAAI Conference on Artificial Intelligence, pages4832–4839.

Xu Han, Pengfei Yu, Zhiyuan Liu, Maosong Sun, andPeng Li. 2018b. Hierarchical relation extractionwith coarse-to-fine grained attention. In Proceed-ings of the Conference on Empirical Methods in Nat-ural Language Processing, pages 2236–2245. Asso-ciation for Computational Linguistics.

Raphael Hoffmann, Congle Zhang, Xiao Ling,Luke Zettlemoyer, and Daniel S. Weld. 2011.Knowledge-based weak supervision for informationextraction of overlapping relations. In Proceedings

of the Annual Meeting of the Association for Com-putational Linguistics, pages 541–550. Associationfor Computational Linguistics.

Diederik P. Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. Computing Re-search Repository, arXiv:1412.6980.

Diederik P. Kingma, Shakir Mohamed, Danilo JimenezRezende, and Max Welling. 2014. Semi-supervisedlearning with deep generative models. In Advancesin Neural Information Processing Systems, pages3581–3589.

Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan,and Maosong Sun. 2016. Neural relation extractionwith selective attention over instances. In Proceed-ings of the Annual Meeting of the Association forComputational Linguistics, pages 2124–2133. Asso-ciation for Computational Linguistics.

Tianyu Liu, Kexiang Wang, Baobao Chang, and Zhi-fang Sui. 2017. A soft-label method for noise-tolerant distantly supervised relation extraction. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing, pages 1790–1795. Association for Computational Linguistics.

Bingfeng Luo, Yansong Feng, Zheng Wang, Zhanx-ing Zhu, Songfang Huang, Rui Yan, and DongyanZhao. 2017. Learning with noise: Enhance distantlysupervised relation extraction with dynamic transi-tion matrix. In Proceedings of the Annual Meet-ing of the Association for Computational Linguis-tics, pages 430–439. Association for ComputationalLinguistics.

Mike Mintz, Steven Bills, Rion Snow, and Daniel Ju-rafsky. 2009. Distant supervision for relation extrac-tion without labeled data. In Proceedings of the An-nual Meeting of the Association for ComputationalLinguistics, pages 1003–1011. Association for Com-putational Linguistics.

Pengda Qin, Weiran XU, and William Yang Wang.2018a. Dsgan: Generative adversarial training fordistant supervision relation extraction. In Proceed-ings of the Annual Meeting of the Association forComputational Linguistics, pages 496–505. Associ-ation for Computational Linguistics.

Pengda Qin, Weiran XU, and William Yang Wang.2018b. Robust distant supervision relation extrac-tion via deep reinforcement learning. In Proceed-ings of the Annual Meeting of the Association forComputational Linguistics, pages 2137–2147. Asso-ciation for Computational Linguistics.

Sebastian Riedel, Limin Yao, and Andrew McCallum.2010. Modeling relations and their mentions with-out labeled text. In Joint European Conferenceon Machine Learning and Knowledge Discovery inDatabases, pages 148–163.

Page 10: Exploiting Noisy Data in Distant Supervision Relation Classification · 2019. 6. 1. · Proceedings of NAACL-HLT 2019 , pages 3216 3225 Minneapolis, Minnesota, June 2 - June 7, 2019.

3225

David Silver, Aja Huang, Chris J. Maddison, ArthurGuez, Laurent Sifre, George van den Driessche, Ju-lian Schrittwieser, Ioannis Antonoglou, VedavyasPanneershelvam, Marc Lanctot, Sander Dieleman,Dominik Grewe, John Nham, Nal Kalchbrenner,Ilya Sutskever, Timothy P. Lillicrap, MadeleineLeach, Koray Kavukcuoglu, Thore Graepel, andDemis Hassabis. 2016. Mastering the game of gowith deep neural networks and tree search. Nature,529(7587):484–489.

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,and Christopher D. Manning. 2012. Multi-instancemulti-label learning for relation extraction. In Pro-ceedings of the Conference on Empirical Methods inNatural Language Processing, pages 455–465. As-sociation for Computational Linguistics.

Richard S. Sutton, David A. McAllester, Satinder P.Singh, and Yishay Mansour. 1999. Policy gradi-ent methods for reinforcement learning with func-tion approximation. In Advances in Neural Infor-mation Processing Systems, pages 1057–1063.

Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine Learning, 8:229–256.

Kun Xu, Siva Reddy, Yansong Feng, Songfang Huang,and Dongyan Zhao. 2016. Question answering onfreebase via relation extraction and textual evidence.In Proceedings of the Annual Meeting of the Associ-ation for Computational Linguistics.

Wei Xu, Raphael Hoffmann, Le Zhao, and Ralph Gr-ishman. 2013. Filling knowledge base gaps for dis-tant supervision of relation extraction. In Proceed-ings of the Annual Meeting of the Association forComputational Linguistics, pages 665–670. Associ-ation for Computational Linguistics.

Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan.2017. Variational autoencoder for semi-supervisedtext classification. In Proceedings of the 31st AAAIConference on Artificial Intelligence, pages 3358–3364.

Dmitry Zelenko, Chinatsu Aone, and AnthonyRichardella. 2002. Kernel methods for relation ex-traction. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing.

Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.2015. Distant supervision for relation extraction viapiecewise convolutional neural networks. In Pro-ceedings of the Conference on Empirical Methodsin Natural Language Processing, pages 1753–1762.Association for Computational Linguistics.

Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,and Jun Zhao. 2014. Relation classification via con-volutional deep neural network. In Proceedingsof International Conference on Computational Lin-guistics, pages 2335–2344.

GuoDong Zhou, Jian Su, Jie Zhang, and Min Zhang.2005. Exploring various knowledge in relation ex-traction. In Proceedings of the Annual Meetingof the Association for Computational Linguistics,pages 427–434. Association for Computational Lin-guistics.


Recommended