Lightly-supervised Representation Learning with Global ...Representation learning approaches of-ten...

Proceedings of 3rd Workshop on Structured Prediction for NLP, pages 18–28Minneapolis, Minnesota, June 7, 2019. c©2019 Association for Computational Linguistics

18

Lightly-supervised Representation Learning with Global Interpretability

Andrew Zupon, Maria Alexeeva, Marco A. Valenzuela-Escarcega,Ajay Nagesh, and Mihai Surdeanu

University of Arizona, Tucson, AZ, USA{zupon, alexeeva, marcov, ajaynagesh, msurdeanu}

@email.arizona.edu

Abstract

We propose a lightly-supervised approach forinformation extraction, in particular named en-tity classification, which combines the bene-fits of traditional bootstrapping, i.e., use oflimited annotations and interpretability of ex-traction patterns, with the robust learning ap-proaches proposed in representation learning.Our algorithm iteratively learns custom em-beddings for both the multi-word entities tobe extracted and the patterns that match themfrom a few example entities per category. Wedemonstrate that this representation-based ap-proach outperforms three other state-of-the-art bootstrapping approaches on two datasets:CoNLL-2003 and OntoNotes. Additionally,using these embeddings, our approach outputsa globally-interpretable model consisting of adecision list, by ranking patterns based on theirproximity to the average entity embedding ina given class. We show that this interpretablemodel performs close to our complete boot-strapping model, proving that representationlearning can be used to produce interpretablemodels with small loss in performance. Thisdecision list can be edited by human experts tomitigate some of that loss and in some casesoutperform the original model.

1 Introduction

One strategy for mitigating the cost of super-vised learning in information extraction (IE) isto bootstrap extractors with light supervisionfrom a few provided examples (or seeds). Tra-ditionally, bootstrapping approaches iterate be-tween learning extraction patterns such as word n-grams, e.g., the pattern “@ENTITY , formerpresident” could be used to extract personnames,1 and applying these patterns to extract thedesired structures (entities, relations, etc.) (Carl-son et al., 2010; Gupta and Manning, 2014, 2015,

1In this work we use surface patterns, but the proposedalgorithm is agnostic to the types of patterns learned.

inter alia). One advantage of this direction isthat these patterns are interpretable, which mit-igates the maintenance cost associated with ma-chine learning systems (Sculley et al., 2014).

On the other hand, representation learning hasproven to be useful for natural language pro-cessing (NLP) applications (Mikolov et al., 2013;Riedel et al., 2013; Toutanova et al., 2015, 2016,inter alia). Representation learning approaches of-ten include a component that is trained in an unsu-pervised manner, e.g., predicting words based ontheir context from large amounts of data, mitigat-ing the brittle statistics affecting traditional boot-strapping approaches. However, the resulting real-valued embedding vectors are hard to interpret.

Here we argue that these two directions arecomplementary, and should be combined. We pro-pose such a bootstrapping approach for informa-tion extraction (IE), which blends the advantagesof both directions. As a use case, we instanti-ate our idea for named entity classification (NEC),i.e., classifying a given set of unknown entitiesinto a predefined set of categories (Collins andSinger, 1999). The contributions of this work are:(1) We propose an approach for bootstrappingNEC that iteratively learns custom embeddings forboth the multi-word entities to be extracted and thepatterns that match them from a few example enti-ties per category. Our approach changes the objec-tive function of a neural network language models(NNLM) to include a semi-supervised componentthat models the known examples, i.e., by attract-ing entities and patterns in the same category toeach other and repelling them from elements indifferent categories, and it adds an external iter-ative process that “cautiously” augments the poolsof known examples (Collins and Singer, 1999).In other words, our contribution is an example ofcombining representation learning and bootstrap-ping.(2) We demonstrate that our representation learn-

19

ing approach is suitable for semi-supervised NEC.We compare our approach against several state-of-the-art semi-supervised approaches on twodatasets: CoNLL-2003 (Tjong Kim Sang andDe Meulder, 2003) and OntoNotes (Pradhan et al.,2013). We show that, despite its simplicity, ourmethod outperforms all other approaches.

(3) Our approach also outputs an interpretation ofthe learned model, consisting of a decision list ofpatterns, where each pattern gets a score per classbased on the proximity of its embedding to the av-erage entity embedding in the given class. Thisinterpretation is global, i.e., it explains the entiremodel rather than local predictions. We show thatthis decision-list model performs comparably tothe complete model on the two datasets.

(4) We also demonstrate that the resulting systemcan be understood, debugged, and maintained bynon-machine learning experts. We compare thedecision-list model edited by human domain ex-perts with the unedited decision-list model andsee a modest improvement in overall performance,with some categories getting a bigger boost. Thisimprovement shows that, for non-ambiguous cat-egories that are well-defined by the local contextscaptured by our patterns, these patterns truly areinterpretable to end users.

2 Related Work

Bootstrapping is an iterative process that alternatesbetween learning representative patterns, and ac-quiring new entities (or relations) belonging toa given category (Riloff, 1996; McIntosh, 2010).Patterns and extractions are ranked using eitherformulas that measure their frequency and asso-ciation with a category, or classifiers, which in-creases robustness due to regularization (Carlsonet al., 2010; Gupta and Manning, 2015). Whilesemi-supervised learning is not novel (Yarowsky,1995; Gupta and Manning, 2014), our approachperforms better than some modern implementa-tions of these methids such as Gupta and Manning(2014).

Distributed representations of words (Deer-wester et al., 1990; Mikolov et al., 2013; Levyand Goldberg, 2014) serve as underlying repre-sentation for many NLP tasks such as informationextraction and question answering (Riedel et al.,2013; Toutanova et al., 2015, 2016; Sharp et al.,2016). Mrksic et al. (2017) build on traditionaldistributional models by incorporating synonymy

and antonymy relations as supervision to finetune word vector spaces, using an Attract/Repelmethod similar to our idea. However, most ofthese works that customize embeddings for a spe-cific task rely on some form of supervision. Incontrast, our approach is lightly supervised, witha only few seed examples per category. Batistaet al. (2015) perform bootstrapping for relation ex-traction using pre-trained word embeddings. Theydo not learn custom pattern embeddings that applyto multi-word entities and patterns. We show thatcustomizing embeddings for the learned patternsis important for interpretability.

Recent work has focused on explanations ofmachine learning models that are model-agnosticbut local, i.e., they interpret individual model pre-dictions (Ribeiro et al., 2018, 2016a). In contrast,our work produces a global interpretation, whichexplains the entire extraction model rather than in-dividual decisions.

Lastly, our work addresses the interpretabilityaspect of information extraction methods. Inter-pretable models mitigate the technical debt of ma-chine learning (Sculley et al., 2014). For example,it allows domain experts to make manual, gradualimprovements to the models. This is why rule-based approaches are commonly used in industryapplications, where software maintenance is cru-cial (Chiticariu et al., 2013). Furthermore, theneed for interpretability also arises in critical sys-tems, e.g., recommending treatment to patients,where these systems are deployed to aid humandecision makers (Lakkaraju and Rudin, 2016).The benefits of interpretability have encouragedefforts to either extract interpretable models fromopaque ones (Craven and Shavlik, 1996), or to ex-plain their decisions (Ribeiro et al., 2016b).

As machine learning models are becomingmore complex, the focus on interpretability hasbecome more important, with new funding pro-grams focused on this topic.2 Our approachfor exporting an interpretable model (§3) is sim-ilar to Valenzuela-Escarcega et al. (2016), butwe start from distributed representations, whereasthey started from a logistic regression model withexplicit features.

2 DARPA’s Explainable AI program: http://www.darpa.mil/program/explainable-artificial-intelligence.

http://www.darpa.mil/program/explainable-artificial-intelligence

http://www.darpa.mil/program/explainable-artificial-intelligence

20

3 Approach

Bootstrapping with representation learning

Our algorithm iteratively grows a pool of multi-word entities (entPoolc) and n-gram patterns(patPoolc) for each category of interest c, andlearns custom embeddings for both, which we willshow are crucial for both performance and inter-pretability.

The entity pools are initialized with a few seedexamples (seedsc) for each category. For exam-ple, in our experiments we initialize the pool fora person names category with 10 names suchas Mother Teresa. Then the algorithm iterativelyapplies the following three steps for T epochs:

(1) Learning custom embeddings: The algorithmlearns custom embeddings for all entities and pat-terns in the dataset, using the current entPoolcs assupervision. This is a key contribution, and is de-tailed in the second part of this section.

(2) Pattern promotion: We generate the patternsthat match the entities in each pool entPoolc,rank those patterns using point-wise mutual in-formation (PMI) with the corresponding category,and select the top ranked patterns for promotionto the corresponding pattern pool patPoolc. Inthis work, we use use surface patterns consist-ing of up to 4 words before/after the entity of in-terest, e.g., the pattern “@ENTITY , formerpresident” matches any entity followed by thethree tokens ,, former, and president. However,our method is agnostic to the types of patternslearned, and can be trivially adapted to other typesof patterns, e.g., over sytactic dependency paths.

(3) Entity promotion: Entities are promoted toentPoolc using a multi-class classifier that esti-mates the likelihood of an entity belonging to eachclass (Gupta and Manning, 2015). Our feature setincludes, for each category c: (a) edit distance overcharacters between the candidate entity e and cur-rent ecs ∈ entPoolc, (b) the PMI (with c) of thepatterns in patPoolc that matched e in the trainingdocuments, and (c) similarity between e and ecs ina semantic space. For the latter feature group, weuse the set of embedding vectors learned in step(1). These features are taken from Gupta and Man-ning (2015). We use these vectors to compute thecosine similarity score of a given candidate entitye to the entities in entPoolc, and add the averageand maximum similarities as features. The top10 entities classified with the highest confidence

for each class are promoted to the correspondingentPoolc after each epoch.

Learning custom embeddings

We train our embeddings for both entities and pat-terns by maximizing the objective function J :

J = SG + Attract + Repel (1)

where SG, Attract, and Repel are individual com-ponents of the objective function designed tomodel both the unsupervised, language model partof the task as well as the light supervision comingfrom the seed examples, as detailed below. A sim-ilar approach is proposed by (Mrksic et al., 2017),who use an objective function modified with At-tract and Repel components to fine-tune word em-beddings with synonym and antonym pairs.

The SG term is formulated identically to theoriginal objective function of the Skip-Grammodel of Mikolov et al. (2013), but, crucially,adapted to operate over multi-word entities andcontexts consisting not of bags of context words,but of the patterns that match each entity. Thus, in-tuitively, our SG term encourages the embeddingsof entities to be similar to the embeddings of thepatterns matching them:

SG =∑e

[log(σ(V >e Vpp))+∑

np

log(σ(−V >e Vnp))]

(2)

where e represents an entity, pp represents a posi-tive pattern, i.e., a pattern that matches entity e inthe training texts, np represents a negative pattern,i.e., it has not been seen with this entity, and σ isthe sigmoid function. Intuitively, this componentforces the embeddings of entities to be similar tothe embeddings of the patterns that match them,and dissimilar to the negative patterns.

The second component, Attract, encourages en-tities or patterns in the same pool to be close toeach other. For example, if we have two entities inthe pool known to be person names, they shouldbe close to each other in the embedding space:

Attract =∑P

∑x1,x2∈P

log(σ(V >x1Vx2)) (3)

where P is the entity/pattern pool for a category,and x1, x2 are entities/patterns in said pool.

21

Lastly, the third term, Repel, encourages thatthe pools be mutually exclusive, which is a softversion of the counter training approach of Yan-garber (2003) or the weighted mutual-exclusivebootstrapping algorithm of McIntosh and Curran(2008). For example, person names should be farfrom organization names in the semantic embed-ding space:

Repel =∑P1,P2 if P16=P2

∑x1∈P1

∑x2∈P2

log(σ(−V >x1Vx2))

(4)

where P1, P2 are different pools, and x1 and x2are entities/patterns in P1, and P2, respectively.

We term the complete algorithm that learns anduses custom embeddings as Emboot (Embeddingsfor bootstrapping), and the stripped-down ver-sion without them as EPB (Explicit Pattern-basedBootstrapping). EPB is similar to Gupta and Man-ning (2015); the main difference is that we use pre-trained embeddings in the entity promotion classi-fier rather than Brown clusters. In other words,EPB relies on pretrained embeddings for both pat-terns and entities rather than the custom ones thatEmboot learns.3

Interpretable modelIn addition to its output (entPoolcs), Emboot pro-duces custom entity and pattern embeddings thatcan be used to construct a decision-list model,which provides a global, deterministic interpreta-tion of what Emboot learned.

This interpretable model is constructed as fol-lows. First, we produce an average embedding percategory by averaging the embeddings of the enti-ties in each entPoolc. Second, we estimate the co-sine similarity between each of the pattern embed-dings and these category embeddings, and convertthem to a probability distribution using a softmaxfunction; probc(p) is the resulting probability ofpattern p for class c.

After being constructed, the interpretable modelis used as follows. First, each candidate entity tobe classified, e, receives a score for a given classc from all patterns in patPoolc that match it. Theentity score aggregates the relevant pattern proba-bilities using Noisy-Or:

3For multi-word entities and patterns, we simply averageword embeddings to generate entity and pattern embeddingsfor EPB.

Score(e, c) =

1−∏

{pc∈patPoolc|matches(pc,e)}

(1− probc(pc))

(5)

Each entity is then assigned to the category withthe highest overall score.

4 Experiments

We evaluate the above algorithms on the task ofnamed entity classification from free text.

Datasets: We used two datasets, the CoNLL-2003 shared task dataset (Tjong Kim Sang andDe Meulder, 2003), which contains 4 entity types,and the OntoNotes dataset (Pradhan et al., 2013),which contains 11.4 These datasets containmarked entity boundaries with labels for eachmarked entity. Here we only use the entity bound-aries but not the labels of these entities during thetraining of our bootstrapping systems. To simulatelearning from large texts, we tuned hyper param-eters on development, but ran the actual experi-ments on the train partitions.

Baselines: In addition to the EPB algorithm, wecompare against the approach proposed by Guptaand Manning (2014)5. This algorithm is a sim-pler version of the EPB system, where entitiesare promoted with a PMI-based formula ratherthan an entity classifier.6 Further, we compareagainst label propagation (LP) (Zhu and Ghahra-mani, 2002), with the implementation availablein the scikit-learn package.7 In each boot-strapping epoch, we run LP, select the entities withthe lowest entropy, and add them to their top cate-gory. Each entity is represented by a feature vectorthat contains the co-occurrence counts of the entityand each of the patterns that matches it in text.8

Settings: For all baselines and proposed models,we used the same set of 10 seeds/category, whichwere manually chosen from the most frequent en-tities in the dataset. For the custom embedding

4We excluded numerical categories such as DATE.5https://nlp.stanford.edu/software/patternslearning.shtml

6We did not run this system on OntoNotes dataset as ituses a builtin NE classifier with a predefined set of labelswhich did not match the OntoNotes labels.

7http://scikit-learn.org/stable/modules/generated/

sklearn.semi_supervised.LabelPropagation.html8We experimented with other feature values, e.g., pattern

PMI scores, but all performed worse than raw counts.

https://nlp.stanford.edu/software/patternslearning.shtml

http://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.LabelPropagation.html

http://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.LabelPropagation.html

22

20 10 0 10 20

20

10

0

10

20

(a) Embeddings initialized randomly

40 30 20 10 0 10 20 30

20

10

0

10

20

30

(b) Bootstrapping epoch 5

40 30 20 10 0 10 20

20

10

0

10

20

30

40

(c) Bootstrapping epoch 10

Figure 1: t-SNE visualizations of the entity embeddings at three stages during training.Legend: = LOC. = ORG. = PER. = MISC. This figure is best viewed in color.

MISC

PER

LOC

ORG

PERMostly

performersMostly

politicians

Figure 2: t-SNE visualizations of the entity embeddings learned by Emboot after training completes.Legend: = LOC. = ORG. = PER. = MISC.

features, we used randomly initialized 15d em-beddings. Here we consider patterns to be n-grams of size up to 4 tokens on either side ofan entity. For instance, “@ENTITY , formerPresident” is one of the patterns learned forthe class person. We ran all algorithms for 20bootstrapping epochs, and the embedding learningcomponent for 100 epochs in each bootstrappingepoch. We add 10 entities and 10 patterns to eachcategory during every bootstrapping epoch.

5 Discussion

Qualitative Analysis

Before we discuss overall results, we provide aqualitative analysis of the learning process for Em-boot for the CoNLL dataset in Figure 1. The fig-ure shows t-SNE visualizations (van der Maatenand Hinton, 2008) of the entity embeddings at sev-eral stages of the algorithm. This visualizationmatches our intuition: as training advances, en-tities belonging to the same category are indeed

23

grouped together. In particular, Figure 1c showsfive clusters, four of which are dominated by onecategory (and centered around the correspondingseeds), and one, in the upper left corner, with theentities that haven’t yet been added to any of thepools.

Figure 2 shows a more detailed view of the t-SNE projections of entity embeddings after Em-boot’s training completes. Again, this demon-strates that Emboot’s semi-supervised approachclusters most entities based on their (unseen) cate-gories. Interestingly, Emboot learned two clustersfor the PER category. Upon a manual inspectionof these clusters, we observed that one containsmostly performers (e.g., athletes or artists such asStephen Ames, a professional golfer), while theother contains many politicians (e.g., Arafat andClinton). Thus, Emboot learned correctly that, atleast at the time when the CoNLL 2003 datasetwas created, the context in which politicians andperformers were mentioned was different. Thecluster in the bottom left part of the figure containsthe remaining working pool of patterns, whichwere not assigned to any category cluster after thetraining epochs.

Quantitative Analysis

A quantitative comparison of the different modelson the two datasets is shown in Figure 3.

Figure 3 shows that Emboot considerably out-performs LP and Gupta and Manning (2014), andhas an occasional improvement over EPB. WhileEPB sometimes outperforms Emboot, Emboot hasthe potential for manual curation of its model,which we will explore later in this section. Thisdemonstrates the value of our approach, and theimportance of custom embeddings.

Importantly, we compare Emboot against: (a)its interpretable version (Embootint), which is con-structed as a decision list containing the pat-terns learned (and scored) after each bootstrappingepoch, and (b) an interpretable system built simi-larly for EPB (EPBint), using the pretrained Levyand Goldberg embeddings9 rather than our customones. This analysis shows that Embootint performsclose to Emboot on both datasets, demonstratingthat most of the benefits of representation learningare available in an interpretable model. Please see

9For multi-word entities, we averaged the embeddings ofthe individual words in the entity to create an overall entityembedding.

the discussion on the edited interpretable model inthe next section.

Importantly, the figure also shows that EPBint,which uses generic entity embeddings rather thanthe custom ones learned for the task performs con-siderably worse than the other approaches. Thishighlights the importance of learning a dedicateddistributed representation for this task.

Interpretability Analysis

Is the list of patterns generated by the interpretablemodel actually interpretable to end users? To in-vestigate this, we asked two linguists to curatethe models learned by Embootint, by editing thelist of patterns in a given model. First, the ex-perts performed an independent analysis of all thepatterns. Next, the two experts conferred witheach other and came to a consensus when theirindependent decisions on a particular pattern dis-agreed. These first two stages took the experts 2hours for the CoNLL dataset and 3 hours for theOntoNotes dataset. The experts did not have ac-cess to the original texts the patterns were pulledfrom, so they had to make their decisions basedon the patterns alone. They made one of three de-cisions for each pattern: (1) no change, when thepattern is a good fit for the category; (2) changingthe category, when the pattern clearly belongs toa different category; and (3) deleting the pattern,when the pattern is either not informative enoughfor any category or when the pattern could occurwith entities from multiple categories. The expertsdid not have the option of modifying the contentof the pattern, because each pattern is associatedwith an embedding learned during training. Ta-ble 1 shows several examples of the patterns anddecision made by the annotators. A summary ofthe changes made for the CoNLL dataset is givenin Figure 4, and a summary of the changes madefor the OntoNotes dataset is given in Figure 5.

As Figure 3 shows, this edited interpretablemodel (Embootint-edited) performs similarly to theunedited interpretable model. When we look a lit-tle deeper, the observed overall similarity betweenthe unchanged and the edited interpretable Em-boot models for both datasets depends on the spe-cific categories and the specific patterns involved.For example, when we look at the CoNLL dataset,we observe that the edited model outperforms theunchanged model on PER entities, but performsworse than the unchanged model on ORG enti-

24

0 100 200 300 400 500 600 700 800throughput

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

OVERALL : CoNLLEPBEPBint

EmbootEmbootint

Embootint edited

LPGupta2014

0 500 1000 1500 2000throughput

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

OVERALL : Ontonotes

EPBEPBint

EmbootEmbootint

Embootint edited

LP

Figure 3: Overall results on the CoNLL and OntoNotes datasets. Throughput is the number of entities classified,and precision is the proportion of entities that were classified correctly. Please see Sec. 4 for a description of thesystems listed in the legend.

Pattern Original Label Decision Rationale

@ENTITY was the LOC delete the pattern is too broad@ENTITY ) Ferrari LOC delete the pattern is uninformativecitizen of @ENTITY MISC change to LOC the pattern is more likely to

occur with a locationAccording to @ENTITY officials ORG no change the pattern is likely to occur

with an organization

Table 1: Examples of patterns and experts’ decisions and rationales from the CoNLL dataset.

25

Figure 4: Summary of expert decisions when editingthe Embootint model, for the CoNLL dataset by originalcategory. Dark blue (bottom) is no change, mediumblue (middle) is deletions, light blue (top) is change ofcategory.

Figure 5: Summary of expert decisions when edit-ing the Embootint model, for the OntoNotes dataset byoriginal category. Dark blue (bottom) is no change,medium blue (middle) is deletions, light blue (top) ischange of category.

ties (Figure 6). We observe a similar pattern withthe OntoNotes dataset, where the Embootint-editedmodel outperforms the Embootint model greatlyfor GPE but not for LAW (Figure 7). Over-all, for OntoNotes, Embootint-edited outperformsEmbootint for 5 categories out of 11. For the cat-egories where Embootint-edited performs worse, thedata is sparse, so few patterns are promoted (30FAC patterns compared to 200 GPE patterns), andmany of them were deleted or changed by the twolinguists (14 FAC deletions and 13 FAC changes,with only 3 FAC patterns remaining).

This difference in outcome partially has to dowith the amount of local vs. global informa-tion available in the patterns. For example, lo-

0 25 50 75 100 125 150 175 200throughput

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

PER

EPBEPB_intEmbootEmboot_intEmboot_int-editedLPGupta 2014

0 50 100 150 200throughput

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

ORGEPBEPB_intEmbootEmboot_intEmboot_int-editedLPGupta 2014

Figure 6: CoNLL results for PER and ORG.The Embootint-edited model generally outperforms theEmbootint model when it comes to PER entities, but notfor ORG entities. This discrepancy seems to relate tothe amount of local information available in PER pat-terns versus ORG patterns that can aid human domainexperts in correcting the patterns. The EPBint model(incorrectly) classifies very few entities as ORG, whichis why it only shows up as a single point in the bottomleft of the lower plot.

cal patterns are common for the PER and MISCcategories in CoNLL, and for the GPE categoryin OntoNotes, e.g., the entity “Syrian” is cor-rectly classified as MISC (which includes de-monyms) due to two patterns matching it in theCoNLL dataset: “@ENTITY President” and“@ENTITY troops”. In general, the majorityof predictions are triggered by 1 or 2 patterns,which makes these decisions explainable. For theCoNLL dataset, 59% of Embootint’s predictionsare triggered by 1 or 2 patterns; 84% are generatedby 5 or fewer patterns; only 1.1% of predictionsare generated by 10 or more patterns.

On the other hand, without seeing the fullsource text, the experts were not able to

26

0 25 50 75 100 125 150 175 200throughput

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

GPE

EPBEPB_intEmbootEmboot_intEmboot_int-editedLP

0 50 100 150 200throughput

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

LAW

EPBEPB_intEmbootEmboot_intEmboot_int-editedLP

Figure 7: OntoNotes results for GPE and LAW. TheEmbootint-edited model greatly outperforms both Em-boot and Embootint models when it comes to GPE enti-ties, but not for LAW entities. This discrepancy seemsto relate to the amount of local information available inGPE patterns versus LAW patterns that can aid humandomain experts in correcting the patterns. EPBint doesnot classify any entities as LAW.

make an accurate judgment on the validity ofsome patterns—for instance, while the pattern@ENTITY and Portugal clearly indicates ageo-political entity, the pattern @ENTITY hasbeen (labeled facility originally when train-ing on OntoNotes documents) can co-occur withentities from any category. Such patterns are com-mon for the LAW category in OntoNotes, and forthe ORG category in CoNLL (due to the focuson sport events in CoNLL, where location namesare commonly used as a placeholder for the corre-sponding team), which partially explains the poorcuration results on these categories in Figures 6and 7. Additionally, lower performance on certaincategories can be partially explained by the smallamount of data in those categories and the fact thatthe edits made by the experts drastically changed

the number of patterns that occur with some cate-gories (see Figures 4 and 5).

6 Conclusion

This work introduced an example of representa-tion learning being successfully combined withtraditional, pattern-based bootstrapping for infor-mation extraction, in particular named entity clas-sification. Our approach iteratively learns customembeddings for multi-word entities and the pat-terns that match them as well as cautiously aug-menting the pools of known examples. This ap-proach outperforms several state-of-the-art semi-supervised approaches to NEC on two datasets,CoNLL 2003 and OntoNotes.

Our approach can also export the model learnedinto an interpretable list of patterns, which hu-man domain experts can use to understand whyan extraction was generated. These patterns canbe manually curated to improve the performanceof the system by modifying the model directly,with minimal effort. For example, we used a teamof two linguists to curate the model learned forOntoNotes in 3 hours. The model edited by humandomain experts shows a modest improvement overthe unedited model, demonstrating the usefulnessof these interpretable patterns. Interestingly, themanual curation of these patterns performed betterfor some categories that rely mostly on local con-text that is captured by the type of patterns usedin this work, and less well for categories that re-quire global context that is beyond the n-gram pat-terns used here. This observation raises opportu-nities for future work such as how to learn globalcontext in an interpretable way, and how to adjustthe amount of global information depending on thecategory learned.

7 Acknowledgments

We gratefully thank Yoav Goldberg for his sugges-tions for the manual curation experiments.

This work was supported by the Defense Ad-vanced Research Projects Agency (DARPA) underthe Big Mechanism program, grant W911NF-14-1-0395, and by the Bill and Melinda Gates Foun-dation HBGDki Initiative. Marco Valenzuela-Escarcega and Mihai Surdeanu declare a financialinterest in lum.ai. This interest has been properlydisclosed to the University of Arizona InstitutionalReview Committee and is managed in accordancewith its conflict of interest policies.

27

References

David S Batista, Bruno Martins, and Mario J Silva.2015. Semi-supervised bootstrapping of relation-ship extractors with distributional semantics. In InEmpirical Methods in Natural Language Process-ing. ACL.

Andrew Carlson, Justin Betteridge, Richard C Wang,Estevam R Hruschka Jr, and Tom M Mitchell. 2010.Coupled semi-supervised learning for informationextraction. In Proceedings of the third ACM inter-national conference on Web search and data mining,pages 101–110. ACM.

Laura Chiticariu, Yunyao Li, and Frederick R Reiss.2013. Rule-based information extraction is dead!long live rule-based information extraction systems!In EMNLP, October, pages 827–832.

Michael Collins and Yoram Singer. 1999. Unsuper-vised models for named entity classification. In Pro-ceedings of the Conference on Empirical Methods inNatural Language Processing.

Mark W Craven and Jude W Shavlik. 1996. Extractingtree-structured representations of trained networks.Advances in neural information processing systems,pages 24–30.

Scott Deerwester, Susan T Dumais, George W Fur-nas, Thomas K Landauer, and Richard Harshman.1990. Indexing by latent semantic analysis. Jour-nal of the American society for information science,41(6):391–407.

Sonal Gupta and Christopher D Manning. 2014. Im-proved pattern learning for bootstrapped entity ex-traction. In CoNLL, pages 98–108.

Sonal Gupta and Christopher D. Manning. 2015. Dis-tributed representations of words to guide boot-strapped entity classifiers. In Proceedings of theConference of the North American Chapter of theAssociation for Computational Linguistics.

Himabindu Lakkaraju and Cynthia Rudin. 2016.Learning cost-effective treatment regimes usingmarkov decision processes. CoRR, abs/1610.06972.

Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In ACL (2), pages 302–308.

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-sne. The Journal of Ma-chine Learning Research, 9(2579-2605):85.

Tara McIntosh. 2010. Unsupervised discovery of nega-tive categories in lexicon bootstrapping. In Proceed-ings of the 2010 Conference on Empirical Methodsin Natural Language Processing, pages 356–365.Association for Computational Linguistics.

Tara McIntosh and James R Curran. 2008. Weightedmutual exclusion bootstrapping for domain indepen-dent lexicon and template acquisition. In Proceed-ings of the Australasian Language Technology Asso-ciation Workshop, volume 2008.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

Nikola Mrksic, Ivan Vulic, Diarmuid O Seaghdha, IraLeviant, Roi Reichart, Milica Gasic, Anna Korho-nen, and Steve Young. 2017. Semantic special-ization of distributional word vector spaces usingmonolingual and cross-lingual constraints. Transac-tions of the Association for Computational Linguis-tics, 5:309–324.

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Hwee Tou Ng, Anders Bjrkelund, Olga Uryupina,Yuchen Zhang, and Zhi Zhong. 2013. Towards ro-bust linguistic analysis using ontonotes. In Proceed-ings of the Seventeenth Conference on Computa-tional Natural Language Learning, pages 143–152,Sofia, Bulgaria. Association for Computational Lin-guistics.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2016a. ”why should i trust you?”: Ex-plaining the predictions of any classifier. In Knowl-edge Discovery and Data Mining (KDD).

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2016b. Why should i trust you?: Explain-ing the predictions of any classifier. In Proceedingsof the 22nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, pages1135–1144. ACM.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2018. Anchors: High-precision model-agnostic explanations. In AAAI Conference on Ar-tificial Intelligence (AAAI).

Sebastian Riedel, Limin Yao, Andrew McCallum, andBenjamin M Marlin. 2013. Relation extraction withmatrix factorization and universal schemas. In Pro-ceedings of NAACL-HLT.

Ellen Riloff. 1996. Automatically generating extrac-tion patterns from untagged text. In Proceedingsof the national conference on artificial intelligence,pages 1044–1049.

D. Sculley, Gary Holt, Daniel Golovin, Eugene Davy-dov, Todd Phillips, Dietmar Ebner, Vinay Chaud-hary, and Michael Young. 2014. Machine learning:The high interest credit card of technical debt. InSE4ML: Software Engineering for Machine Learn-ing (NIPS 2014 Workshop).

Rebecca Sharp, Mihai Surdeanu, Peter Jansen, Pe-ter Clark, and Michael Hammond. 2016. Creat-ing causal embeddings for question answering with

http://arxiv.org/abs/1610.06972

http://arxiv.org/abs/1610.06972

http://www.aclweb.org/anthology/W13-3516

http://www.aclweb.org/anthology/W13-3516

28

minimal supervision. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP).

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. InProceedings of CoNLL-2003, pages 142–147. Ed-monton, Canada.

Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoi-fung Poon, Pallavi Choudhury, and Michael Gamon.2015. Representing text for joint embedding of textand knowledge bases. In EMNLP, volume 15, pages1499–1509. Citeseer.

Kristina Toutanova, Xi Victoria Lin, Wen-tau Yih, Hoi-fung Poon, and Chris Quirk. 2016. Compositionallearning of embeddings for relation paths in knowl-edge bases and text. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics, volume 1, pages 1434–1444.

Marco A Valenzuela-Escarcega, Gus Hahn-Powell,Dane Bell, and Mihai Surdeanu. 2016. Snaptogrid:From statistical to interpretable models for biomed-ical information extraction. In Proceedings of the15th Workshop on Biomedical Natural LanguageProcessing, pages 56–65.

Roman Yangarber. 2003. Counter-training in discoveryof semantic patterns. In Proceedings of the AnnualMeeting of the Association for Computational Lin-guistics.

David Yarowsky. 1995. Unsupervised word sense dis-ambiguation rivaling supervised methods. In Pro-ceedings of the Annual Meeting of the Associationfor Computational Linguistics.

X. Zhu and Z. Ghahramani. 2002. Learning from la-beled and unlabeled data with label propagation.Technical Report CMU-CALD-02-107, CarnegieMellon University.

citeseer.ist.psu.edu/zhu02learning.html

citeseer.ist.psu.edu/zhu02learning.html

Date post:	05-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Lightly-supervised Representation Learning with Global ...Representation learning approaches of-ten...

Documents