+ All Categories
Home > Documents > Global joint models for coreference resolution and named entity classification

Global joint models for coreference resolution and named entity classification

Date post: 12-May-2023
Category:
Upload: utexas
View: 0 times
Download: 0 times
Share this document with a friend
10
Global joint models for coreference resolution and named entity classification Modelos juntos globales para la resoluci´on de la correferencia y de la clasificaci´on de las entidades nombradas Pascal Denis Alpage Project-Team INRIA and Universit´ e Paris 7 30, rue Chˆ ateau des Rentiers 75013 Paris, FRANCE [email protected] Jason Baldridge Department of Linguistics University of Texas at Austin 1 University Station B5100 Austin, TX 78712-0198 USA [email protected] Resumen: En este art´ ıculo, combinamos modelos de correferencia, anaforicidad y clasificaci´ on de las entidades nombradas, como un problema de inferencia junta global utilizando la Programaci´ on Lineal Entera (ilp). Nuestras restricciones garan- tizan: (i) la coherencia entre las decisiones finales de los tres modelos locales, y (ii) la transitividad de las decisiones de correferencia. Este enfoque proporciona mejoras significativas en el f -score sobre los corpora ace con las tres m´ etricas de evaluaci´ on principales para la correferencia: muc, b 3 ,y ceaf. A trav´ es de ejemplos, modelos de or´ aculo y nuestros resultados, se muestra tambi´ en que es fundamental utilizar es- tas tres m´ etricas y, en particular, que no se puede confiar ´ unicamente en la m´ etrica muc. Palabras clave: Resoluci´ on de la correferencia, entidades nombradas, aprendizaje autom´ atico, Programaci´ on Lineal Entera (ILP) Abstract: In this paper, we combine models for coreference, anaphoricity and named entity classification as a joint, global inference problem using Integer Linear Programming (ilp). Our constraints ensure: (i) coherence between the final deci- sions of the three local models, and (ii) transitivity of multiple coreference decisions. This approach provides significant f -score improvements on the ace datasets for all three main coreference metrics: muc, b 3 , and ceaf. Through examples, oracle models, and our results, we also show that it is fundamental to use all three of these metrics, and in particular, to never rely solely on the muc metric. Keywords: Coreference Resolution, Named Entities, Machine Learning, Integer Linear Programming (ILP) 1 Introduction Coreference resolution involves imposing a partition on a set of mentions in a text; each partition corresponds to some entity in a dis- course model. Early machine learning ap- proaches for the task which rely on local, discriminative pairwise classifiers (Soon, Ng, and Lim, 2001; Ng and Cardie, 2002b; Mor- ton, 2000; Kehler et al., 2004) made consid- erable progress in creating robust coreference systems, but their performance still left much room for improvement. This stems from two main deficiencies: Decision locality. Decisions are made independently of others; a separate clus- tering step forms chains from pairwise classifications. But, coreference clearly should be conditioned on properties of an entity as a whole. Knowledge bottlenecks. Corefer- ence involves many different factors, e.g., morphosyntax, discourse structure and reasoning. Yet most systems rely on small sets of shallow features. Accu- rately predicting such information and using it to constrain coreference is dif- ficult, so its potential benefits often go unrealized due to error propagation. More recent work has sought to address these limitations. For example, to ad- dress decision locality, McCallum and Well- ner (2004) use conditional random fields with Procesamiento del Lenguaje Natural, Revista nº 42, marzo de 2009, pp. 87-96 recibido 15-01-09, aceptado 02-03-09 ISSN 1135-5948 © 2009 Sociedad Española para el Procesamiento del Lenguaje Natural
Transcript

Global joint models for coreference resolution and named entityclassification

Modelos juntos globales para la resolucion de la correferencia y de laclasificacion de las entidades nombradas

Pascal DenisAlpage Project-Team

INRIA and Universite Paris 730, rue Chateau des Rentiers

75013 Paris, [email protected]

Jason BaldridgeDepartment of Linguistics

University of Texas at Austin1 University Station B5100

Austin, TX 78712-0198 [email protected]

Resumen: En este artıculo, combinamos modelos de correferencia, anaforicidady clasificacion de las entidades nombradas, como un problema de inferencia juntaglobal utilizando la Programacion Lineal Entera (ilp). Nuestras restricciones garan-tizan: (i) la coherencia entre las decisiones finales de los tres modelos locales, y (ii)la transitividad de las decisiones de correferencia. Este enfoque proporciona mejorassignificativas en el f -score sobre los corpora ace con las tres metricas de evaluacionprincipales para la correferencia: muc, b3, y ceaf. A traves de ejemplos, modelosde oraculo y nuestros resultados, se muestra tambien que es fundamental utilizar es-tas tres metricas y, en particular, que no se puede confiar unicamente en la metricamuc.Palabras clave: Resolucion de la correferencia, entidades nombradas, aprendizajeautomatico, Programacion Lineal Entera (ILP)

Abstract: In this paper, we combine models for coreference, anaphoricity andnamed entity classification as a joint, global inference problem using Integer LinearProgramming (ilp). Our constraints ensure: (i) coherence between the final deci-sions of the three local models, and (ii) transitivity of multiple coreference decisions.This approach provides significant f -score improvements on the ace datasets forall three main coreference metrics: muc, b3, and ceaf. Through examples, oraclemodels, and our results, we also show that it is fundamental to use all three of thesemetrics, and in particular, to never rely solely on the muc metric.Keywords: Coreference Resolution, Named Entities, Machine Learning, IntegerLinear Programming (ILP)

1 Introduction

Coreference resolution involves imposing apartition on a set of mentions in a text; eachpartition corresponds to some entity in a dis-course model. Early machine learning ap-proaches for the task which rely on local,discriminative pairwise classifiers (Soon, Ng,and Lim, 2001; Ng and Cardie, 2002b; Mor-ton, 2000; Kehler et al., 2004) made consid-erable progress in creating robust coreferencesystems, but their performance still left muchroom for improvement. This stems from twomain deficiencies:

• Decision locality. Decisions are madeindependently of others; a separate clus-tering step forms chains from pairwise

classifications. But, coreference clearlyshould be conditioned on properties ofan entity as a whole.

• Knowledge bottlenecks. Corefer-ence involves many different factors, e.g.,morphosyntax, discourse structure andreasoning. Yet most systems rely onsmall sets of shallow features. Accu-rately predicting such information andusing it to constrain coreference is dif-ficult, so its potential benefits often gounrealized due to error propagation.

More recent work has sought to addressthese limitations. For example, to ad-dress decision locality, McCallum and Well-ner (2004) use conditional random fields with

Procesamiento del Lenguaje Natural, Revista nº 42, marzo de 2009, pp. 87-96 recibido 15-01-09, aceptado 02-03-09

ISSN 1135-5948 © 2009 Sociedad Española para el Procesamiento del Lenguaje Natural

model structures in which pairwise decisionsinfluence others. Denis (2007) and Klenner(2007) use integer linear programming (ilp)to perform global inference via transitivityconstraints between different coreference de-cisions.1 Haghighi and Klein (2007) providea fully generative model that combines globalproperties of entities across documents withlocal attentional states. Denis and Baldridge(2008) use a ranker to compare antecedentsfor an anaphor simultaneously rather thanin the standard pairwise manner. To ad-dress the knowledge bottleneck problem, De-nis and Baldridge (2007) use ilp for jointinference using a pairwise coreference modeland a model for determining the anaphoric-ity of mentions. Also, Denis and Baldridge(2008) and Bengston and Roth (2008) usemodels and features, respectively, that at-tend to particular types of mentions (e.g.,full noun phrases versus pronouns). Further-more, Bengston and Roth (2008) use a widerrange of features than are normally consid-ered, and in particular use predicted featuresfor later classifiers, to considerably boost per-formance.

In this paper, we use ilp to extend thejoint formulation of Denis and Baldridge(2007) using named entity classification andcombine it with the transitivity constraints(Denis, 2007; Klenner, 2007). Intuitively, weonly should identify antecedents for the men-tions which are likely to have one (Ng andCardie, 2002a), and we should only make aset of mentions coreferent if they are all in-stances of the same entity type (eg, personor location). ilp enables such constraintsto be declared between the outputs of inde-pendent classifiers to ensure coherent assign-ments are made. It also leads to global in-ference via both constraints on named entitytypes and transitivity constraints since bothrelate multiple pairwise decisions.

We show that this strategy leads to im-provements across the three main metricsproposed for coreference: the muc metric(Vilain et al., 1995), the b3 metric (Baggaand Baldwin, 1998), and ceaf metric (Luo,2005). In addition, we contextualize the per-formance of our system with respect to cas-cades of multiple models and oracle systemsthat assume perfect information (e.g. aboutentity types). We furthermore demonstrate

1These were independent, simultaneous develop-ments.

the inadequacy of using only the muc met-ric and argue that results should always begiven for all three. We include a simple com-posite of the three metrics, called mela, forMention, Entity, and Link Average score.2

2 Data and evaluation

We use the ACE corpus (Phase 2) for train-ing and testing. The corpus has three parts:npaper, nwire, and bnews, and each set issplit into a train part and a devtest part.The corpus text was preprocessed with theOpenNLP Toolkit3 (i.e., a sentence detector,a tokenizer, and a POS tagger). In our ex-periments, we consider only true ACE men-tions instead of detecting them; our focus ison evaluating pairwise local approaches ver-sus the global ilp approach rather than onbuilding a full coreference resolution system.

Three primary metrics have been pro-posed for evaluating coreference perfor-mance: (i) the link based muc metric (Vi-lain et al., 1995), (ii) the mention based b3

metric (Bagga and Baldwin, 1998), and (iii)the entity based ceaf metric (Luo, 2005).All these metrics compare the set of chains Sproduced by a system against the true chainsT , and report performance in terms of recalland precision. They however differ in howthey computes these scores, and each embedsa different bias.

The muc metric is the oldest and stillmost commonly used. muc operates by de-termining the number of links (i.e., pairs ofmentions) that are common to S and T . Re-call is the number of common links dividedby the total number of links in the T ; preci-sion is the number of common links dividedby the total number of links in S. By focusingon the links, this metric has two main biases,which are now well-known (Bagga and Bald-win, 1998; Luo, 2005) but merit re-emphasisdue its continued use as the sole evaluationmeasure. First, it favors systems that createlarge chains (hence, fewer entities). For in-stance, a system that produces a single chainachieves 100% recall without severe degrada-tion in precision. Second, it ignores recall forsingle mention entities, since no link can befound in these; however, putting such men-tions in the wrong chain does hurt precision.4

2Interestingly, mela means “gathering” in San-skrit, so this acronym seems appropriate.

3Available from opennlp.sf.net.4It is worth noting that the muc corpus for which

Pascal Denis, Jason Baldridge

88

T = {m1,m3,m5}, {m2}, {m4,m6,m7}S1 = {m1,m2,m3,m6}, {m4,m5,m7}S2 = {m1,m2,m3,m4,m5,m6,m7}

Figure 1: Two competiting partitionings formention set {m1,m2,m3,m4,m5,m6,m7}.

The b3 metric addresses the muc metric’sshortcomings, by computing recall and pre-cision scores for each mention m. Let S bethe system chain containing m, T be the truechain containing m. The set of correct ele-ments in S is thus |S ∩ T |. The recall scorefor a mention m is thus computed as |S∩T |

|T | ,

while the precision score for m is |S∩T ||S| . Over-

all recall/precision is obtained by averagingover the individual mention scores. The factthat this metric is mention-based by defini-tion solves the problem of single mention en-tities. It also does not favor larger chains,since they will be penalized in the precisionscore of each mention.

The Constrained Entity Aligned F-Measure5 (ceaf) aligns each system chain Swith at most one true chain T . It finds thebest one-to-one mapping between the set ofchains S and T , which is equivalent to findingthe optimal alignment in a bipartite graph.The best mapping is that which maximizesthe similarity over pairs of chains (Si, Ti),where the similarity of two chains is the num-ber of common mentions between them. Forceaf, recall is the total similarity divided bythe number of mentions in all the T , whileprecision is the total similarity divided bythe number of mentions in S. Note thatwhen true mentions are used, ceaf assignsthe same recall and precision: this is becausethe two systems partition the same set ofmentions.

A simple example illustrating how themetrics operate is presented in Figure 1 (seeLuo (2005) for more examples). T is the setof true chains, S1 and S2 are the partitionsproduced by two hypothetical resolvers. Re-call, precision, and f -score for these metricsare given in Table 1.

the metric was devised does not annotate single men-tion entities. However, the ACE corpus does includesuch entities.

5We use the mention-based ceaf measure (Luo,2005). This is the same metric as ECM-F (Luo et al.,2004) used by Klenner (2007).

muc b3 ceaf

R P F R P F FS1 .50 .40 .44 .62 .45 .52 .57S2 1.0 .66 .79 1.0 .39 .56 .43

Table 1: Recall (R), precision (P), and f -score (F) using muc, b3, and ceaf for parti-tionings of Figure 1

The bias of the muc metric for large chainsis shown by the fact that it gives better recalland precision scores for S2 even though thispartition is completely uninformative. Moreintuitively, b3 highly penalizes the precisionof this partition: precision errors are herecomputed for each mention. ceaf is theharshest on S2, and in fact is the only metricthat prefers S1 over S2.

muc is known for being an applicable met-ric when one is only interested in precisionon pairwise links (Bagga and Baldwin, 1998).Given that much recent work —including thepresent paper— seeks to move beyond sim-ple pairwise coreference and produce goodentities, it is crucial that they are scoredon the other metrics as well as muc. Mosttellingly, our results show that both b3 andceaf scores can show degradation even whenmuc appears to show an improvement.

3 Base models

Here we define the three base classifiersfor pairwise coreference, anaphoricity, andnamed entity classification. They form thebasis for several cascades and joint inferencewith ilp. Like Kehler et al. (2004) and Mor-ton (2000), we estimate the parameters ofall models using maximum entropy (Berger,Pietra, and Pietra, 1996); specifically, weuse the limited memory variable metric al-gorithm (Malouf, 2002).6 Gaussian priors forthe models were optimized on developmentdata.

3.1 The coreference classifier

Our coreference classifier is based on thatof Soon, Ng, and Lim (2001), though thefeatures have been extended and are similar(though not equivalent) to those used by Ngand Cardie (2002a). Features fall into 3 cat-egories: (i) features of the anaphor, (ii) fea-tures of antecedent mention, and (iii) pair-wise features (i.e., such as distance between

6This algorithm is implemented in Toolkit for Ad-vanced Discriminative Modeling (tadm.sf.net).

Global joint models for coreference resolution and named entity classification

89

the two mentions). We omit details here forbrevity (details on the different feature setscan be found in Denis (2007)); the ilp ap-proach could be equally well applied to mod-els using other, extended feature sets suchas those discussed in Denis and Baldridge(2008) and Bengston and Roth (2008).

Using the coreference classifier on its owninvolves: (i) estimating PC(coref|〈i, j〉), theprobability of having a coreferential out-come given a pair of mentions 〈i, j〉, and(ii) applying a selection algorithm that picksone or more mentions out of the candidatesfor which PC(coref|〈i, j〉) surpasses a giventhreshold (here, .5).

PC(coref|〈i, j〉) =exp(

n∑k=1

λkfk(〈i, j〉,coref))

Z(〈i, j〉)

where fk(i, j) is the number of times featurek occurs for i and j, λk is the weight assignedto feature k during training, and Z(〈i, j〉) isa normalization factor over both outcomes(coref and ¬coref).

Training instances are constructed basedon pairs of mentions of the form 〈i, j〉, wherej and i describe an anaphor and an an-tecedent candidate, respectively. Each suchpair is assigned a label, either coref or¬coref, depending on whether or not thetwo mentions corefer. We followed the sam-pling method of Soon, Ng, and Lim (2001)for creating the training material for eachanaphor: (i) a positive instance for the pair〈i, j〉 where i is the closest antecedent for j,and (ii) a negative instance for each pair 〈i, k〉where k intervenes between i and j.

Once trained, the classifier can be usedto choose pairwise coreference links–and thusdetermine the partition of entities–in twoways. The first is to pick a unique antecedentwith closest-first link-clustering (Soon, Ng,and Lim, 2001); this is the standard strat-egy, referred to as COREFclosest. The secondis to simply take all links with probabilityabove .5, which we refer to as COREFabove .5.The purpose of including this latter strategyis primarily to demonstrate an easy way toimprove muc scores that actually degradesb3 and ceaf scores. This strategy indeedresults in positing significantly larger chains,since each anaphor is allowed to link to sev-eral antecedents.

3.2 The anaphoricity classifier

Ng and Cardie (2002a) introduced the use ofan anaphoricity classifier to act as a filter forcoreference resolution to correct errors wherenon-anaphoric mentions are mistakenly re-solved or where anaphoric mentions failed tobe resolved. Their approach produces im-provements in precision, but larger losses inrecall. Ng (2004) improves recall by opti-mizing the anaphoricity threshold. By us-ing joint inference for anaphoricity and coref-erence, Denis and Baldridge (2007) avoidcascade-induced errors without the need toseparately optimize the threshold. They re-alize gains in both recall and precision; how-ever, they report only muc scores. As we willshow, these improvements do not hold for b3

and ceaf.The task for the anaphoricity determina-

tion component is the following: one wantsto decide for each mention i in a documentwhether i is anaphoric or not. This task canbe performed using a simple classifier withtwo outcomes: anaph and ¬anaph. Theclassifier estimates the conditional probabil-ities P (anaph|i) and predicts anaph for iwhen P (anaph|i) > .5. The anaphoricitymodel is as follows:

PA(anaph|i) =exp(

n∑k=1

λkfk(i,anaph))

Z(i)

The features used for the anaphoricityclassifier are quite simple. They include in-formation regarding (i) the mention itself,such as the number of words and whether it isa pronoun, and (ii) properties of the potentialantecedent set, such as whether there is a pre-vious mention with a matching string. Thisclassifier achieves 80.8% on the entire acecorpus (bnews: 80.1, npaper: 82.2, nwire:80.1).

3.3 The named entity classifier

Named entity classification involves pre-dicting one of the five ACE class labels.The set of named entity types T are:facility, gpe (geo-political entity), location,organization, person. The classifier es-timates the conditional probabilities P (t|i)for each t∈T and predicts the named en-tity type t for mention i such that t =argmaxt∈T P (t|i).

Pascal Denis, Jason Baldridge

90

PE(t|i) =exp(

n∑k=1

λkfk(i, t))

Z(i)

The features for this model include: (i)the string of the mention, (ii) features definedover the string (e.g., capitalization, punctu-ations, head word), (iii) features describingthe word and POS context around the men-tion. The classifier achieves 79.5% on theentire ace corpus (bnews: 79.8, npaper:73.0, nwire: 72.7).

4 Base model results

This section describes coreference perfor-mance when the pairwise coreference classi-fier is used alone with closest-first clustering(COREFclosest) or with the liberal all-links-above-.5 clustering (COREFabove .5), or whenCOREFclosest is constrained by the anaphoric-ity and named entity classifiers as filters ina cascade or by gold-standard information asfilters in oracle systems. The cascades are:

• CASCADEa→c: the anaphoricity classifierspecificies which mentions to resolve

• CASCADEe→c: the named entity classi-fier specifies which antecedents have thesame type as the mention to be resolved;others are excluded from consideration

• CASCADEa,e→c: the two classifiers actingas combined filters

We also provide results for the correspond-ing oracle systems which have perfect knowl-edge about anaphoricity and/or named en-tity types: ORACLEa,c, ORACLEe,c, and ORA-CLEa,e,c.

Table 2 summarizes the results in termsof recall (R), precision (P), and f -score (F)on the three coreference metrics: muc, b3,and ceaf. The first thing to note is the con-trast between COREFclosest and COREFabove .5.Recall that the only difference between thetwo clustering strategies is that the latter cre-ates strictly larger entities than the former byadding all links above .5. By doing so, it gainsabout 10% in R for both muc and b3. How-ever, whereas muc does not register a drop inprecision, b3 P is 14% lower, which producesan overall 1% drop in F. ceaf punishes thisstrategy even more, with a 3.6% drop. Notethat the resulting composite mela scores are

almost identical. Given the nature of thetwo strategies COREFclosest and COREFabove .5,these differences across metrics strongly sup-port arguments that muc is too indiscrimi-nate and can in fact be gamed (knowingly ornot) by simply creating larger chains.

Table 2 also shows that cascades in generalfail to produce significant F improvementsover the pairwise model COREFclosest. Thesesystems are far behind the performance oftheir corresponding oracles. This tendency iseven stronger when both classifiers filter pos-sible assignments: CASCADEa,e→c does muchworse than COREFclosest on all metrics. Infact, this system has the lowest F on theb3 evaluation metric, suggesting that the er-rors of the two filters accumulate in this case.In contrast, the corresponding oracle, ORA-CLEa,e,c, achieves the best results across allmeasures. It does so by capitalizing on theimprovements given by the separate oracles.

Furthermore, note that the use of the twoauxiliary models have complementary effectson the muc and b3 metrics, in both the cas-cade and the oracle systems. Thus, the useof the anaphoricity classifier improves recall(suggesting that some true anaphors get “res-cued” by this model), while the the use ofthe named entity model leads to precision im-provements (suggesting that this model man-ages to filter out incorrect candidates thatwould have been chosen by the coreferencemodel). In the case of the oracle systems,these gains translate in overall F improve-ments. But, as noted, this is generally notthe case with the cascade systems. Only CAS-CADEa→c shows significant gains with mucand ceaf (and not with b3). CASCADEe→c

underperforms in all three metrics. This lat-ter system indeed shows a large drop in recall,suggesting that this model filter is overzeal-ous in filtering true antecedents.

The oracle results suggest that joint mod-eling could deliver large performance gainsby not falling prey to cascade errors. In thenext section, we build on previous ilp for-mulations and show such improvements canindeed be realized.

5 Integer programmingformulations

ilp is an optimization framework for globalinference over the outputs of various baseclassifiers (Roth and Yih, 2004). Previoususes of ilp for nlp tasks include eg. Roth

Global joint models for coreference resolution and named entity classification

91

System muc b3 ceaf melaR P F R P F R/P/F F-avg

COREFclosest 60.8 72.6 66.2 62.4 77.7 69.2 62.3 65.9COREFabove .5 70.3 72.7 71.5 73.2 63.7 68.1 58.7 66.1CASCADEa→c 64.9 72.3 68.4 65.6 74.1 69.6 63.4 67.1CASCADEe→c 56.3 75.2 64.4 59.6 82.4 69.2 61.6 65.1CASCADEa,e→c 61.3 68.8 64.8 62.5 73.8 67.7 61.9 64.8ORACLEa,c 75.6 75.6 75.6 71.4 70.7 71.1 71.5 72.7ORACLEe,c 62.5 81.3 70.7 62.9 85.5 72.4 65.2 69.4ORACLEa,e,c 83.2 83.2 83.2 79.0 78.2 78.6 78.7 80.2

Table 2: Recall (R), precision (P), and f -score (F) using muc, b3, and ceaf on the entireace corpus for the basic coreference system, the cascade systems, and the corresponding oraclesystems.

and Yih (2004), Barzilay and Lapata (2006),and Clarke and Lapata (2006). Here, we pro-vide several ilp formulations for coreference.The first formulation ILPc,a is based on De-nis and Baldridge (2007) and performs jointinference over the coreference classifier andthe anaphoricity classifier. A second formu-lation ILPc,e combines the coreference classi-fier with the named entity classifier. A thirdformulation ILPc,a,e combines all three mod-els together. In each of these joint formu-lation, a set of consistency constraints mu-tually constrain the ultimate assignments ofeach model. Finally, a fourth formulationILPc,a,e|trans adds to ILPc,a,e a set of transi-tivity constraints (similar to those of Klen-ner (2007)). These latter constraints ensurebetter global coherence between the variouspairwise coreference decisions, hence makingthis fourth formulation both a joint and aglobal model.

For solving the ilp problem, we usecplex, a commercial lp solver.7 In practice,each document is processed to define a dis-tinct ilp problem that is then submitted tothe solver.

5.1 ILPc,a: anaphoricity-coreferenceformulation

The ILPc,a system of Denis and Baldridge(2007) brings the two decisions of corefer-ence and anaphoricity together by includingboth in a single objective function and en-forcing consistency constraints on the finaloutputs of both tasks. More technically, letfirst M denotes the set of mentions, and Pthe set of possible coreference links over M:P = {〈i, j〉|〈i, j〉 ∈ M × M and i < j}.

7http://www.ilog.com/products/cplex/

Each model introduces a set of indicator vari-ables: (i) coreference variables 〈i, j〉 ∈ 0, 1depending on whether i and j corefer ornot, and (ii) anaphoricity variables x〈i,j〉 ∈0, 1 depending on whether j is anaphoricor not. These variables are associated withassignment costs that are derived from themodel probabilities pC = PC(coref|i, j)and pA = PA(anaph|j), respectively. Thecost of commiting to a coreference link iscC〈i,j〉 = −log(pC) and the complement costof choosing not to establish a link is cC〈i,j〉 =−log(1−pC). Analogously, we define costs onanaphoricity decisions as cAj = −log(pA) andcAj = −log(1−pA), the costs associated withmaking j anaphoric or not, respectively. Theresulting objective function takes the follow-ing form:

min∑〈i,j〉∈P

cC〈i,j〉 · x〈i,j〉 + cC〈i,j〉 · (1−x〈i,j〉)

+∑j∈M

cAj · yj + cAj · (1−yj)

subject to:

x〈i,j〉 ∈ {0, 1} ∀〈i, j〉 ∈ Pyj ∈ {0, 1} ∀j ∈M

The final assignments of x〈i,j〉 and yj vari-ables are forced to respect the following twoconsistency constraints (where Mj is the setof all mentions preceding mention j in thedocument):Resolve all anaphors: if a mention isanaphoric (yj=1), it must have at least oneantecedent.

yj ≤∑

i∈Mj

x〈i,j〉 ∀j ∈M

Pascal Denis, Jason Baldridge

92

Resolve only anaphors: if a pair of men-tions 〈i, j〉 is coreferent (x〈i,j〉=1), then j isanaphoric (yj=1).

x〈i,j〉 ≤ yj ∀〈i, j〉 ∈ P

These constraints make sure that theanaphoricity classifier are not taken on faithas they were with CASCADEa→c. Instead, weoptimize over consideration of both possibil-ities in the objective function (relative to theprobability output by the classifier) while en-suring that the final assignments respect thesignifance of what it is to be anaphoric ornon-anaphoric.

5.2 ILPc,e: entity-coreferenceformulation

In this second joint formulation, we combinecoreference decisions with named entity clas-sification. New indicator variables for theassignments of this model are introduced,namely z〈i,j〉, where 〈i, t〉 ∈ M × T . Sinceentity classification is not a binary decision,each assigment variable encode a mention iand a named entity type t. Each of thesevariables have an associated cost cE〈i,t〉, whichis the probability that mention i has type t:cE〈i,t〉 = −log(PE(t|i)). The objective functionfor this formulation is:

min∑〈i,j〉∈P

cC〈i,j〉 · x〈i,j〉 + cC〈i,j〉 · (1−x〈i,j〉)

+∑

〈i,t〉∈M×T

cE〈i,t〉 · z〈i,t〉

subject to:

z〈i,t〉 ∈ {0, 1} ∀〈i, t〉 ∈ M× T∑i∈M

z〈i,t〉 = 1 ∀i ∈M

The last constraint ensures that each men-tion is only assigned a unique named entitytype. Consistency between the two models isensured with the constraint:Coreferential mentions have the sameentity type: if i and j are coreferential(x〈i,j〉=1), they must have the same type(z〈i,t〉 − z〈j,t〉 = 0):

1− x〈i,j〉 ≥ z〈i,t〉 − z〈j,t〉 ∀〈i, j〉 ∈ P, ∀t ∈ T1− x〈i,j〉 ≥ z〈j,t〉 − z〈i,t〉 ∀〈i, j〉 ∈ P, ∀t ∈ T

These constraints above make sure that thecoreference decisions (the x values) are in-formed by the named entity classifier andvice versa. Furthermore, because these con-straints ensure like assignments to coreferentpairs of mentions, they have a “propagating”effect that makes the overall system global.Coreference assignments that have low cost(i.e., high confidence) can influence namedentity assignments (e.g., from a org to aper). This in turn influences other corefer-ence assignments involving further mentionsradiating out from one core, highly likely as-signment.

5.3 ILPc,a,e: anaphoricity-entity-coreferenceformulation

For the third joint model, we combine allthree base models with an objective func-tion that is the composite of those of ILPc,a

and ILPc,e and incorporate all the constraintsthat go with them. By creating a triple jointmodel, we get constraints between anaphoric-ity and named entity classification for free, asa result of the interaction of the consistencyconstraints between anaphoricity and coref-erence and of those between named entityand coreference. For example, if a mentionof type t is anaphoric, then there must be atleast one mention of type t preceding it.

5.4 Adding transitivity constraints

The previous formulations relate corefer-ence decisions to the decisions made bytwo auxiliary models in a joint formulation.In addition one would also like to makecoreference decisions dependent on one an-other, thus ensuring globally coherent enti-ties. This is achieved through the use transi-tivity constraints that relate triples of men-tions 〈i, j, k〉 ∈ M×M×M, where i < j < k(Denis, 2007; Klenner, 2007). These con-straints directly exploit the fact that coref-erence is an equivalence relation.Transitivity: if x〈i,j〉 and x〈j,k〉 are corefer-ential pairs (i.e., x〈i,j〉 = x〈j,k〉 = 1), then sois x〈i,k〉:

x〈i,k〉 ≥ x〈i,j〉 + x〈j,k〉 − 1 ∀〈i, j, k〉 ∈Mi,j,k

Euclideanity: if x〈i,j〉 and x〈i,k〉 are corefer-ential pairs (i.e., x〈i,j〉 = x〈i,k〉 = 1), then sois x〈j,k〉.

Global joint models for coreference resolution and named entity classification

93

x〈j,k〉 ≥ x〈i,j〉 + x〈i,k〉 − 1 ∀〈i, j, k〉 ∈Mi,j,k

Anti-Euclideanity: if x〈i,k〉 and x〈j,k〉 arecoreferential pairs (i.e., x〈i,k〉 = x〈j,k〉 = 1),then so is x〈i,j〉:

x〈i,j〉 ≥ x〈i,k〉 + x〈j,k〉 − 1 ∀〈i, j, k〉 ∈Mi,j,k

Enforcing Anti-Euclideanity aloneguarantees that the final assignment will notproduce any “implicit” anaphors: that is, aconfiguration wherein x〈j,k〉 = 1, x〈i,k〉 = 1,and yj = 0. The interaction of this con-straint with resolve only anaphors indeedguarantees that such configuration cannotarise, since all three equalities cannot holdtogether. This means that mention j mustbe a good match for mention i as well as formention k.

Note that one could have one unique tran-sitivity constraint if we had symmetry inour model; concretely, capturing symmetrymeans: (i) adding a new indicator variablex〈j,i〉 for each variable x〈i,j〉, and (ii) makingsure x〈j,i〉 agrees with x〈i,j〉.

Enforcing each of these constraints abovemeans adding 1

6 × n× (n− 1)× (n− 2) con-straints, for a document containing n men-tions. This means close to 500, 000 of theseconstraints for a document containing just100 mentions. The inclusion of such a largeset of constraints turned out to be diffi-cult, causing memory issues with large docu-ments (some of the ace documents have morethan 250 mentions). Consequently, we in-vestigated during development various sim-pler scenarios, such as enforcing these con-straints for documents that had a relativelysmall number of mentions (e.g., 100) or justusing one of these types of constraint (inparticular Anti-Euclideanity given the wayit interacts with the discourse status assign-ments). In the following, ILPc,a,e|trans will re-fer to the ILPc,a,e formulation augmented withthe Anti-Euclideanity constraints.

6 ILP Results

Table 3 summarizes the scores for the dif-ferent ilp systems, along with COREFclosest.Like Denis and Baldridge (2007), we find thatjoint anaphoricity and coreference (ILPc,a)greatly improves muc F. However, we alsosee that this model suffers from the sameproblem as COREFabove .5: performance on

the other metrics go down. This is in factunsurprising: COREFabove .5 can be viewed asan unconstrained ilp formulation; similarly,ILPc,a takes all links above .5 subject to meet-ing the constraints on anaphoricity. The con-straining effect of anaphoricity improves mucR and P and b3 R over COREFabove .5, but notb3 P nor ceaf. Despite the encouraging mucscores, more is thus needed.

The next thing to note is that joint namedentity classification and coreference (ILPc,e)nearly beats COREFclosest across the metrics,but fails for ceaf. As for ILPc,a, ILPc,e canalso be viewed as constraining COREFabove .5:in this case, precision is improved (comparemuc: 72.7 to 75.0 and b3: 63.7 to 71.2), whilestill retaining over half the gain in recall thatCOREFabove .5 obtained over COREFclosest. Indoing so, the degradation in ceaf is just 1%,compared to ILPc,a’s 3.4%. In addition to im-proving coreference resolution performance,this joint formulation also yields a slight im-provement on the named entity classification:specifically, accuracy for that task went from79.5% to over 80.0% using the ILPc,e model.

Joint inference over all three models(ILPc,a,e) delivers larger improvements forboth muc and b3 without any ceaf degrada-tion, thus mirroring the improvements foundwith the corresponding oracle. In partic-ular, R is boosted nearly to the level ofCOREFabove .5 without the dramatic loss inP (in fact P is better than COREFclosest formuc). By adding the Anti-Euclideanity con-straint to this formulation (ILPc,a,e|trans), wesee the best across-the-metric scores of anysystem. For muc and b3, both P and Rare boosted over COREFclosest, and there isa jump of 4% for ceaf. Both the mucand ceaf improvements for ILPc,a,e|trans arein line with the improvements that Klen-ner (2007) found using transitivity, thoughit should be noted that he scored on all men-tions, not just true mentions as we do here.

The composite mela metric provides aninteresting overall view, showing step-wiseimprovements through the addition of thevarious models and the global constraints.

These results are in sharp contrast withthose obtained by the cascade model CAS-CADEa,e→c: recall that this system, while alsousing the two auxiliary models as filters wasworse than COREFclosest. The joint ilp formu-lation is clearly better able to integrate theextra information provided by the anaphoric-

Pascal Denis, Jason Baldridge

94

System muc b3 ceaf melaR P F R P F R/P/F F

COREFclosest 60.8 72.6 66.2 62.4 77.7 69.2 62.3 65.9COREFabove .5 70.3 72.7 71.5 73.2 63.7 68.1 58.7 66.1ILPc,a 73.2 73.4 73.3 75.3 62.0 68.0 58.9 66.7ILPc,e 66.2 75.0 70.4 69.6 71.2 70.4 61.2 67.3ILPc,a,e 69.6 75.4 72.4 72.2 69.7 70.9 62.3 68.5ILPc,a,e|trans 63.7 77.8 70.1 65.6 81.4 72.7 66.2 69.7

Table 3: Recall (R), precision (P), and f -score (F) using the muc, b3, and ceaf evaluationmetric on the entire ace dataset for the ilp coreference systems.

ity and named entity classifiers. In doingso, it does not require fine-tuning thresholds,and it can further benefit from constraints,such as transitivity.

Further experiments reveal that bringingthe other transitivity constraints into theilp formulation results in additional preci-sion gains, although not in overall F gains.The effect of these constraints is to withdrawincoherent links, rather than producing newlinks. At the global level, this results in thecreation of smaller, more coherent clustersof mentions. In some cases, this will leadto a single entity being split across multi-ple chains. Switching on these constraintsmay therefore be useful for certain applica-tions where precision is more important thanrecall.

Though in general ceaf appears to be themost discriminating metric, this point bringsup the reason why using ceaf on its own isnot ideal. When one entity is split across twoor more chains, all the links between the men-tions are indeed correct and will thus be use-ful for applications like information retrieval.muc and b3 give points to such assignments,whereas only the largest of such chains will beused for ceaf, leaving the others—and theircorrect links—out of the score. It is also in-teresting to consider muc and b3 as they canbe useful for teasing apart the behavior ofdifferent models, for example, with ILPc,a,e

compared to COREFclosest, where ceaf wasthe same but the others were different.

There is an interesting point of compar-ison with our results using rankers ratherthan classifiers and using models specializedto particular types of mentions (Denis andBaldridge, 2008). This work does not useilp, but the best system there, with f -scoresof 71.6, 72.7, and 67.0 for muc, b3, andceaf, respectively, actually slightly beats

ILPc,a,e|trans, our best ilp system. This un-derscores the importance of attending care-fully to the base classifiers and features used(see also Bengston and Roth (2008) in this re-gard). The ilp approach in this paper couldstraightforwardly swap in these better basemodels. We expect this to lead to further per-formance improvements, which we intend totest in future work, as well as testing the per-formance of these models and methods whenusing predicted, rather than gold, mentions.

7 Conclusion

We have shown that joint inference overcoreference, anaphoricity, and named entityclassification using ilp leads to improvementsfor all three main coreference metrics: muc,b3, and ceaf. The fact that b3 and ceafscores were also improved is significant: theilp formulations tend to construct largercoreference chains—these are rewarded bymuc without precision penalties, but b3 andceaf are not as lenient.

As importantly, we have provided a care-ful study of cascaded systems, oracle sys-tems and the joint systems with respect toall of the metrics. We demonstrated that themuc metric’s bias for larger chains leads itto give much higher scores while performanceaccording to the other metrics actually drops.Nonetheless, b3 and ceaf also have weak-nesses; it is thus important to report all ofthese scores. We also include the mela scoreas a simple at-a-glance composite metric.

Acknowledgments

We would like to thank Nicholas Asher,David Beaver, Andrew Kehler, Ray Mooney,and the three anonymous reviewers for theircomments, as well as the audience at theworkshop for their questions. This work wassupported by NSF grant IIS-0535154.

Global joint models for coreference resolution and named entity classification

95

References

Bagga, A. and B. Baldwin. 1998. Algorithmsfor scoring coreference chains. In Proceed-ings of LREC 1998, pages 563–566.

Barzilay, Regina and Mirella Lapata. 2006.Aggregation via set partitioning for natu-ral language generation. In Proceedings ofHLT-NAACL 2006, pages 359–366, NewYork City, USA.

Bengston, Eric and Dan Roth. 2008. Under-standing the value of features for coref-erence resolution. In Proceedings ofEMNLP 2008, pages 294–303, Honolulu,Hawaii.

Berger, A., S. Della Pietra, and V. DellaPietra. 1996. A maximum entropy ap-proach to natural language processing.Computational Linguistics, 22(1):39–71.

Clarke, James and Mirella Lapata. 2006.Constraint-based sentence compression:An integer programming approach. InProceedings of COLING-ACL 2006, pages144–151.

Denis, P. 2007. New Learning Models forRobust Reference Resolution. Ph.D. the-sis, University of Texas at Austin.

Denis, P. and J. Baldridge. 2007. Joint deter-mination of anaphoricity and coreferenceresolution using integer programming.In Proceedings of HLT-NAACL 2007,Rochester, NY.

Denis, Pascal and Jason Baldridge. 2008.Specialized models and ranking for coref-erence resolution. In Proceedings ofEMNLP 2008, pages 660–669, Honolulu,Hawaii.

Haghighi, A. and D. Klein. 2007. Unsuper-vised coreference resolution in a nonpara-metric bayesian model. In Proceedings ofACL 2007, pages 848–855, Prague, CzechRepublic.

Kehler, A., D. Appelt, L. Taylor, andA. Simma. 2004. The (non)utility ofpredicate-argument frequencies for pro-noun interpretation. In Proceedings ofHLT-NAACL 2004.

Klenner, M. 2007. Enforcing coherenceon coreference sets. In Proceedings ofRANLP 2007.

Luo, X. 2005. On coreference resolution per-formance metrics. In Proceedings of HLT-NAACL 2005, pages 25–32.

Luo, Xiaoqiang, Abe Ittycheriah, HogyanJing, Nanda Kambhatla, and SalimRoukos. 2004. A mention-synchronouscoreference resolution algorithm based onthe bell tree. In Proceedings of ACL 2004,pages 135–142, Barcelona, Spain.

Malouf, R. 2002. A comparison of algorithmsfor maximum entropy parameter estima-tion. In Proceedings of the Sixth Workshopon Natural Language Learning, pages 49–55, Taipei, Taiwan.

McCallum, A. and B. Wellner. 2004. Condi-tional models of identity uncertainty withapplication to noun coreference. In Pro-ceedings of NIPS 2004.

Morton, T. 2000. Coreference for NLP ap-plications. In Proceedings of ACL 2000,Hong Kong.

Ng, V. 2004. Learning noun phraseanaphoricity to improve coreference reso-lution: Issues in representation and opti-mization. In Proceedings of ACL 2004.

Ng, V. and C. Cardie. 2002a. Identi-fying anaphoric and non-anaphoric nounphrases to improve coreference resolution.In Proceedings of COLING 2002.

Ng, V. and C. Cardie. 2002b. Improving ma-chine learning approaches to coreferenceresolution. In Proceedings of ACL 2002,pages 104–111.

Roth, Dan and Wen-tau Yih. 2004. A linearprogramming formulation for global infer-ence in natural language tasks. In Pro-ceedings of CoNLL.

Soon, W. M., H. T. Ng, and D. Lim. 2001.A machine learning approach to corefer-ence resolution of noun phrases. Compu-tational Linguistics, 27(4):521–544.

Vilain, M., J. Burger, J. Aberdeen, D. Con-nolly, and L. Hirschman. 1995. A model-theoretic coreference scoring scheme. InProceedings fo the 6th Message Under-standing Conference (MUC-6), pages 45–52, San Mateo, CA. Morgan Kaufmann.

Pascal Denis, Jason Baldridge

96


Recommended