Importance sampling for unbiased on-demand …Debbie Reynolds arrived together at the awards show...

Importance sampling for unbiased on-demand evaluation of knowledgebase population

Arun Tejasvi Chaganty∗ and Ashwin Pradeep Paranjape∗ and Percy LiangComputer Science Department

Stanford University{chaganty,ashwinp,pliang}@cs.stanford.edu

Christopher D. ManningComputer Science Department

Stanford University{manning}@cs.stanford.edu

Abstract

Knowledge base population (KBP) sys-tems take in a large document corpus andextract entities and their relations. Thusfar, KBP evaluation has relied on judge-ments on the pooled predictions of exist-ing systems. We show that this evalua-tion is problematic: when a new systempredicts a previously unseen relation, it ispenalized even if it is correct. This leadsto significant bias against new systems,which counterproductively discourages in-novation in the field. Our first contribu-tion is a new importance-sampling basedevaluation which corrects for this bias byannotating a new system’s predictions on-demand via crowdsourcing. We show thiseliminates bias and reduces variance usingdata from the 2015 TAC KBP task. Oursecond contribution is an implementationof our method made publicly available asan online KBP evaluation service. We pi-lot the service by testing diverse state-of-the-art systems on the TAC KBP 2016 cor-pus and obtain accurate scores in a cost ef-fective manner.

1 Introduction

Harnessing the wealth of information present inunstructured text online has been a long stand-ing goal for the natural language processing com-munity. In particular, knowledge base popula-tion seeks to automatically construct a knowl-edge base consisting of relations between entitiesfrom a document corpus. Knowledge bases havefound many applications including question an-swering (Berant et al., 2013; Fader et al., 2014;

∗ Authors contributed equally.

Fisher’s mother, entertainer Debbie Reynolds, said on Twitter on Sunday that her daughter was stabilizing.

Debbie Reynolds , title, entertainer

Debbie Reynolds, per:parents, Fisher

Fisher

Debbie Reynolds

daughter

her

Twitter

Relation instances

Linked Entities

Knowledge Basetitle entertainer

child of

+

Figure 1: An example describing entities and re-lations in knowledge base population.

Reddy et al., 2014), automated reasoning (Kalyan-pur et al., 2012) and dialogue (Han et al., 2015).

Evaluating these systems remains a challengeas it is not economically feasible to exhaustivelyannotate every possible candidate relation from asufficiently large corpus. As a result, a pooling-based methodology is used in practice to constructdatasets, similar to them methodology used in in-formation retrieval (Jones and Rijsbergen, 1975;Harman, 1993). For instance, at the annual NISTTAC KBP evaluation, all relations predicted byparticipating systems are pooled together, anno-tated and released as a dataset for researchers todevelop and evaluate their systems on. However,during development, if a new system predicts apreviously unseen relation it is considered to bewrong even if it is correct. The discrepancy be-tween a system’s true score and the score on thepooled dataset is called pooling bias and is typi-cally assumed to be insignificant in practice (Zo-bel, 1998).

The key finding of this paper contradicts this as-sumption and shows that the pooling bias is actu-

ally significant, and it penalizes newly developedsystems by 2% F1 on average (Section 3). Novelimprovements, which typically increase scores byless than 1% F1 on existing datasets, are there-fore likely to be clouded by pooling bias duringdevelopment. Worse, the bias is larger for a sys-tem which predicts qualitatively different relationssystematically missing from the pool. Of course,systems participating in the TAC KBP evaluationdo not suffer from pooling bias, but this requiresresearchers to wait a year to get credible feedbackon new ideas.

This bias is particularly counterproductive formachine learning methods as they are trained as-suming the pool is the complete set of positives.Predicting unseen relations and learning novel pat-terns is penalized. The net effect is that researchersare discouraged from developing innovative ap-proaches, in particular from applying machinelearning, thereby slowing progress on the task.

Our second contribution, described in Sec-tion 4, addresses this bias through a new evalua-tion methodology, on-demand evaluation, whichavoids pooling bias by querying crowdworkers,while minimizing cost by leveraging previous sys-tems’ predictions when possible. We then com-pute the new system’s score based on the predic-tions of past systems using importance weighting.As more systems are evaluated, the marginal costof evaluating a new system decreases. We showhow the on-demand evaluation methodology canbe applied to knowledge base population in Sec-tion 5. Through a simulated experiment on eval-uation data released through the TAC KBP 2015Slot Validation track, we show that we are able toobtain unbiased estimates of a new systems score’swhile significantly reducing variance.

Finally, our third contribution is an implementa-tion of our framework as a publicly available eval-uation service at https://kbpo.stanford.edu, where researchers can have their own KBPsystems evaluated. The data collected through theevaluation process could even be valuable for rela-tion extraction, entity linking and coreference, andwill also be made publicly available through thewebsite. We evaluate three systems on the 2016TAC KBP corpus for about $150 each (a fractionof the cost of official evaluation). We believe thepublic availability of this service will speed thepace of progress in developing KBP systems.

Humans

System A

System B

System C

i1 : (s1, r, o1, p1)

i2 : (s1, r, o2, p2)

i3 : (s1, r, o3, p3)

i4 : (s1, r, o2, p4)

i5 : (s1, r, o3, p5)

i6 : (s1, r, o4, p6)

XXXX×X

Figure 2: In pooled evaluation, an evaluationdataset is constructed by labeling relation in-stances collected from the pooled systems (A andB) and from a team of human annotators (Hu-mans). However, when a new system (C) is evalu-ated on this dataset, some of its predictions (i6) aremissing and can not be fairly evaluated. Here, theprecision and recall for C should be 3

3 and 34 re-

spectively, but its evaluation scores are estimatedto be 2

3 and 23 . The discrepancy between these two

scores is called pooling bias.

2 Background

In knowledge base population, each relation isa triple (SUBJECT, PREDICATE, OBJECT) whereSUBJECT and OBJECT are some globally uniqueentity identifiers (e.g. Wikipedia page titles) andPREDICATE belongm to a specified schema.1 AKBP system returns an output in the form of re-lation instances (SUBJECT, PREDICATE, OBJECT,PROVENANCE), where PROVENANCE is a descrip-tion of where exactly in the document corpus therelation was found. In the example shown in Fig-ure 1, CARRIE FISHER and DEBBIE REYNOLDS

are identified as the subject and object, respec-tively, of the predicate CHILD OF, and the wholesentence is provided as provenance. The prove-nance also identifies that CARRIE FISHER is ref-erenced by Fisher within the sentence. Note thatthe same relation can be expressed in multiple sen-tences across the document corpus; each of theseis a different relation instance.

Pooled evaluation. The primary source of eval-uation data for KBP comes from the annual TACKBP competition organized by NIST (Ji et al.,

1The TAC KBP guidelines specify a total of 65predicates (including inverses) such as per:title ororg:founded on, etc. Subject entities can be people, or-ganizations, geopolitical entities, while object entities alsoinclude dates, numbers and arbitrary string-values like job ti-tles.

2011). Let E be a held-out set of evaluation en-tities. There are two steps performed in parallel:First, each participating system is run on the docu-ment corpus to produce a set of relation instances;those whose subjects are in E are labeled as eitherpositive or negative by annotators. Second, a teamof annotators identify and label correct relation in-stances for the evaluation entities E by manuallysearching the document corpus within a time bud-get (Ellis et al., 2012). These labeled relation in-stances from the two steps are combined and re-leased as the evaluation dataset. In the example inFigure 2, systems A and B were used in construct-ing the pooling dataset, and there are 3 distinct re-lations in the dataset, between s1 and o1, o2, o3.

A system is evaluated on the precision of itspredicted relation instances for the evaluation en-tities E and on the recall of the corresponding pre-dicted relations (not instances) for the same enti-ties (see Figure 2 for a worked example). Whenusing the evaluation data during system develop-ment, it is common practice to use the more le-nient anydoc score that ignores the provenancewhen checking if a relation instance is true. Un-der this metric, predicting the relation (CARRIE

FISHER, CHILD OF, DEBBIE REYNOLDS) froman ambiguous provenance like “Carrie Fisher andDebbie Reynolds arrived together at the awardsshow” would be considered correct even though itwould be marked wrong under the official metric.

3 Measuring pooling bias

The example in Figure 2 makes it apparent thatpooling-based evaluation can introduce a system-atic bias against unpooled systems. However, ithas been assumed that the bias is insignificant inpractice given the large number of systems pooledin the TAC KBP evaluation. We will now showthat the assumption is not valid using data fromthe TAC KBP 2015 evaluation.2

Measuring bias. In total, there are 70 systemsubmissions from 18 teams for 317 evaluation en-tities (E) and the evaluation set consists of 11,008labeled relation instances.3 The original evalua-

2Our results are not qualitatively different on data fromprevious years of the shared task.

3The evaluation set is actually constructed from composi-tional queries like, “what does Carrie Fisher’s parents do?”:these queries select relation instances that answer the ques-tion “who are Carrie Fisher’s parents?”, and then use thoseanswers (e.g. “Debbie Reynolds”) to select relation instancesthat answer “what does Debbie Reynolds do?”. We only con-

Median biasPrecision Recall Macro F1

Official 17.93% 17.00% 15.51%anydoc 2.34% 1.93% 2.05%

Figure 3: Median pooling bias (difference be-tween pooled and unpooled scores) on the top 40systems of TAC KBP 2015 evaluation using theofficial and anydoc scores. The bias is muchsmaller for the lenient anydoc metric, but evenso, it is larger than the largest difference betweenadjacent systems (1.5%F1) and typical system im-provements (around 1% F1).

tion dataset gives us a good measure of the truescores for the participating systems. Similar to Zo-bel (1998), which studied pooling bias in informa-tion retrieval, we simulate the condition of a teamnot being part of the pooling process by removingany predictions that are unique to its systems fromthe evaluation dataset. The pooling bias is then thedifference between the true and unpooled scores.

Results. Figure 3 shows the results of measur-ing pooling bias on the TAC KBP 2015 eval-uation on the F1 metric using the official andanydoc scores.45 We observe that even with le-nient anydoc heuristic, the median bias (2.05%F1) is much larger than largest difference betweenadjacently ranked systems (1.5% F1). This ex-periment shows that pooling evaluation is signif-icantly and systematically biased against systemsthat make novel predictions!

sider instances selected in the first part of this process.4We note that anydoc scores are on average 0.88%F1

larger than the official scores.5 The outlier at rank 36 corresponds to a University of

Texas, Austin system that only filtered predictions from othersystems and hence has no unique predictions itself.

4 On-demand evaluation withimportance sampling

Pooling bias is fundamentally a sampling biasproblem where relation instances from new sys-tems are underrepresented in the evaluationdataset. We could of course sidestep the prob-lem by exhaustively annotating the entire docu-ment corpus, by annotating all mentions of en-tities and checking relations between all pairs ofmentions. However, that would be a laborious andprohibitively expensive task: using the interfaceswe’ve developed (Section 6), it costs about $15 toannotate a single document by non-expert crowd-workers, resulting in an estimated cost of at least$1,350,000 for a reasonably large corpus of 90,000documents (Dang, 2016). The annotation effortwould cost significantly more with expert annota-tors. In contrast, labeling relation instances fromsystem predictions can be an order of magnitudecheaper than finding them in documents: using ourinterfaces, it costs only about $0.18 to verify eachrelation instance compared to $1.60 per instanceextracted through exhaustive annotations.

We propose a new paradigm called on-demandevaluation which takes a lazy approach to datasetconstruction by annotating predictions from sys-tems only when they are underrepresented, thuscorrecting for pooling bias as it arises. In this sec-tion, we’ll formalize the problem solved by on-demand evaluation independent of KBP and de-scribe a cost-effective solution that allows us toaccurately estimate evaluation scores without biasusing importance sampling. We’ll then instantiatethe framework for KBP in Section 5.

4.1 Problem statement

Let X be the universe of (relation) instances, Y ⊆X be the unknown subset of correct instances,X1, . . . Xm ⊆ X be the predictions for m sys-tems, and let Yi = Xi ∩ Y . Let X =

⋃mi=1Xi

and Y =⋃m

i=1 Yi. Let f(x) def= I[x ∈ Y] and

gi(x) = I[x ∈ Xi], then the precision, πi, andrecall, ri, of the set of predictions Xi is

πidef= Ex∼pi [f(x)] ri

def= Ex∼p0 [gi(x)],

where pi is a distribution over Xi and p0 is a dis-tribution over Y . We assume that pi is known, e.g.the uniform distribution overXi and that we knowp0 up to normalization constant and can samplefrom it.

In on-demand evaluation, we can query f(x)(e.g. labeling an instance) or draw a samplefrom p0; typically, querying f(x) is significantlycheaper than sampling from p0. We obtain predic-tion sets X1, . . . , Xm sequentially as the systemsare submitted for evaluation. Our goal is to esti-mate πi and ri for each system i = 1, . . . ,m.

4.2 Simple estimators

We can estimate each πi and ri independently withsimple Monte Carlo integration. Let X1, . . . , Xm

be multi-sets of n1, . . . , nj i.i.d. samples fromX1, . . . , Xm respectively, and let Y0 be a multi-set of n0 samples drawn from Y . Then, the simpleestimators for precision and recall are:

π(simple)i =

1

ni

∑

x∈Xi

f(x) r(simple)i =

1

n0

∑

x∈Y0

gi(x).

4.3 Joint estimators6

The simple estimators are unbiased but havewastefully large variance because evaluating a newsystem does not leverage labels acquired for pre-vious systems.

On-demand evaluation with the joint estimatorworks as follows: First Y0 is randomly sampledfrom Y once when the evaluation framework islaunched. For every new set of predictions Xm

submitted for evaluation, the minimum number ofsamples nm required to accurately evaluate Xm iscalculated based on the current evaluation data, Y0and X1, . . . , Xm−1. Then, the set Xm is added tothe evaluation data by evaluating f(x) on nm sam-ples drawn from Xm. Finally, estimates πi and riare updated for each system i = 1, . . . ,m usingthe joint estimators that will be defined next. Inthe rest of this section, we will answer the follow-ing three questions:

1. How can we use all the samples X1, . . . Xm

when estimating the precision πi of system i?

2. How can we use all the samples X1, . . . , Xm

with Y0 when estimating recall ri?

3. Finally, to form Xm, how many samplesshould we draw fromXm given existing sam-ples and X1, . . . , Xm−1 and Y0?

Estimating precision jointly. Intuitively, if twosystems have very similar predictions Xi and Xj ,

6Proofs for claims made in this section can be found inAppendix B of the supplementary material.

we should be able to use samples from one to es-timate precision on the other. However, it mightalso be the case that Xi and Xj only overlap on asmall region, in which case the samples from Xj

do not accurately represent instances in Xi andcould lead to a biased estimate. We address thisproblem by using importance sampling (Owen,2013), a standard statistical technique for estimat-ing properties of one distribution using samplesfrom another distribution.

In importance sampling, if Xi is sampled fromqi, then 1

ni

∑x∈Xi

pi(x)qi(x)

f(x) is an unbiased esti-mate of πi. We would like the proposal distribu-tion qi to both leverage samples from all m sys-tems and be tailored towards system i. To thisend, we first define a distribution over systems j,represented by probabilities wij . Then, define qias sampling a j and drawing x ∼ pj ; formallyqi(x) =

∑mj=1wijpj(x).

We note that qi(x) not only significantly differsbetween systems, but also changes as new systemsare added to the evaluation pool. Unfortunately,the standard importance sampling procedure re-quires us to draw and use samples from each dis-tribution qi(x) independently and thus can not ef-fectively reuse samples drawn from different dis-tributions. To this end, we introduce a practicalrefinement to the importance sampling procedure:we independently draw nj samples according topj(x) from each of the m systems independentlyand then numerically integrate over these samplesusing the weights wij to “mix” them appropriatelyto produce and unbiased estimate of πi while re-ducing variance. Formally, we define the joint pre-cision estimator:

π(joint)i

def=

m∑

j=1

wij

nj

∑

x∈Xj

pi(x)f(x)

qi(x),

where each Xj consists of nj i.i.d. samples drawnfrom pj .

It is a hard problem to determine what the op-timal mixing weights wij should be. However,we can formally verify that if Xi and Xj are dis-joint, then wij = 0 minimizes the variance ofπi, and if Xi = Xj , then wij ∝ nj is opti-mal. This motivates the following heuristic choicewhich interpolates between these two extremes:wij ∝ nj

∑x∈X pj(x)pi(x).

Estimating recall jointly. The recall of systemi can be expressed can be expressed as a product

ri = θνi, where θ is the recall of the pool, whichmeasures the fraction of all positive instances pre-dicted by the pool (any system), and νi is thepooled recall of system i, which measures the frac-tion of the pool’s positive instances predicted bysystem i. Letting g(x) def

= I[x ∈ X], we can de-fine these as:

νidef= Ex∼p0 [gi(x) | x ∈ X] θ

def= Ex∼p0 [g(x)].

We can estimate θ analogous to the simple recallestimator ri, except we use the pool g instead asystem gi. For νi, the key is to leverage the workfrom estimating precision. We already evaluatedf(x) on Xi, so we can compute Yi

def= Xi ∩Y and

form the subset Y =⋃m

i=1 Yi. Y is an approx-imation of Y whose bias we can correct throughimportance reweighting. We then define estima-tors as follows:

νidef=

∑mj=1

wij

nj

∑x∈Yj

p0(x)gi(x)qi(x)∑m

j=1wij

nj

∑x∈Yj

p0(x)qi(x)

r(joint)i

def= θνi θ

def=

1

n0

∑

x∈Y0

g(x).

where qi and wij are the same as before.

Adaptively choosing the number of samples.Finally, a desired property for on-demand evalu-ation is to label new instances only when the cur-rent evaluation data is insufficient, e.g. when a newset of predictionsXm contains many instances notcovered by other systems. We can measure howwell the current evaluation set covers the predic-tions Xm by using a conservative estimate of thevariance of π(joint)

m .7 In particular, the varianceof π(joint)

m is a monotonically decreasing functionin nm, the number of samples drawn from Xm.We can easily solve for the minimum number ofsamples required to estimate π(joint)

m within a con-fidence interval ε by using the bisection method(Burden and Faires, 1985).

5 On-demand evaluation for KBP

Applying the on-demand evaluation framework toa task requires us to answer three questions:

1. What is the desired distribution over systempredictions pi?

7Further details can be found in Appendix B of the sup-plementary material.

2. How do we label an instance x, i.e. check ifx ∈ Y?

3. How do we sample from the unknown set oftrue instances x ∼ p0?

In this section, we present practical implementa-tions for knowledge base population.

5.1 Sampling from system predictionsBoth the official TAC-KBP evaluation and theon-demand evaluation we propose use micro-averaged precision and recall as metrics. However,in the official evaluation, these metrics are com-puted over a fixed set of evaluation entities chosenby LDC annotators, resulting in two problems: (a)defining evaluation entities requires human inter-vention and (b) typically a large source of vari-ability in evaluation scores comes from not hav-ing enough evaluation entities (see e.g. (Webber,2010)). In our methodology, we replace manu-ally chosen evaluation entities by sampling entitiesfrom each system’s output according pi. In effect,pi makes explicit the decision process of the anno-tator who chooses evaluation entities.

Identifying a reasonable distribution pi is an im-portant implementation decision that depends onwhat one wishes to evaluate. Our goal for the on-demand evaluation service we have implementedis to ensure that KBP systems are fairly evalu-ated on diverse subjects and predicates, while atthe same time, ensuring that entities with multiplerelations are represented to measure completenessof knowledge base entries. As a result, we proposea distribution that is inversely proportional to thefrequency of the subject and predicate and is pro-portional to the number of unique relations iden-tified for an entity (to measure knowledge basecompleteness). See Appendix A in the supplemen-tary material for an analysis of this distribution anda study of other potential choices.

5.2 Labeling predicted instancesWe label predicted relation instances by present-ing the instance’s provenance to crowdworkersand asking them to identify if a relation holds be-tween the identified subject and object mentions(Figure 4a). Crowdworkers are also asked to linkthe subject and object mentions to their canoni-cal mentions within the document and to pages onWikipedia, if possible, for entity linking. On av-erage, we find that crowdworkers are able to per-form this task in about 20 seconds, correspond-

ing to about $0.05 per instance. We requested 5crowdworkers to annotate a small set of 200 rela-tion instances from the 2015 TAC-KBP corpus andmeasured a substantial inter-annotator agreementwith a Fleiss’ kappa of 0.61 with 3 crowdworkersand 0.62 with 5. Consequently, we take a majorityvote over 3 workers in subsequent experiments.

5.3 Sampling true instances

Sampling from the set of true instances Y is diffi-cult because we can’t even enumerate the elementsof Y . As a proxy, we assume that relations areidentically distributed across documents and havecrowdworkers annotate a random subset of doc-uments for relations using an interface we devel-oped (Figure 4b). Crowdworkers begin by iden-tifying every mention span in a document. Foreach mention, they are asked to identify its type,canonical mention within the document and asso-ciated Wikipedia page if possible. They are thenpresented with a separate interface to label predi-cates between pairs of mentions within a sentencethat were identified earlier.

We compare crowdsourced annotations againstthose of expert annotators using data from the TACKBP 2015 EDL task on 10 randomly chosen docu-ments. We find that 3 crowdworkers together iden-tify 92% of the entity spans identified by expertannotators, while 7 crowdworkers together iden-tify 96%. When using a token-level majority voteto identify entities, 3 crowdworkers identify about78% of the entity spans; this number does notchange significantly with additional crowdwork-ers. We also measure substantial token-level inter-annotator agreement using Fleiss’ kappa for iden-tifying typed mention spans (κ = 0.83), canonicalmentions (κ = 0.75) and entity links (κ = 0.75)with just three workers. Based on this analysis, weuse token-level majority over 3 workers in subse-quent experiments.

The entity annotation interface is far more in-volved and takes on average about 13 minutes perdocument, corresponding to about $2.60 per docu-ment, while the relation annotation interface takeson average about $2.25 per document. Becausedocuments vary significantly in length and com-plexity, we set rewards for each document basedon the number of tokens (.75c per token) and men-tion pairs (5c per pair) respectively. With 3 work-ers per document, we paid about $15 per documenton average. Each document contained an average

(a) (b)

0 10 20 30 40System Rank

−10.0

−7.5

−5.0

−2.5

0.0

2.5

5.0

7.5

Pre

cisi

onB

ias

Pooling

Simple

Joint

(c)

0 10 20 30 40System Rank

−7.5

−5.0

−2.5

0.0

2.5

5.0

7.5

Rec

all

Bia

s

Pooling

Simple

Joint

(d)

10 20 30 40 50Number of systems

2

4

6

8

10

12

14

16

Nu

mb

erof

sam

ple

s×

500

Fixed

Adaptive

x = +0.2y2 − 0.1y + 2.3 (R = 15.1)

(e)

Sys. P R F1

TAC KBP evaluationP 47.6% 11.0% 17.9%P+L 35.5% 18.4% 24.2%P+L+N 26.3% 27.0% 26.6%

On-demand evaluationP 74.7% 5.8% 10.8%P+L 54.7% 7.6% 13.3%P+L+N 34.0% 9.8% 15.2%

(f)

Figure 4: (a, b): Interfaces for annotating relations and entities respectively. (c, d): A comparison ofbias for the pooling, simple and joint estimators on the TAC KBP 2015 challenge. Each point in the figureis a mean of 500 repeated trials; dotted lines show the 90% quartile. Both the simple and joint estimatorsare unbiased, and the joint estimator is able to significantly reduce variance. (e): A comparison of thenumber of samples used to estimate scores under the fixed and adaptive sample selection scheme. Eachfaint line shows the number of samples used during a single trial, while solid lines show the mean over100 trials. The dashed line shows a square-root relationship between the number of systems evaluatedand the number of samples required. Thus joint estimation combined with adaptive sample selection canreduce the number of labeled annotations required by an order of magnitude. (f): Precision (P ), recall(R) and F1 scores from a pilot run of our evaluation service for ensembles of a rule-based system (R), alogistic classifier (L) and a neural network classifier (N) run on the TAC KBP 2016 document corpus.

9.2 relations, resulting in a cost of about $1.61 perrelation instance. We note that this is about tentimes as much as labeling a relation instance.

We defer details regarding how documentsthemselves should be weighted to capture diverseentities that span documents to Appendix A.

6 Evaluation

Let us now see how well on-demand evaluationworks in practice. We begin by empirically study-ing the bias and variance of the joint estimator pro-posed in Section 4 and find it is able to correct forpooling bias while significantly reducing variancein comparison with the simple estimator. We thendemonstrate that on-demand evaluation can serveas a practical replacement for the TAC KBP eval-uations by piloting a new evaluation service wehave developed to evaluate three distinct systemson TAC KBP 2016 document corpus.

6.1 Bias and variance of the on-demandevaluation.

Once again, we use the labeled system predictionsfrom the TAC KBP 2015 evaluation and treat themas an exhaustively annotated dataset. To evaluatethe pooling methodology we construct an evalua-tion dataset using instances found by human an-notators and labeled instances pooled from 9 ran-domly chosen teams (i.e. half the total numberof participating teams), and use this dataset toevaluate the remaining 9 teams. On average, thepooled evaluation dataset contains between 5,000and 6,000 labeled instances and evaluates 34 dif-ferent systems (since each team may have submit-ted multiple systems). Next, we evaluated sets of 9randomly chosen teams with our proposed simpleand joint estimators using a total of 5,000 samples:about 150 of these samples are drawn from Y , i.e.the full TAC KBP 2015 evaluation data, and 150samples from each of the systems being evaluated.

We repeat the above simulated experiment 500times and compare the estimated precision andrecall with their true values (Figure 4). Thesimulations once again highlights that the pooledmethodology is biased, while the simple and jointestimators are not. Furthermore, the joint estima-tors significantly reduce variance relative to thesimple estimators: the median 90% confidenceintervals reduce from 0.14 to 0.06 precision andfrom 0.14 to 0.08 for recall.

6.2 Number of samples required byon-demand evaluation

Separately, we evaluate the efficacy of the adaptivesample selection method described in Section 4.3through another simulated experiment. In eachtrial of this experiment, we evaluate the top 40systems in random order. As each subsequent sys-tem is evaluated, the number of samples to pickfrom the system is chosen to meet a target varianceand added to the current pool of labeled instances.To make the experiment more interpretable, wechoose the target variance to correspond with theestimated variance of having 500 samples. Fig-ure 4 plots the results of the experiment. Thenumber of samples required to estimate systemsquickly drops off from the benchmark of 500 sam-ples as the pool of labeled instances covers moresystems. This experiment shows that on-demandevaluation using joint estimation can scale up toan order of magnitude more submissions than asimple estimator for the same cost.

6.3 A mock evaluation for TAC KBP 2016

We have implemented the on-demand evaluationframework described here as an evaluation serviceto which researchers can submit their own systempredictions. As a pilot of the service, we evaluatedthree relation extraction systems that also partici-pated in the official 2016 TAC KBP competition.Each system uses Stanford CoreNLP (Manninget al., 2014) to identify entities, the Illinois Wiki-fier (Ratinov et al., 2011) to perform entity linkingand a combination of a rule-based system (P), alogistic classifier (L), and a neural network classi-fier (N) for relation extraction. We used 15,000Newswire documents from the 2016 TAC KBPevaluation as our document corpus. In total, 100documents were exhaustively annotated for about$2,000 and 500 instances from each system werelabeled for about $150 each. Evaluating all threesystem only took about 2 hours.

Figure 4f reports scores obtained through on-demand evaluation of these systems as wellas their corresponding official TAC evaluationscores. While the relative ordering of systems be-tween the two evaluations is the same, we notethat precision and recall as measured through on-demand evaluation are respectively higher andlower than the official scores. This is to be ex-pected because on-demand evaluation measuresprecision using each systems output as opposed

to an externally defined set of evaluation entities.Likewise, recall is measured using exhaustive an-notations of relations within the corpus instead ofannotations from pooled output in the official eval-uation.

7 Related work

The subject of pooling bias has been extensivelystudied in the information retrieval (IR) commu-nity starting with Zobel (1998), which examinedthe effects of pooling bias on the TREC AdHoctask, but concluded that pooling bias was not asignificant problem. However, when the topic waslater revisited, Buckley et al. (2007) identified thatthe reason for the small bias was because the sub-missions to the task were too similar; upon repeat-ing the experiment using a novel system as partof the TREC Robust track, they identified a 23%point drop in average precision scores!8

Many solutions to the pooling bias problemhave been proposed in the context of informationretrieval, e.g. adaptively constructing the pool tocollect relevant data more cost-effectively (Zobel,1998; Cormack et al., 1998; Aslam et al., 2006),or modifying the scoring metrics to be less sen-sitive to unassessed data (Buckley and Voorhees,2004; Sakai and Kando, 2008; Aslam et al., 2006).Many of these ideas exploit the ranking of docu-ments in IR which does not apply to KBP. Whileboth Aslam et al. (2006) and Yilmaz et al. (2008)estimate evaluation metrics by using importancesampling estimators, the techniques they proposerequire knowing the set of all submissions before-hand. In contrast, our on-demand methodologycan produce unbiased evaluation scores for newdevelopment systems as well.

There have been several approaches taken tocrowdsource data pertinent to knowledge basepopulation (Vannella et al., 2014; Angeli et al.,2014; He et al., 2015; Liu et al., 2016). The mostextensive annotation effort is probably Pavlicket al. (2016), which crowdsources a knowledgebase for gun-violence related events. In contrast toprevious work, our focus is on evaluating systems,not collecting a dataset. Furthermore, our maincontribution is not a large dataset, but an evalua-tion service that allows anyone to use crowdsourc-ing predictions made by their system.

8For the interested reader, Webber (2010) presents an ex-cellent survey of the literature on pooling bias.

8 Discussion

Over the last ten years of the TAC KBP task, thegap between human and system performance hasbarely narrowed despite the community’s best ef-forts: top automated systems score less than 36%F1 while human annotators score more than 60%.In this paper, we’ve shown that the current eval-uation methodology may be a contributing factorbecause of its bias against novel system improve-ments. The new on-demand framework proposedin this work addresses this problem by obtaininghuman assessments of new system output throughcrowdsourcing. The framework is made economi-cally feasible by carefully sampling output to beassessed and correcting for sample bias throughimportance sampling.

Of course, simply providing better evaluationscores is only part of the solution and it is clearthat better datasets are also necessary. However,the very same difficulties in scale that make eval-uating KBP difficult also make it hard to collecta high quality dataset for the task. As a result,existing datasets (Angeli et al., 2014; Adel et al.,2016) have relied on the output of existing sys-tems, making it likely that they exhibit the samebiases against novel systems that we’ve discussedin this paper. We believe that providing a fair andstandardized evaluation platform as a service al-lows researchers to exploit such datasets and whilestill being able to accurately measure their perfor-mance on the knowledge base population task.

There are many other tasks in NLP that are evenharder to evaluate than KBP. Existing evaluationmetrics for tasks with a generation component—such as summarization or dialogue—leave muchto be desired. We believe that adapting the ideasof this paper to those tasks is a fruitful direction, asprogress of a research community is strongly tiedto the fidelity of evaluation.

Acknowledgments

We would like to thank Yuhao Zhang, Hoa Deng,Eduard Hovy, and Jacob Steinhardt for discus-sions, William E. Webber for his excellent thesisthat helped shape this project and the anonymousreviewers for their detailed and pertinent feedback.The first and second authors are supported underDARPA DEFT program under ARFL prime con-tract no. FA8750-13-2-0040.

ReferencesH. Adel, B. Roth, and H. Schutze. 2016. Comparing

convolutional neural networks to traditional modelsfor slot filling. In Human Language Technology andNorth American Association for Computational Lin-guistics (HLT/NAACL).

G. Angeli, J. Tibshirani, J. Y. Wu, and C. D. Manning.2014. Combining distant and partial supervision forrelation extraction. In Empirical Methods in NaturalLanguage Processing (EMNLP).

J. A. Aslam, V. Pavlu, and E. Yilmaz. 2006. A statis-tical method for system evaluation using incompletejudgments. In ACM Special Interest Group on Infor-mation Retreival (SIGIR), pages 541–548.

J. Berant, A. Chou, R. Frostig, and P. Liang. 2013. Se-mantic parsing on Freebase from question-answerpairs. In Empirical Methods in Natural LanguageProcessing (EMNLP).

C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees.2007. Bias and the limits of pooling for large col-lections. In ACM Special Interest Group on Infor-mation Retreival (SIGIR).

C. Buckley and E. M. Voorhees. 2004. Retrieval eval-uation with incomplete information. In ACM Spe-cial Interest Group on Information Retreival (SI-GIR), pages 25–32.

R. L. Burden and J. D. Faires. 1985. Numerical Analy-sis (3rd ed.). PWS Publishers.

G. V. Cormack, C. R. Palmer, and C. L. A. Clarke.1998. Efficient construction of large test collections.In ACM Special Interest Group on Information Re-treival (SIGIR).

H. T. Dang. 2016. Cold start knowledge base popula-tion at TAC KBP 2016. Text Analytics Conference.

J. Ellis, X. Li, K. Griffitt, and S. M. Strassel. 2012.Linguistic resources for 2012 knowledge base pop-ulation evaluations. Text Analytics Conference.

A. Fader, L. Zettlemoyer, and O. Etzioni. 2014.Open question answering over curated and extractedknowledge bases. In International Conference onKnowledge Discovery and Data Mining (KDD),pages 1156–1165.

S. Han, J. Bang, S. Ryu, and G. G. Lee. 2015. Exploit-ing knowledge base to generate responses for naturallanguage dialog listening agents. 16th Annual Meet-ing of the Special Interest Group on Discourse andDialogue, pages 129–133.

D. K. Harman. 1993. The first text retrieval conference(trec-1) rockville, md, u.s.a., 4-6 november, 1992.Information Processing and Management, 29:411–414.

L. He, M. Lewis, and L. Zettlemoyer. 2015. Question-answer driven semantic role labeling: Using natu-ral language to annotate natural language. In Em-pirical Methods in Natural Language Processing(EMNLP).

H. Ji, R. Grishman, and H. Trang Dang. 2011.Overview of the TAC 2011 knowledge base popu-lation track. In Text Analytics Conference.

K. S. Jones and C. V. Rijsbergen. 1975. Report on theneed for and provision of an “ideal test collection.Information Retrieval Test Collection.

A. Kalyanpur, B. K. Boguraev, S. Patwardhan, J. W.Murdock, A. Lally, C. A. Welty, J. M. Prager,B. Coppola, A. Fokoue-Nkoutche, L. Zhang, Y. Pan,and Z. M. Qui. 2012. Structured data and inferencein deepqa. IBM Journal of Research and Develop-ment, 56:351–364.

A. Liu, S. Soderland, J. Bragg, C. H. Lin, X. Ling, andD. S. Weld. 2016. Effective crowd annotation forrelation extraction. In North American Associationfor Computational Linguistics (NAACL), pages 897–906.

C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J.Bethard, and D. McClosky. 2014. The stanfordcoreNLP natural language processing toolkit. InACL system demonstrations.

A. B. Owen. 2013. Monte Carlo theory, methods andexamples.

E. Pavlick, H. Ji, X. Pan, and C. Callison-Burch. 2016.The gun violence database: A new task and dataset for NLP. In Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1018–1024.

L. Ratinov, D. Roth, D. Downey, and M. Anderson.2011. Local and global algorithms for disambigua-tion to Wikipedia. In Association for ComputationalLinguistics (ACL).

S. Reddy, M. Lapata, and M. Steedman. 2014. Large-scale semantic parsing without question-answerpairs. Transactions of the Association for Compu-tational Linguistics (TACL), 2(10):377–392.

T. Sakai and N. Kando. 2008. On information retrievalmetrics designed for evaluation with incomplete rel-evance assessments. In ACM Special Interest Groupon Information Retreival (SIGIR), pages 447–470.

D. Vannella, D. Jurgens, D. Scarfini, D. Toscani, andR. Navigli. 2014. Validating and extending semanticknowledge bases using video games with a purpose.In Association for Computational Linguistics (ACL),pages 1294–1304.

W. E. Webber. 2010. Measurement in Information Re-trieval Evaluation. Ph.D. thesis, University of Mel-bourne.

E. Yilmaz, E. Kanoulas, and J. A. Aslam. 2008. A sim-ple and efficient sampling method for estimating APand NDCG. In ACM Special Interest Group on In-formation Retreival (SIGIR), pages 603–610.

J. Zobel. 1998. How reliable are the results of large-scale information retrieval experiments? In ACMSpecial Interest Group on Information Retreival (SI-GIR).

Date post:	23-May-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Importance sampling for unbiased on-demand …Debbie Reynolds arrived together at the awards show...

Documents