(Budget, [accuracy]) Show Your Work: Improved Reporting of Experimental … · 2019. 9. 9. · ing...

Show Your Work: Improved Reporting of Experimental Results

Jesse Dodge♣ Suchin Gururangan♦ Dallas Card♥ Roy Schwartz♠♦ Noah A. Smith♠♦♣Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA

♦Allen Institute for Artificial Intelligence, Seattle, WA, USA♥Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA, USA

♠Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA{jessed,dcard}@cs.cmu.edu {suching,roys,noah}@allenai.org

Abstract

Research in natural language processing pro-ceeds, in part, by demonstrating that new mod-els achieve superior performance (e.g., accu-racy) on held-out test data, compared to pre-vious results. In this paper, we demonstratethat test-set performance scores alone are in-sufficient for drawing accurate conclusionsabout which model performs best. We arguefor reporting additional details, especially per-formance on validation data obtained duringmodel development. We present a novel tech-nique for doing so: expected validation per-formance of the best-found model as a func-tion of computation budget (i.e., the numberof hyperparameter search trials or the overalltraining time). Using our approach, we findmultiple recent model comparisons where au-thors would have reached a different conclu-sion if they had used more (or less) compu-tation. Our approach also allows us to esti-mate the amount of computation required toobtain a given accuracy; applying it to severalrecently published results yields massive vari-ation across papers, from hours to weeks. Weconclude with a set of best practices for report-ing experimental results which allow for ro-bust future comparisons, and provide code toallow researchers to use our technique.1

1 Introduction

In NLP and machine learning, improved perfor-mance on held-out test data is typically used asan indication of the superiority of one methodover others. But, as the field grows, there isan increasing gap between the large computa-tional budgets used for some high-profile exper-iments and the budgets used in most other work(Schwartz et al., 2019). This hinders meaning-ful comparison between experiments, as improve-ments in performance can, in some cases, be ob-

1https://github.com/allenai/allentune

(Budget, [accuracy])

Budget that favors LR

Budget that favors CNN

39.832.038.831.139.5

…

LR val. accuracy CNN val. accuracy

38.926.126.440.536.1

…

current practice:

max

max

report corresponding test-set accuracies

Hyp

erpa

ram

eter

as

sign

men

ts

Figure 1: Current practice when comparing NLP mod-els is to train multiple instantiations of each, choosethe best model of each type based on validation per-formance, and compare their performance on test data(inner box). Under this setup, (assuming test-set re-sults are similar to validation), one would concludefrom the results above (hyperparameter search for twomodels on the 5-way SST classification task) that theCNN outperforms Logistic Regression (LR). In ourproposed evaluation framework, we instead encouragepractitioners to consider the expected validation accu-racy (y-axis; shading shows ±1 standard deviation), asa function of budget (x-axis). Each point on a curveis the expected value of the best validation accuracyobtained (y) after evaluating x random hyperparame-ter values. Note that (1) the better performing modeldepends on the computational budget; LR has higherexpected performance for budgets up to 10 hyperpa-rameter assignments, while the CNN is better for largerbudgets. (2) Given a model and desired accuracy (e.g.,0.395 for CNN), we can estimate the expected budgetrequired to reach it (16; dotted lines).

arX

iv:1

909.

0300

4v1

[cs

.LG

] 6

Sep

201

9

https://github.com/allenai/allentune

tained purely through more intensive hyperparam-eter tuning (Melis et al., 2018; Lipton and Stein-hardt, 2018).2

Moreover, recent investigations into “state-of-the-art” claims have found competing methodsto only be comparable, without clear superiority,even against baselines (Reimers and Gurevych,2017; Lucic et al., 2018; Li and Talwalkar, 2019);this has exposed the need for reporting more thana single point estimate of performance.

Echoing calls for more rigorous scientific prac-tice in machine learning (Lipton and Steinhardt,2018; Sculley et al., 2018), we draw attention tothe weaknesses in current reporting practices andpropose solutions which would allow for fairercomparisons and improved reproducibility.

Our primary technical contribution is the intro-duction of a tool for reporting validation resultsin an easily interpretable way: expected valida-tion performance of the best model under a givencomputational budget.3 That is, given a budgetsufficient for training and evaluating n models,we calculate the expected performance of the bestof these models on validation data. Note thatthis differs from the best observed value after nevaluations. Because the expectation can be esti-mated from the distribution of N validation per-formance values, with N ≥ n, and these are ob-tained during model development,4 our methoddoes not require additional computation be-yond hyperparameter search or optimization. Weencourage researchers to report expected valida-tion performance as a curve, across values of n ∈{1, . . . , N}.

As we show in §4.3, our approach makes clearthat the expected-best performing model is a func-tion of the computational budget. In §4.4 weshow how our approach can be used to estimatethe budget that went into obtaining previous re-sults; in one example, we see a too-small bud-get for baselines, while in another we estimate abudget of about 18 GPU days was used (but notreported). Previous work on reporting validationperformance used the bootstrap to approximate themean and variance of the best performing model(Lucic et al., 2018); in §3.2 we show that our ap-

2Recent work has also called attention to the environmen-tal cost of intensive model exploration (Strubell et al., 2019).

3We use the term performance as a general evaluationmeasure, e.g., accuracy, F1, etc.

4We leave forecasting performance with larger budgetsn > N to future work.

proach computes these values with strictly less er-ror than the bootstrap.

We conclude by presenting a set of recommen-dations for researchers that will improve scientificreporting over current practice. We emphasizethis work is about reporting, not about runningadditional experiments (which undoubtedly canimprove evidence in comparisons among mod-els). Our reporting recommendations aim at repro-ducibility and improved understanding of sensitiv-ity to hyperparameters and random initializations.Some of our recommendations may seem obvious;however, our empirical analysis shows that out offifty EMNLP 2018 papers chosen at random, nonereport all items we suggest.

2 Background

Reproducibility Reproducibility in machinelearning is often defined as the ability to pro-duce the exact same results as reported by thedevelopers of the model. In this work, we followGundersen and Kjensmo (2018) and use anextended notion of this concept: when comparingtwo methods, two research groups with differentimplementations should follow an experimentalprocedure which leads to the same conclusionabout which performs better. As illustrated inFig. 1, this conclusion often depends on theamount of computation applied. Thus, to make areproducible claim about which model performsbest, we must also take into account the budgetused (e.g., the number of hyperparameter trials).

Notation We use the term model family to referto an approach subject to comparison and to hyper-parameter selection.5 Each model family M re-quires its own hyperparameter selection, in termsof a set of k hypermarameters, each of which de-fines a range of possible values. A hyperparame-ter value (denoted h) is a k-tuple of specific val-ues for each hyperparameter. We call the set of allpossible hyperparameter valuesHM.6 GivenHMand a computational budget sufficient for train-ing B models, the set of hyperparameter values is{h1, . . . , hB}, hi ∈ HM. We let mi ∈ M denotethe model trained with hyperparameter value hi.

5Examples include different architectures, but also abla-tions of the same architecture.

6The hyperparameter value space can also include the ran-dom seed used to initialize the model, and some specifica-tions such as the size of the hidden layers in a neural network,in addition to commonly tuned values such as learning rate.

Hyperparameter value selection There aremany ways of selecting hyperparameter values,hi. Grid search and uniform sampling are popularsystematic methods; the latter has been shown tobe superior for most search spaces (Bergstra andBengio, 2012). Adaptive search strategies suchas Bayesian optimization select hi after evaluat-ing h1, . . . , hi−1. While these strategies may findbetter results quickly, they are generally less repro-ducible and harder to parallelize (Li et al., 2017).Manual search, where practitioners use knowl-edge derived from previous experience to adjusthyperparameters after each experiment, is a typeof adaptive search that is the least reproducible,as different practitioners make different decisions.Regardless of the strategy adopted, we advocatefor detailed reporting of the method used for hy-perparmeter value selection (§5). We next intro-duce a technique to visualize results of sampleswhich are drawn i.i.d. (e.g., random initializationsor uniformly sampled hyperparameter values).

3 Expected Validation PerformanceGiven Budget

After selecting the best hyperparameter values hi∗from among {h1, . . . , hB} with actual budget B,NLP researchers typically evaluate the associatedmodel mi∗ on the test set and report its perfor-mance as an estimate of the family M’s abilityto generalize to new data. We propose to makebetter use of the intermediately-trained modelsm1, . . . ,mB .

For any set of n hyperparmeter values, denotethe validation performance of the best model as

v∗n = maxh∈{h1,...,hn}A(M, h,DT ,DV ), (1)

whereA denotes an algorithm that returns the per-formance on validation data DV after training amodel from family M with hyperparameter val-ues h on training data DT .7 We view evaluationsof A as the elementary unit of experimental cost.8

Though not often done in practice, procedure(1) could be repeated many times with differenthyperparameter values, yielding a distribution ofvalues for random variable V ∗n . This would allowus to estimate the expected performance, E[V ∗n |n] (given n hyperparameter configurations). The

7A captures standard parameter estimation, as well as pro-cedures that depend on validation data, like early stopping.

8Note that researchers do not always report validation, butrather test performance, a point we will return to in §5.

key insight used below is that, if we use randomsearch for hyperparameter selection, then the ef-fort that goes into a single round of random search(Eq. 1) suffices to construct a useful estimate ofexpected validation performance, without requir-ing any further experimentation.

Under random search, the n hyperparametervalues h1, . . . , hn are drawn uniformly at randomfromHM, so the values ofA(M, hi,DT ,DV ) arei.i.d. As a result, the maximum among these is it-self a random variable. We introduce a diagnosticthat captures information about the computationused to generate a result: the expectation of max-imum performance, conditioned on n, the amountof computation used in the maximization over hy-perparameters and random initializations:

E[maxh∈{h1,...,hn}A(M, h,DT ,DV ) | n

]. (2)

Reporting this expectation as we vary n ∈{1, 2, . . . , B} gives more information than themaximum v∗B (Eq. 1 with n = B); future re-searchers who use this model will know moreabout the computation budget required to achievea given performance. We turn to calculating thisexpectation, then we compare it to the bootstrap(§3.2), and discuss estimating variance (§3.3).

3.1 Expected MaximumWe describe how to estimate the expected maxi-mum validation performance (Eq. 2) given a bud-get of n hyperparameter values.9

Assume we draw {h1, . . . , hn} uniformly atrandom from hyperparameter space HM. Eachevaluation of A(M, h,DT ,DV ) is therefore ani.i.d. draw of a random variable, denoted Vi, withobserved value vi for hi ∼ HM. Let the maxi-mum among n i.i.d. draws from an unknown dis-tribution be

V ∗n = maxi∈{1,...,n} Vi (3)

We seek the expected value of V ∗n given n:

E[V ∗n | n] =∑

v v · P (V ∗n = v | n) (4)

where P (V ∗n | n) is the probability mass function(PMF) for the max-random variable.10 For dis-

9Conversion to alternate formulations of budget, suchas GPU hours or cloud-machine rental cost in dollars, isstraightforward in most cases.

10For a finite validation set DV , most performance mea-sures (e.g., accuracy) only take on a finite number of possiblevalues, hence the use of a sum instead of an integral in Eq. 4.

crete random variables,

P (V ∗n = v | n) = P (V ∗n ≤ v | n)− P (V ∗n < v | n),(5)

Using the definition of “max”, and the fact thatthe Vi are drawn i.i.d.,

P (V ∗n ≤ v | n) = P(maxi∈{1,...,n} Vi ≤ v | n

)= P (V1 ≤ v, V2 ≤ v, . . . , Vn ≤ v | n)=∏n

i=1 P (Vi ≤ v) = P (V ≤ v)n, (6)

and similarly for P (V ∗n < v | n).P (V ≤ v) and P (V < v) are cumulative distri-

bution functions, which we can estimate using theempirical distribution, i.e.

P (V ≤ v) = 1n

∑ni=1 1[Vi≤v] (7)

and similarly for strict inequality.Thus, our estimate of the expected maximum

validation performance is

E[V ∗n | n] =∑

v v · (P (Vi ≤ v)n − P (Vi < v)n).(8)

Discussion As we increase the amount of com-putation for evaluating hyperparameter values (n),the maximum among the samples will approachthe observed maximum v∗B . Hence the curve ofE[V ∗n | n] as a function of n will appear to asymp-tote. Our focus here is not on estimating that value,and we do not make any claims about extrapola-tion of V ∗ beyond B, the number of hyperparam-eter values to which A is actually applied.

Two points follow immediately from our deriva-tion. First, at n = 1, E[V ∗1 | n = 1] is the mean ofv1, . . . , vn. Second, for all n, E[V ∗n | n] ≤ v∗n =maxi vi, which means the curve is a lower boundon the selected model’s validation performance.

3.2 Comparison with BootstrapLucic et al. (2018) and Henderson et al. (2018)have advocated for using the bootstrap to estimatethe mean and variance of the best validation per-formance. The bootstrap (Efron and Tibshirani,1994) is a general method which can be used toestimate statistics that do not have a closed form.The bootstrap process is as follows: draw N i.i.d.samples (in our case, N model evaluations). Fromthese N points, sample n points (with replace-ment), and compute the statistic of interest (e.g.,the max). Do this K times (where K is large), and

average the computed statistic. By the law of largenumbers, as K → ∞ this average converges tothe sample expected value (Efron and Tibshirani,1994).

The bootstrap has two sources of error: the errorfrom the finite sample of N points, and the errorintroduced by resampling these points K times.Our approach has strictly less error than using thebootstrap: our calculation of the expected maxi-mum performance in §3.1 provides a closed-formsolution, and thus contains none of the resamplingerror (the finite sample error is the same).

3.3 Variance of V ∗nExpected performance becomes more useful withan estimate of variation. When using the boot-strap, standard practice is to report the standarddeviation of the estimates from the K resamples.As K →∞, this standard deviation approximatesthe sample standard error (Efron and Tibshirani,1994). We instead calculate this from the distribu-tion in Eq. 5 using the standard plug-in-estimator.

In most cases, we advocate for reporting a mea-sure of variability such as the standard deviationor variance; however, in some cases it might causeconfusion. For example, when the variance islarge, plotting the expected value plus the variancecan go outside of reasonable bounds, such as accu-racy greater than any observed (even greater than1). In such situations, we recommend shadingonly values within the observed range, such as inFig. 4. Additionally, in situations where the vari-ance is high and variance bands overlap betweenmodel families (e.g., Fig. 1), the mean is still themost informative statistic.

4 Case Studies

Here we show two clear use cases of our method.First, we can directly estimate, for a given budget,which approach has better performance. Second,we can estimate, given our experimental setup, thebudget for which the reported validation perfor-mance (V ∗) matches a desired performance level.We present three examples that demonstrate theseuse cases. First, we reproduce previous findingsthat compared different models for text classifi-cation. Second, we explore the time vs. perfor-mance tradeoff of models that use contextual wordembeddings (Peters et al., 2018). Third, fromtwo previously published papers, we examine thebudget required for our expected performance to

match their reported performance. We find thesebudget estimates vary drastically. Consistently, wesee that the best model is a function of the budget.We publicly release the search space and trainingconfigurations used for each case study. 11

Note that we do not report test performance inour experiments, as our purpose is not to establisha benchmark level for a model, but to demonstratethe utility of expected validation performance formodel comparison and reproducibility.

4.1 Experimental Details

For each experiment, we document the hyperpa-rameter search space, hardware, average runtime,number of samples, and links to model imple-mentations. We use public implementations forall models in our experiments, primarily in Al-lenNLP (Gardner et al., 2018). We use Tune (Liawet al., 2018) to run parallel evaluations of uni-formly sampled hyperparameter values.

4.2 Validating Previous Findings

We start by applying our technique on a text classi-fication task in order to confirm a well-establishedobservation (Yogatama and Smith, 2015): logisticregression has reasonable performance with mini-mal hyperparameter tuning, but a well-tuned con-volutional neural network (CNN) can perform bet-ter.

We experiment with the fine-grained Stan-ford Sentiment Treebank text classification dataset(Socher et al., 2013). For the CNN classifier, weembed the text with 50-dim GloVe vectors (Pen-nington et al., 2014), feed the vectors to a Con-vNet encoder, and feed the output representationinto a softmax classification layer. We use thescikit-learn implementation of logistic regressionwith bag-of-word counts and a linear classificationlayer. The hyperparameter spaces HCNN and HLRare detailed in Appendix B. For logistic regres-sion we used bounds suggested by Yogatama andSmith (2015), which include term weighting, n-grams, stopwords, and learning rate. For the CNNwe follow the hyperparameter sensitivity analysisin Zhang and Wallace (2015).

We run 50 trials of random hyperparametersearch for each classifier. Our results (Fig. 1) con-firm previous findings (Zhang and Wallace, 2015):under a budget of fewer than 10 hyperparameter

11https://github.com/allenai/show-your-work

30min 1h 6h 1d 3d 10dTraining duration

0.76

0.78

0.80

0.82

0.84

0.86

0.88

0.90

0.92

Exp

ecte

d va

lidat

ion

accu

racy

SST (binary)

GloVe + ELMo (FT)GloVe + ELMo (FR)GloVe

Figure 2: Expected maximum performance of a BCNclassifier on SST. We compare three embedding ap-proaches (GloVe embeddings, GloVe + frozen ELMo,and GloVe + fine-tuned ELMo). The x-axis is time, ona log scale. We omit the variance for visual clarity. Foreach of the three model families, we sampled 50 hy-perparameter values, and plot the expected maximumperformance with the x-axis values scaled by the aver-age training duration. The plot shows that for each ap-proach (GloVe, ELMo frozen, and ELMo fine-tuned),there exists a budget for which it is preferable.

search trials, logistic regression achieves a higherexpected validation accuracy than the CNN. Asthe budget increases, the CNN gradually improvesto a higher overall expected validation accuracy.For all budgets, logistic regression has lower vari-ance, so may be a more suitable approach for fastprototyping.

4.3 Contextual Representations

We next explore how computational budget affectsthe performance of contextual embedding models(Peters et al., 2018). Recently, Peters et al. (2019)compared two methods for using contextual rep-resentations for downstream tasks: feature extrac-tion, where features are fixed after pretraining andpassed into a task-specific model, or fine-tuning,where they are updated during task training. Pe-ters et al. (2019) found that feature extraction ispreferable to fine-tuning ELMo embeddings. Herewe set to explore whether this conclusion dependson the experimental budget.

Closely following their experimental setup, inFig. 2 we show the expected performance of thebiattentive classification network (BCN; McCannet al., 2017) with three embedding approaches(GloVe only, GloVe + ELMo frozen, and GloVe

https://github.com/allenai/show-your-work

https://github.com/allenai/show-your-work

+ ELMo fine-tuned), on the binary Stanford Sen-timent Treebank task.12

We use time for the budget by scaling thecurves by the average observed training durationfor each model. We observe that as the time bud-get increases, the expected best-performing modelchanges. In particular, we find that our experi-mental setup leads to the same conclusion as Pe-ters et al. (2019) given a budget between approxi-mately 6 hours and 1 day. For larger budgets (e.g.,10 days) fine-tuning outperforms feature extrac-tion. Moreover, for smaller budgets (< 2 hours),using GloVe embeddings is preferable to ELMo(frozen or fine-tuned).

4.4 Inferring Budgets in Previous Reports

Our method provides another appealing property:estimating the budget required for the expectedperformance to reach a particular level, which wecan compare against previously reported results.We present two case studies, and show that theamount of computation required to match the re-ported results varies drastically.

We note that in the two examples that follow,the original papers only reported partial experi-mental information; we made sure to tune the hy-perparameters they did list in addition to standardchoices (such as the learning rate). In neither casedo they report the method used to tune the hyper-parameters, and we suspect they tuned them man-ually. Our experiments here are meant give anidea of the budget that would be required to re-produce their results or to apply their models toother datasets under random hyperparameter valueselection.

SciTail When introducing the SciTail textual en-tailment dataset, Khot et al. (2018) comparedfour models: an n-gram baseline, which mea-sures word-overlap as an indicator of entailment,ESIM (Chen et al., 2017), a sequence-based en-tailment model, DAM (Parikh et al., 2016), a bag-of-words entailment model, and their proposedmodel, DGEM (Khot et al., 2018), a graph-basedstructured entailment model. Their conclusionwas that DGEM outperforms the other models.

12Peters et al. (2019) use a BCN with frozen embeddingsand a BiLSTM BCN for fine-tuning. We conducted experi-ments with both a BCN and a BiLSTM with frozen and fine-tuned embeddings, and found our conclusions to be consis-tent. We report the full hyperparameter search space, whichmatched Peters et al. (2019) as closely as their reporting al-lowed, in Appendix C.

5 10 50 100Hyperparameter assignments

0.625

0.650

0.675

0.700

0.725

0.750

0.775

0.800

0.825

Exp

ecte

d va

lidat

ion

accu

racy

reported DGEM accuracy

reported DAM accuracy

reported ESIM accuracy

reported n-gram baseline accuracy

SciTail

DGEMDAMESIMn-gram baseline

Figure 3: Comparing reported accuracies (dashedlines) on SciTail to expected validation performanceunder varying levels of compute (solid lines). The es-timated budget required for expected performance tomatch the reported result differs substantially acrossmodels, and the relative ordering varies with budget.We omit variance for visual clarity.

We use the same implementations of each ofthese models each with a hyperparameter searchspace detailed in Appendix D.13 We use a budgetbased on trials instead of runtime so as to empha-size how these models behave when given a com-parable number of hyperparameter configurations.

13The search space bounds we use are large neighborhoodsaround the hyperparameter assignments specified in the pub-lic implementations of these models. Note that these curvesdepend on the specific hyperparameter search space adopted;as the original paper does not report hyperparameter search ormodel selection details, we have chosen what we believe tobe reasonable bounds, and acknowledge that different choicescould result in better or worse expected performance.

8h 1d 3d 10d 18d 1moTraining duration

0.1

0.2

0.3

0.4

0.5

0.6

0.7E

xpec

ted

valid

atio

n E

Mreported BIDAF EM

SQuAD

BIDAF

Figure 4: Comparing reported development exact-match score of BIDAF (dashed line) on SQuAD toexpected performance of the best model with varyingcomputational budgets (solid line). The shaded arearepresents the expected performance ±1 standard de-viation, within the observed range of values. It takesabout 18 days (55 hyperparameter trials) for the ex-pected performance to match the reported results.

Our results (Fig. 3) show that the different mod-els require different budgets to reach their reportedperformance in expectation, ranging from 2 (n-gram) to 20 (DGEM). Moreover, providing a largebudget for each approach improves performancesubstantially over reported numbers. Finally, un-der different computation budgets, the top per-forming model changes (though the neural modelsare similar).

SQuAD Next, we turn our attention to SQuAD(Rajpurkar et al., 2016) and report performanceof the commonly-used BiDAF model (Seo et al.,2017). The set of hyperparameters we tune coversthose mentioned in addition to standard choices(details in Appendix D). We see in Fig. 4 thatwe require a budget of 18 GPU days in orderfor the expected maximum validation performanceto match the value reported in the original paper.This suggests that some combination of prior in-tuition and extensive hyperparameter tuning wereused by the original authors, though neither werereported.

X For all reported experimental results

� Description of computing infrastructure� Average runtime for each approach� Details of train/validation/test splits� Corresponding validation performance for each

reported test result� A link to implemented code

X For experiments with hyperparameter search

� Bounds for each hyperparameter� Hyperparameter configurations for best-

performing models� Number of hyperparameter search trials� The method of choosing hyperparameter values

(e.g., uniform sampling, manual tuning, etc.) andthe criterion used to select among them (e.g., ac-curacy)

� Expected validation performance, as introducedin §3.1, or another measure of the mean and vari-ance as a function of the number of hyperparam-eter trials.

Text Box 1: Experimental results checklist.

5 Recommendations

Experimental results checklist The findingsdiscussed in this paper and other similar effortshighlight methodological problems in experimen-tal NLP. In this section we provide a checklist toencourage researchers to report more comprehen-sive experimentation results. Our list, shown inText Box 1, builds on the reproducibility checklistthat was introduced for the machine learning com-munity during NeurIPS 2018 (which is required tobe filled out for each NeurIPS 2019 submission;Pineau, 2019).

Our focus is on improved reporting of exper-imental results, thus we include relevant pointsfrom their list in addition to our own. Similarto other calls for improved reporting in machinelearning (Mitchell et al., 2019; Gebru et al., 2018),we recommend pairing experimental results withthe information from this checklist in a structuredformat (see examples provided in Appendix A).

EMNLP 2018 checklist coverage. To estimatehow commonly this information is reported in theNLP community, we sample fifty random EMNLP2018 papers that include experimental results andevaluate how well they conform to our proposedreporting guidelines. We find that none of thepapers reported all of the items in our checklist.However, every paper reported at least one itemin the checklist, and each item is reported by at

least one paper. Of the papers we analyzed, 74%reported at least some of the best hyperparameterassignments. By contrast, 10% or fewer papers re-ported hyperparameter search bounds, the numberof hyperparameter evaluation trials, or measures ofcentral tendency and variation. We include the fullresults of this analysis in Table 1 in the Appendix.

Comparisons with different budgets. We haveargued that claims about relative model perfor-mance should be qualified by computational ex-pense. With varying amounts of computation, notall claims about superiority are valid. If two mod-els have similar budgets, we can claim one outper-forms the other (with that budget). Similarly, if amodel with a small budget outperforms a modelwith a large budget, increasing the small budgetwill not change this conclusion. However, if amodel with a large budget outperforms a modelwith a small budget, the difference might be dueto the model or the budget (or both). As a con-crete example, Melis et al. (2018) report the per-formance of an LSTM on language modeling thePenn Treebank after 1,500 rounds of Bayesian op-timization; if we compare to a new M with asmaller budget, we can only draw a conclusion ifthe new model outperforms the LSTM.14

In a larger sense, there may be no simple wayto make a comparison “fair.” For example, the twomodels in Fig. 1 have hyperparameter spaces thatare different, so fixing the same number of hy-perparameter trials for both models does not im-ply a fair comparison. In practice, it is often notpossible to measure how much past human experi-ence has contributed to reducing the hyperparame-ter bounds for popular models, and there might notbe a way to account for the fact that better under-stood (or more common) models can have betterspaces to optimize over. Further, the cost of oneapplication of A might be quite different depend-ing on the model family. Converting to runtimeis one possible solution, but implementation effortcould still affect comparisons at a fixed x-value.Because of these considerations, our focus is onreporting whatever experimental results exist.

6 Discussion: Reproducibility

In NLP, the use of standardized test sets and pub-lic leaderboards (which limit test evaluations) has

14This is similar to controlling for the amount of trainingdata, which is an established norm in NLP research.

helped to mitigate the so-called “replication cri-sis” happening in fields such as psychology andmedicine (Ioannidis, 2005; Gelman and Loken,2014). Unfortunately, leaderboards can createadditional reproducibility issues (Rogers, 2019).First, leaderboards obscure the budget that wasused to tune hyperparameters, and thus the amountof work required to apply a model to a new dataset.Second, comparing to a model on a leaderboard isdifficult if they only report test scores. For exam-ple, on the GLUE benchmark (Wang et al., 2018),the differences in test set performance between thetop performing models can be on the order of atenth of a percent, while the difference betweentest and validation performance might be one per-cent or larger. Verifying that a new implemen-tation matches established performance requiressubmitting to the leaderboard, wasting test eval-uations. Thus, we recommend leaderboards re-port validation performance for models evaluatedon test sets.

As an example, consider Devlin et al. (2019),which introduced BERT and reported state-of-the-art results on the GLUE benchmark. The authorsprovide some details about the experimental setup,but do not report a specific budget. Subsequentwork which extended BERT (Phang et al., 2018)included distributions of validation results, and wehighlight this as a positive example of how to re-port experimental results. To achieve comparabletest performance to Devlin et al. (2019), the au-thors report the best of twenty or one hundred ran-dom initializations. Their validation performancereporting not only illuminates the budget requiredto fine-tune BERT on such tasks, but also givesother practitioners results against which they cancompare without submitting to the leaderboard.

7 Related Work

Lipton and Steinhardt (2018) address a number ofproblems with the practice of machine learning,including incorrectly attributing empirical gainsto modeling choices when they came from othersources such as hyperparameter tuning. Sculleyet al. (2018) list examples of similar evaluation is-sues, and suggest encouraging stronger standardsfor empirical evaluation. They recommend de-tailing experimental results found throughout theresearch process in a time-stamped document, asis done in other experimental science fields. Ourwork formalizes these issues and provides an ac-

tionable set of recommendations to address them.Reproducibility issues relating to standard data

splits (Schwartz et al., 2011; Gorman and Bedrick,2019; Recht et al., 2019a,b) have surfaced in anumber of areas. Shuffling standard training, val-idation, and test set splits led to a drop in perfor-mance, and in a number of cases the inability toreproduce rankings of models. Dror et al. (2017)studied reproducibility in the context of consis-tency among multiple comparisons.

Limited community standards exist for docu-menting datasets and models. To address this, Ge-bru et al. (2018) recommend pairing new datasetswith a “datasheet” which includes informationsuch as how the data was collected, how it wascleaned, and the motivation behind building thedataset. Similarly, Mitchell et al. (2019) advocatefor including a “model card” with trained mod-els which document training data, model assump-tions, and intended use, among other things. Ourrecommendations in §5 are meant to document rel-evant information for experimental results.

8 Conclusion

We have shown how current practice in experi-mental NLP fails to support a simple standard ofreproducibility. We introduce a new technique forestimating the expected validation performance ofa method, as a function of computation budget,and present a set of recommendations for report-ing experimental findings.

Acknowledgments

This work was completed while the first authorwas an intern at the Allen Institute for ArtificialIntelligence. The authors thank Kevin Jamieson,Samuel Ainsworth, and the anonymous reviewersfor helpful feedback.

ReferencesJames Bergstra and Yoshua Bengio. 2012. Random

search for hyper-parameter optimization. JMLR,13:281–305.

Qian Chen, Xiao-Dan Zhu, Zhen-Hua Ling, Si Wei,Hui Jiang, and Diana Inkpen. 2017. EnhancedLSTM for natural language inference. In Proc. ofACL.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proc. of NAACL.

Rotem Dror, Gili Baumer, Marina Bogomolov, and RoiReichart. 2017. Replicability analysis for naturallanguage processing: Testing significance with mul-tiple datasets. TACL, 5:471–486.

Bradley Efron and Robert Tibshirani. 1994. An Intro-duction to the Bootstrap. CRC Press.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, Matthew E.Peters, Michael Schmitz, and Luke S. Zettlemoyer.2018. AllenNLP: A deep semantic natural languageprocessing platform. In Proc. of NLP-OSS.

Timnit Gebru, Jamie H. Morgenstern, Briana Vec-chione, Jennifer Wortman Vaughan, Hanna M.Wallach, Hal Daume, and Kate Crawford. 2018.Datasheets for datasets. arXiv:1803.09010.

Andrew Gelman and Eric Loken. 2014. The statisticalcrisis in science. American Scientist, 102:460.

Kyle Gorman and Steven Bedrick. 2019. We need totalk about standard splits. In Proc. of ACL.

Odd Erik Gundersen and Sigbjrn Kjensmo. 2018. Stateof the art: Reproducibility in artificial intelligence.In Proc. of AAAI.

Peter Henderson, Riashat Islam, Philip Bachman,Joelle Pineau, Doina Precup, and David Meger.2018. Deep reinforcement learning that matters. InProc. of AAAI.

John P. A. Ioannidis. 2005. Why most published re-search findings are false. PLoS Med, 2(8).

Tushar Khot, Ashutosh Sabharwal, and Peter Clark.2018. SciTaiL: A textual entailment dataset fromscience question answering. In Proc. of AAAI.

Liam Li and Ameet Talwalkar. 2019. Random searchand reproducibility for neural architecture search. InProc. of UAI.

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Ros-tamizadeh, and Ameet Talwalkar. 2017. Hyperband:Bandit-based configuration evaluation for hyperpa-rameter optimization. In Proc. of ICLR.

Richard Liaw, Eric Liang, Robert Nishihara, PhilippMoritz, Joseph E Gonzalez, and Ion Stoica. 2018.Tune: A research platform for distributed model se-lection and training. In Proc. of the ICML Workshopon AutoML.

Zachary C. Lipton and Jacob Steinhardt. 2018.Troubling trends in machine learning scholarship.arXiv:1807.03341.

Mario Lucic, Karol Kurach, Marcin Michalski, OlivierBousquet, and Sylvain Gelly. 2018. Are GANscreated equal? A large-scale study. In Proc. ofNeurIPS.

http://dl.acm.org/citation.cfm?id=2188385.2188395

http://dl.acm.org/citation.cfm?id=2188385.2188395

https://arxiv.org/abs/1609.06038





https://doi.org/10.1162/tacl_a_00074






https://doi.org/10.1511/2014.111.460

https://doi.org/10.1511/2014.111.460

https://www.aclweb.org/anthology/P19-1267/

https://www.aclweb.org/anthology/P19-1267/

https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/17248/15864

https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/17248/15864


https://doi.org/10.1371/journal.pmed.0020124

https://doi.org/10.1371/journal.pmed.0020124

http://ai2-website.s3.amazonaws.com/publications/scitail-aaai-2018_cameraready.pdf

http://ai2-website.s3.amazonaws.com/publications/scitail-aaai-2018_cameraready.pdf



https://openreview.net/pdf?id=ry18Ww5ee








Bryan McCann, James Bradbury, Caiming Xiong, andRichard Socher. 2017. Learned in translation: Con-textualized word vectors. In Proc. of NeurIPS.

Gabor Melis, Chris Dyer, and Phil Blunsom. 2018. Onthe state of the art of evaluation in neural languagemodels. In Proc. of EMNLP.

Margaret Mitchell, Simone Wu, Andrew Zaldivar,Parker Barnes, Lucy Vasserman, Ben Hutchinson,Elena Spitzer, Inioluwa Deborah Raji, and TimnitGebru. 2019. Model cards for model reporting. InProc. of FAT*.

Ankur P. Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In Proc. ofEMNLP.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global vectors for wordrepresentation. In Proc. of EMNLP.

Matthew Peters, Sebastian Ruder, and Noah A. Smith.2019. To tune or not to tune? Adapting pretrainedrepresentations to diverse tasks. In Proc. of theRepL4NLP Workshop at ACL.

Matthew E. Peters, Mark Neumann, Mohit Iyyer,Matt Gardner, Christopher Clark, Kenton Lee, andLuke S. Zettlemoyer. 2018. Deep contextualizedword representations. In Proc. of NAACL.

Jason Phang, Thibault Fevry, and Samuel R. Bow-man. 2018. Sentence encoders on STILTs: Supple-mentary training on intermediate labeled-data tasks.arXiv:1811.01088.

Joelle Pineau. 2019. Machine learning reproducibilitychecklist. https://www.cs.mcgill.ca/˜jpineau/ReproducibilityChecklist.pdf. Accessed: 2019-5-14.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questionsfor machine comprehension of text. In Proc. ofEMNLP.

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt,and Vaishaal Shankar. 2019a. Do CIFAR-10 classi-fiers generalize to CIFAR-10? arXiv:1806.00451.

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt,and Vaishaal Shankar. 2019b. Do ImageNet classi-fiers generalize to ImageNet? In Proc. of ICML.

Nils Reimers and Iryna Gurevych. 2017. Reportingscore distributions makes a difference: Performancestudy of LSTM-networks for sequence tagging. InProc. of EMNLP.

Anna Rogers. 2019. How the transformers broke NLPleaderboards. https://hackingsemantics.xyz/2019/leaderboards/. Accessed: 2019-8-29.

Roy Schwartz, Omri Abend, Roi Reichart, and AriRappoport. 2011. Neutralizing linguistically prob-lematic annotations in unsupervised dependencyparsing evaluation. In Proc. of ACL.

Roy Schwartz, Jesse Dodge, Noah A. Smith, and OrenEtzioni. 2019. Green AI. arXiv:1907.10597.

D. Sculley, Jasper Snoek, Ali Rahimi, and AlexWiltschko. 2018. Winner’s curse? On pace,progress, and empirical rigor. In Proc. of ICLR(Workshop Track).

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2017. Bidirectional attentionflow for machine comprehension. In Proc. of ICLR.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Y. Ng,and Christopher Potts. 2013. Recursive deep mod-els for semantic compositionality over a sentimenttreebank. In Proc. of EMNLP.

Emma Strubell, Ananya Ganesh, and Andrew McCal-lum. 2019. Energy and policy considerations fordeep learning in NLP. In Proc. of ACL.

Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In Proc.of ICLR.

Dani Yogatama and Noah A. Smith. 2015. Bayesianoptimization of text representations. In Proc. ofEMNLP.

Ye Zhang and Byron Wallace. 2015. A sensitivityanalysis of (and practitioners’ guide to) convolu-tional neural networks for sentence classification.arXiv:1510.03820.









http://www.aclweb.org/anthology/D14-1162

http://www.aclweb.org/anthology/D14-1162







https://www.cs.mcgill.ca/~jpineau/ReproducibilityChecklist.pdf



https://doi.org/10.18653/v1/D16-1264

https://doi.org/10.18653/v1/D16-1264








https://hackingsemantics.xyz/2019/leaderboards/

https://hackingsemantics.xyz/2019/leaderboards/

http://www.aclweb.org/anthology/P11-1067




https://openreview.net/references/pdf?id=HyT0zqkwG

https://openreview.net/references/pdf?id=HyT0zqkwG



https://www.aclweb.org/anthology/D13-1170/












A EMNLP 2018 Checklist Survey

Checklist item Percentage of EMNLP 2018 papers

Reports train/validation/test splits 92%

Reports best hyperparameter assignments 74%

Reports code 30%

Reports dev accuracy 24%

Reports computing infrastructure 18%

Reports empirical runtime 14%

Reports search strategy 14%

Reports score distribution 10%

Reports number of hyperparameter trials 10%

Reports hyperparameter search bounds 8%

Table 1: Presence of checklist items from §5 across 50 randomly sampled EMNLP 2018 papers that involvedmodeling experiments.

B Hyperparameter Search Spaces for Section 4.2

Computing infrastructure GeForce GTX 1080 GPU

Number of search trials 50

Search strategy uniform sampling

Best validation accuracy 40.5

Training duration 39 sec

Model implementation http://github.com/allenai/show-your-work

Hyperparameter Search space Best assignmentnumber of epochs 50 50

patience 10 10

batch size 64 64

embedding GloVe (50 dim) GloVe (50 dim)

encoder ConvNet ConvNet

max filter size uniform-integer[3, 6] 4

number of filters uniform-integer[64, 512] 332

dropout uniform-float[0, 0.5] 0.4

learning rate scheduler reduce on plateau reduce on plateau

learning rate scheduler patience 2 epochs 2 epochs

learning rate scheduler reduction factor 0.5 0.5

learning rate optimizer Adam Adam

learning rate loguniform-float[1e-6, 1e-1] 0.0008

Table 2: SST (fine-grained) CNN classifier search space and best assignments.

http://github.com/allenai/show-your-work

Computing Infrastructure 3.1 GHz Intel Core i7 CPU




Training duration 1.56 seconds


Hyperparameter Search space Best assignment

penalty choice[L1, L2] L2

no. of iter 100 100

solver liblinear liblinear

regularization uniform-float[0, 1] 0.13

n-grams choice[(1, 2), (1, 2, 3), (2, 3)] [1, 2]

stopwords choice[True, False] True

weight choice[tf, tf-idf, binary] binary

tolerance loguniform-float[10e-5, 10e-3] 0.00014

Table 3: SST (fine-grained) logistic regression search space and best assignments.


C Hyperparameter Search Spaces for Section 4.3

Computing Infrastructure GeForce GTX 1080 GPU







number of epochs 50 50

patience 10 10

batch size 64 64

gradient norm uniform-float[5, 10] 9.0

embedding dropout uniform-float[0, 0.5] 0.3

number of pre-encode feedforward layers choice[1, 2, 3] 3

number of pre-encode feedforward hidden dims uniform-integer[64, 512] 232

pre-encode feedforward activation choice[relu, tanh] tanh

pre-encode feedforward dropout uniform-float[0, 0.5] 0.0

encoder hidden size uniform-integer[64, 512] 424

number of encoder layers choice[1, 2, 3] 2

integrator hidden size uniform-integer[64, 512] 337

number of integrator layers choice[1, 2, 3] 3

integrator dropout uniform-float[0, 0.5] 0.1

number of output layers choice[1, 2, 3] 3

output hidden size uniform-integer[64, 512] 384

output dropout uniform-float[0, 0.5] 0.2

output pool sizes uniform-integer[3, 7] 6






Table 4: SST (binary) BCN GloVe search space and best assignments.










patience 10 10

batch size 64 64


freeze ELMo True True




pre-encode feedforward activation choice[relu, tanh] relu













use integrator output ELMo choice[True, False] True




Table 5: SST (binary) BCN GLoVe + ELMo (frozen) search space and best assignments.


Computing Infrastructure NVIDIA Titan Xp GPU








patience 10 10

batch size 64 64


freeze ELMo False False




pre-encode feedforward activation choice[relu, tanh] relu












learning rate loguniform-float[1e-6, 1e-1] 2.1e-5

use integrator output ELMo choice[True, False] True




Table 6: SST (binary) BCN GloVe + ELMo (fine-tuned) search space and best assignments.


D Hyperparameter Search Spaces for Section 4.4









patience 20 20

batch size 64 64

gradient clip uniform-float[5, 10] 5.28

embedding projection dim uniform-integer[64, 300] 78

number of attend feedforward layers choice[1, 2, 3] 1

attend feedforward hidden dims uniform-integer[64, 512] 336

attend feedforward activation choice[relu, tanh] tanh

attend feedforward dropout uniform-float[0, 0.5] 0.1

number of compare feedforward layers choice[1, 2, 3] 1

compare feedforward hidden dims uniform-integer[64, 512] 370

compare feedforward activation choice[relu, tanh] relu

compare feedforward dropout uniform-float[0, 0.5] 0.2

number of aggregate feedforward layers choice[1, 2, 3] 2

aggregate feedforward hidden dims uniform-integer[64, 512] 370

aggregate feedforward activation choice[relu, tanh] relu

aggregate feedforward dropout uniform-float[0, 0.5] 0.1

learning rate optimizer Adagrad Adagrad


Table 7: SciTail DAM search space and best assignments.










patience 5 5

batch size 64 64




number of projection feedforward layers choice[1, 2, 3] 2

projection feedforward hidden dims uniform-integer[64, 512] 85

projection feedforward activation choice[relu, tanh] relu

number of inference encoder layers choice[1, 2, 3] 1

number of output feedforward layers choice[1, 2, 3] 2

output feedforward hidden dims uniform-integer[64, 512] 432

output feedforward activation choice[relu, tanh] tanh

output feedforward dropout uniform-float[0, 0.5] 0.03







learning rate scheduler mode max max

Table 8: SciTail ESIM search space and best assignments.










patience 20 20

batch size 64 64


hidden size uniform-integer[64, 512] 167

activation choice[relu, tanh] tanh

number of layers choice[1, 2, 3] 3




learning rate scheduler exponential exponential

learning rate scheduler gamma 0.5 0.5

Table 9: SciTail n-gram baseline search space and best assignments.










patience 20 20

batch size 16 16

embedding projection dim uniform-integer[64, 300] 100

edge embedding size uniform-integer[64, 512] 204

premise encoder hidden size uniform-integer[64, 512] 234

number of premise encoder layers choice[1, 2, 3] 2

premise encoder is bidirectional choice[True, False] True

number of phrase probability layers choice[1, 2, 3] 2

phrase probability hidden dims uniform-integer[64, 512] 268

phrase probability dropout uniform-float[0, 0.5] 0.2

phrase probability activation choice[tanh, relu] tanh

number of edge probability layers choice[1, 2, 3] 1

edge probability dropout uniform-float[0, 0.5] 0.2

edge probability activation choice[tanh, relu] tanh




learning rate scheduler exponential exponential

learning rate scheduler gamma 0.5 0.5

Table 10: SciTail DGEM search space and best assignments.





Best validation EM 68.2





patience 10 10

batch size 16 16

token embedding GloVe (100 dim) GloVe (100 dim)



character embedding dim uniform-integer[16, 64] 43

max character filter size uniform-integer[3, 6] 3

number of character filters uniform-integer[64, 512] 33

character embedding dropout uniform-float[0, 0.5] 0.15

number of highway layers choice[1, 2, 3] 3

phrase layer hidden size uniform-integer[64, 512] 122

number of phrase layers choice[1, 2, 3] 1

phrase layer dropout uniform-float[0, 0.5] 0.46

modeling layer hidden size uniform-integer[64, 512] 423

number of modeling layers choice[1, 2, 3] 3

modeling layer dropout uniform-float[0, 0.5] 0.32

span end encoder hidden size uniform-integer[64, 512] 138

span end encoder number of layers choice[1, 2, 3] 1

span end encoder dropout uniform-float[0, 0.5] 0.03



Adam β1 uniform-float[0.9, 1.0] 0.95

Adam β2 uniform-float[0.9, 1.0] 0.93




learning rate scheduler mode max max

Table 11: SQuAD BiDAF search space and best assignments.


Date post:	09-Oct-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

(Budget, [accuracy]) Show Your Work: Improved Reporting of Experimental … · 2019. 9. 9. · ing...

Documents