+ All Categories
Home > Documents > Predicting Performance for Natural Language Processing Tasks · Natural language processing (NLP)...

Predicting Performance for Natural Language Processing Tasks · Natural language processing (NLP)...

Date post: 06-Jul-2020
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
22
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8625–8646 July 5 - 10, 2020. c 2020 Association for Computational Linguistics 8625 Predicting Performance for Natural Language Processing Tasks Mengzhou Xia, Antonios Anastasopoulos, Ruochen Xu, Yiming Yang, Graham Neubig Language Technologies Institute, Carnegie Mellon University {mengzhox,aanastas,yiming,gneubig}@cs.cmu.edu [email protected] Abstract Given the complexity of combinations of tasks, languages, and domains in natural lan- guage processing (NLP) research, it is com- putationally prohibitive to exhaustively test newly proposed models on each possible ex- perimental setting. In this work, we attempt to explore the possibility of gaining plausi- ble judgments of how well an NLP model can perform under an experimental setting, with- out actually training or testing the model. To do so, we build regression models to predict the evaluation score of an NLP experiment given the experimental settings as input. Ex- perimenting on 9 different NLP tasks, we find that our predictors can produce meaningful predictions over unseen languages and differ- ent modeling architectures, outperforming rea- sonable baselines as well as human experts. Going further, we outline how our predictor can be used to find a small subset of represen- tative experiments that should be run in order to obtain plausible predictions for all other ex- perimental settings. 1 1 Introduction Natural language processing (NLP) is an extraor- dinarily vast field, with a wide variety of models being applied to a multitude of tasks across a plen- itude of domains and languages. In order to mea- sure progress in all these scenarios, it is necessary to compare performance on test datasets represent- ing each scenario. However, the cross-product of tasks, languages, and domains creates an explo- sion of potential application scenarios, and it is in- feasible to collect high-quality test sets for each. In addition, even for tasks where we do have a wide variety of test data, e.g. for well-resourced tasks such as machine translation (MT), it is still 1 Code, data and logs are publicly available at https: //github.com/xiamengzhou/NLPerf. computationally prohibitive as well as not environ- mentally friendly (Strubell et al., 2019) to build and test on systems for all languages or domains we are interested in. Because of this, the common practice is to test new methods on a small num- ber of languages or domains, often semi-arbitrarily chosen based on previous work or the experi- menters’ intuition. As a result, this practice impedes the NLP community from gaining a comprehensive under- standing of newly-proposed models. Table 1 il- lustrates this fact with an example from bilingual lexicon induction, a task that aims to find word translation pairs from cross-lingual word embed- dings. As vividly displayed in Table 1, almost all the works report evaluation results on a differ- ent subset of language pairs. Evaluating only on a small subset raises concerns about making infer- ences when comparing the merits of these meth- ods: there is no guarantee that performance on English–Spanish (ENES, the only common evalu- ation dataset) is representative of the expected per- formance of the models over all other language pairs (Anastasopoulos and Neubig, 2020). Such phenomena lead us to consider if it is possible to make a decently accurate estimation for the perfor- mance over an untested language pair without ac- tually running the NLP model to bypass the com- putation restriction. Toward that end, through drawing on the idea of characterizing an experiment from Lin et al. (2019), we propose a framework, which we call NLPERF, to provide an exploratory solution. We build regression models, to predict the perfor- mance on a particular experimental setting given past experimental records of the same task, with each record consisting of a characterization of its training dataset and a performance score of the corresponding metric. Concretely, in §2, we start with a partly populated table (such as the one from
Transcript

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8625–8646July 5 - 10, 2020. c©2020 Association for Computational Linguistics

8625

Predicting Performance for Natural Language Processing Tasks

Mengzhou Xia, Antonios Anastasopoulos, Ruochen Xu, Yiming Yang, Graham NeubigLanguage Technologies Institute, Carnegie Mellon University

{mengzhox,aanastas,yiming,gneubig}@[email protected]

Abstract

Given the complexity of combinations oftasks, languages, and domains in natural lan-guage processing (NLP) research, it is com-putationally prohibitive to exhaustively testnewly proposed models on each possible ex-perimental setting. In this work, we attemptto explore the possibility of gaining plausi-ble judgments of how well an NLP model canperform under an experimental setting, with-out actually training or testing the model. Todo so, we build regression models to predictthe evaluation score of an NLP experimentgiven the experimental settings as input. Ex-perimenting on 9 different NLP tasks, we findthat our predictors can produce meaningfulpredictions over unseen languages and differ-ent modeling architectures, outperforming rea-sonable baselines as well as human experts.Going further, we outline how our predictorcan be used to find a small subset of represen-tative experiments that should be run in orderto obtain plausible predictions for all other ex-perimental settings.1

1 Introduction

Natural language processing (NLP) is an extraor-dinarily vast field, with a wide variety of modelsbeing applied to a multitude of tasks across a plen-itude of domains and languages. In order to mea-sure progress in all these scenarios, it is necessaryto compare performance on test datasets represent-ing each scenario. However, the cross-product oftasks, languages, and domains creates an explo-sion of potential application scenarios, and it is in-feasible to collect high-quality test sets for each.In addition, even for tasks where we do have awide variety of test data, e.g. for well-resourcedtasks such as machine translation (MT), it is still

1Code, data and logs are publicly available at https://github.com/xiamengzhou/NLPerf.

computationally prohibitive as well as not environ-mentally friendly (Strubell et al., 2019) to buildand test on systems for all languages or domainswe are interested in. Because of this, the commonpractice is to test new methods on a small num-ber of languages or domains, often semi-arbitrarilychosen based on previous work or the experi-menters’ intuition.

As a result, this practice impedes the NLPcommunity from gaining a comprehensive under-standing of newly-proposed models. Table 1 il-lustrates this fact with an example from bilinguallexicon induction, a task that aims to find wordtranslation pairs from cross-lingual word embed-dings. As vividly displayed in Table 1, almost allthe works report evaluation results on a differ-ent subset of language pairs. Evaluating only ona small subset raises concerns about making infer-ences when comparing the merits of these meth-ods: there is no guarantee that performance onEnglish–Spanish (EN–ES, the only common evalu-ation dataset) is representative of the expected per-formance of the models over all other languagepairs (Anastasopoulos and Neubig, 2020). Suchphenomena lead us to consider if it is possible tomake a decently accurate estimation for the perfor-mance over an untested language pair without ac-tually running the NLP model to bypass the com-putation restriction.

Toward that end, through drawing on the ideaof characterizing an experiment from Lin et al.(2019), we propose a framework, which we callNLPERF, to provide an exploratory solution. Webuild regression models, to predict the perfor-mance on a particular experimental setting givenpast experimental records of the same task, witheach record consisting of a characterization of itstraining dataset and a performance score of thecorresponding metric. Concretely, in §2, we startwith a partly populated table (such as the one from

8626

BLI Method Evaluation SetDE–EN EN–DE ES–EN EN–ES FR–EN EN–FR IT–EN EN–IT EN–PT EN–RU ES–DE PT–RU

Zhang et al. (2017) ? X X X ? ? X ? ? ? ? ?Chen and Cardie (2018) X X X X X X X X X ? X ?

Yang et al. (2019) X X X X X X X ? ? ? ? ?Heyman et al. (2019) ? X ? X ? X ? X ? ? ? ?Huang et al. (2019) ? ? X X X X ? ? ? ? ? ?Artetxe et al. (2019) X X X X X X ? ? ? X ? ?

Table 1: An illustration of the comparability issues across methods and multiple evaluation datasets from theBilingual Lexicon Induction task. Our prediction model can reasonably fill in the blanks, as illustrated in Section 4.

Table 1) and attempt to infer the missing valueswith the predictor. We begin by introducing theprocess of characterizing an NLP experiment foreach task in §3. We evaluate the effectiveness androbustness of NLPERF by comparing to multiplebaselines, human experts, and by perturbing a sin-gle feature to simulate a grid search over that fea-ture (§4). Evaluations on multiple tasks show thatNLPERF is able to outperform all baselines. No-tably, on a machine translation (MT) task, the pre-dictions made by the predictor turn out to be moreaccurate than human experts.

An effective predictor can be very useful formultiple applications associated with practicalscenarios. In §5, we show how it is possible toadopt the predictor as a scoring function to find asmall subset of experiments that are most repre-sentative of a bigger set of experiments. We arguethat this will allow researchers to make informeddecisions on what datasets to use for training andevaluation, in the case where they cannot experi-ment on all experimental settings. Last, in §6, weshow that we can adequately predict the perfor-mance of new models even with a minimal numberof experimental records.

2 Problem Formulation

In this section we formalize the problem of pre-dicting performance on supervised NLP tasks.Given an NLP model of architecture M trainedover dataset(s) D of a specific task involving lan-guage(s) L with a training procedure (optimiza-tion algorithms, learning rate scheduling etc.) P ,we can test the model on a test dataset D′ and geta score S of a specific evaluation metric. The re-sulting score will surely vary depending on all theabove mentioned factors, and we denote this rela-tion as g:

SM,P,L,D,D′ = g(M,P,L,D,D′). (1)

In the ideal scenario, for each test dataset D′ ofa specific task, one could enumerate all differentsettings and find the one that leads to the best per-formance. As mentioned in Section §1, however,such a brute-force method is computationally in-feasible. Thus, we turn to modeling the processand formulating our problem as a regression taskby using a parametric function fθ to approximatethe true function g as follows:

SM,P,L,D,D′ = fθ([ΦM; ΦP ; ΦL; ΦD; ΦD′ ])

where Φ∗ denotes a set of features for each influ-encing factor.

For the purpose of this study, we mainly focuson dataset and language features ΦL and ΦD, asthis already results in a significant search space,and gathering extensive experimental results withfine-grained tuning over model and training hyper-parameters is both expensive and relatively com-plicated. In the cases where we handle multiplemodels, we only use a single categorical modelfeature to denote the combination of model archi-tecture and training procedure, denoted as ΦC . Westill use the term model to refer to this combina-tion in the rest of the paper. We also omit the testset features, under the assumption that the data dis-tributions for training and testing data are the same(a fairly reasonable assumption if we ignore pos-sible domain shift). Therefore, for all experimentsbelow, our final prediction function is the follow-ing:

SC,L,D = fθ([ΦC ; ΦL; ΦD])

In the next section we describe concrete instan-tiations of this function for several NLP tasks.

3 NLP Task Instantiations

To build a predictor for NLP task performance,we must 1) select a task, 2) describe its featuriza-tion, and 3) train a predictor. We describe detailsof these three steps in this section.

8627

Task Dataset CitationSource Target Transfer

# Models # EXsTask

Langs Langs Langs Metric

Wiki-MT Schwenk et al. (2019) 39 39 – single 995 BLEUTED-MT Qi et al. (2018) 54 1 – single 54 BLEUTSF-MT Qi et al. (2018) 54 1 54 single 2862 BLEUTSF-PARSING Nivre et al. (2018) – 30 30 single 870 AccuracyTSF-POS Nivre et al. (2018) – 26 60 single 1531 AccuracyTSF-EL Rijhwani et al. (2019) – 9 54 single 477 AccuracyBLI Lample et al. (2018) 44 44 – 3 88×3 AccuracyMA McCarthy et al. (2019) – 66 – 6 107×6 F1UD Zeman et al. (2018a) – 53 – 25 72×25 F1

Table 2: Statistics of the datasets we use for training predictors. # EXs denote the total number of experimentinstances; Task Metric reflects how the models are evaluated.

Tasks We test on tasks including bilingual lexi-con induction (BLI); machine translation trainedon aligned Wikipedia data (Wiki-MT), on TEDtalks (TED-MT), and with cross-lingual trans-fer for translation into English (TSF-MT); cross-lingual dependency parsing (TSF-Parsing); cross-lingual POS tagging (TSF-POS); cross-lingualentity linking (TSF-EL); morphological analysis(MA) and universal dependency parsing (UD). Ba-sic statistics on the datasets are outlined in Table 2.

For Wiki-MT tasks, we collect experimentalrecords directly from the paper describing the cor-responding datasets (Schwenk et al., 2019). ForTED-MT and all the transfer tasks, we use the re-sults of Lin et al. (2019). For BLI, we conduct ex-periments using published results from three pa-pers, namely Artetxe et al. (2016), Artetxe et al.(2017) and Xu et al. (2018). For MA, we usethe results of the SIGMORPHON 2019 sharedtask 2 (McCarthy et al., 2019). Last, the UD re-sults are taken from the CoNLL 2018 Shared Taskon universal dependency parsing (Zeman et al.,2018b).

Featurization For language features, we utilizesix distance features from the URIEL Typologi-cal Database (Littell et al., 2017), namely geo-graphic, genetic, inventory, syntactic, phonologi-cal, and featural distance.

The complete set of dataset features includes thefollowing:

1. Dataset Size: The number of data entries usedfor training.

2. Word/Subword Vocabulary Size: The numberof word/subword types.

3. Average Sentence Length: The average length

of sentences from all experimental.4. Word/Subword Overlap:

|T1 ∩ T2||T1|+ |T2|

where T1 and T2 denote vocabularies of anytwo corpora.

5. Type-Token Ratio (TTR): The ratio betweenthe number of types and number of tokens(Richards, 1987) of one corpus.

6. Type-Token Ratio Distance:(1− TTR1

TTR2

)2

where TTR1 and TTR2 denote TTR of anytwo corpora.

7. Single Tag Type: Number of single tag types.8. Fused Tag Type: Number of fused tag types.9. Average Tag Length Per Word: Average num-

ber of single tags for each word.10. Dependency Arcs Matching WALS Fea-

tures: the proportion of dependency pars-ing arcs matching the following WALS fea-tures, computed over the training set: sub-ject/object/oblique before/after verb and ad-jective/numeral before/after noun.

For transfer tasks, we use the same set of datasetfeatures ΦD as Lin et al. (2019), including fea-tures 1–6 on the source and the transfer languageside. We also include language distance featuresbetween source and transfer language, as well asbetween source and target language. For MT tasks,we use features 1–6 and language distance fea-tures, but only between the source and target lan-guage. For MA, we use features 1, 2, 5 and mor-phological tag related features 7–9. For UD, we

8628

use features 1, 2, 5, and 10. For BLI, we use lan-guage distance features and URIEL syntactic fea-tures for the source and the target language.

Predictor Our prediction model is based ongradient boosting trees (Friedman, 2001), im-plemented with XGBoost (Chen and Guestrin,2016). This method is widely known as an effec-tive means for solving problems including rank-ing, classification and regression. We also exper-imented with Gaussian processes (Williams andRasmussen, 1996), but settled on gradient boostedtrees because performance was similar and Xg-boost’s implementation is very efficient throughthe use of parallelism. We use squared error as theobjective function for the regression and adopteda fixed learning rate 0.1. To allow the model tofully fit the data we set the maximum tree depthto be 10 and the number of trees to be 100, anduse the default regularization terms to prevent themodel from overfitting.

4 Can We Predict NLP Performance?

In this section we investigate the effectiveness ofNLPERF across different tasks on various met-rics. Following Lin et al. (2019), we conduct k-fold cross validation for evaluation. To be specific,we randomly partition the experimental records of〈L,D, C,S〉 tuples into k folds, and use k−1 foldsto train a prediction model and evaluate on the re-maining fold. Note that this scenario is similar to“filling in the blanks” in Table 1, where we havesome experimental records that we can train themodel on, and predict the remaining ones.

For evaluation, we calculate the average rootmean square error (RMSE) between the predictedscores and the true scores.

Baselines We compare against a simple meanvalue baseline, as well as against language-wisemean value and model-wise mean value baselines.The simple mean value baseline outputs an aver-age of scores s from the training folds for all testentries in the left-out fold (i) as follows:

s(i)mean =1

|S \ S(i)|∑

s∈S\S(i)s; i ∈ 1 . . . k (2)

Note that for tasks involving multiple models,we calculate the RMSE score separately on eachmodel and use the mean RMSE of all models asthe final RMSE score.

The language-wise baselines make more in-formed predictions, taking into account only train-ing instances with the same transfer, source, or tar-get language (depending on the task setting). Forexample, the source-language mean value baselines(i,j)s-lang for jth test instance in fold i outputs an av-

erage of the scores s of the training instances thatshare the same source language features s-lang, asshown in Equation 3:

s(i,j)s-lang =

∑s,φ δ(φL,src = s-lang) · s∑s,φ δ(φL,src = s-lang)

∀(s, φ) ∈ (|S \ S(i)|, |Φ \ Φ(i)|)(3)

where δ is the indicator function. Similarly, wedefine the target- and the transfer-language meanvalue baselines.

In a similar manner, we also compare against amodel-wise mean value baseline for tasks that in-clude experimental records from multiple models.Now, the prediction for the jth test instance in theleft-out fold i is an average of the scores on thesame dataset (as characterized by the language φLand dataset φD features) from all other models:

s(i,j)model =

∑s,φ δ(φL = lang, φD = data) · s∑s,φ δ(φL = lang, φD = data)

∀(s, φ) ∈ (|S \ S(i)|, |Φ \ Φ(i)|)(4)

where lang = Φ(i,j)L and data = Φ

(i,j)D respec-

tively denote the language and dataset features ofthe test instance.

Main Results For multi-model tasks, we can doeither Single Model prediction (SM), restrictingtraining and testing of the predictor within a sin-gle model, or Multi-Model (MM) prediction us-ing a categorical model feature. The RMSE scoresof NLPERF along with the baselines are shownin Table 3. For all tasks, our single model predic-tor is able to more accurately estimate the evalua-tion score of unseen experiments compared to thesingle model baselines, confirming our hypothe-sis that the there exists a correlation that can becaptured between experimental settings and thedownstream performance of NLP systems. Thelanguage-wise baselines are much stronger thanthe simple mean value baseline but still performworse than our single model predictor. Similarly,the model-wise baseline significantly outperformsthe mean value baseline because results from othermodels reveal much information about the dataset.

8629

TaskModel Wiki-MT TED-MT TSF-MT TSF-PARSING TSF-POS TSF-EL BLI MA UD

Mean 6.40 12.65 10.77 17.58 29.10 18.65 20.10 9.47 17.69Transfer Lang-wise – – 10.96 15.68 29.98 20.55 – – –Source Lang-wise 5.69 12.65 2.24 – – – 20.13 – –Target Lang-wise 5.12 12.65 10.78 12.05 8.92 8.61 20.00 9.47 –NLPERF (SM) 2.50 6.18 1.43 6.24 7.37 7.82 12.63 6.48 12.06

Model-wise – – – – – – 8.77 5.22 4.96NLPERF (MM) – – – – – – 6.87 3.18 3.54

Table 3: RMSE scores of three baselines and our predictions under the single model and multi model setting(missing values correspond to settings not applicable to the task). All results are from k-fold (k = 5) evaluationsaveraged over 10 random runs.

Even so, our multi-model predictor still outper-forms the model-wise baseline.

The results nicely imply that for a wide range oftasks, our predictor is able to reasonably estimateleft-out slots in a partly populated table given re-sults of other experiment records, without actuallyrunning the system.

We should note that RMSE scores across differ-ent tasks should not be directly compared, mainlybecause the scale of each evaluation metric isdifferent. For example, a BLEU score (Papineniet al., 2002) for MT experiments typically rangesfrom 1 to 40, while an accuracy score usually hasa much larger range, for example, BLI accuracyranges from 0.333 to 78.2 and TSF-POS accuracyranges from 1.84 to 87.98, which consequentlymakes the RMSE scores of these tasks higher.

Comparison to Expert Human PerformanceWe constructed a small scale case study to eval-uate whether NLPERF is competitive to the per-formance of NLP sub-field experts. We focusedon the TED-MT task and recruited 10 MT practi-tioners,2 all of whom had published at least 3 MT-related papers in ACL-related conferences.

In the first set of questions, the participants werepresented with language pairs from one of the kdata folds along with the dataset features and wereasked to estimate an eventual BLEU score for eachdata entry. In the second part of the questionnaire,the participants were tasked with making estima-tions on the same set of language pairs, but thistime they also had access to features, and BLEUscores from all the other folds.3

2None of the study participants were affiliated to the au-thors’ institutions, nor were familiar with this paper’s content.

3The interested reader can find an example questionnaire

Predictor RMSE

Mean Baseline 12.64Human (w/o training data) 9.38Human (w/ training data) 7.29NLPERF 6.04

Table 4: Our model performs better than human MTexperts on the TED-MT prediction task.

The partition of the folds is consistent betweenthe human study and the training/evaluation for thepredictor. While the first sheet is intended to fa-miliarize the participants with the task, the secondsheet fairly adopts the training/evaluation settingfor our predictor. As shown in Table 4, our partic-ipants outperform the mean baseline even withoutinformation from other folds, demonstrating theirown strong prior knowledge in the field. In addi-tion, the participants make more accurate guessesafter acquiring more information on experimentalrecords in other folds. In neither case, though, arethe human experts competitive to our predictor. Infact, only one of the participants achieved perfor-mance comparable to our predictor.

Feature Perturbation Another question of in-terest concerning predicting performance is “howwill the model perform when trained on data ofa different size” (Kolachina et al., 2012a). To testNLPERF’s extrapolation ability in this regard, weconduct an array of experiments on one languagepair with various data sizes on the Wiki-MT task.We pick two language pairs, Turkish to English(TR–EN) and Portuguese to English (PT–EN) asour testbed for the Wiki-MT task. We sample par-

(and make estimations over one of the folds) in the A.

8630

0 100 200 300 400 500Data Size (k)

5

10

15

20BL

EU

TR-ENTR-EN prediction

0 400 800 1200 1600 2000 24001520253035

BLEU

PT-ENPT-EN prediction

Figure 1: Our model’s predicted BLEU scores andtrue BLEU scores, on sampled TR–EN datasets (sizes10k/50k/100k/200k/478k) and PT–EN datasets (sizes100k/500k/1000k/2000k/2462k), achieving a RMSEscore of 1.83 and 9.97 respectively.

allel datasets with different sizes and train MTmodels with each sampled dataset to obtain thetrue BLEU scores. On the other hand, we collectthe features of all sampled datasets and use ourpredictor (trained over all other languages pairs) toobtain predictions. The plot of true BLEU scoresand predicted BLEU scores are shown in Figure 1.Our predictor achieves a very low average RMSEof 1.83 for TR–EN pair but a relatively higherRMSE of 9.97 for PT–EN pair. The favorable per-formance on the tr-en pair demonstrates the possi-bility of our predictor to do feature extrapolationover data set size. In contrast, the predictions onthe pt-en pair are significantly less accurate. Thisis due to the fact that there are only two other ex-perimental settings scoring as high as 34 BLEUscore, with data sizes of 3378k (en-es) and 611k(gl-es), leading to the predictor’s inadequacy inpredicting high BLEU scores for low-resourceddata sets during extrapolation. This reveals the factthat while the predictor is able to extrapolate per-formance on settings similar to what it has seenin the data, NLPERF may be less successful undercircumstances unlike its training inputs.

5 What Datasets Should We Test On?

As shown in Table 1, it is common practice to testmodels on a subset of all available datasets. Thereason for this is practical – it is computationallyprohibitive to evaluate on all settings. However,if we pick test sets that are not representative ofthe data as a whole, we may mistakenly reach un-

founded conclusions about how well models per-form on other data with distinct properties. Forexample, models trained on a small-sized datasetmay not scale well to a large-sized one, or mod-els that perform well on languages with a partic-ular linguistic characteristic may not do well onlanguages with other characteristics (Bender andFriedman, 2018).

Here we ask the following question: if we areonly practically able to test on a small number ofexperimental settings, which ones should we teston to achieve maximally representative results?Answering the question could have practical im-plications: organizers of large shared tasks likeSIGMORPHON (McCarthy et al., 2019) or UD(Zeman et al., 2018a) could create a minimal sub-set of settings upon which they would ask partici-pants to test to get representative results; similarly,participants could possibly expedite the iterationof model development by testing on the represen-tative subset only. A similar avenue for researchersand companies deploying systems over multiplelanguages could lead to not only financial savings,but potentially a significant cut-down of emissionsfrom model training (Strubell et al., 2019).

We present an approximate explorative solutionto the problem mentioned above. Formally, as-sume that we have a setN , comprising experimen-tal records (both features and scores) of n datasetsfor one task. We set a number m (< n) of datasetsthat we would like to select as the representativesubset. By defining RMSEA(B) to be the RMSEscore derived from evaluating on one subset B thepredictor trained on another subset of experimen-tal records A, we consider the most representativesubset D to be the one that minimizes the RMSEscore when predicting all of the other datasets:

arg minD⊂N

RMSED(N \ D). (5)

Naturally, enumerating all(nm

)possible sub-

sets would be prohibitively costly, even though itwould lead to the optimal solution. Instead, weemploy a beam-search-like approach to efficientlysearch for an approximate solution to the best per-forming subset of arbitrary size. Concretely, westart our approximate search with an exhaustiveenumeration of all subsets of size 2. At each fol-lowing step t, we only consider the best k subsets{D(i)

t ; i ∈ 1, . . . , k} into account and discard therest. As shown in Equation 6, for each candidate

8631

2 3 4 510

20

30

40

RMSE

rus-englav-eng

bos-engron-engeng-fin

cat-englav-engeng-esteng-nor

bos-engron-engeng-finkor-enghrv-eng

eng-spaeng-ben

eng-spaeng-afrafr-eng

eng-spaeng-noreng-daneng-afr

eng-spaeng-danafr-engeng-afreng-nor

BLI

2 3 4 5

10

20

30

40

50

lt_hsegl_treegal

lt_hseen_pudhy_armtdp

lt_hsepcm_nscpl_lfgpl_sz

lt_hsebr_keben_linecs_fictreepl_sz

kpv_ikdpsa_ufal

kpv_ikdptl_trgsa_ufal

kpv_ikdpkpv_latticesa_ufaltl_trg

tl_trgsa_ufalcs_pudfo_ofttr_pud

MA

2 3 4 55

10

15

20

25

RMSE

sqi-engkur-eng

nob-engmsa-engcmn-eng

nob-engmsa-englit-engces-eng

nob-engmsa-englit-engfas-engheb-eng

spa-engpor-eng

rus-engpor-engvie-eng

rus-engpor-engvie-engron-eng

spa-engfra-engara-engpor-engita-eng

TED-MT

2 3 4 5

5

10

15

20

25

glg-rusron-por

srp-ukrdeu-epoeng-tur

swe-fraita-slkfin-engglg-eng

srp-ukrdeu-epoita-rusfra-porukr-srp

eng-spaspa-eng

por-engeng-spaspa-eng

eng-spapor-engeng-porfra-eng

glg-spaeng-itaeng-spavie-engspa-glg

Wiki-MT

Most representative Least representative Random Search

Figure 2: Beam search results (beam size=100) for up to the 5 most (and least) representative datasets for 4 NLPtasks. We also show random search results averaged over 100 random runs.

subset, we expand it with one more data point,

{D(i)t ∪ {s};∀i ∈ 1 . . . k, s ∈ N \ D(i)

t }. (6)

For tasks that involve multiple models, we takeexperimental records of the selected dataset fromall models into account during expansion. Givenall expanded subsets, we train a predictor for eachto evaluate on the rest of the data sets, and keep thebest performing k subsets {D(i)

t+1; i ∈ 1, . . . , k}with minimum RMSE scores for the next step.Furthermore, note that by simply changing thearg min to an arg max in Equation 5, we can alsofind the least representative datasets.

We present search results for four tasks4 asbeam search progresses in Figure 2, with cor-responding RMSE scores from all remainingdatasets as the y-axis. For comparison, we alsoconduct random searches by expanding the subsetwith a randomly selected experimental record. Inall cases, the most representative sets are an aggre-gation of datasets with diverse characteristics suchas languages and dataset sizes. For example, in theWiki-MT task, the 5 most representative datasetsinclude languages that fall into a diverse rangeof language families such as Romance, Turkic,Slavic, etc. while the least representative ones in-clude duplicate pairs (opposite directions) mostly

4Readers can find results on other tasks in Appendix B.

involving English. The phenomenon is more pro-nounced in the TED-MT task, where not onlythe 5 most representative source languages are di-verse, but also the dataset sizes. Specifically, theMalay-English (msa-eng) is a tiny dataset (5k par-allel sentences), and Hebrew-English (heb-eng) isa high-resource case (212k parallel sentences).

Notably, for BLI task, to test how represen-tative the commonly used datasets are, we se-lect the most frequent 5 language pairs shown inTable 1, namely en-de, es-en, en-es, fr-en, en-frfor evaluation. Unsurprisingly, we get an RMSEscore as high as 43.44, quite close to the perfor-mance of the worst representative set found usingbeam search. This finding indicates that the stan-dard practice of choosing datasets for evaluationis likely unrepresentative of results over the fulldataset spectrum, well aligned with the claims inAnastasopoulos and Neubig (2020).

A particularly encouraging observation is thatthe predictor trained with only the 5 most rep-resentative datasets can achieve an RMSE scorecomparable to k-fold validation, which requiredusing all of the datasets for training.5 This indi-cates that one would only need to train NLP mod-els on a small set of representative datasets to ob-tain reasonably plausible predictions for the rest.

5to be accurate, k − 1 folds of all datasets.

8632

6 Can We Extrapolate Performance forNew Models?

In another common scenario, researchers proposenew models for an existing task. It is both time-consuming and computationally intensive to runexperiments with all settings for a new model. Inthis section, we explore if we can use past exper-imental records from other models and a minimalset of experiments from the new model to give aplausible prediction over the rest of the datasets,potentially reducing the time and resources neededfor experimenting with the new model to a largeextent. We use the task of UD parsing as ourtestbed6 as it is the task with most unique mod-els (25 to be exact). Note that we still only use asingle categorical feature for the model type.

To investigate how many experiments areneeded to have a plausible prediction for a newmodel, we first split the experimental recordsequally into a sample set and a test set. Then werandomly sample n (0 ≤ n ≤ 5) experimentalrecords from the sample set and add them into thecollection of experiment records of past models.Each time we re-train a predictor and evaluate onthe test set. The random split repeats 50 times andthe random sampling repeats 50 times, adding upto a total of 2500 experiments. We use the meanvalue of the results from other models, shown inEquation 7 as the prediction baseline for the left-out model, and because experiment results of othermodels reveal significant information about thedataset, this serves as a relatively strong baseline:

sk =1

n− 1

n∑i=1

1(i ∈M/{k}) · si. (7)

M denotes a collection of models and k denotesthe left-out model.

We show the prediction performance (inRMSE) over 8 systems7 in Figure 3. Interestingly,the predictor trained with no model records (0)outperforms the mean value baseline for the 4 bestsystems, while it is the opposite case on the 4worst systems. Since there is no information pro-vided about the new-coming model, the predic-tions are solely based on dataset and language fea-tures. One reason might explain the phenomenon -the correlation between the features and the scoresof the worse-performing systems is different from

6MA and BLI task results are in Appendix C7The best and worst 4 systems from the shared task.

those better-performing systems, so the predictoris unable to generalize well (ONLP).

In the following discussion, we use RMSE@nto denote the RMSE from the predictor trainedwith n data points of a new model. The rela-tively low RMSE@0 scores indicate that othermodels’ features and scores are informative forpredicting the performance of the new modeleven without new model information. ComparingRMSE@0 and RMSE@1, we observe a consis-tent improvement for almost all systems, indicat-ing that NLPERF trained on even a single ex-tra random example achieves more accurate esti-mates over the test sets. Adding more data pointsconsistently leads to additional gains. However,predictions on worse-performing systems benefitmore from it than for better-performing systems,indicating that their feature-performance correla-tion might be considerably different. The findingshere indicate that by extrapolating from past ex-periments, one can make plausible judgments fornewly developed models.

7 Related Work

As discusssed in Domhan et al. (2015), there aretwo main threads of work focusing on predict-ing performance of machine learning algorithms.The first thread is to predict the performance of amethod as a function of its training time, while thesecond thread is to predict a method’s performanceas a function of the training dataset size. Our workbelongs in the second thread, but could easily beextended to encompass training time/procedure.

In the first thread, Kolachina et al. (2012b) at-tempt to infer learning curves based on trainingdata features and extrapolate the initial learningcurves based on BLEU measurements for statis-tical machine translation (SMT). By extrapolatingthe performance of initial learning curves, the pre-dictions on the remainder allows for early termi-nation of a bad run (Domhan et al., 2015).

In the second thread, Birch et al. (2008) adoptlinear regression to capture the relationship be-tween data features and SMT performance andfind that the amount of reordering, the morpholog-ical complexity of the target language and the re-latedness of the two languages explains the major-ity of performance variability. More recently, Elsa-har and Gallé (2019) use domain shift metrics suchas H-divergence based metrics to predict drop inperformance under domain-shift. Rosenfeld et al.

8633

0 1 2 3 4 5

4

6

8R

MSE

HIT-SCIR (78.86)

0 1 2 3 4 5

3

4

UDPipe (76.07)

0 1 2 3 4 5

44.5

5

LATTICE (76.07)

0 1 2 3 4 5

44.5

5

ICS (75.98)

0 1 2 3 4 5

5

6

RM

SE

Phoenix (68.17)

0 1 2 3 4 5

678

BOUN (66.69)

0 1 2 3 4 5678

CUNI (66.6)

0 1 2 3 4 5

10

12

ONLP (61.92)

Figure 3: RMSE scores of UD task from dataset-wise mean value predictor (the dashed black line in each graph)and predictors trained with experimental records of other models and 0–5 records from a new model.

(2020) explore the functional form of the depen-dency of the generalization error of neural modelson model and data size. We view our work as ageneralization of such approaches, appropriate forapplication on any NLP task.

8 Conclusion and Future Work

In this work, we investigate whether the exper-iment setting itself is informative for predictingthe evaluation scores of NLP tasks. Our findingspromisingly show that given a sufficient number ofpast training experimental records, our predictorcan 1) outperform human experts; 2) make plau-sible predictions even over new-coming modelsand languages; 3) extrapolate well on features likedataset size; 4) provide a guide on how we shouldchoose representative datasets for fast iteration.

While this discovery is a promising start, thereare still several avenues on improvement in futurework.

First, the dataset and language settings coveredin our study are still limited. Experimental recordswe use are from relatively homogeneous settings,e.g. all datasets in Wiki-MT task are sentence-pieced to have 5000 subwords, indicating that ourpredictor may fail for other subword settings. Ourmodel also failed to generalize to cases where fea-ture values are out of the range of the training ex-perimental records. We attempted to apply the pre-dictor of Wiki-MT to evaluate on a low-resourceMT dataset, translating from Mapudungun (arn)to Spanish (spa) with the dataset from Duan et al.(2019), but ended up with a poor RMSE score.It turned out that the average sentence length ofthe arn–spa data set is much lower than that of thetraining data sets and our predictors fail to gener-

alize to this different setting.Second, using a categorical feature to denote

model types constrains its expressive power formodeling performance. In reality, a slight changein model hyperparameters (Hoos and Leyton-Brown, 2014; Probst et al., 2019), optimization al-gorithms (Kingma and Ba, 2014), or even randomseeds (Madhyastha and Jain, 2019) may give riseto a significant variation in performance, whichour predictor is not able to capture. While investi-gating the systematic implications of model struc-tures or hyperparameters is practically infeasiblein this study, we may use additional informationsuch as textual model descriptions for modelingNLP models and training procedures more elabo-rately in the future.

Lastly, we assume that the distribution of train-ing and testing data is the same, which does notconsider domain shift. On top of this, there mightalso be a domain shift between data sets of train-ing and testing experimental records. We believethat modeling domain shift is a promising futuredirection to improve performance prediction.

Acknowledgement

The authors sincerely thank all the reviewersfor their insightful comments and suggestions,Philipp Koehn, Kevin Duh, Matt Post, ShuoyangDing, Xuan Zhang, Adi Renduchintala, Paul Mc-Namee, Toan Nguyen and Kenton Murray for con-ducting human evaluation for the TED-MT task,Daniel Beck for discussions on Gaussian Pro-cesses, Shruti Rijhwani, Xinyi Wang, Paul Michelfor discussions on this paper. This work is gener-ously supported from the National Science Foun-dation under grant 1761548.

8634

ReferencesAntonios Anastasopoulos and Graham Neubig. 2020.

Should all cross-lingual embeddings speak english?In Proc. ACL. To appear.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016.Learning principled bilingual mappings of word em-beddings while preserving monolingual invariance.In Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing, pages2289–2294.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017.Learning bilingual word embeddings with (almost)no bilingual data. In Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 451–462,Vancouver, Canada. Association for ComputationalLinguistics.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2019.Bilingual lexicon induction through unsupervisedmachine translation. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics, pages 5002–5007, Florence, Italy. Asso-ciation for Computational Linguistics.

Emily M Bender and Batya Friedman. 2018. Datastatements for natural language processing: Towardmitigating system bias and enabling better science.Transactions of the Association for ComputationalLinguistics, 6:587–604.

Alexandra Birch, Miles Osborne, and Philipp Koehn.2008. Predicting success in machine translation. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing, pages 745–754. Association for Computational Linguistics.

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: Ascalable tree boosting system. In Proceedings ofthe 22nd acm sigkdd international conference onknowledge discovery and data mining, pages 785–794. ACM.

Xilun Chen and Claire Cardie. 2018. Unsupervisedmultilingual word embeddings. In Proceedings ofthe 2018 Conference on Empirical Methods in Nat-ural Language Processing, pages 261–270, Brus-sels, Belgium. Association for Computational Lin-guistics.

Tobias Domhan, Jost Tobias Springenberg, and FrankHutter. 2015. Speeding up automatic hyperparame-ter optimization of deep neural networks by extrap-olation of learning curves. In Twenty-Fourth Inter-national Joint Conference on Artificial Intelligence.

Mingjun Duan, Carlos Fasola, Sai Krishna Rallabandi,Rodolfo M. Vega, Antonios Anastasopoulos, LoriLevin, and Alan W Black. 2019. A resource forcomputational experiments on mapudungun. InProc. LREC. To appear.

Hady Elsahar and Matthias Gallé. 2019. To annotateor not? predicting performance drop under domainshift. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages2163–2173.

Jerome H Friedman. 2001. Greedy function approx-imation: a gradient boosting machine. Annals ofstatistics, pages 1189–1232.

Geert Heyman, Bregt Verreet, Ivan Vulic, and Marie-Francine Moens. 2019. Learning unsupervised mul-tilingual word embeddings with incremental multi-lingual hubs. In Proceedings of the 2019 Confer-ence of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 1890–1902.

Holger Hoos and Kevin Leyton-Brown. 2014. An ef-ficient approach for assessing hyperparameter im-portance. In International conference on machinelearning, pages 754–762.

Jiaji Huang, Qiang Qiu, and Kenneth Church. 2019.Hubless nearest neighbor search for bilingual lexi-con induction. In Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics, pages 4072–4080, Florence, Italy. Associa-tion for Computational Linguistics.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Prasanth Kolachina, Nicola Cancedda, Marc Dymet-man, and Sriram Venkatapathy. 2012a. Prediction oflearning curves in machine translation. In Proceed-ings of the 50th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 22–30, Jeju Island, Korea. Associationfor Computational Linguistics.

Prasanth Kolachina, Nicola Cancedda, Marc Dymet-man, and Sriram Venkatapathy. 2012b. Prediction oflearning curves in machine translation. In Proceed-ings of the 50th Annual Meeting of the Associationfor Computational Linguistics: Long Papers-Volume1, pages 22–30. Association for Computational Lin-guistics.

Guillaume Lample, Alexis Conneau, Marc’AurelioRanzato, Ludovic Denoyer, and Hervé Jégou. 2018.Word translation without parallel data. In Interna-tional Conference on Learning Representations.

Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li,Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani,Junxian He, Zhisong Zhang, Xuezhe Ma, AntoniosAnastasopoulos, Patrick Littell, and Graham Neu-big. 2019. Choosing transfer languages for cross-lingual learning. In Proceedings of the 57th Annual

8635

Meeting of the Association for Computational Lin-guistics, pages 3125–3135, Florence, Italy. Associa-tion for Computational Linguistics.

Patrick Littell, David R Mortensen, Ke Lin, KatherineKairis, Carlisle Turner, and Lori Levin. 2017. Urieland lang2vec: Representing languages as typologi-cal, geographical, and phylogenetic vectors. In Pro-ceedings of the 15th Conference of the EuropeanChapter of the Association for Computational Lin-guistics: Volume 2, Short Papers, pages 8–14.

Pranava Madhyastha and Rishabh Jain. 2019. Onmodel stability as a function of random seed. In Pro-ceedings of the 23rd Conference on ComputationalNatural Language Learning (CoNLL), pages 929–939, Hong Kong, China. Association for Computa-tional Linguistics.

Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu,Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar-rett Nicolai, Christo Kirov, Miikka Silfverberg, Se-bastian J. Mielke, Jeffrey Heinz, Ryan Cotterell, andMans Hulden. 2019. The SIGMORPHON 2019shared task: Morphological analysis in context andcross-lingual transfer for inflection. In Proceedingsof the 16th Workshop on Computational Research inPhonetics, Phonology, and Morphology, pages 229–244, Florence, Italy. Association for ComputationalLinguistics.

Joakim Nivre, Mitchell Abrams, Željko Agic, LarsAhrenberg, Lene Antonsen, Maria Jesus Aranzabe,Gashaw Arutie, Masayuki Asahara, Luma Ateyah,Mohammed Attia, Aitziber Atutxa, LiesbethAugustinus, Elena Badmaeva, Miguel Balles-teros, Esha Banerjee, Sebastian Bank, VerginicaBarbu Mititelu, John Bauer, Sandra Bellato, KepaBengoetxea, Riyaz Ahmad Bhat, Erica Biagetti,Eckhard Bick, Rogier Blokland, Victoria Bobicev,Carl Börstell, Cristina Bosco, Gosse Bouma, SamBowman, Adriane Boyd, Aljoscha Burchardt, MarieCandito, Bernard Caron, Gauthier Caron, GülsenCebiroglu Eryigit, Giuseppe G. A. Celano, SavasCetin, Fabricio Chalub, Jinho Choi, Yongseok Cho,Jayeol Chun, Silvie Cinková, Aurélie Collomb,Çagrı Çöltekin, Miriam Connor, Marine Courtin,Elizabeth Davidson, Marie-Catherine de Marn-effe, Valeria de Paiva, Arantza Diaz de Ilarraza,Carly Dickerson, Peter Dirix, Kaja Dobrovoljc,Timothy Dozat, Kira Droganova, Puneet Dwivedi,Marhaba Eli, Ali Elkahky, Binyam Ephrem, TomažErjavec, Aline Etienne, Richárd Farkas, HectorFernandez Alcalde, Jennifer Foster, Cláudia Freitas,Katarína Gajdošová, Daniel Galbraith, MarcosGarcia, Moa Gärdenfors, Kim Gerdes, Filip Gin-ter, Iakes Goenaga, Koldo Gojenola, MemduhGökırmak, Yoav Goldberg, Xavier Gómez Guino-vart, Berta Gonzáles Saavedra, Matias Grioni,Normunds Gruzıtis, Bruno Guillaume, CélineGuillot-Barbance, Nizar Habash, Jan Hajic, JanHajic jr., Linh Hà My, Na-Rae Han, Kim Harris,Dag Haug, Barbora Hladká, Jaroslava Hlavácová,Florinel Hociung, Petter Hohle, Jena Hwang, Radu

Ion, Elena Irimia, Tomáš Jelínek, Anders Johannsen,Fredrik Jørgensen, Hüner Kasıkara, Sylvain Kahane,Hiroshi Kanayama, Jenna Kanerva, Tolga Kayade-len, Václava Kettnerová, Jesse Kirchner, NataliaKotsyba, Simon Krek, Sookyoung Kwak, VeronikaLaippala, Lorenzo Lambertino, Tatiana Lando,Septina Dian Larasati, Alexei Lavrentiev, JohnLee, Phương Lê Hông, Alessandro Lenci, SaranLertpradit, Herman Leung, Cheuk Ying Li, Josie Li,Keying Li, KyungTae Lim, Nikola Ljubešic, OlgaLoginova, Olga Lyashevskaya, Teresa Lynn, VivienMacketanz, Aibek Makazhanov, Michael Mandl,Christopher Manning, Ruli Manurung, CatalinaMaranduc, David Marecek, Katrin Marheinecke,Héctor Martínez Alonso, André Martins, JanMašek, Yuji Matsumoto, Ryan McDonald, GustavoMendonça, Niko Miekka, Anna Missilä, CatalinMititelu, Yusuke Miyao, Simonetta Montemagni,Amir More, Laura Moreno Romero, ShinsukeMori, Bjartur Mortensen, Bohdan Moskalevskyi,Kadri Muischnek, Yugo Murawaki, Kaili Müürisep,Pinkey Nainwani, Juan Ignacio Navarro Horñiacek,Anna Nedoluzhko, Gunta Nešpore-Berzkalne,Lương Nguyên Thi., Huyên Nguyên Thi. Minh,Vitaly Nikolaev, Rattima Nitisaroj, Hanna Nurmi,Stina Ojala, Adédayơ. Olúòkun, Mai Omura,Petya Osenova, Robert Östling, Lilja Øvrelid,Niko Partanen, Elena Pascual, Marco Passarotti,Agnieszka Patejuk, Siyao Peng, Cenel-AugustoPerez, Guy Perrier, Slav Petrov, Jussi Piitulainen,Emily Pitler, Barbara Plank, Thierry Poibeau,Martin Popel, Lauma Pretkalnin, a, Sophie Prévost,Prokopis Prokopidis, Adam Przepiórkowski, TiinaPuolakainen, Sampo Pyysalo, Andriela Rääbis,Alexandre Rademaker, Loganathan Ramasamy,Taraka Rama, Carlos Ramisch, Vinit Ravishankar,Livy Real, Siva Reddy, Georg Rehm, MichaelRießler, Larissa Rinaldi, Laura Rituma, LuisaRocha, Mykhailo Romanenko, Rudolf Rosa, Da-vide Rovati, Valentin Ros, ca, Olga Rudina, ShovalSadde, Shadi Saleh, Tanja Samardžic, StephanieSamson, Manuela Sanguinetti, Baiba Saulıte,Yanin Sawanakunanon, Nathan Schneider, Sebas-tian Schuster, Djamé Seddah, Wolfgang Seeker,Mojgan Seraji, Mo Shen, Atsuko Shimada, MuhShohibussirri, Dmitry Sichinava, Natalia Silveira,Maria Simi, Radu Simionescu, Katalin Simkó,Mária Šimková, Kiril Simov, Aaron Smith, IsabelaSoares-Bastos, Antonio Stella, Milan Straka, JanaStrnadová, Alane Suhr, Umut Sulubacak, ZsoltSzántó, Dima Taji, Yuta Takahashi, Takaaki Tanaka,Isabelle Tellier, Trond Trosterud, Anna Trukhina,Reut Tsarfaty, Francis Tyers, Sumire Uematsu,Zdenka Urešová, Larraitz Uria, Hans Uszkoreit,Sowmya Vajjala, Daniel van Niekerk, Gertjan vanNoord, Viktor Varga, Veronika Vincze, Lars Wallin,Jonathan North Washington, Seyi Williams, MatsWirén, Tsegay Woldemariam, Tak-sum Wong,Chunxiao Yan, Marat M. Yavrumyan, Zhuoran Yu,Zdenek Žabokrtský, Amir Zeldes, Daniel Zeman,Manying Zhang, and Hanzhi Zhu. 2018. Universaldependencies 2.2. LINDAT/CLARIN digital libraryat the Institute of Formal and Applied Linguistics

8636

(ÚFAL), Faculty of Mathematics and Physics,Charles University.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics, pages 311–318. Association forComputational Linguistics.

Philipp Probst, Anne-Laure Boulesteix, and Bernd Bis-chl. 2019. Tunability: Importance of hyperparame-ters of machine learning algorithms. Journal of Ma-chine Learning Research, 20(53):1–32.

Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Pad-manabhan, and Graham Neubig. 2018. When andwhy are pre-trained word embeddings useful forneural machine translation? In Meeting of the NorthAmerican Chapter of the Association for Computa-tional Linguistics (NAACL), New Orleans, USA.

Brian Richards. 1987. Type/token ratios: What do theyreally tell us? Journal of child language, 14(2):201–209.

Shruti Rijhwani, Jiateng Xie, Graham Neubig, andJaime Carbonell. 2019. Zero-shot neural transfer forcross-lingual entity linking. In Thirty-Third AAAIConference on Artificial Intelligence (AAAI), Hon-olulu, Hawaii.

Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Be-linkov, and Nir Shavit. 2020. A constructive pre-diction of the generalization error across scales. InInternational Conference on Learning Representa-tions.

Holger Schwenk, Vishrav Chaudhary, Shuo Sun,Hongyu Gong, and Francisco Guzmán. 2019. Wiki-matrix: Mining 135m parallel sentences in 1620language pairs from wikipedia. arXiv preprintarXiv:1907.05791.

Emma Strubell, Ananya Ganesh, and Andrew McCal-lum. 2019. Energy and policy considerations fordeep learning in NLP. In Proceedings of the 57thAnnual Meeting of the Association for Computa-tional Linguistics, pages 3645–3650, Florence, Italy.Association for Computational Linguistics.

Christopher KI Williams and Carl Edward Rasmussen.1996. Gaussian processes for regression. In Ad-vances in neural information processing systems,pages 514–520.

Ruochen Xu, Yiming Yang, Naoki Otani, and YuexinWu. 2018. Unsupervised cross-lingual transfer ofword embedding spaces. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 2465–2474.

Pengcheng Yang, Fuli Luo, Peng Chen, Tianyu Liu,and Xu Sun. 2019. Maam: A morphology-awarealignment model for unsupervised bilingual lexicon

induction. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics, pages 3190–3196.

Daniel Zeman, Jan Hajic, Martin Popel, Martin Pot-thast, Milan Straka, Filip Ginter, Joakim Nivre, andSlav Petrov. 2018a. CoNLL 2018 shared task: Mul-tilingual parsing from raw text to universal depen-dencies. In Proceedings of the CoNLL 2018 SharedTask: Multilingual Parsing from Raw Text to Univer-sal Dependencies, pages 1–21, Brussels, Belgium.Association for Computational Linguistics.

Daniel Zeman, Jan Hajic, Martin Popel, Martin Pot-thast, Milan Straka, Filip Ginter, Joakim Nivre, andSlav Petrov. 2018b. Conll 2018 shared task: mul-tilingual parsing from raw text to universal depen-dencies. In Proceedings of the CoNLL 2018 SharedTask: Multilingual Parsing from Raw Text to Univer-sal Dependencies, pages 1–21.

Meng Zhang, Yang Liu, Huanbo Luan, and MaosongSun. 2017. Earth mover’s distance minimization forunsupervised bilingual lexicon induction. In Pro-ceedings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing, pages 1934–1945.

8637

Appendix

A Questionnaire

An example of the first questionnaire from ouruser case study is shown below. The second sheetalso included the results in 44 more languagepairs. We provide an answer key after the secondsheet.

Please provide your prediction of the BLEU score based on the language pair and dataset features(the domain of the training and test sets is TED talks). After you finish, please go to sheet v2.

idx Source Target Parallel Source Source Target Target BLEULanguage Language Sentences vocab subword vocab subword

(k) size (k) vocab size (k) vocabsize (k) size( k)

1 Basque (eus) English 5 20 8 9 62 Slovak (slk) English 61 134 8 36 83 Burmese (mya) English 21 101 8 21 84 Korean (kor) English 206 386 9 67 85 Lithuanian (lit) English 42 108 8 29 86 Arabic (ara) English 214 308 8 69 87 Czech (ces) English 103 181 8 47 88 Esperanto (epo) English 7 21 8 10 69 Finnish (fin) English 24 77 8 22 810 Albanian (sqi) English 45 93 8 30 811 Vietnamese (vie) English 172 66 8 61 8

8638

Please provide your prediction of the BLEU score in the yellow area given all the informationin this sheet. Note that all experiments are trained with the same model.

idx Source Target Parallel Source Source Target Target BLEULanguage Lang. Sentences vocab subword vocab subword

(k) size (k) vocab size (k) vocabsize (k) size( k)

1 Basque (eus) English 5 20 8 9 62 Slovak (slk) English 61 134 8 36 83 Burmese (mya) English 21 101 8 21 84 Korean (kor) English 206 386 9 67 85 Lithuanian (lit) English 42 108 8 29 86 Arabic (ara) English 214 308 8 69 87 Czech (ces) English 103 181 8 47 88 Esperanto (epo) English 7 21 8 10 69 Finnish (fin) English 24 77 8 22 810 Albanian (sqi) English 45 93 8 30 811 Vietnamese (vie) English 172 66 8 61 812 French (fra) English 192 158 8 65 8 37.7413 Estonian (est) English 11 39 8 14 7 9.914 Macedonian (mkd) English 25 61 8 23 8 21.7515 Bosnian (bos) English 6 23 8 9 6 32.4216 Swedish (swe) English 57 84 8 34 8 33.9217 Polish (pol) English 176 267 8 63 8 21.5118 Persian (fas) English 151 148 8 57 8 24.519 Kurdish (kur) English 10 39 8 14 7 6.8620 Hungarian (hun) English 147 305 8 56 8 22.6721 Slovenian (slv) English 20 58 8 20 8 14.1822 Romanian (ron) English 181 205 8 63 8 32.4223 Russian (rus) English 208 291 8 68 8 22.624 Serbian (srp) English 137 239 8 54 8 30.4125 Tamil (tam) English 6 27 8 10 6 1.8226 Kazakh (kaz) English 3 15 8 7 5 2.0527 Marathi (mar) English 10 29 8 13 7 3.6828 Ukrainian (ukr) English 108 191 8 48 8 24.0929 Thai (tha) English 98 323 8 45 8 20.3430 Belarusian (bel) English 5 20 8 8 5 2.8531 Turkish (tur) English 182 304 8 63 8 22.5232 Azerbaijani (aze) English 6 23 8 9 6 3.133 German (deu) English 168 194 8 61 8 33.1534 Bulgarian (bul) English 174 216 8 62 8 35.7835 Norwegian (nob) English 16 36 8 17 7 29.6336 Georgian (kat) English 13 44 8 15 7 4.9437 Danish (dan) English 45 72 8 31 8 37.7338 Armenian (hye) English 21 56 8 20 8 13.9739 Mandarin (cmn) English 200 481 9 67 8 17.0

8639

idx Source Target Parallel Source Source Target Target BLEULanguage Language Sentences vocab subword vocab subword

40 Indonesian (ind) English 87 76 8 43 8 27.2741 Galician (glg) English 10 28 8 13 7 16.8442 Portuguese (por) English 185 165 8 64 8 41.6743 Urdu (urd) English 6 13 6 10 6 3.3844 Italian (ita) English 205 195 8 67 8 35.6745 Spanish (spa) English 196 179 8 66 8 39.4846 Greek (ell) English 134 171 8 54 8 34.9447 Bengali (ben) English 5 18 8 9 6 2.7948 Japanese (jpn) English 204 584 9 67 8 11.4249 Malay (msa) English 5 13 7 9 6 3.6850 Dutch (nld) English 184 172 8 63 8 34.2751 Croatian (hrv) English 122 191 8 52 8 31.8452 Hebrew (heb) English 212 276 8 68 8 33.8953 Mongolian (mon) English 8 21 8 11 6 2.9654 Hindi (hin) English 19 31 8 19 7 14.25

AnswerKey:eus:3.37,slk:25.36,mya:3.93,kor:16.23,lit:13.75,ara:28.38,ces:25.07,epo:3.28,fin:13.79,sqi:29.6,vie:24.67.

8640

B Representative datasets

In this section, we show the searching results ofmost/least representative subsets for the rest of thefive tasks.

2 3 4 510

20

30

40

50

60

RMSE

en_ewtla_proiel

en_gumhu_szegedkmr_mg

en_gumug_udtgl_ctgkmr_mg

en_gumug_udtkmr_mgko_kaistfr_gsd

bxr_bdtvi_vtb

fr_spokenbxr_bdtvi_vtb

ko_kaistbxr_bdtko_gsdja_gsd

ko_kaistbxr_bdtfr_spokenvi_vtbgl_ctg

UD

2 3 4 55

10

15

20

25

eng-por (mon)eng-tam (msa)

eng-por (hye)eng-tam (msa)eng-0porkaz)

mkd-eng (dan)lit-eng (tam)aze-eng (hye)sqi-eng (tam)

mkd-eng (dan)lit-eng (tam)bos-eng (mon)nob-eng (hye)fas-eng (aze)

por-eng (mon)tam-eng (msa)

por-eng (hye)tam-eng (msa)por-eng (kaz)

por-eng (mon)tam-eng (msa)por-eng (kaz)por-eng (hye)

por-eng (bel)tam-eng (msa)por-eng (hye)por-eng (mon)por-eng (kaz)

TSFMT

2 3 4 520

30

40

50

60

RMSE

af (ja)ta (ja)

hu (ja)be (ja)yue (ja)

hu (ja)be (ja)yue (ja)wbp (hr)

mr (ja)hu (ja)yue (ja)hsb (af)tl (bg)

af (ja)am (la)

am (la)af (et)cop (la)

cop (ja)am (la)fo (swl)hu (fr)

cop (ja)am (la)fo (swl)hu (fr)br (lv)

TSFPOS

2 3 4 510

20

30

40

50

no (he)pl (id)

de (he)de (fr)id (nl)

sk (da)sk (he)sl (ca)la (zh)

de (it)de (he)ru (sv)hi (hr)fi (hi)

cs (uk)zh (ar)

ru (sv)zh (ar)cs (uk)

cs (uk)zh (ar)cs (zh)la (nl)

cs (uk)zh (ar)la (nl)cs (ar)cs (zh)

TSFPARSING

2 3 4 510

20

30

40

50

60

RMSE

jv (so)uk (fa)

jv (so)pa (ms)mr (id)

jv (so)pa (ms)bn (tl)ti (xh)

jv (so)pa (ms)mr (id)te (ha)jv (ar)

jv (id)jv (sw)

jv (id)jv (yo)jv (sw)

jv (id)ti (am)jv (yo)jv (sw)

jv (id)ti (am)jv (ilo)jv (ceb)jv (tl)

TSFEL

Most representative Least representative Random Search

Figure 4: Beam search results (beam size=100) for up to the 5 most (and least) representative datasets for theremaining NLP tasks. We also show random search results of corresponding sizes.

8641

C New Model

In this section, we show the extrapolation perfor-mance for new models on BLI, MA and the re-maining systems of UD.

0 1 2 3 4 5

68

1012

RM

SESinkhorn (38.43)

0 1 2 3 4 568

1012

Artetxe17 (36.46)

0 1 2 3 4 58

10

12

Artetxe16 (46.7)

Figure 5: RMSE scores of BLI task from dataset-wise mean value predictor (the dashed black line in each graph)and predictors trained with experimental records of other models and 0–5 records from a new model (as indicatedby the title of each graph).

0 1 2 3 4 5

456

RM

SE

CHARLES-SAARLAND-02-2 (93.23)

0 1 2 3 4 5

456

Unknown (93.19)

0 1 2 3 4 52

2.2

2.4EDINBURGH-01-2 (88.93)

0 1 2 3 4 53.5

44.5

5

RM

SE

OHIOSTATE-01-2 (87.42)

0 1 2 3 4 55.5

66.5

7CMU-01-2-DataAug (86.53)

0 1 2 3 4 5

7

8

CARNEGIEMELLON-02-2 (85.06)

Figure 6: RMSE scores of MA task from dataset-wise mean value predictor (the dashed black line in each graph)and predictors trained with experimental records of other models and 0–5 records from a new model (as indicatedby the title of each graph)

.

8642

0 1 2 3 4 53

3.54

4.5

RM

SE

TurkuNLP (75.93)

0 1 2 3 4 5

2.53

3.54

CEA (75.06)

0 1 2 3 4 5

5.5

6

Stanford (75.05)

0 1 2 3 4 5

3

3.5

Uppsala (74.76)

0 1 2 3 4 5

2.8

3

RM

SE

AntNLP (74.1)

0 1 2 3 4 52.62.83

3.2

ParisNLP (74.05)

0 1 2 3 4 52.62.83

3.23.4

NLP-Cube (73.96)

0 1 2 3 4 5

7

7.5

SLT-Interactions(72.92)

0 1 2 3 4 53

3.2

3.4

RM

SE

IBM (71.88)

0 1 2 3 4 5

2.42.62.83

LeisureX (71.7)

0 1 2 3 4 52

2.5

3UniMelb (71.54)

0 1 2 3 4 5

4

4.5

5

Fudan (69.42)

0 1 2 3 4 5

4

5

RM

SE

KParse (69.39)

0 1 2 3 4 54

4.55

5.56

BASELINE (68.5)

Figure 7: RMSE scores of UD task from dataset-wise mean value predictor (the dashed black line in each graph)and predictors trained with experimental records of other models and 0–5 records from a new model (as indicatedby the title of each graph).

8643

D Feature importance

In this section, we show the plots of feature impor-tance for all the tasks.

0 250 500 750 1000 1250 1500 1750F score

featuralgeographic

phonologicalTarget lang vocab size

geneticsyntactic

Target lang subword vocab sizeTarget lang subword TTR

Source lang subword vocab sizeSource lang vocab size

inventorySource lang subword TTR

Target lang average sent. lengthSource lang Average Sent. Length

Target lang word TTRSource lang word TTR

dataset size (sent)

Feat

ures

200340

466491508

585635637645

682696

787811829

13321547

1723Wiki-MT: Feature Importance

0 100 200 300 400 500 600F score

Target lang vocab sizeTarget lang word TTR

FEATURALTarget lang subword vocab size

GENETICGEOGRAPHIC

PHONOLOGICALTarget lang average sent. lengthSource lang subword vocab size

Source lang subword TTRSource lang vocab size

Source lang Average Sent. LengthINVENTORYSYNTACTIC

Source lang word TTRdataset size (sent)

Feat

ures

5991115

4966

8199

108110

129131

191341

591

TED-MT: Feature Importance

0 500 1000 1500 2000 2500 3000 3500F score

PHONOLOGICAL_2INVENTORY_2

GENETIC_2FEATURAL

FEATURAL_2GEOGRAPHIC_2

GEOGRAPHICSYNTACTIC_2

GENETICPHONOLOGICALTarget lang TTR

INVENTORYTarget lang dataset size

Transfer lang TTRSYNTACTIC

Transfer target TTR distanceTransfer over target size ratio

Overlap subword-levelTransfer lang dataset size

Overlap word-level

Feat

ures

120197207

253276316322

388401

536649

898955

10471170

13151389

17561782

3529

TSFMT: Feature Importance

8644

0 500 1000 1500 2000 2500 3000 3500F score

GENETICPHONOLOGICALTarget lang TTR

INVENTORYFEATURAL

Transfer lang TTRGEOGRAPHIC

SYNTACTICTransfer over target size ratio

Transfer target TTR distanceTarget lang dataset size

Transfer lang dataset sizeWord overlap

Feat

ures

345477

666688

847887

9289619961001

12011204

3270TSFPARSING: Feature Importance

0 500 1000 1500 2000 2500 3000F score

GENETICPHONOLOGICALTarget lang TTR

INVENTORYTarget lang dataset size

FEATURALSYNTACTIC

Transfer lang TTRTransfer target TTR distance

GEOGRAPHICTransfer over target size ratio

Transfer lang dataset sizeOverlap word-level

Feat

ures

361549

632839

9911110

11471200

12411268

13651722

2938TSFPOS: Feature Importance

0 200 400 600 800 1000 1200 1400 1600F score

Transfer over target size ratioGENETIC

PHONOLOGICALTarget lang dataset size

INVENTORYGEOGRAPHIC

Transfer lang dataset sizeFEATURAL

Entity overlapSYNTACTIC

Feat

ures

292429

572604

928971

11821241

13381605

TSFEL: Feature Importance

8645

0 200 400 600 800 1000F score

syntax_59syntax_63syntax_81syntax_4syntax_74syntax_27syntax_43_2syntax_5syntax_65_2syntax_13syntax_91syntax_63_2syntax_68_2syntax_70syntax_24syntax_82syntax_6_2syntax_30syntax_44_2syntax_47_2syntax_26syntax_83_2syntax_56_2syntax_90_2syntax_41syntax_67_2syntax_67syntax_86syntax_42syntax_65syntax_41_2syntax_85_2syntax_77_2syntax_60syntax_72syntax_83syntax_51_2syntax_50_2syntax_44syntax_80_2syntax_57_2syntax_20_2syntax_27_2syntax_5_2syntax_25_2syntax_37syntax_52_2syntax_13_2syntax_21syntax_7syntax_24_2syntax_62_2syntax_40syntax_9syntax_59_2syntax_33_2syntax_57syntax_3syntax_71syntax_21_2syntax_46syntax_30_2syntax_51syntax_47syntax_20syntax_25syntax_4_2syntax_7_2syntax_89_2syntax_53_2syntax_9_2syntax_37_2syntax_54syntax_70_2syntax_23_2syntax_54_2syntax_32syntax_53syntax_72_2syntax_17syntax_84syntax_42_2syntax_89syntax_31_2syntax_78_2syntax_90syntax_55syntax_71_2syntax_52syntax_19syntax_31syntax_78syntax_32_2syntax_1_2syntax_46_2syntax_39syntax_77syntax_50syntax_66_2syntax_19_2syntax_35syntax_75syntax_35_2syntax_84_2syntax_2_2syntax_66syntax_60_2syntax_22_2syntax_75_2syntax_73_2syntax_39_2syntax_23syntax_62syntax_38syntax_85syntax_68syntax_79syntax_14syntax_0syntax_12_2syntax_1syntax_2syntax_16syntax_38_2syntax_16_2syntax_69_2syntax_76syntax_76_2syntax_17_2syntax_79_2syntax_55_2syntax_69syntax_86_2syntax_14_2syntax_22syntax_15_2syntax_100syntax_73syntax_100_2syntax_0_2syntax_15syntax_12model_SinkhornPHONOLOGICALSYNTACTICGEOGRAPHICGENETICINVENTORYFEATURALmodel_Artetxe16model_Artetxe17Fe

atur

es

1111111111122222222233334445555555566667777888888888889910101010101111111111121212121213131313131414141515151616161617171717171718181819202122232324242425252730303131323236373943454546494953535353565860626265686870737676819496 163 202239 433441 502 770 823 869 917

BLI: Feature Importance

8646

0 500 1000 1500 2000 2500F score

word typeword num

num type fusion tagtag per word

num type tagdata size

model_CHARLES-SAARLAND-02-2word type ratio

model_Unknownmodel_CMU-01-2-DataAug

model_OHIOSTATE-01-2model_EDINBURGH-01-2

model_CARNEGIEMELLON-02-2avg sent length

average type tag for wordaverage tag type length

Feat

ures

269278

321326347

405405417

454537549

584618

7381084

2387

MA: Feature Importance

0 500 1000 1500 2000 2500F score

word typeword num

num type fusion tagtag per word

num type tagdata size

model_CHARLES-SAARLAND-02-2word type ratio

model_Unknownmodel_CMU-01-2-DataAug

model_OHIOSTATE-01-2model_EDINBURGH-01-2

model_CARNEGIEMELLON-02-2avg sent length

average type tag for wordaverage tag type length

Feat

ures

269278

321326347

405405417

454537549

584618

7381084

2387

UD: Feature Importance


Recommended