Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8625–8646July 5 - 10, 2020. c©2020 Association for Computational Linguistics
8625
Predicting Performance for Natural Language Processing Tasks
Mengzhou Xia, Antonios Anastasopoulos, Ruochen Xu, Yiming Yang, Graham NeubigLanguage Technologies Institute, Carnegie Mellon University
{mengzhox,aanastas,yiming,gneubig}@[email protected]
Abstract
Given the complexity of combinations oftasks, languages, and domains in natural lan-guage processing (NLP) research, it is com-putationally prohibitive to exhaustively testnewly proposed models on each possible ex-perimental setting. In this work, we attemptto explore the possibility of gaining plausi-ble judgments of how well an NLP model canperform under an experimental setting, with-out actually training or testing the model. Todo so, we build regression models to predictthe evaluation score of an NLP experimentgiven the experimental settings as input. Ex-perimenting on 9 different NLP tasks, we findthat our predictors can produce meaningfulpredictions over unseen languages and differ-ent modeling architectures, outperforming rea-sonable baselines as well as human experts.Going further, we outline how our predictorcan be used to find a small subset of represen-tative experiments that should be run in orderto obtain plausible predictions for all other ex-perimental settings.1
1 Introduction
Natural language processing (NLP) is an extraor-dinarily vast field, with a wide variety of modelsbeing applied to a multitude of tasks across a plen-itude of domains and languages. In order to mea-sure progress in all these scenarios, it is necessaryto compare performance on test datasets represent-ing each scenario. However, the cross-product oftasks, languages, and domains creates an explo-sion of potential application scenarios, and it is in-feasible to collect high-quality test sets for each.In addition, even for tasks where we do have awide variety of test data, e.g. for well-resourcedtasks such as machine translation (MT), it is still
1Code, data and logs are publicly available at https://github.com/xiamengzhou/NLPerf.
computationally prohibitive as well as not environ-mentally friendly (Strubell et al., 2019) to buildand test on systems for all languages or domainswe are interested in. Because of this, the commonpractice is to test new methods on a small num-ber of languages or domains, often semi-arbitrarilychosen based on previous work or the experi-menters’ intuition.
As a result, this practice impedes the NLPcommunity from gaining a comprehensive under-standing of newly-proposed models. Table 1 il-lustrates this fact with an example from bilinguallexicon induction, a task that aims to find wordtranslation pairs from cross-lingual word embed-dings. As vividly displayed in Table 1, almost allthe works report evaluation results on a differ-ent subset of language pairs. Evaluating only ona small subset raises concerns about making infer-ences when comparing the merits of these meth-ods: there is no guarantee that performance onEnglish–Spanish (EN–ES, the only common evalu-ation dataset) is representative of the expected per-formance of the models over all other languagepairs (Anastasopoulos and Neubig, 2020). Suchphenomena lead us to consider if it is possible tomake a decently accurate estimation for the perfor-mance over an untested language pair without ac-tually running the NLP model to bypass the com-putation restriction.
Toward that end, through drawing on the ideaof characterizing an experiment from Lin et al.(2019), we propose a framework, which we callNLPERF, to provide an exploratory solution. Webuild regression models, to predict the perfor-mance on a particular experimental setting givenpast experimental records of the same task, witheach record consisting of a characterization of itstraining dataset and a performance score of thecorresponding metric. Concretely, in §2, we startwith a partly populated table (such as the one from
8626
BLI Method Evaluation SetDE–EN EN–DE ES–EN EN–ES FR–EN EN–FR IT–EN EN–IT EN–PT EN–RU ES–DE PT–RU
Zhang et al. (2017) ? X X X ? ? X ? ? ? ? ?Chen and Cardie (2018) X X X X X X X X X ? X ?
Yang et al. (2019) X X X X X X X ? ? ? ? ?Heyman et al. (2019) ? X ? X ? X ? X ? ? ? ?Huang et al. (2019) ? ? X X X X ? ? ? ? ? ?Artetxe et al. (2019) X X X X X X ? ? ? X ? ?
Table 1: An illustration of the comparability issues across methods and multiple evaluation datasets from theBilingual Lexicon Induction task. Our prediction model can reasonably fill in the blanks, as illustrated in Section 4.
Table 1) and attempt to infer the missing valueswith the predictor. We begin by introducing theprocess of characterizing an NLP experiment foreach task in §3. We evaluate the effectiveness androbustness of NLPERF by comparing to multiplebaselines, human experts, and by perturbing a sin-gle feature to simulate a grid search over that fea-ture (§4). Evaluations on multiple tasks show thatNLPERF is able to outperform all baselines. No-tably, on a machine translation (MT) task, the pre-dictions made by the predictor turn out to be moreaccurate than human experts.
An effective predictor can be very useful formultiple applications associated with practicalscenarios. In §5, we show how it is possible toadopt the predictor as a scoring function to find asmall subset of experiments that are most repre-sentative of a bigger set of experiments. We arguethat this will allow researchers to make informeddecisions on what datasets to use for training andevaluation, in the case where they cannot experi-ment on all experimental settings. Last, in §6, weshow that we can adequately predict the perfor-mance of new models even with a minimal numberof experimental records.
2 Problem Formulation
In this section we formalize the problem of pre-dicting performance on supervised NLP tasks.Given an NLP model of architecture M trainedover dataset(s) D of a specific task involving lan-guage(s) L with a training procedure (optimiza-tion algorithms, learning rate scheduling etc.) P ,we can test the model on a test dataset D′ and geta score S of a specific evaluation metric. The re-sulting score will surely vary depending on all theabove mentioned factors, and we denote this rela-tion as g:
SM,P,L,D,D′ = g(M,P,L,D,D′). (1)
In the ideal scenario, for each test dataset D′ ofa specific task, one could enumerate all differentsettings and find the one that leads to the best per-formance. As mentioned in Section §1, however,such a brute-force method is computationally in-feasible. Thus, we turn to modeling the processand formulating our problem as a regression taskby using a parametric function fθ to approximatethe true function g as follows:
SM,P,L,D,D′ = fθ([ΦM; ΦP ; ΦL; ΦD; ΦD′ ])
where Φ∗ denotes a set of features for each influ-encing factor.
For the purpose of this study, we mainly focuson dataset and language features ΦL and ΦD, asthis already results in a significant search space,and gathering extensive experimental results withfine-grained tuning over model and training hyper-parameters is both expensive and relatively com-plicated. In the cases where we handle multiplemodels, we only use a single categorical modelfeature to denote the combination of model archi-tecture and training procedure, denoted as ΦC . Westill use the term model to refer to this combina-tion in the rest of the paper. We also omit the testset features, under the assumption that the data dis-tributions for training and testing data are the same(a fairly reasonable assumption if we ignore pos-sible domain shift). Therefore, for all experimentsbelow, our final prediction function is the follow-ing:
SC,L,D = fθ([ΦC ; ΦL; ΦD])
In the next section we describe concrete instan-tiations of this function for several NLP tasks.
3 NLP Task Instantiations
To build a predictor for NLP task performance,we must 1) select a task, 2) describe its featuriza-tion, and 3) train a predictor. We describe detailsof these three steps in this section.
8627
Task Dataset CitationSource Target Transfer
# Models # EXsTask
Langs Langs Langs Metric
Wiki-MT Schwenk et al. (2019) 39 39 – single 995 BLEUTED-MT Qi et al. (2018) 54 1 – single 54 BLEUTSF-MT Qi et al. (2018) 54 1 54 single 2862 BLEUTSF-PARSING Nivre et al. (2018) – 30 30 single 870 AccuracyTSF-POS Nivre et al. (2018) – 26 60 single 1531 AccuracyTSF-EL Rijhwani et al. (2019) – 9 54 single 477 AccuracyBLI Lample et al. (2018) 44 44 – 3 88×3 AccuracyMA McCarthy et al. (2019) – 66 – 6 107×6 F1UD Zeman et al. (2018a) – 53 – 25 72×25 F1
Table 2: Statistics of the datasets we use for training predictors. # EXs denote the total number of experimentinstances; Task Metric reflects how the models are evaluated.
Tasks We test on tasks including bilingual lexi-con induction (BLI); machine translation trainedon aligned Wikipedia data (Wiki-MT), on TEDtalks (TED-MT), and with cross-lingual trans-fer for translation into English (TSF-MT); cross-lingual dependency parsing (TSF-Parsing); cross-lingual POS tagging (TSF-POS); cross-lingualentity linking (TSF-EL); morphological analysis(MA) and universal dependency parsing (UD). Ba-sic statistics on the datasets are outlined in Table 2.
For Wiki-MT tasks, we collect experimentalrecords directly from the paper describing the cor-responding datasets (Schwenk et al., 2019). ForTED-MT and all the transfer tasks, we use the re-sults of Lin et al. (2019). For BLI, we conduct ex-periments using published results from three pa-pers, namely Artetxe et al. (2016), Artetxe et al.(2017) and Xu et al. (2018). For MA, we usethe results of the SIGMORPHON 2019 sharedtask 2 (McCarthy et al., 2019). Last, the UD re-sults are taken from the CoNLL 2018 Shared Taskon universal dependency parsing (Zeman et al.,2018b).
Featurization For language features, we utilizesix distance features from the URIEL Typologi-cal Database (Littell et al., 2017), namely geo-graphic, genetic, inventory, syntactic, phonologi-cal, and featural distance.
The complete set of dataset features includes thefollowing:
1. Dataset Size: The number of data entries usedfor training.
2. Word/Subword Vocabulary Size: The numberof word/subword types.
3. Average Sentence Length: The average length
of sentences from all experimental.4. Word/Subword Overlap:
|T1 ∩ T2||T1|+ |T2|
where T1 and T2 denote vocabularies of anytwo corpora.
5. Type-Token Ratio (TTR): The ratio betweenthe number of types and number of tokens(Richards, 1987) of one corpus.
6. Type-Token Ratio Distance:(1− TTR1
TTR2
)2
where TTR1 and TTR2 denote TTR of anytwo corpora.
7. Single Tag Type: Number of single tag types.8. Fused Tag Type: Number of fused tag types.9. Average Tag Length Per Word: Average num-
ber of single tags for each word.10. Dependency Arcs Matching WALS Fea-
tures: the proportion of dependency pars-ing arcs matching the following WALS fea-tures, computed over the training set: sub-ject/object/oblique before/after verb and ad-jective/numeral before/after noun.
For transfer tasks, we use the same set of datasetfeatures ΦD as Lin et al. (2019), including fea-tures 1–6 on the source and the transfer languageside. We also include language distance featuresbetween source and transfer language, as well asbetween source and target language. For MT tasks,we use features 1–6 and language distance fea-tures, but only between the source and target lan-guage. For MA, we use features 1, 2, 5 and mor-phological tag related features 7–9. For UD, we
8628
use features 1, 2, 5, and 10. For BLI, we use lan-guage distance features and URIEL syntactic fea-tures for the source and the target language.
Predictor Our prediction model is based ongradient boosting trees (Friedman, 2001), im-plemented with XGBoost (Chen and Guestrin,2016). This method is widely known as an effec-tive means for solving problems including rank-ing, classification and regression. We also exper-imented with Gaussian processes (Williams andRasmussen, 1996), but settled on gradient boostedtrees because performance was similar and Xg-boost’s implementation is very efficient throughthe use of parallelism. We use squared error as theobjective function for the regression and adopteda fixed learning rate 0.1. To allow the model tofully fit the data we set the maximum tree depthto be 10 and the number of trees to be 100, anduse the default regularization terms to prevent themodel from overfitting.
4 Can We Predict NLP Performance?
In this section we investigate the effectiveness ofNLPERF across different tasks on various met-rics. Following Lin et al. (2019), we conduct k-fold cross validation for evaluation. To be specific,we randomly partition the experimental records of〈L,D, C,S〉 tuples into k folds, and use k−1 foldsto train a prediction model and evaluate on the re-maining fold. Note that this scenario is similar to“filling in the blanks” in Table 1, where we havesome experimental records that we can train themodel on, and predict the remaining ones.
For evaluation, we calculate the average rootmean square error (RMSE) between the predictedscores and the true scores.
Baselines We compare against a simple meanvalue baseline, as well as against language-wisemean value and model-wise mean value baselines.The simple mean value baseline outputs an aver-age of scores s from the training folds for all testentries in the left-out fold (i) as follows:
s(i)mean =1
|S \ S(i)|∑
s∈S\S(i)s; i ∈ 1 . . . k (2)
Note that for tasks involving multiple models,we calculate the RMSE score separately on eachmodel and use the mean RMSE of all models asthe final RMSE score.
The language-wise baselines make more in-formed predictions, taking into account only train-ing instances with the same transfer, source, or tar-get language (depending on the task setting). Forexample, the source-language mean value baselines(i,j)s-lang for jth test instance in fold i outputs an av-
erage of the scores s of the training instances thatshare the same source language features s-lang, asshown in Equation 3:
s(i,j)s-lang =
∑s,φ δ(φL,src = s-lang) · s∑s,φ δ(φL,src = s-lang)
∀(s, φ) ∈ (|S \ S(i)|, |Φ \ Φ(i)|)(3)
where δ is the indicator function. Similarly, wedefine the target- and the transfer-language meanvalue baselines.
In a similar manner, we also compare against amodel-wise mean value baseline for tasks that in-clude experimental records from multiple models.Now, the prediction for the jth test instance in theleft-out fold i is an average of the scores on thesame dataset (as characterized by the language φLand dataset φD features) from all other models:
s(i,j)model =
∑s,φ δ(φL = lang, φD = data) · s∑s,φ δ(φL = lang, φD = data)
∀(s, φ) ∈ (|S \ S(i)|, |Φ \ Φ(i)|)(4)
where lang = Φ(i,j)L and data = Φ
(i,j)D respec-
tively denote the language and dataset features ofthe test instance.
Main Results For multi-model tasks, we can doeither Single Model prediction (SM), restrictingtraining and testing of the predictor within a sin-gle model, or Multi-Model (MM) prediction us-ing a categorical model feature. The RMSE scoresof NLPERF along with the baselines are shownin Table 3. For all tasks, our single model predic-tor is able to more accurately estimate the evalua-tion score of unseen experiments compared to thesingle model baselines, confirming our hypothe-sis that the there exists a correlation that can becaptured between experimental settings and thedownstream performance of NLP systems. Thelanguage-wise baselines are much stronger thanthe simple mean value baseline but still performworse than our single model predictor. Similarly,the model-wise baseline significantly outperformsthe mean value baseline because results from othermodels reveal much information about the dataset.
8629
TaskModel Wiki-MT TED-MT TSF-MT TSF-PARSING TSF-POS TSF-EL BLI MA UD
Mean 6.40 12.65 10.77 17.58 29.10 18.65 20.10 9.47 17.69Transfer Lang-wise – – 10.96 15.68 29.98 20.55 – – –Source Lang-wise 5.69 12.65 2.24 – – – 20.13 – –Target Lang-wise 5.12 12.65 10.78 12.05 8.92 8.61 20.00 9.47 –NLPERF (SM) 2.50 6.18 1.43 6.24 7.37 7.82 12.63 6.48 12.06
Model-wise – – – – – – 8.77 5.22 4.96NLPERF (MM) – – – – – – 6.87 3.18 3.54
Table 3: RMSE scores of three baselines and our predictions under the single model and multi model setting(missing values correspond to settings not applicable to the task). All results are from k-fold (k = 5) evaluationsaveraged over 10 random runs.
Even so, our multi-model predictor still outper-forms the model-wise baseline.
The results nicely imply that for a wide range oftasks, our predictor is able to reasonably estimateleft-out slots in a partly populated table given re-sults of other experiment records, without actuallyrunning the system.
We should note that RMSE scores across differ-ent tasks should not be directly compared, mainlybecause the scale of each evaluation metric isdifferent. For example, a BLEU score (Papineniet al., 2002) for MT experiments typically rangesfrom 1 to 40, while an accuracy score usually hasa much larger range, for example, BLI accuracyranges from 0.333 to 78.2 and TSF-POS accuracyranges from 1.84 to 87.98, which consequentlymakes the RMSE scores of these tasks higher.
Comparison to Expert Human PerformanceWe constructed a small scale case study to eval-uate whether NLPERF is competitive to the per-formance of NLP sub-field experts. We focusedon the TED-MT task and recruited 10 MT practi-tioners,2 all of whom had published at least 3 MT-related papers in ACL-related conferences.
In the first set of questions, the participants werepresented with language pairs from one of the kdata folds along with the dataset features and wereasked to estimate an eventual BLEU score for eachdata entry. In the second part of the questionnaire,the participants were tasked with making estima-tions on the same set of language pairs, but thistime they also had access to features, and BLEUscores from all the other folds.3
2None of the study participants were affiliated to the au-thors’ institutions, nor were familiar with this paper’s content.
3The interested reader can find an example questionnaire
Predictor RMSE
Mean Baseline 12.64Human (w/o training data) 9.38Human (w/ training data) 7.29NLPERF 6.04
Table 4: Our model performs better than human MTexperts on the TED-MT prediction task.
The partition of the folds is consistent betweenthe human study and the training/evaluation for thepredictor. While the first sheet is intended to fa-miliarize the participants with the task, the secondsheet fairly adopts the training/evaluation settingfor our predictor. As shown in Table 4, our partic-ipants outperform the mean baseline even withoutinformation from other folds, demonstrating theirown strong prior knowledge in the field. In addi-tion, the participants make more accurate guessesafter acquiring more information on experimentalrecords in other folds. In neither case, though, arethe human experts competitive to our predictor. Infact, only one of the participants achieved perfor-mance comparable to our predictor.
Feature Perturbation Another question of in-terest concerning predicting performance is “howwill the model perform when trained on data ofa different size” (Kolachina et al., 2012a). To testNLPERF’s extrapolation ability in this regard, weconduct an array of experiments on one languagepair with various data sizes on the Wiki-MT task.We pick two language pairs, Turkish to English(TR–EN) and Portuguese to English (PT–EN) asour testbed for the Wiki-MT task. We sample par-
(and make estimations over one of the folds) in the A.
8630
0 100 200 300 400 500Data Size (k)
5
10
15
20BL
EU
TR-ENTR-EN prediction
0 400 800 1200 1600 2000 24001520253035
BLEU
PT-ENPT-EN prediction
Figure 1: Our model’s predicted BLEU scores andtrue BLEU scores, on sampled TR–EN datasets (sizes10k/50k/100k/200k/478k) and PT–EN datasets (sizes100k/500k/1000k/2000k/2462k), achieving a RMSEscore of 1.83 and 9.97 respectively.
allel datasets with different sizes and train MTmodels with each sampled dataset to obtain thetrue BLEU scores. On the other hand, we collectthe features of all sampled datasets and use ourpredictor (trained over all other languages pairs) toobtain predictions. The plot of true BLEU scoresand predicted BLEU scores are shown in Figure 1.Our predictor achieves a very low average RMSEof 1.83 for TR–EN pair but a relatively higherRMSE of 9.97 for PT–EN pair. The favorable per-formance on the tr-en pair demonstrates the possi-bility of our predictor to do feature extrapolationover data set size. In contrast, the predictions onthe pt-en pair are significantly less accurate. Thisis due to the fact that there are only two other ex-perimental settings scoring as high as 34 BLEUscore, with data sizes of 3378k (en-es) and 611k(gl-es), leading to the predictor’s inadequacy inpredicting high BLEU scores for low-resourceddata sets during extrapolation. This reveals the factthat while the predictor is able to extrapolate per-formance on settings similar to what it has seenin the data, NLPERF may be less successful undercircumstances unlike its training inputs.
5 What Datasets Should We Test On?
As shown in Table 1, it is common practice to testmodels on a subset of all available datasets. Thereason for this is practical – it is computationallyprohibitive to evaluate on all settings. However,if we pick test sets that are not representative ofthe data as a whole, we may mistakenly reach un-
founded conclusions about how well models per-form on other data with distinct properties. Forexample, models trained on a small-sized datasetmay not scale well to a large-sized one, or mod-els that perform well on languages with a partic-ular linguistic characteristic may not do well onlanguages with other characteristics (Bender andFriedman, 2018).
Here we ask the following question: if we areonly practically able to test on a small number ofexperimental settings, which ones should we teston to achieve maximally representative results?Answering the question could have practical im-plications: organizers of large shared tasks likeSIGMORPHON (McCarthy et al., 2019) or UD(Zeman et al., 2018a) could create a minimal sub-set of settings upon which they would ask partici-pants to test to get representative results; similarly,participants could possibly expedite the iterationof model development by testing on the represen-tative subset only. A similar avenue for researchersand companies deploying systems over multiplelanguages could lead to not only financial savings,but potentially a significant cut-down of emissionsfrom model training (Strubell et al., 2019).
We present an approximate explorative solutionto the problem mentioned above. Formally, as-sume that we have a setN , comprising experimen-tal records (both features and scores) of n datasetsfor one task. We set a number m (< n) of datasetsthat we would like to select as the representativesubset. By defining RMSEA(B) to be the RMSEscore derived from evaluating on one subset B thepredictor trained on another subset of experimen-tal records A, we consider the most representativesubset D to be the one that minimizes the RMSEscore when predicting all of the other datasets:
arg minD⊂N
RMSED(N \ D). (5)
Naturally, enumerating all(nm
)possible sub-
sets would be prohibitively costly, even though itwould lead to the optimal solution. Instead, weemploy a beam-search-like approach to efficientlysearch for an approximate solution to the best per-forming subset of arbitrary size. Concretely, westart our approximate search with an exhaustiveenumeration of all subsets of size 2. At each fol-lowing step t, we only consider the best k subsets{D(i)
t ; i ∈ 1, . . . , k} into account and discard therest. As shown in Equation 6, for each candidate
8631
2 3 4 510
20
30
40
RMSE
rus-englav-eng
bos-engron-engeng-fin
cat-englav-engeng-esteng-nor
bos-engron-engeng-finkor-enghrv-eng
eng-spaeng-ben
eng-spaeng-afrafr-eng
eng-spaeng-noreng-daneng-afr
eng-spaeng-danafr-engeng-afreng-nor
BLI
2 3 4 5
10
20
30
40
50
lt_hsegl_treegal
lt_hseen_pudhy_armtdp
lt_hsepcm_nscpl_lfgpl_sz
lt_hsebr_keben_linecs_fictreepl_sz
kpv_ikdpsa_ufal
kpv_ikdptl_trgsa_ufal
kpv_ikdpkpv_latticesa_ufaltl_trg
tl_trgsa_ufalcs_pudfo_ofttr_pud
MA
2 3 4 55
10
15
20
25
RMSE
sqi-engkur-eng
nob-engmsa-engcmn-eng
nob-engmsa-englit-engces-eng
nob-engmsa-englit-engfas-engheb-eng
spa-engpor-eng
rus-engpor-engvie-eng
rus-engpor-engvie-engron-eng
spa-engfra-engara-engpor-engita-eng
TED-MT
2 3 4 5
5
10
15
20
25
glg-rusron-por
srp-ukrdeu-epoeng-tur
swe-fraita-slkfin-engglg-eng
srp-ukrdeu-epoita-rusfra-porukr-srp
eng-spaspa-eng
por-engeng-spaspa-eng
eng-spapor-engeng-porfra-eng
glg-spaeng-itaeng-spavie-engspa-glg
Wiki-MT
Most representative Least representative Random Search
Figure 2: Beam search results (beam size=100) for up to the 5 most (and least) representative datasets for 4 NLPtasks. We also show random search results averaged over 100 random runs.
subset, we expand it with one more data point,
{D(i)t ∪ {s};∀i ∈ 1 . . . k, s ∈ N \ D(i)
t }. (6)
For tasks that involve multiple models, we takeexperimental records of the selected dataset fromall models into account during expansion. Givenall expanded subsets, we train a predictor for eachto evaluate on the rest of the data sets, and keep thebest performing k subsets {D(i)
t+1; i ∈ 1, . . . , k}with minimum RMSE scores for the next step.Furthermore, note that by simply changing thearg min to an arg max in Equation 5, we can alsofind the least representative datasets.
We present search results for four tasks4 asbeam search progresses in Figure 2, with cor-responding RMSE scores from all remainingdatasets as the y-axis. For comparison, we alsoconduct random searches by expanding the subsetwith a randomly selected experimental record. Inall cases, the most representative sets are an aggre-gation of datasets with diverse characteristics suchas languages and dataset sizes. For example, in theWiki-MT task, the 5 most representative datasetsinclude languages that fall into a diverse rangeof language families such as Romance, Turkic,Slavic, etc. while the least representative ones in-clude duplicate pairs (opposite directions) mostly
4Readers can find results on other tasks in Appendix B.
involving English. The phenomenon is more pro-nounced in the TED-MT task, where not onlythe 5 most representative source languages are di-verse, but also the dataset sizes. Specifically, theMalay-English (msa-eng) is a tiny dataset (5k par-allel sentences), and Hebrew-English (heb-eng) isa high-resource case (212k parallel sentences).
Notably, for BLI task, to test how represen-tative the commonly used datasets are, we se-lect the most frequent 5 language pairs shown inTable 1, namely en-de, es-en, en-es, fr-en, en-frfor evaluation. Unsurprisingly, we get an RMSEscore as high as 43.44, quite close to the perfor-mance of the worst representative set found usingbeam search. This finding indicates that the stan-dard practice of choosing datasets for evaluationis likely unrepresentative of results over the fulldataset spectrum, well aligned with the claims inAnastasopoulos and Neubig (2020).
A particularly encouraging observation is thatthe predictor trained with only the 5 most rep-resentative datasets can achieve an RMSE scorecomparable to k-fold validation, which requiredusing all of the datasets for training.5 This indi-cates that one would only need to train NLP mod-els on a small set of representative datasets to ob-tain reasonably plausible predictions for the rest.
5to be accurate, k − 1 folds of all datasets.
8632
6 Can We Extrapolate Performance forNew Models?
In another common scenario, researchers proposenew models for an existing task. It is both time-consuming and computationally intensive to runexperiments with all settings for a new model. Inthis section, we explore if we can use past exper-imental records from other models and a minimalset of experiments from the new model to give aplausible prediction over the rest of the datasets,potentially reducing the time and resources neededfor experimenting with the new model to a largeextent. We use the task of UD parsing as ourtestbed6 as it is the task with most unique mod-els (25 to be exact). Note that we still only use asingle categorical feature for the model type.
To investigate how many experiments areneeded to have a plausible prediction for a newmodel, we first split the experimental recordsequally into a sample set and a test set. Then werandomly sample n (0 ≤ n ≤ 5) experimentalrecords from the sample set and add them into thecollection of experiment records of past models.Each time we re-train a predictor and evaluate onthe test set. The random split repeats 50 times andthe random sampling repeats 50 times, adding upto a total of 2500 experiments. We use the meanvalue of the results from other models, shown inEquation 7 as the prediction baseline for the left-out model, and because experiment results of othermodels reveal significant information about thedataset, this serves as a relatively strong baseline:
sk =1
n− 1
n∑i=1
1(i ∈M/{k}) · si. (7)
M denotes a collection of models and k denotesthe left-out model.
We show the prediction performance (inRMSE) over 8 systems7 in Figure 3. Interestingly,the predictor trained with no model records (0)outperforms the mean value baseline for the 4 bestsystems, while it is the opposite case on the 4worst systems. Since there is no information pro-vided about the new-coming model, the predic-tions are solely based on dataset and language fea-tures. One reason might explain the phenomenon -the correlation between the features and the scoresof the worse-performing systems is different from
6MA and BLI task results are in Appendix C7The best and worst 4 systems from the shared task.
those better-performing systems, so the predictoris unable to generalize well (ONLP).
In the following discussion, we use RMSE@nto denote the RMSE from the predictor trainedwith n data points of a new model. The rela-tively low RMSE@0 scores indicate that othermodels’ features and scores are informative forpredicting the performance of the new modeleven without new model information. ComparingRMSE@0 and RMSE@1, we observe a consis-tent improvement for almost all systems, indicat-ing that NLPERF trained on even a single ex-tra random example achieves more accurate esti-mates over the test sets. Adding more data pointsconsistently leads to additional gains. However,predictions on worse-performing systems benefitmore from it than for better-performing systems,indicating that their feature-performance correla-tion might be considerably different. The findingshere indicate that by extrapolating from past ex-periments, one can make plausible judgments fornewly developed models.
7 Related Work
As discusssed in Domhan et al. (2015), there aretwo main threads of work focusing on predict-ing performance of machine learning algorithms.The first thread is to predict the performance of amethod as a function of its training time, while thesecond thread is to predict a method’s performanceas a function of the training dataset size. Our workbelongs in the second thread, but could easily beextended to encompass training time/procedure.
In the first thread, Kolachina et al. (2012b) at-tempt to infer learning curves based on trainingdata features and extrapolate the initial learningcurves based on BLEU measurements for statis-tical machine translation (SMT). By extrapolatingthe performance of initial learning curves, the pre-dictions on the remainder allows for early termi-nation of a bad run (Domhan et al., 2015).
In the second thread, Birch et al. (2008) adoptlinear regression to capture the relationship be-tween data features and SMT performance andfind that the amount of reordering, the morpholog-ical complexity of the target language and the re-latedness of the two languages explains the major-ity of performance variability. More recently, Elsa-har and Gallé (2019) use domain shift metrics suchas H-divergence based metrics to predict drop inperformance under domain-shift. Rosenfeld et al.
8633
0 1 2 3 4 5
4
6
8R
MSE
HIT-SCIR (78.86)
0 1 2 3 4 5
3
4
UDPipe (76.07)
0 1 2 3 4 5
44.5
5
LATTICE (76.07)
0 1 2 3 4 5
44.5
5
ICS (75.98)
0 1 2 3 4 5
5
6
RM
SE
Phoenix (68.17)
0 1 2 3 4 5
678
BOUN (66.69)
0 1 2 3 4 5678
CUNI (66.6)
0 1 2 3 4 5
10
12
ONLP (61.92)
Figure 3: RMSE scores of UD task from dataset-wise mean value predictor (the dashed black line in each graph)and predictors trained with experimental records of other models and 0–5 records from a new model.
(2020) explore the functional form of the depen-dency of the generalization error of neural modelson model and data size. We view our work as ageneralization of such approaches, appropriate forapplication on any NLP task.
8 Conclusion and Future Work
In this work, we investigate whether the exper-iment setting itself is informative for predictingthe evaluation scores of NLP tasks. Our findingspromisingly show that given a sufficient number ofpast training experimental records, our predictorcan 1) outperform human experts; 2) make plau-sible predictions even over new-coming modelsand languages; 3) extrapolate well on features likedataset size; 4) provide a guide on how we shouldchoose representative datasets for fast iteration.
While this discovery is a promising start, thereare still several avenues on improvement in futurework.
First, the dataset and language settings coveredin our study are still limited. Experimental recordswe use are from relatively homogeneous settings,e.g. all datasets in Wiki-MT task are sentence-pieced to have 5000 subwords, indicating that ourpredictor may fail for other subword settings. Ourmodel also failed to generalize to cases where fea-ture values are out of the range of the training ex-perimental records. We attempted to apply the pre-dictor of Wiki-MT to evaluate on a low-resourceMT dataset, translating from Mapudungun (arn)to Spanish (spa) with the dataset from Duan et al.(2019), but ended up with a poor RMSE score.It turned out that the average sentence length ofthe arn–spa data set is much lower than that of thetraining data sets and our predictors fail to gener-
alize to this different setting.Second, using a categorical feature to denote
model types constrains its expressive power formodeling performance. In reality, a slight changein model hyperparameters (Hoos and Leyton-Brown, 2014; Probst et al., 2019), optimization al-gorithms (Kingma and Ba, 2014), or even randomseeds (Madhyastha and Jain, 2019) may give riseto a significant variation in performance, whichour predictor is not able to capture. While investi-gating the systematic implications of model struc-tures or hyperparameters is practically infeasiblein this study, we may use additional informationsuch as textual model descriptions for modelingNLP models and training procedures more elabo-rately in the future.
Lastly, we assume that the distribution of train-ing and testing data is the same, which does notconsider domain shift. On top of this, there mightalso be a domain shift between data sets of train-ing and testing experimental records. We believethat modeling domain shift is a promising futuredirection to improve performance prediction.
Acknowledgement
The authors sincerely thank all the reviewersfor their insightful comments and suggestions,Philipp Koehn, Kevin Duh, Matt Post, ShuoyangDing, Xuan Zhang, Adi Renduchintala, Paul Mc-Namee, Toan Nguyen and Kenton Murray for con-ducting human evaluation for the TED-MT task,Daniel Beck for discussions on Gaussian Pro-cesses, Shruti Rijhwani, Xinyi Wang, Paul Michelfor discussions on this paper. This work is gener-ously supported from the National Science Foun-dation under grant 1761548.
8634
ReferencesAntonios Anastasopoulos and Graham Neubig. 2020.
Should all cross-lingual embeddings speak english?In Proc. ACL. To appear.
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016.Learning principled bilingual mappings of word em-beddings while preserving monolingual invariance.In Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing, pages2289–2294.
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017.Learning bilingual word embeddings with (almost)no bilingual data. In Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 451–462,Vancouver, Canada. Association for ComputationalLinguistics.
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2019.Bilingual lexicon induction through unsupervisedmachine translation. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics, pages 5002–5007, Florence, Italy. Asso-ciation for Computational Linguistics.
Emily M Bender and Batya Friedman. 2018. Datastatements for natural language processing: Towardmitigating system bias and enabling better science.Transactions of the Association for ComputationalLinguistics, 6:587–604.
Alexandra Birch, Miles Osborne, and Philipp Koehn.2008. Predicting success in machine translation. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing, pages 745–754. Association for Computational Linguistics.
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: Ascalable tree boosting system. In Proceedings ofthe 22nd acm sigkdd international conference onknowledge discovery and data mining, pages 785–794. ACM.
Xilun Chen and Claire Cardie. 2018. Unsupervisedmultilingual word embeddings. In Proceedings ofthe 2018 Conference on Empirical Methods in Nat-ural Language Processing, pages 261–270, Brus-sels, Belgium. Association for Computational Lin-guistics.
Tobias Domhan, Jost Tobias Springenberg, and FrankHutter. 2015. Speeding up automatic hyperparame-ter optimization of deep neural networks by extrap-olation of learning curves. In Twenty-Fourth Inter-national Joint Conference on Artificial Intelligence.
Mingjun Duan, Carlos Fasola, Sai Krishna Rallabandi,Rodolfo M. Vega, Antonios Anastasopoulos, LoriLevin, and Alan W Black. 2019. A resource forcomputational experiments on mapudungun. InProc. LREC. To appear.
Hady Elsahar and Matthias Gallé. 2019. To annotateor not? predicting performance drop under domainshift. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages2163–2173.
Jerome H Friedman. 2001. Greedy function approx-imation: a gradient boosting machine. Annals ofstatistics, pages 1189–1232.
Geert Heyman, Bregt Verreet, Ivan Vulic, and Marie-Francine Moens. 2019. Learning unsupervised mul-tilingual word embeddings with incremental multi-lingual hubs. In Proceedings of the 2019 Confer-ence of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 1890–1902.
Holger Hoos and Kevin Leyton-Brown. 2014. An ef-ficient approach for assessing hyperparameter im-portance. In International conference on machinelearning, pages 754–762.
Jiaji Huang, Qiang Qiu, and Kenneth Church. 2019.Hubless nearest neighbor search for bilingual lexi-con induction. In Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics, pages 4072–4080, Florence, Italy. Associa-tion for Computational Linguistics.
Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.
Prasanth Kolachina, Nicola Cancedda, Marc Dymet-man, and Sriram Venkatapathy. 2012a. Prediction oflearning curves in machine translation. In Proceed-ings of the 50th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 22–30, Jeju Island, Korea. Associationfor Computational Linguistics.
Prasanth Kolachina, Nicola Cancedda, Marc Dymet-man, and Sriram Venkatapathy. 2012b. Prediction oflearning curves in machine translation. In Proceed-ings of the 50th Annual Meeting of the Associationfor Computational Linguistics: Long Papers-Volume1, pages 22–30. Association for Computational Lin-guistics.
Guillaume Lample, Alexis Conneau, Marc’AurelioRanzato, Ludovic Denoyer, and Hervé Jégou. 2018.Word translation without parallel data. In Interna-tional Conference on Learning Representations.
Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li,Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani,Junxian He, Zhisong Zhang, Xuezhe Ma, AntoniosAnastasopoulos, Patrick Littell, and Graham Neu-big. 2019. Choosing transfer languages for cross-lingual learning. In Proceedings of the 57th Annual
8635
Meeting of the Association for Computational Lin-guistics, pages 3125–3135, Florence, Italy. Associa-tion for Computational Linguistics.
Patrick Littell, David R Mortensen, Ke Lin, KatherineKairis, Carlisle Turner, and Lori Levin. 2017. Urieland lang2vec: Representing languages as typologi-cal, geographical, and phylogenetic vectors. In Pro-ceedings of the 15th Conference of the EuropeanChapter of the Association for Computational Lin-guistics: Volume 2, Short Papers, pages 8–14.
Pranava Madhyastha and Rishabh Jain. 2019. Onmodel stability as a function of random seed. In Pro-ceedings of the 23rd Conference on ComputationalNatural Language Learning (CoNLL), pages 929–939, Hong Kong, China. Association for Computa-tional Linguistics.
Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu,Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar-rett Nicolai, Christo Kirov, Miikka Silfverberg, Se-bastian J. Mielke, Jeffrey Heinz, Ryan Cotterell, andMans Hulden. 2019. The SIGMORPHON 2019shared task: Morphological analysis in context andcross-lingual transfer for inflection. In Proceedingsof the 16th Workshop on Computational Research inPhonetics, Phonology, and Morphology, pages 229–244, Florence, Italy. Association for ComputationalLinguistics.
Joakim Nivre, Mitchell Abrams, Željko Agic, LarsAhrenberg, Lene Antonsen, Maria Jesus Aranzabe,Gashaw Arutie, Masayuki Asahara, Luma Ateyah,Mohammed Attia, Aitziber Atutxa, LiesbethAugustinus, Elena Badmaeva, Miguel Balles-teros, Esha Banerjee, Sebastian Bank, VerginicaBarbu Mititelu, John Bauer, Sandra Bellato, KepaBengoetxea, Riyaz Ahmad Bhat, Erica Biagetti,Eckhard Bick, Rogier Blokland, Victoria Bobicev,Carl Börstell, Cristina Bosco, Gosse Bouma, SamBowman, Adriane Boyd, Aljoscha Burchardt, MarieCandito, Bernard Caron, Gauthier Caron, GülsenCebiroglu Eryigit, Giuseppe G. A. Celano, SavasCetin, Fabricio Chalub, Jinho Choi, Yongseok Cho,Jayeol Chun, Silvie Cinková, Aurélie Collomb,Çagrı Çöltekin, Miriam Connor, Marine Courtin,Elizabeth Davidson, Marie-Catherine de Marn-effe, Valeria de Paiva, Arantza Diaz de Ilarraza,Carly Dickerson, Peter Dirix, Kaja Dobrovoljc,Timothy Dozat, Kira Droganova, Puneet Dwivedi,Marhaba Eli, Ali Elkahky, Binyam Ephrem, TomažErjavec, Aline Etienne, Richárd Farkas, HectorFernandez Alcalde, Jennifer Foster, Cláudia Freitas,Katarína Gajdošová, Daniel Galbraith, MarcosGarcia, Moa Gärdenfors, Kim Gerdes, Filip Gin-ter, Iakes Goenaga, Koldo Gojenola, MemduhGökırmak, Yoav Goldberg, Xavier Gómez Guino-vart, Berta Gonzáles Saavedra, Matias Grioni,Normunds Gruzıtis, Bruno Guillaume, CélineGuillot-Barbance, Nizar Habash, Jan Hajic, JanHajic jr., Linh Hà My, Na-Rae Han, Kim Harris,Dag Haug, Barbora Hladká, Jaroslava Hlavácová,Florinel Hociung, Petter Hohle, Jena Hwang, Radu
Ion, Elena Irimia, Tomáš Jelínek, Anders Johannsen,Fredrik Jørgensen, Hüner Kasıkara, Sylvain Kahane,Hiroshi Kanayama, Jenna Kanerva, Tolga Kayade-len, Václava Kettnerová, Jesse Kirchner, NataliaKotsyba, Simon Krek, Sookyoung Kwak, VeronikaLaippala, Lorenzo Lambertino, Tatiana Lando,Septina Dian Larasati, Alexei Lavrentiev, JohnLee, Phương Lê Hông, Alessandro Lenci, SaranLertpradit, Herman Leung, Cheuk Ying Li, Josie Li,Keying Li, KyungTae Lim, Nikola Ljubešic, OlgaLoginova, Olga Lyashevskaya, Teresa Lynn, VivienMacketanz, Aibek Makazhanov, Michael Mandl,Christopher Manning, Ruli Manurung, CatalinaMaranduc, David Marecek, Katrin Marheinecke,Héctor Martínez Alonso, André Martins, JanMašek, Yuji Matsumoto, Ryan McDonald, GustavoMendonça, Niko Miekka, Anna Missilä, CatalinMititelu, Yusuke Miyao, Simonetta Montemagni,Amir More, Laura Moreno Romero, ShinsukeMori, Bjartur Mortensen, Bohdan Moskalevskyi,Kadri Muischnek, Yugo Murawaki, Kaili Müürisep,Pinkey Nainwani, Juan Ignacio Navarro Horñiacek,Anna Nedoluzhko, Gunta Nešpore-Berzkalne,Lương Nguyên Thi., Huyên Nguyên Thi. Minh,Vitaly Nikolaev, Rattima Nitisaroj, Hanna Nurmi,Stina Ojala, Adédayơ. Olúòkun, Mai Omura,Petya Osenova, Robert Östling, Lilja Øvrelid,Niko Partanen, Elena Pascual, Marco Passarotti,Agnieszka Patejuk, Siyao Peng, Cenel-AugustoPerez, Guy Perrier, Slav Petrov, Jussi Piitulainen,Emily Pitler, Barbara Plank, Thierry Poibeau,Martin Popel, Lauma Pretkalnin, a, Sophie Prévost,Prokopis Prokopidis, Adam Przepiórkowski, TiinaPuolakainen, Sampo Pyysalo, Andriela Rääbis,Alexandre Rademaker, Loganathan Ramasamy,Taraka Rama, Carlos Ramisch, Vinit Ravishankar,Livy Real, Siva Reddy, Georg Rehm, MichaelRießler, Larissa Rinaldi, Laura Rituma, LuisaRocha, Mykhailo Romanenko, Rudolf Rosa, Da-vide Rovati, Valentin Ros, ca, Olga Rudina, ShovalSadde, Shadi Saleh, Tanja Samardžic, StephanieSamson, Manuela Sanguinetti, Baiba Saulıte,Yanin Sawanakunanon, Nathan Schneider, Sebas-tian Schuster, Djamé Seddah, Wolfgang Seeker,Mojgan Seraji, Mo Shen, Atsuko Shimada, MuhShohibussirri, Dmitry Sichinava, Natalia Silveira,Maria Simi, Radu Simionescu, Katalin Simkó,Mária Šimková, Kiril Simov, Aaron Smith, IsabelaSoares-Bastos, Antonio Stella, Milan Straka, JanaStrnadová, Alane Suhr, Umut Sulubacak, ZsoltSzántó, Dima Taji, Yuta Takahashi, Takaaki Tanaka,Isabelle Tellier, Trond Trosterud, Anna Trukhina,Reut Tsarfaty, Francis Tyers, Sumire Uematsu,Zdenka Urešová, Larraitz Uria, Hans Uszkoreit,Sowmya Vajjala, Daniel van Niekerk, Gertjan vanNoord, Viktor Varga, Veronika Vincze, Lars Wallin,Jonathan North Washington, Seyi Williams, MatsWirén, Tsegay Woldemariam, Tak-sum Wong,Chunxiao Yan, Marat M. Yavrumyan, Zhuoran Yu,Zdenek Žabokrtský, Amir Zeldes, Daniel Zeman,Manying Zhang, and Hanzhi Zhu. 2018. Universaldependencies 2.2. LINDAT/CLARIN digital libraryat the Institute of Formal and Applied Linguistics
8636
(ÚFAL), Faculty of Mathematics and Physics,Charles University.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics, pages 311–318. Association forComputational Linguistics.
Philipp Probst, Anne-Laure Boulesteix, and Bernd Bis-chl. 2019. Tunability: Importance of hyperparame-ters of machine learning algorithms. Journal of Ma-chine Learning Research, 20(53):1–32.
Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Pad-manabhan, and Graham Neubig. 2018. When andwhy are pre-trained word embeddings useful forneural machine translation? In Meeting of the NorthAmerican Chapter of the Association for Computa-tional Linguistics (NAACL), New Orleans, USA.
Brian Richards. 1987. Type/token ratios: What do theyreally tell us? Journal of child language, 14(2):201–209.
Shruti Rijhwani, Jiateng Xie, Graham Neubig, andJaime Carbonell. 2019. Zero-shot neural transfer forcross-lingual entity linking. In Thirty-Third AAAIConference on Artificial Intelligence (AAAI), Hon-olulu, Hawaii.
Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Be-linkov, and Nir Shavit. 2020. A constructive pre-diction of the generalization error across scales. InInternational Conference on Learning Representa-tions.
Holger Schwenk, Vishrav Chaudhary, Shuo Sun,Hongyu Gong, and Francisco Guzmán. 2019. Wiki-matrix: Mining 135m parallel sentences in 1620language pairs from wikipedia. arXiv preprintarXiv:1907.05791.
Emma Strubell, Ananya Ganesh, and Andrew McCal-lum. 2019. Energy and policy considerations fordeep learning in NLP. In Proceedings of the 57thAnnual Meeting of the Association for Computa-tional Linguistics, pages 3645–3650, Florence, Italy.Association for Computational Linguistics.
Christopher KI Williams and Carl Edward Rasmussen.1996. Gaussian processes for regression. In Ad-vances in neural information processing systems,pages 514–520.
Ruochen Xu, Yiming Yang, Naoki Otani, and YuexinWu. 2018. Unsupervised cross-lingual transfer ofword embedding spaces. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 2465–2474.
Pengcheng Yang, Fuli Luo, Peng Chen, Tianyu Liu,and Xu Sun. 2019. Maam: A morphology-awarealignment model for unsupervised bilingual lexicon
induction. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics, pages 3190–3196.
Daniel Zeman, Jan Hajic, Martin Popel, Martin Pot-thast, Milan Straka, Filip Ginter, Joakim Nivre, andSlav Petrov. 2018a. CoNLL 2018 shared task: Mul-tilingual parsing from raw text to universal depen-dencies. In Proceedings of the CoNLL 2018 SharedTask: Multilingual Parsing from Raw Text to Univer-sal Dependencies, pages 1–21, Brussels, Belgium.Association for Computational Linguistics.
Daniel Zeman, Jan Hajic, Martin Popel, Martin Pot-thast, Milan Straka, Filip Ginter, Joakim Nivre, andSlav Petrov. 2018b. Conll 2018 shared task: mul-tilingual parsing from raw text to universal depen-dencies. In Proceedings of the CoNLL 2018 SharedTask: Multilingual Parsing from Raw Text to Univer-sal Dependencies, pages 1–21.
Meng Zhang, Yang Liu, Huanbo Luan, and MaosongSun. 2017. Earth mover’s distance minimization forunsupervised bilingual lexicon induction. In Pro-ceedings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing, pages 1934–1945.
8637
Appendix
A Questionnaire
An example of the first questionnaire from ouruser case study is shown below. The second sheetalso included the results in 44 more languagepairs. We provide an answer key after the secondsheet.
Please provide your prediction of the BLEU score based on the language pair and dataset features(the domain of the training and test sets is TED talks). After you finish, please go to sheet v2.
idx Source Target Parallel Source Source Target Target BLEULanguage Language Sentences vocab subword vocab subword
(k) size (k) vocab size (k) vocabsize (k) size( k)
1 Basque (eus) English 5 20 8 9 62 Slovak (slk) English 61 134 8 36 83 Burmese (mya) English 21 101 8 21 84 Korean (kor) English 206 386 9 67 85 Lithuanian (lit) English 42 108 8 29 86 Arabic (ara) English 214 308 8 69 87 Czech (ces) English 103 181 8 47 88 Esperanto (epo) English 7 21 8 10 69 Finnish (fin) English 24 77 8 22 810 Albanian (sqi) English 45 93 8 30 811 Vietnamese (vie) English 172 66 8 61 8
8638
Please provide your prediction of the BLEU score in the yellow area given all the informationin this sheet. Note that all experiments are trained with the same model.
idx Source Target Parallel Source Source Target Target BLEULanguage Lang. Sentences vocab subword vocab subword
(k) size (k) vocab size (k) vocabsize (k) size( k)
1 Basque (eus) English 5 20 8 9 62 Slovak (slk) English 61 134 8 36 83 Burmese (mya) English 21 101 8 21 84 Korean (kor) English 206 386 9 67 85 Lithuanian (lit) English 42 108 8 29 86 Arabic (ara) English 214 308 8 69 87 Czech (ces) English 103 181 8 47 88 Esperanto (epo) English 7 21 8 10 69 Finnish (fin) English 24 77 8 22 810 Albanian (sqi) English 45 93 8 30 811 Vietnamese (vie) English 172 66 8 61 812 French (fra) English 192 158 8 65 8 37.7413 Estonian (est) English 11 39 8 14 7 9.914 Macedonian (mkd) English 25 61 8 23 8 21.7515 Bosnian (bos) English 6 23 8 9 6 32.4216 Swedish (swe) English 57 84 8 34 8 33.9217 Polish (pol) English 176 267 8 63 8 21.5118 Persian (fas) English 151 148 8 57 8 24.519 Kurdish (kur) English 10 39 8 14 7 6.8620 Hungarian (hun) English 147 305 8 56 8 22.6721 Slovenian (slv) English 20 58 8 20 8 14.1822 Romanian (ron) English 181 205 8 63 8 32.4223 Russian (rus) English 208 291 8 68 8 22.624 Serbian (srp) English 137 239 8 54 8 30.4125 Tamil (tam) English 6 27 8 10 6 1.8226 Kazakh (kaz) English 3 15 8 7 5 2.0527 Marathi (mar) English 10 29 8 13 7 3.6828 Ukrainian (ukr) English 108 191 8 48 8 24.0929 Thai (tha) English 98 323 8 45 8 20.3430 Belarusian (bel) English 5 20 8 8 5 2.8531 Turkish (tur) English 182 304 8 63 8 22.5232 Azerbaijani (aze) English 6 23 8 9 6 3.133 German (deu) English 168 194 8 61 8 33.1534 Bulgarian (bul) English 174 216 8 62 8 35.7835 Norwegian (nob) English 16 36 8 17 7 29.6336 Georgian (kat) English 13 44 8 15 7 4.9437 Danish (dan) English 45 72 8 31 8 37.7338 Armenian (hye) English 21 56 8 20 8 13.9739 Mandarin (cmn) English 200 481 9 67 8 17.0
8639
idx Source Target Parallel Source Source Target Target BLEULanguage Language Sentences vocab subword vocab subword
40 Indonesian (ind) English 87 76 8 43 8 27.2741 Galician (glg) English 10 28 8 13 7 16.8442 Portuguese (por) English 185 165 8 64 8 41.6743 Urdu (urd) English 6 13 6 10 6 3.3844 Italian (ita) English 205 195 8 67 8 35.6745 Spanish (spa) English 196 179 8 66 8 39.4846 Greek (ell) English 134 171 8 54 8 34.9447 Bengali (ben) English 5 18 8 9 6 2.7948 Japanese (jpn) English 204 584 9 67 8 11.4249 Malay (msa) English 5 13 7 9 6 3.6850 Dutch (nld) English 184 172 8 63 8 34.2751 Croatian (hrv) English 122 191 8 52 8 31.8452 Hebrew (heb) English 212 276 8 68 8 33.8953 Mongolian (mon) English 8 21 8 11 6 2.9654 Hindi (hin) English 19 31 8 19 7 14.25
AnswerKey:eus:3.37,slk:25.36,mya:3.93,kor:16.23,lit:13.75,ara:28.38,ces:25.07,epo:3.28,fin:13.79,sqi:29.6,vie:24.67.
8640
B Representative datasets
In this section, we show the searching results ofmost/least representative subsets for the rest of thefive tasks.
2 3 4 510
20
30
40
50
60
RMSE
en_ewtla_proiel
en_gumhu_szegedkmr_mg
en_gumug_udtgl_ctgkmr_mg
en_gumug_udtkmr_mgko_kaistfr_gsd
bxr_bdtvi_vtb
fr_spokenbxr_bdtvi_vtb
ko_kaistbxr_bdtko_gsdja_gsd
ko_kaistbxr_bdtfr_spokenvi_vtbgl_ctg
UD
2 3 4 55
10
15
20
25
eng-por (mon)eng-tam (msa)
eng-por (hye)eng-tam (msa)eng-0porkaz)
mkd-eng (dan)lit-eng (tam)aze-eng (hye)sqi-eng (tam)
mkd-eng (dan)lit-eng (tam)bos-eng (mon)nob-eng (hye)fas-eng (aze)
por-eng (mon)tam-eng (msa)
por-eng (hye)tam-eng (msa)por-eng (kaz)
por-eng (mon)tam-eng (msa)por-eng (kaz)por-eng (hye)
por-eng (bel)tam-eng (msa)por-eng (hye)por-eng (mon)por-eng (kaz)
TSFMT
2 3 4 520
30
40
50
60
RMSE
af (ja)ta (ja)
hu (ja)be (ja)yue (ja)
hu (ja)be (ja)yue (ja)wbp (hr)
mr (ja)hu (ja)yue (ja)hsb (af)tl (bg)
af (ja)am (la)
am (la)af (et)cop (la)
cop (ja)am (la)fo (swl)hu (fr)
cop (ja)am (la)fo (swl)hu (fr)br (lv)
TSFPOS
2 3 4 510
20
30
40
50
no (he)pl (id)
de (he)de (fr)id (nl)
sk (da)sk (he)sl (ca)la (zh)
de (it)de (he)ru (sv)hi (hr)fi (hi)
cs (uk)zh (ar)
ru (sv)zh (ar)cs (uk)
cs (uk)zh (ar)cs (zh)la (nl)
cs (uk)zh (ar)la (nl)cs (ar)cs (zh)
TSFPARSING
2 3 4 510
20
30
40
50
60
RMSE
jv (so)uk (fa)
jv (so)pa (ms)mr (id)
jv (so)pa (ms)bn (tl)ti (xh)
jv (so)pa (ms)mr (id)te (ha)jv (ar)
jv (id)jv (sw)
jv (id)jv (yo)jv (sw)
jv (id)ti (am)jv (yo)jv (sw)
jv (id)ti (am)jv (ilo)jv (ceb)jv (tl)
TSFEL
Most representative Least representative Random Search
Figure 4: Beam search results (beam size=100) for up to the 5 most (and least) representative datasets for theremaining NLP tasks. We also show random search results of corresponding sizes.
8641
C New Model
In this section, we show the extrapolation perfor-mance for new models on BLI, MA and the re-maining systems of UD.
0 1 2 3 4 5
68
1012
RM
SESinkhorn (38.43)
0 1 2 3 4 568
1012
Artetxe17 (36.46)
0 1 2 3 4 58
10
12
Artetxe16 (46.7)
Figure 5: RMSE scores of BLI task from dataset-wise mean value predictor (the dashed black line in each graph)and predictors trained with experimental records of other models and 0–5 records from a new model (as indicatedby the title of each graph).
0 1 2 3 4 5
456
RM
SE
CHARLES-SAARLAND-02-2 (93.23)
0 1 2 3 4 5
456
Unknown (93.19)
0 1 2 3 4 52
2.2
2.4EDINBURGH-01-2 (88.93)
0 1 2 3 4 53.5
44.5
5
RM
SE
OHIOSTATE-01-2 (87.42)
0 1 2 3 4 55.5
66.5
7CMU-01-2-DataAug (86.53)
0 1 2 3 4 5
7
8
CARNEGIEMELLON-02-2 (85.06)
Figure 6: RMSE scores of MA task from dataset-wise mean value predictor (the dashed black line in each graph)and predictors trained with experimental records of other models and 0–5 records from a new model (as indicatedby the title of each graph)
.
8642
0 1 2 3 4 53
3.54
4.5
RM
SE
TurkuNLP (75.93)
0 1 2 3 4 5
2.53
3.54
CEA (75.06)
0 1 2 3 4 5
5.5
6
Stanford (75.05)
0 1 2 3 4 5
3
3.5
Uppsala (74.76)
0 1 2 3 4 5
2.8
3
RM
SE
AntNLP (74.1)
0 1 2 3 4 52.62.83
3.2
ParisNLP (74.05)
0 1 2 3 4 52.62.83
3.23.4
NLP-Cube (73.96)
0 1 2 3 4 5
7
7.5
SLT-Interactions(72.92)
0 1 2 3 4 53
3.2
3.4
RM
SE
IBM (71.88)
0 1 2 3 4 5
2.42.62.83
LeisureX (71.7)
0 1 2 3 4 52
2.5
3UniMelb (71.54)
0 1 2 3 4 5
4
4.5
5
Fudan (69.42)
0 1 2 3 4 5
4
5
RM
SE
KParse (69.39)
0 1 2 3 4 54
4.55
5.56
BASELINE (68.5)
Figure 7: RMSE scores of UD task from dataset-wise mean value predictor (the dashed black line in each graph)and predictors trained with experimental records of other models and 0–5 records from a new model (as indicatedby the title of each graph).
8643
D Feature importance
In this section, we show the plots of feature impor-tance for all the tasks.
0 250 500 750 1000 1250 1500 1750F score
featuralgeographic
phonologicalTarget lang vocab size
geneticsyntactic
Target lang subword vocab sizeTarget lang subword TTR
Source lang subword vocab sizeSource lang vocab size
inventorySource lang subword TTR
Target lang average sent. lengthSource lang Average Sent. Length
Target lang word TTRSource lang word TTR
dataset size (sent)
Feat
ures
200340
466491508
585635637645
682696
787811829
13321547
1723Wiki-MT: Feature Importance
0 100 200 300 400 500 600F score
Target lang vocab sizeTarget lang word TTR
FEATURALTarget lang subword vocab size
GENETICGEOGRAPHIC
PHONOLOGICALTarget lang average sent. lengthSource lang subword vocab size
Source lang subword TTRSource lang vocab size
Source lang Average Sent. LengthINVENTORYSYNTACTIC
Source lang word TTRdataset size (sent)
Feat
ures
5991115
4966
8199
108110
129131
191341
591
TED-MT: Feature Importance
0 500 1000 1500 2000 2500 3000 3500F score
PHONOLOGICAL_2INVENTORY_2
GENETIC_2FEATURAL
FEATURAL_2GEOGRAPHIC_2
GEOGRAPHICSYNTACTIC_2
GENETICPHONOLOGICALTarget lang TTR
INVENTORYTarget lang dataset size
Transfer lang TTRSYNTACTIC
Transfer target TTR distanceTransfer over target size ratio
Overlap subword-levelTransfer lang dataset size
Overlap word-level
Feat
ures
120197207
253276316322
388401
536649
898955
10471170
13151389
17561782
3529
TSFMT: Feature Importance
8644
0 500 1000 1500 2000 2500 3000 3500F score
GENETICPHONOLOGICALTarget lang TTR
INVENTORYFEATURAL
Transfer lang TTRGEOGRAPHIC
SYNTACTICTransfer over target size ratio
Transfer target TTR distanceTarget lang dataset size
Transfer lang dataset sizeWord overlap
Feat
ures
345477
666688
847887
9289619961001
12011204
3270TSFPARSING: Feature Importance
0 500 1000 1500 2000 2500 3000F score
GENETICPHONOLOGICALTarget lang TTR
INVENTORYTarget lang dataset size
FEATURALSYNTACTIC
Transfer lang TTRTransfer target TTR distance
GEOGRAPHICTransfer over target size ratio
Transfer lang dataset sizeOverlap word-level
Feat
ures
361549
632839
9911110
11471200
12411268
13651722
2938TSFPOS: Feature Importance
0 200 400 600 800 1000 1200 1400 1600F score
Transfer over target size ratioGENETIC
PHONOLOGICALTarget lang dataset size
INVENTORYGEOGRAPHIC
Transfer lang dataset sizeFEATURAL
Entity overlapSYNTACTIC
Feat
ures
292429
572604
928971
11821241
13381605
TSFEL: Feature Importance
8645
0 200 400 600 800 1000F score
syntax_59syntax_63syntax_81syntax_4syntax_74syntax_27syntax_43_2syntax_5syntax_65_2syntax_13syntax_91syntax_63_2syntax_68_2syntax_70syntax_24syntax_82syntax_6_2syntax_30syntax_44_2syntax_47_2syntax_26syntax_83_2syntax_56_2syntax_90_2syntax_41syntax_67_2syntax_67syntax_86syntax_42syntax_65syntax_41_2syntax_85_2syntax_77_2syntax_60syntax_72syntax_83syntax_51_2syntax_50_2syntax_44syntax_80_2syntax_57_2syntax_20_2syntax_27_2syntax_5_2syntax_25_2syntax_37syntax_52_2syntax_13_2syntax_21syntax_7syntax_24_2syntax_62_2syntax_40syntax_9syntax_59_2syntax_33_2syntax_57syntax_3syntax_71syntax_21_2syntax_46syntax_30_2syntax_51syntax_47syntax_20syntax_25syntax_4_2syntax_7_2syntax_89_2syntax_53_2syntax_9_2syntax_37_2syntax_54syntax_70_2syntax_23_2syntax_54_2syntax_32syntax_53syntax_72_2syntax_17syntax_84syntax_42_2syntax_89syntax_31_2syntax_78_2syntax_90syntax_55syntax_71_2syntax_52syntax_19syntax_31syntax_78syntax_32_2syntax_1_2syntax_46_2syntax_39syntax_77syntax_50syntax_66_2syntax_19_2syntax_35syntax_75syntax_35_2syntax_84_2syntax_2_2syntax_66syntax_60_2syntax_22_2syntax_75_2syntax_73_2syntax_39_2syntax_23syntax_62syntax_38syntax_85syntax_68syntax_79syntax_14syntax_0syntax_12_2syntax_1syntax_2syntax_16syntax_38_2syntax_16_2syntax_69_2syntax_76syntax_76_2syntax_17_2syntax_79_2syntax_55_2syntax_69syntax_86_2syntax_14_2syntax_22syntax_15_2syntax_100syntax_73syntax_100_2syntax_0_2syntax_15syntax_12model_SinkhornPHONOLOGICALSYNTACTICGEOGRAPHICGENETICINVENTORYFEATURALmodel_Artetxe16model_Artetxe17Fe
atur
es
1111111111122222222233334445555555566667777888888888889910101010101111111111121212121213131313131414141515151616161617171717171718181819202122232324242425252730303131323236373943454546494953535353565860626265686870737676819496 163 202239 433441 502 770 823 869 917
BLI: Feature Importance
8646
0 500 1000 1500 2000 2500F score
word typeword num
num type fusion tagtag per word
num type tagdata size
model_CHARLES-SAARLAND-02-2word type ratio
model_Unknownmodel_CMU-01-2-DataAug
model_OHIOSTATE-01-2model_EDINBURGH-01-2
model_CARNEGIEMELLON-02-2avg sent length
average type tag for wordaverage tag type length
Feat
ures
269278
321326347
405405417
454537549
584618
7381084
2387
MA: Feature Importance
0 500 1000 1500 2000 2500F score
word typeword num
num type fusion tagtag per word
num type tagdata size
model_CHARLES-SAARLAND-02-2word type ratio
model_Unknownmodel_CMU-01-2-DataAug
model_OHIOSTATE-01-2model_EDINBURGH-01-2
model_CARNEGIEMELLON-02-2avg sent length
average type tag for wordaverage tag type length
Feat
ures
269278
321326347
405405417
454537549
584618
7381084
2387
UD: Feature Importance