Proceedings of the 5th Conference on Machine Translation (WMT), pages 688–725Online, November 19–20, 2020. c©2020 Association for Computational Linguistics
688
Results of the WMT20 Metrics Shared Task
Nitika MathurThe University of Melbourne
Johnny Tian-Zheng WeiUniversity of Southern California,
Markus FreitagGoogle Research
Qingsong MaTencent-CSIG,
AI Evaluation [email protected]
Ondrej BojarCharles University,
Abstract
This paper presents the results of the WMT20Metrics Shared Task. Participants were askedto score the outputs of the translation systemscompeting in the WMT20 News TranslationTask with automatic metrics. Ten researchgroups submitted 27 metrics, four of whichare reference-less “metrics”. In addition,we computed five baseline metrics, includ-ing SENTBLEU, BLEU, TER and CHRF us-ing the SacreBLEU scorer. All metrics wereevaluated on how well they correlate at thesystem-, document- and segment-level withthe WMT20 official human scores.
We present an extensive analysis on influenceof reference translations on metric reliability,how well automatic metrics score human trans-lations, and we also flag major discrepanciesbetween metric and human scores when eval-uating MT systems. Finally, we investigatewhether we can use automatic metrics to flagincorrect human ratings.
1 Introduction
The metrics shared task1 has been a key componentof WMT since 2008, serving as a way to validatethe use of automatic MT evaluation metrics anddrive the development of new metrics.
We evaluate automatic metrics that score MT out-put by comparing them with a reference translationgenerated by human translators, who are instructedto translate “from scratch”, without post-editingfrom MT. In addition, following last year’s collab-oration with the WMT Quality Estimation (QE)task, we also invited submissions of reference-freemetrics that compare MT outputs directly with thesource segment.
Similar to the last year’s editions, the source, ref-erence texts, and MT system outputs for the metric
1http://www.statmt.org/wmt20/metrics-task.html
task come from the News Translation Task (Bar-rault et al., 2020, which we denote as Findings2020). This year, the language pairs were English↔ Chinese, Czech, German, Inuktitut, Japanese,Polish, Russian and Tamil. We further included sys-tems participating in the WMT parallel corpus fil-tering task (Koehn et al., 2020): Khmer and Pashtoto English.2
All metrics are evaluated based on their agree-ment with human evaluation. We evaluate met-rics at three levels: comparing MT systems onthe entire testset, segments (either sentences orshort paragraphs), and new this year, documents.We introduce document-level evaluation to incen-tivize the development of metrics that are take intoaccount broader context of evaluated sentencesor paragraphs, following the recent emergence ofdocument-level MT techniques.
Multiple References This year, we have two in-dependently generated references for English ↔German, English↔ Russian, and Chinese→ En-glish. This lets us investigate the influence of ref-erences and the utility of multiple references. Weinstructed participants to score MT systems againstthe references individually as well as with all avail-able references. In addition, we also supplied a setof references for English to German, that were gen-erated by asking linguists to paraphrase the WMTreference as much as possible (Freitag et al., 2020).These references are designed to minimise transla-tionese in the reference which could lead to metricsto be biased against systems that generate morenatural text.
2Note that the metrics task inputs also included MT sys-tems translating between German ↔ French in the NewsTranslation Task, and English → Khmer and Pashto fromthe WMT parallel corpus filtering task. We are unable to eval-uate metrics on these language pairs as human evaluation isnot available
689
Evaluating Human Translations Given that wehave multiple human translations, we asked partici-pants to evaluate each human translation using theother as a reference. For these language-pairs, atleast one of these human translations was includedin the human evaluation, so we can directly evalu-ate metrics on how they rank the human translationcompared to the MT systems.
Additional Human Evaluation Finally, wepose the question if some of the discrepancies be-tween metrics and human scores can be explainedby bad human ratings. We rerun some of the humanevaluations by using the same template, but switch-ing the rater pool from non-experts to professionallinguists. In particular, we rerun human evalua-tion for a subset of translations where all metricsdisagree with the WMT human evaluation. This ex-periment could reveal a new use case of automaticmetrics and indicate that automatic metrics can beused to identify bad ratings in human evaluations.
We first give an overview of the task (Sec-tion 2) and summarize the baseline (Section 3.1)and submitted (Section 3.2) metrics. The resultsfor system-, segment-, and document-level evalua-tion are provided in Sections 4, followed by a jointdiscussion Section 5. Section 6 describes our re-running of human evaluation with linguists beforewe summarise our findings in Section 7.
We will release data, code and additionalvisualisations in the metrics package to bemade available at http://www.statmt.org/
wmt20/results.html
2 Task Setup
This year, we provided task participants with onetest set for each examined language pair, i.e. aset of source texts (which are commonly ignoredby MT metrics), corresponding MT outputs (theseare the key inputs to be scored) and one or morereference translations.
In the system-level, metrics aim to correlate witha system’s score which is an average over manyhuman judgments of segment translation qualityproduced by the given system. In the segment-level, metrics aim to produce scores that correlatebest with a human ranking judgment of two out-put translations for a given source segment. Andfinally, we also trial document-level evaluation thisyear. (more on the manual quality assessment inSection 2.3).
Segments are sentences for all language pairsexcept English↔ German and Czech, and for En-glish → Chinese, which do not contain sentenceboundaries and are translated and evaluated at theparagraph-level.
Participants were free to choose which languagepairs and tracks (system/segment/document andreference-based/reference-free) they wanted to takepart in.
2.1 Source and Reference Texts
The source and reference texts we use are mainlysourced from this year’s WMT News TranslationTask (see Findings 2020).
The test set typically contains somewhere be-tween 1000 and 2000 segments for each translationdirection, with fewer segments for some paragraph-segmented test sets, and the English↔ Inuktitutdirections contain 2971 sentences.
All test sets are from the news domain, ex-cept the English↔ Inuktitut datasets which havea mix of in-domain text from Canadian Parlia-ment Hansards (1566 sentences) and out-of-domainnews documents (1405 sentences).
We also have systems from the parallel corpusfiltering task which are from the Wikipedia domain(also labelled newstest2020 in the metrics test set).The Khmer→ English and Pashto→ English con-tain 2320 and 2719 sentences respectively.
The reference translations provided in new-stest2020 were created in the same direction asthe MT systems were translating. The exceptionsare English ↔ Inuktitut, Khmer → English andPashto→ English, where the testset is a mixture of“source-original” and “target-original” texts.
2.2 System Outputs
The results of the Metrics Task are affected by theactual set of MT systems participating in a giventranslation direction. On one hand, if all systemsare very close in their translation quality, then evenhumans will struggle to rank them. This in turnwill make the task for MT metrics very hard. Onthe other hand, if the task includes a wide range ofsystems of varying quality, correlating with humansshould be generally easier. One can also expectthat if the evaluated systems are of different types,they will exhibit different error patterns and variousMT metrics can be differently sensitive to thesepatterns.
690
• Parallel Corpus Filtering Task. This taskrequired participants to submit scores for eachsentence in the provided noisy parallel texts.These scores were used to subsample sentencepairs, which was then used to train a neuralmachine translation system (fairseq). Thiswas tested on a held-out subset of Wikipediatranslations.
• Regular News Tasks Systems. These areall the other MT systems in the evaluation;differing in whether they are trained only onWMT provided data (“Constrained”, or “Un-constrained”) as in the previous years.
With all language pairs, in addition to the sub-missions to the task, the test sets also include trans-lations from freely available web services (onlineMT systems), which are deemed unconstrained.
Overall, the results are based on 208 systemsacross 18 language pairs.
2.3 Manual Quality Assessment
Human scores were obtained using Direct Assess-ment, where annotators are asked to rate the ad-equacy of a translation compared to either thesource segment or a reference translation of thesame source. This year, human data was collectedfrom reference-based evaluations (or “monolin-gual”) and reference-free evaluations (or “bilin-gual”). The reference-based (monolingual) evalua-tions were crowdsourced, while the reference-less(bilingual) evaluations were mainly from MT re-searchers who committed their time to contributeto the manual evaluation for each submitted systemto the translation task.
Finally, following reports that MT system trans-lations might seem adequate when scored in isola-tion but not in context of the whole document, whenpossible, the ratings are collected for each segmentwith document context. Table 1 summarises thedetails of how human annotations were collectedfor various language-pairs at WMT 2020.
The English→ Inuktitut dataset, which containsa mix of in-domain (Hansard) and out-of-domain(news) data, was only evaluated on out-of-domainsegments, so for system level evaluation, we eval-uate metric scores computed on the news domainonly as well as the full test set.
See Findings 2020 for details on human evalua-tion.
2.3.1 System-level Golden Truth: DAFor the system-level evaluation, the collected con-tinuous DA scores, standardized for each annotator,are averaged across all assessed segments for eachMT system to produce a scalar rating for the sys-tem’s performance.
The underlying set of assessed segments is dif-ferent for each system. Thanks to the fact that thesystem-level DA score is an average over manyjudgments, mean scores are consistent and havebeen found to be reproducible (Graham et al.,2013). For more details see Findings 2020.
The score of an MT system is calculated as theaverage rating of the segments translated by thesystem.
2.3.2 Segment-level Golden Truth: DARRStarting from Bojar et al. (2017), when WMTfully switched to DA, we had to come up witha solid golden standard for segment-level judge-ments. Standard DA scores are reliable only whenaveraged over sufficient number of judgments.3
Fortunately, when we have at least two DAscores for translations of the same source input,it is possible to convert those DA scores into a rel-ative ranking judgement, if the difference in DAscores allows conclusion that one translation is bet-ter than the other. In the following, we denote thesere-interpreted DA judgements as “DARR”, to dis-tinguish it clearly from the relative ranking (“RR”)golden truth used in the past years.4
From the complete set of human assessments col-lected for the News Translation Task, all possiblepairs of DA judgements attributed to distinct trans-lations of the same source segment were convertedinto DARR better/worse judgements. Distinct trans-lations of the same source input whose DA scoresfell within 25 percentage points (which could have
3For segment-level evaluation, one would need to collectmany manual evaluations of the exact same segment as pro-duced by each MT system. Such a sampling would be howeverwasteful for the evaluation needed by WMT, so only some MTsystems happen to be evaluated for a given input segment. Inprinciple, we would like to return to DA’s standard segment-level evaluation in future, where a minimum of 15 humanjudgements of translation quality are collected per translationand combined to get highly accurate scores for translations,but this would increase annotation costs.
4Since the analogue rating scale employed by DA ismarked at the 0-25-50-75-100 points, we use 25 points as theminimum required difference between two system scores toproduce DARR judgements. Note that we rely on judgementscollected from known-reliable volunteers and crowd-sourcedworkers who passed DA’s quality control mechanism. Any in-consistency that could arise from reliance on DA judgementscollected from low quality crowd-sourcing is thus prevented.
691
Language pairs source/reference crowd/researcher document context
iu-en reference crowd No*-en except iu-en reference crowd Yesen-*, de-fr, fr-de source mix of crowd and researcher* Yes
Table 1: Direct Assessment at WMT20. Note that researcher annotations can contain some amount of professionalannotations
been deemed equal quality) were omitted from theevaluation of segment-level metrics. Conversion ofscores in this way produced a large set of DARRjudgements for all language pairs, shown in Ta-ble 2 due to combinatorial advantage of extractingDARR judgements from all possible pairs of trans-lations of the same source input. We see that onlykm-en and ps-en can suffer from insufficient num-ber of these simulated pairwise comparisons.
The DARR judgements serve as the golden stan-dard for segment-level evaluation in WMT19.
2.3.3 Document-level Golden Truth: DARRAs segments were scored in document context, wecan compute document scores as the average hu-man rating of the segments in the document. Weacknowledge that this may be an oversimplification.First of all, we are hoping that human assessorshave indicated errors in document-level coherenceat at least one of the affected segments, but wehave no evidence that they actually do so. Second,document-level phenomena are rather scarce andaveraging segment-level scores is likely to aver-age out these sparse observations even if they weremarked at individual sentences. And lastly, in somesituations, lack of cross-sentence coherence can beso critical that any strategy of composing sentence-level scores is bound to downplay the severity ofthe error, see e.g. Vojtechova et al. (2019). Atthe current point, we have nothing better to startwith but we believe that better techniques will beproposed in the future.
Graham et al. (2017) recommend around av-eraging 100 annotations per document to obtainreliable document scores. Since the average num-ber of assessments we have is much less than that,we compute the ground truth in the same way asthe segment level evaluation.
We first compute document scores as the averageof all segment scores in the document, which wedenote as DOC-DA. We then generate DOC-DARRpairs of better and worse translations of the samesource document when there is at least a 25 point
difference in the raw DOC-DA scores. See Table 3for details.
In case of DARR (which we denote as DOC-DARR), all language pairs suffer from insufficientnumber of these simulated pairwise comparisons.
Similar to segment-level evaluation, we use theKendall Tau-like formula (Section 2) to evaluatemetric agreement with humans on the generatedpairwise DARR judgements.
Note that we do not include any human-translated segments in this evaluation. In addition,iu-en is excluded from document-level evaluationbecause its DA judgements were collected for iso-lated sentences.
3 Metrics
3.1 BaselinesWe agree with the call to use SacreBLEU (Post,2018) as the standard MT evaluation scorer. We nolonger report scores of the metrics from the Mosesscorer, which requires tokenized text. We use thefollowing metrics from the SacreBLEU scorer asbaselines, with the default parameters:
3.1.1 SacreBLEU baselines• BLEU (Papineni et al., 2002a) is the preci-
sion of n-grams of the MT output comparedto the reference, weighted by a brevitypenalty to punish overly short translations.BLEU+case.mixed+lang.LANGPAIR-+numrefs.1+smooth.exp+tok.13a-+version.1.4.14
We run SacreBLEU with the--sentence-score option to obtainsentence scores for SENTBLEU; this uses thesame parameters as BLEU. Although not it’sintended use, we also compute system- anddocument-level scores for SENTBLEU as themean segment score.
• TER (Snover et al., 2006) measuresthe number of edits (insertions, dele-tions, shifts and substitutions) required
692
DA>1 Ave DA pairs DARR
cs-en 664 11.3 39187 14018de-en 785 11.0 43669 16584iu-en 2620 4.5 26120 8162ja-en 993 9.0 36169 15193pl-en 1001 11.8 64670 21121ru-en 991 10.0 44664 14024ta-en 997 7.6 26662 12789zh-en 2000 13.8 177492 62586km-en 1963 3.2 8295 3706ps-en 2204 3.1 7994 3507
en-cs 1418 10.3 68587 21121en-de 1418 6.9 30567 9339en-iu 1268 7.9 35384 13159en-ja 1000 9.6 41576 12830en-pl 1000 10.6 52003 17689en-ru 1971 5.7 28274 8330en-ta 1000 7.9 28974 9087en-zh 1418 10.6 72581 12652
Table 2: Segment-level: Number of judgements for DAconverted to DARR data; “DA>1” is the number ofsource input segments in the manual evaluation whereat least two translations of that same source input seg-ment received a DA judgement; “Ave” is the averagenumber of translations with at least one DA judgementavailable for the same source input segment; “DA pairs”is the number of all possible pairs of translations ofthe same source input resulting from “DA>1”; and“DARR” is the number of DA pairs with an absolutedifference in DA scores greater than the 25 percentagepoint margin.
to transform the MT output to thereference. TER+lang.LANGPAIR-+tok.tercom-nonorm-punct-noasian-uncased+version.1.4.14
• CHRF (Popovic, 2015) uses charactern-grams instead of word n-grams to com-pare the MT output with the reference 5.Version string: chrF2+lang.LANGPAIR-+numchars.6+space.false-+version.1.4.14.
3.1.2 CHRF++CHRF++ (Popovic, 2017) includes word unigramsand bigrams in addition to character ngrams. Weran the original Python implementation of the met-
5Note that the SacreBLEU scorer does not yet implementCHRF with multiple references
DOC-DA>1 Ave DOC-DA pairs DOC-DARR
cs-en 102 11.4 6041 1424de-en 118 11.0 6579 1866ja-en 80 8.9 2850 790pl-en 62 11.8 4012 635ru-en 91 9.9 4077 753ta-en 82 7.5 2126 684zh-en 155 13.8 13897 3085
en-cs 130 10.2 6162 1442en-de 130 6.9 2844 669en-iu 35 7.8 969 203en-ja 63 9.7 2686 469en-pl 63 10.7 3359 677en-ru 122 5.7 1768 387en-ta 63 7.9 1834 389en-zh 130 10.6 6667 651
Table 3: Document-level: Number of judgements forDOC-DA converted to DOC-DARR data; “DOC-DA>1”is the number of source input documents in the manualevaluation where we have DOC-DA scores for at leasttwo translations of that same source input documents;“Ave” is the average number of translations with at leastone DOC-DA judgement available for the same sourceinput document; “DOC-DA pairs” is the number of allpossible pairs of translations of the same source inputresulting from “DOC-DA>1”; and “DOC-DARR” is thenumber of DOC-DA pairs with an absolute differencein DOC-DA scores greater than the 25 percentage pointmargin.Note that iu-en is not included as document-contextwas not available for this evaluation.
ric 6 with the default parameters --ncorder 6--nwworder 2 --beta 2
3.2 Submissions
The rest of this section summarizes participatingmetrics.
3.2.1 BERT-BASE-L2, BERT-LARGE-L2,MBERT-L2
The three baselines were obtained by fine-tuningBERT (Devlin et al., 2019) on the ratings of WMTMetrics years 2015 to 2018, using a regressionloss. What distinguishes the metrics is the ini-tial BERT checkpoint: BERT-BASE-L2 uses a12-layer Transformer architecture pre-trained onEnglish data, MBERT-L2 is similar but trained
6chrF++.py available at https://github.com/m-popovic/chrF
693
Scor
ing
leve
lm
etri
cfe
atur
esL
earn
edse
gdo
csy
sC
itatio
n/Pa
rtic
ipan
tA
vaila
bilit
yBaselines
SE
NT
BL
EU
n-gr
ams
•?
?Pa
pine
niet
al.(
2002
a)ht
tps:
//gith
ub.c
om/m
jpos
t/sac
rebl
euB
LE
Un-
gram
s−
−•
Papi
neni
etal
.(20
02a)
http
s://g
ithub
.com
/mjp
ost/s
acre
bleu
TE
Red
itdi
stan
ce•
•
Snov
eret
al.(
2006
)ht
tps:
//gith
ub.c
om/m
jpos
t/sac
rebl
euC
HR
Fch
arac
tern
-gra
ms
•
•Po
povi
c(2
015)
http
s://g
ithub
.com
/mjp
ost/s
acre
bleu
CH
RF
++
char
acte
rn-g
ram
s•
•
Popo
vic
(201
7)ht
tps:
//gith
ub.c
om/m
-pop
ovic
/chr
F
Reference-basedmetrics
PAR
BL
EU
para
phra
ses
•?
•U
niv
ofE
dinb
urgh
,Uni
vof
Tart
u,JH
UB
awde
net
al.(
2020
)no
tapu
blic
met
ric
PAR
CH
RF
++
para
phra
ses
•?
•U
niv
ofE
dinb
urgh
,Uni
vof
Tart
u,JH
UB
awde
net
al.(
2020
)no
tapu
blic
met
ric
PAR
ES
IMpa
raph
rase
sye
s•
Uni
vof
Edi
nbur
gh,U
niv
ofTa
rtu,
JHU
Baw
den
etal
.(20
20)
nota
publ
icm
etri
cP
RIS
Mpa
raph
rase
s•
John
sH
opki
nsU
nive
rsity
http
s://g
ithub
.com
/thom
pson
b/pr
ism
CH
AR
AC
TE
Rch
arac
tere
ditd
ista
nce
•
R
WT
HA
ache
nW
ang
etal
.(20
16)
http
s://g
ithub
.com
/rw
th-i
6/C
hara
cTE
RE
ED
char
acte
redi
tdis
tanc
e•
RW
TH
Aac
hen
Stan
chev
etal
.(20
19)
http
s://g
ithub
.com
/rw
th-i
6/E
xten
dedE
ditD
ista
nce
SW
SS
+M
ET
EO
Rse
man
ticsi
mila
rity
•
,X
uet
al.(
2020
)no
tapu
blic
met
ric
ME
Ew
ord
embe
ddin
gs•
IIIT
-Hyd
erab
ad,A
nany
aM
ukhe
rjee
and
Shar
ma
(202
0)no
tapu
blic
met
ric
YIS
Ico
ntex
tual
wor
dem
bedd
ings
•
N
RC
Lo
(201
9,20
20)
http
://ch
ikiu
-jac
kie-
lo.o
rg/h
ome/
inde
x.ph
p/yi
siB
ER
T-B
AS
E-L
2co
ntex
tual
wor
dem
bedd
ings
yes
•
G
oogl
e(D
evlin
etal
.,20
19)
(BL
EU
RT
code
,priv
ate
chec
kpoi
nt)
BE
RT-
LA
RG
E-L
2,co
ntex
tual
wor
dem
bedd
ings
yes
•
G
oogl
e(D
evlin
etal
.,20
19)
(BL
EU
RT
code
,priv
ate
chec
kpoi
nt)
MB
ER
T-L
2co
ntex
tual
wor
dem
bedd
ings
yes
•
G
oogl
e(D
evlin
etal
.,20
19)
(BL
EU
RT
code
,priv
ate
chec
kpoi
nt)
BL
EU
RT
cont
extu
alw
ord
embe
ddin
gsye
s•
Goo
gle
(Dev
linet
al.,
2019
)ht
tps:
//gith
ub.c
om/g
oogl
e-re
sear
ch/b
leur
tB
LE
UR
T-E
XT
EN
DE
Dco
ntex
tual
wor
dem
bedd
ings
yes
•
G
oogl
e(D
evlin
etal
.,20
19)
(BL
EU
RT
code
,priv
ate
chec
kpoi
nt)
YIS
I-C
OM
BII
cont
extu
alw
ord
embe
ddin
gsye
s•
Goo
gle
(Dev
linet
al.,
2019
)no
tapu
blic
met
ric
BL
EU
RT-
CO
MB
Ico
ntex
tual
wor
dem
bedd
ings
yes
•
G
oogl
e(D
evlin
etal
.,20
19)
nota
publ
icm
etri
cC
OM
ET
pred
icto
r-es
timat
orm
odel
yes
••
U
nbab
el(R
eiet
al.,
2020
b)ht
tps:
//gith
ub.c
om/U
nbab
el/C
OM
ET
CO
ME
T-R
AN
Kpr
edic
tor-
estim
ator
mod
elye
s•
Unb
abel
(Rei
etal
.,20
20b)
http
s://g
ithub
.com
/Unb
abel
/CO
ME
TC
OM
ET-
HT
ER
pred
icto
r-es
timat
orm
odel
yes
••
U
nbab
el(R
eiet
al.,
2020
b)ht
tps:
//gith
ub.c
om/U
nbab
el/C
OM
ET
CO
ME
T-2R
pred
icto
r-es
timat
orm
odel
yes
•
U
nbab
el(R
eiet
al.,
2020
b)ht
tps:
//gith
ub.c
om/U
nbab
el/C
OM
ET
CO
ME
T-M
QM
pred
icto
r-es
timat
orm
odel
yes
••
U
nbab
el(R
eiet
al.,
2020
b)ht
tps:
//gith
ub.c
om/U
nbab
el/C
OM
ET
BA
Q,E
Q?
?•
?no
tapu
blic
met
ric
src-based
CO
ME
T-Q
Epr
edic
tor-
estim
ator
mod
elye
s•
Unb
abel
(Rei
etal
.,20
20b)
http
s://g
ithub
.com
/Unb
abel
/CO
ME
TO
PE
NK
IWI-
BE
RT
pred
icto
r-es
timat
orm
odel
yes
•
U
nbab
elK
eple
reta
l.(2
019)
http
s://g
ithub
.com
/Unb
abel
/Ope
nKiw
iO
PE
NK
IWI-
XL
MR
pred
icto
r-es
timat
orm
odel
yes
•
U
nbab
elK
eple
reta
l.(2
019)
http
s://g
ithub
.com
/Unb
abel
/Ope
nKiw
iY
ISI-
2co
ntex
tual
wor
dem
bedd
ings
•
N
RC
Lo
and
Lar
kin
(202
0)
Tabl
e4:
Part
icip
ants
ofW
MT
20M
etri
csSh
ared
Task
.“•
”de
note
sth
atth
em
etri
cto
okpa
rtin
(som
eof
the
lang
uage
pair
s)of
the
segm
ent-
and/
ordo
cum
ent-
and/
orsy
stem
-le
vele
valu
atio
n.“
”in
dica
tes
that
the
docu
men
t-an
dsy
stem
-lev
elsc
ores
are
impl
ied,
sim
ply
taki
ngar
ithm
etic
(mac
ro-)
aver
age
ofse
gmen
t-le
vels
core
s.“−
”in
dica
tes
that
the
met
ric
didn
’tpa
rtic
ipat
eth
etr
ack
(Seg
/Doc
/Sys
-lev
el).
“?”
indi
cate
sth
atw
eco
mpu
ted
the
met
ric’
sdo
cum
ento
rsy
stem
scor
efo
rth
istr
ack
asth
em
acro
-ave
rage
ofse
gmen
tsc
ores
,tho
ugh
the
met
ric
isno
tdefi
ned
this
way
.A
met
ric
isle
arne
dif
itis
trai
ned
ona
QE
orm
etri
cev
alua
tion
data
set(
i.e.
pret
rain
ing
orpa
rser
sdo
n’tc
ount
,but
trai
ning
onW
MT
2019
met
rics
task
data
does
).
694
on Wikipedia data in 102 languages, and BERT-LARGE-L2 is English-only with 24 layers.
3.2.2 BLEURT, BLEURT-EXTENDED,YISI-COMBI, BLEURT-YISI-COMBI
BLEURT (Sellam et al., 2020a) is a BERT-basedregression model trained twice: first on million syn-thetic pairs obtained by random perturbations, thenon ratings from years 2015 to 2019 of the WMTWorkshop. BLEURT-EXTENDED (Sellam et al.,2020b) is a BERT-based regression model trainedon human ratings of years 2015 to 2019 of theWMT Workshop, combined with BERT-Chinesefor to-Chinese sentence pairs. The main checkpointis a 24-layer Transformer, trained on a mixture ofWikipedia articles and training data from WMTNewstest in 20 languages.
YISI-COMBI: We are using YISI-1 on anmBERT model that is fine tuned on WMT datafor single reference submissions. We are usingaggregating internal scores in YISI over differentreferences for the final output for multi referencesubmission.
BLEURT-COMBI: We are using the same outputas YISI-COMBI for single reference submissions.We are mixing YISI-1, YISI-2 and BLEURTscores for different references for the multi ref-erence submission.
3.2.3 CHARACTERCHARACTER (Wang et al., 2016), identical to the2016 setup, is a character-level metric inspired bythe commonly applied translation edit rate (TER).It is defined as the minimum number of characteredits required to adjust a hypothesis, until it com-pletely matches the reference, normalized by thelength of the hypothesis sentence. CHARACTERcalculates the character-level edit distance whileperforming the shift edit on word level. Unlikethe strict matching criterion in TER, a hypothe-sis word is considered to match a reference wordand could be shifted, if the edit distance betweenthem is below a threshold value. The Levenshteindistance between the reference and the shifted hy-pothesis sequence is computed on the characterlevel. In addition, the lengths of hypothesis se-quences instead of reference sequences are usedfor normalizing the edit distance, which effectivelycounters the issue that shorter translations normallyachieve lower TER. Similarly to other character-level metrics, CHARACTER is generally appliedto nontokenized outputs and references, which also
holds for this year’s submission with one exception.This year tokenization was carried out for en-ruhypotheses and references before calculating thescores, since this results in large improvements interms of correlations. For other language pairs, notokenizer was used for pre-processing.
3.2.4 COMETCOMET* metrics (Rei et al., 2020b) were build us-ing the Estimator model or the Translation Rankingmodel proposed in Rei et al. (2020a). Those neuralmodels use XLM-RoBERTa to encode source, MThypothesis and reference in the same cross-lingualspace and then are optimised towards different ob-jectives. COMET (main metric) is an Estimatormodel that regresses on Direct Assessments (DA)from 2017 to 2019 and COMET-2R is a variant ofCOMET (main metric) that was trained to handlemultiple references at inference time. COMET-HTER and COMET-MQM follow the same archi-tecture but regress on Human-mediated TranslationEdit Rate (HTER) and a proprietary metric com-pliant with the Multidimensional Quality Metricsframework (MQM), respectively. COMET-Rankuses the Translation Ranking architecture to di-rectly optimize the distance between “better“ hy-pothesis and the respective source and reference,while pushing the “worse“ hypothesis away. ThisTranslation Ranking model was directly optimisedon DA relative-ranks from 2017 to 2019. Finally,COMET-QE removes the reference at input andproportionately reduces the dimensions of the esti-mator network to accommodate the reduced input.
3.2.5 EEDEED (Stanchev et al., 2019) is a character-basedmetric, which builds upon CDER. It is defined asthe minimum number of operations of an exten-sion to the conventional edit distance containing a“jump” operation. The edit distance operations (in-sertions, deletions and substitutions) are performedat the character level and jumps are performedwhen a blank space is reached. Furthermore, thecoverage of multiple characters in the hypothesis ispenalised by the introduction of a coverage penalty.The sum of the length of the reference and the cov-erage penalty is used as the normalisation term.
3.2.6 MEEMEE (Ananya Mukherjee and Sharma, 2020) isan automatic evaluation metric that leverages thesimilarity between embeddings of words in candi-
695
date and reference sentences to assess translationquality. Unigrams are matched based on their sur-face forms, root forms and meanings which aids tocapture lexical, morphological and semantic equiv-alence. Semantic evaluation is achieved by usingpretrained fasttext embeddings provided by Face-book to calculate the word similarity score betweenthe candidate and the reference words. MEE com-putes evaluation score using three modules namelyexact match, root match and synonym match. Ineach module, fmean-score is calculated using har-monic mean of precision and recall by assigningmore weight to recall. The final translation score isobtained by taking average of fmean-scores fromindividual modules.
3.2.7 ESIMEnhanced Sequential Inference Model (Chen et al.,2017) is a neural model proposed for Natural Lan-guage Inference that has been adapted for MTevaluation by Mathur et al. (2019). It uses cross-sentence attention and sentence matching heuris-tics to generate a representation of the translationand the reference, which is fed to a feedforwardregressor. This year’s scores were submitted byBawden et al. (2020) as part of the submission onPARESIM.
3.3 OPENKIWI-BERT, OPENKIWI-XLMR
OPENKIWI-BERT and OPENKIWI-XLMR (Ke-pler et al., 2019) are state of the art Quality Estima-tion models developed for the WMT20 QE sharedtask and are trained with WMT Metrics data from2017 to 2019.
3.3.1 PARBLEU, PARCHRF++, PARESIMPARBLEU, PARCHRF++, and PARESIM (Baw-den et al., 2020) are variants of their respectivecore metrics computed against the provided hu-man reference and a set of automatically gener-ated paraphrases. PARBLEU used five paraphrases,while the other two used only one. Both BLEU andCHRF++ have in-built support for multiple refer-ences. For ESIM, we calculate the score for eachreference separately and then average them to getthe final score.
3.3.2 PRISM
PRISM (Thompson and Post, 2020) is a many-many multilingual neural machine translation sys-tem trained on data for 39 language pairs, withdata derived largely from WMT and Wikimatrix. It
casts machine translation evaluation as a zero-shotparaphrasing task, producing segment-level scoresby force-decoding between a system output anda reference, in both directions, and averaging themodel scores. System-level scores are producedby averaging segment-level ones. For evaluationin Inuktikut, Khmer, Pashto, and Tamil, we useda “Prism44” model that was retrained after addingWMT-provided data for these languages to its orig-inal training data set. All other languages wereevaluated with the original “Prism39” model.
3.3.3 SWSS+METEORSWSS (Semantically Weighted Sentence Similar-ity, Xu et al. 2020) is an approach to extractingsemantic core words, which are words that carryimportant semantic meanings in sentences, and us-ing them in MT evaluation. It uses UCCA (Uni-versal Conceptual Cognitive Annotation), a seman-tic representation framework, to identify semanticcore words, and then calculates sentence similar-ity scores on the overlap of semantic core wordsof sentence pairs. Taking sentence-level seman-tic structure information into consideration, SWSScan improve the performance of lexical metricswhen combined with them. The submitted metric(SWSS+METEOR) is a weighted combination ofSWSS and Meteor.
3.3.4 YISI-0, YISI-1, YISI-2YISI (Lo, 2019, 2020) is a unified semantic MTquality evaluation and estimation metric for lan-guages with different levels of available re-sources.YISI-1 is a reference-based MT evaluation met-ric that measures the semantic similarity betweena ma-chine translation and human references byaggregating the idf-weighted lexical semantic sim-ilarities based on the contextual embeddings ex-tracted from pretrained language models (BERT,CamemBERT, RoBERTa, XLM, XLM-RoBERTa,etc.) and optionally incorporating shallow seman-tic structures (denoted as YISI-1 SRL; not partic-ipating this year). YISI-0 is the degenerate ver-sion of YISI-1 that is ready-to-deploy to any lan-guage. It uses longest common character substringto measure the lexical similarity. YISI-2 (Lo andLarkin, 2020) is the bilingual, reference-less ver-sion for MT quality estimation, which uses bilin-gual mappings of the contextual embeddings ex-tracted from pretrained language models (XLMor XLM-RoBERTa) to evaluate the crosslinguallexical semantic similarity between the input and
696
MT output. Like YISI-1, YISI-2 can exploit shal-low semantic structures as well (denoted as YISI-2 SRL; does not participate this year).
3.4 Pre-processingSince some metrics, such as BLEU, aim to achievea strong positive correlation with human assess-ment, while error metrics, such as TER, aim for astrong negative correlation, in previous years wecompare metrics via the absolute value |r| of agiven metric’s correlation with human assessment.However, this can mask instances of true negativecorrelation for metrics that aim for a positive corre-lation (and vice-versa).
For system, document and segment level scores,we reverse the sign of the score of error met-rics prior to the comparison with human scores,whether on the system, document or segment level:higher scores have to indicate better translationquality.
4 Results
4.1 System-Level EvaluationAs in previous years, we employ the Pearson cor-relation (r) as the main evaluation measure forsystem-level metrics. The Pearson correlation is asfollows:
r =
∑ni=1(Hi −H)(Mi −M)√∑n
i=1(Hi −H)2√∑n
i=1(Mi −M)2(1)
where Hi are human assessment scores of all sys-tems in a given translation direction, Mi are thecorresponding scores as predicted by a given met-ric. H and M are their means, respectively.
As recommended by Graham and Baldwin(2014), we employ Williams significance test(Williams, 1959) to identify differences in correla-tion that are statistically significant. Williams testis a test of significance of a difference in dependentcorrelations and therefore suitable for evaluationof metrics. Correlations not significantly outper-formed by any other metric for the given languagepair are highlighted in bold in all the results tablesthat show Pearson correlation of metric and humanscores.
Pearson correlation is ideal for reporting whethermetric scores have the same trend as human scores.In practice, we use metrics to make decisions com-paring MT systems, and Kendall’s Tau appears tobe more close to this use case, as it directly checks
whether the metric ordering of a pair of MT sys-tems agrees with the human ordering. However,unlike Pearson correlation, it is not sensitive towhether the metric score differences correspond tothe human score differences. We stay with Pearsoncorrelation for the official results, but also reportKendall’s Tau correlation in the appendix.
The calculation of Pearson correlation coeffi-cient is dependent on the mean, which is very sen-sitive to outliers. So if we have systems whosescores are far away from the rest of the systems,the presence of these “outlier” systems can give amisleadingly high impression of the correlations,and potentially change ranking of metrics. To avoidthis, we also report correlations over non-outliersystems only.
To remove outliers, we are guided by the robustoutlier detection method proposed for MT metricevaluation by Mathur et al. (2020). This method,recommended by the statistics literature (Iglewiczand Hoaglin, 1993; Rousseeuw and Hubert, 2011;Leys et al., 2013) depends on the median and themedian absolute deviation (MAD) which is the me-dian of the absolute difference between each pointand the median. The method removes systemswhose human scores are greater than 2.5 MADaway from the median.
The cutoff of 2.5 is subjective, and Leys et al.(2013) suggest the guidelines of using 3 (veryconservative), 2.5 (moderately conservative) or 2(poorly conservative), and recommends 2.5. Forsome language pairs, we override the 2.5 cutoff forsystems that are close to the cutoff. We give exam-ples in Section 5, and list all identified outliers inTable 15 in the Appendix.
4.1.1 System-Level ResultsTables 5 and 6 provide the system-level correlationsof metrics. These tables include results for all MTsystems, and in cases where we detect outliers, wealso report correlation without outliers.
This year, we also carry out an extended analysisof the impact of (multiple) human references, seethe following paragraphs.
Scoring Human Translation In this section, weinvestigate how well the metric submissions scorehuman translations. We have five language pairswhere two reference translations were provided byWMT. The manual DA scoring of News TranslationTask included all the out-of-English human refer-ences in the evaluation along with the MT systems.
697
cs-e
nde
-en
ja-e
npl
-en
ru-e
nta
-en
zh-e
niu
-en
km-e
nps
-en
109
713
1012
159
76
SE
NT
BL
EU
0.84
40.
800
0.97
80.
786
0.97
40.
851
0.50
20.
284
0.91
60.
833
0.92
50.
829
0.94
80.
950
0.64
90.
469
0.96
90.
888
BL
EU
0.85
10.
800
0.98
50.
778
0.96
90.
826
0.54
90.
355
0.88
40.
761
0.91
60.
807
0.95
60.
957
0.56
90.
348
0.96
90.
888
TE
R0.
845
0.78
30.
993
0.76
60.
974
0.75
20.
586
0.34
60.
904
0.82
90.
805
0.79
50.
956
0.91
10.
733
0.61
60.
973
0.93
5C
HR
F+
+0.
867
0.80
40.
997
0.69
90.
974
0.87
10.
538
0.32
80.
894
0.83
30.
953
0.83
00.
975
0.95
50.
726
0.39
20.
983
0.90
0C
HR
F0.
872
0.80
60.
997
0.68
70.
968
0.86
10.
528
0.31
20.
890
0.83
10.
951
0.82
80.
976
0.95
40.
729
0.33
70.
978
0.89
8PA
RB
LE
U0.
834
0.77
40.
986
0.83
80.
970
0.83
30.
562
0.34
20.
877
0.74
40.
908
0.80
10.
958
0.95
30.
624
0.39
80.
971
0.93
9PA
RC
HR
F+
+0.
865
0.81
00.
998
0.70
80.
974
0.87
70.
551
0.34
70.
885
0.82
30.
942
0.82
50.
976
0.95
60.
720
0.29
60.
985
0.89
9C
HA
RA
CT
ER
0.84
40.
812
0.99
80.
687
0.97
00.
895
0.52
20.
325
0.92
70.
869
0.96
50.
880
0.96
40.
950
0.76
30.
410
0.97
70.
841
EE
D0.
884
0.83
80.
997
0.75
20.
974
0.90
40.
538
0.29
90.
926
0.87
20.
958
0.86
20.
956
0.93
20.
821
0.58
70.
990
0.93
0Y
ISI-
00.
876
0.82
50.
998
0.78
60.
972
0.86
70.
453
0.20
70.
938
0.87
40.
968
0.86
10.
956
0.91
80.
831
0.56
30.
986
0.93
2S
WS
S+
ME
TE
OR
−−
0.97
80.
919
0.47
20.
212
0.92
50.
876
0.96
70.
862
0.95
90.
926
0.76
60.
545
0.99
00.
946
ME
E0.
861
0.82
20.
995
0.71
20.
982
0.90
00.
464
0.29
50.
927
0.87
80.
950
0.83
50.
952
0.94
80.
771
0.56
20.
970
0.87
8P
RIS
M0.
818
0.72
00.
998
0.77
50.
974
0.86
90.
502
0.26
90.
908
0.83
90.
898
0.78
80.
957
0.94
50.
833
0.61
60.
950
0.96
6Y
ISI-
10.
832
0.74
60.
998
0.78
30.
982
0.86
80.
543
0.31
60.
915
0.83
30.
925
0.79
70.
961
0.94
20.
834
0.59
00.
977
0.95
3B
ER
T-B
AS
E-L
20.
775
0.69
30.
997
0.79
10.
971
0.78
90.
552
0.32
80.
919
0.83
60.
909
0.74
60.
967
0.92
90.
704
0.14
50.
967
0.94
5B
ER
T-L
AR
GE
-L2
0.78
40.
695
0.99
00.
800
0.97
40.
784
0.52
00.
282
0.92
50.
843
0.90
10.
760
0.96
20.
928
0.74
40.
211
0.95
90.
950
MB
ER
T-L
20.
798
0.71
50.
995
0.82
40.
969
0.81
10.
555
0.30
20.
908
0.80
50.
887
0.74
00.
959
0.93
50.
837
0.53
00.
980
0.93
8B
LE
UR
T0.
792
0.72
50.
996
0.77
00.
978
0.82
00.
591
0.37
10.
924
0.84
40.
906
0.76
80.
966
0.93
10.
771
0.32
00.
984
0.95
5B
LE
UR
T-E
XT
EN
DE
D0.
771
0.66
80.
985
0.81
80.
961
0.77
20.
551
0.29
80.
900
0.79
70.
897
0.74
30.
945
0.93
10.
789
0.35
90.
985
0.94
2E
SIM
0.79
00.
716
0.99
80.
808
0.98
30.
822
0.59
10.
358
0.92
80.
834
0.88
50.
801
0.96
30.
910
0.80
70.
514
0.92
90.
929
PAR
ES
IM-1
0.78
80.
712
0.99
80.
835
0.98
30.
819
0.59
10.
363
0.92
60.
828
0.88
50.
797
0.96
30.
910
0.80
00.
495
0.92
90.
929
CO
ME
T0.
783
0.69
40.
998
0.77
30.
964
0.82
80.
591
0.34
50.
923
0.83
60.
880
0.76
40.
952
0.93
10.
852
0.60
50.
971
0.94
1C
OM
ET-
2R0.
777
0.69
70.
998
0.77
20.
964
0.81
80.
584
0.33
20.
924
0.84
30.
881
0.77
00.
949
0.92
80.
872
0.64
40.
970
0.94
9C
OM
ET-
HT
ER
0.73
80.
661
0.99
50.
767
0.91
20.
702
0.44
60.
231
0.86
70.
741
0.72
60.
595
0.80
90.
873
0.77
00.
464
0.90
10.
862
CO
ME
T-M
QM
0.72
80.
612
0.99
10.
684
0.90
60.
707
0.42
40.
222
0.85
80.
746
0.76
70.
617
0.78
40.
862
0.84
10.
631
0.91
40.
880
CO
ME
T-R
AN
K0.
705
0.53
40.
964
0.75
70.
923
0.79
30.
483
0.28
40.
868
0.73
20.
787
0.66
40.
877
0.90
90.
158
0.21
40.
911
0.85
5B
AQ
DY
N−
−−
−−
−0.
956
0.92
8−
−−
BA
QS
TAT
IC−
−−
−−
−0.
960
0.93
3−
−−
CO
ME
T-Q
E0.
755
0.62
20.
939
0.80
50.
892
0.58
50.
447
0.21
80.
883
0.77
30.
795
0.67
20.
847
0.88
70.
685
0.66
10.
896
0.83
2O
PE
NK
IWI-
BE
RT
0.72
60.
698
0.98
90.
741
0.73
50.
546
0.35
50.
187
0.86
20.
695
0.64
50.
469
0.62
50.
774
-0.1
26-0
.671
0.75
10.
753
OP
EN
KIW
I-X
LM
R0.
760
0.68
00.
995
0.70
10.
931
0.71
40.
442
0.17
10.
859
0.69
70.
792
0.65
90.
905
0.89
90.
271
-0.5
770.
880
0.86
5Y
ISI-
20.
764
0.64
00.
988
0.40
40.
971
0.77
60.
437
0.23
00.
825
0.81
40.
849
0.76
10.
964
0.93
30.
676
0.37
10.
790
0.94
2
Tabl
e5:
Pear
son
corr
elat
ion
ofto
-Eng
lish
syst
em-l
evel
met
rics
with
DA
hum
anas
sess
men
tove
rMT
syst
ems
usin
gth
ene
wst
est2
020
refe
renc
es.F
orla
ngua
gepa
irs
that
cont
ain
outli
ersy
stem
s,w
eal
sosh
owco
rrel
atio
naf
terr
emov
ing
outli
ersy
stem
s(“
-out
”).C
orre
latio
nsof
met
rics
nots
igni
fican
tlyou
tper
form
edby
any
othe
rfor
that
lang
uage
pair
are
high
light
edin
bold
.
698
en-c
sen
-de
en-j
aen
-pl
en-r
uen
-ta
en-z
hen
-iu F
UL
Len
-iu N
EW
S#
All
-out
All
-out
All
-out
All
-out
All
All
-out
All
All
-out
All
-out
1210
1411
119
1411
915
1212
118
118
SE
NT
BL
EU
0.84
00.
436
0.93
40.
823
0.94
60.
976
0.95
00.
772
0.98
10.
881
0.85
20.
927
0.12
90.
047
0.07
50.
172
BL
EU
0.82
50.
390
0.92
80.
825
0.94
50.
980
0.94
30.
743
0.98
00.
880
0.82
90.
928
0.16
30.
131
0.07
40.
111
TE
R0.
814
0.33
90.
941
0.84
80.
297
0.80
10.
893
0.55
30.
064
0.87
00.
883
-0.2
130.
384
0.13
30.
357
0.08
3C
HR
F+
+0.
833
0.34
90.
958
0.85
00.
952
0.94
50.
956
0.78
30.
983
0.92
90.
880
0.87
80.
328
0.12
80.
315
0.09
8C
HR
F0.
826
0.31
30.
962
0.86
20.
951
0.96
40.
957
0.79
30.
982
0.93
70.
890
0.92
30.
350
0.12
20.
336
0.09
1PA
RB
LE
U0.
870
0.54
30.
910
0.77
40.
869
0.81
30.
948
0.76
00.
959
0.87
10.
849
0.96
20.
194
0.46
40.
126
0.30
6PA
RC
HR
F+
+0.
860
0.43
80.
957
0.84
50.
955
0.95
10.
953
0.81
80.
975
−0.
948
−−
CH
AR
AC
TE
R0.
807
0.26
90.
961
0.86
80.
951
0.93
60.
935
0.72
60.
961
0.95
70.
851
0.90
50.
503
0.00
80.
515
0.12
1E
ED
0.81
70.
271
0.96
50.
869
0.95
50.
965
0.96
20.
789
0.98
00.
959
0.91
30.
928
0.51
90.
043
0.48
30.
122
ME
E0.
875
0.49
50.
954
0.82
0−
0.95
20.
733
0.72
40.
906
0.86
1−
0.28
70.
094
0.24
20.
113
YIS
I-0
0.79
70.
270
0.95
30.
889
0.96
70.
972
0.95
30.
783
0.97
10.
929
0.89
70.
362
0.52
50.
015
0.50
50.
095
PR
ISM
0.94
90.
805
0.95
80.
851
0.93
20.
921
0.95
80.
742
0.72
40.
863
0.45
20.
221
0.95
7-0
.043
0.94
50.
088
YIS
I-1
0.92
20.
664
0.97
10.
887
0.96
90.
967
0.96
40.
714
0.92
60.
973
0.90
90.
959
0.55
4-0
.217
0.52
3-0
.014
YIS
I-C
OM
BI
−0.
971
0.86
8−
−−
−−
−−
BL
EU
RT-
YIS
I-C
OM
BI
−0.
971
0.86
8−
−−
−−
−−
MB
ER
T-L
20.
946
0.78
20.
970
0.86
10.
977
0.96
90.
976
0.77
50.
946
0.94
40.
834
0.93
4−
−B
LE
UR
T-E
XT
EN
DE
D0.
989
0.96
00.
969
0.87
00.
944
0.95
30.
982
0.82
80.
980
0.94
00.
814
0.92
80.
823
0.12
20.
762
0.15
5E
SIM
0.90
80.
575
0.97
90.
894
0.99
30.
981
0.96
90.
698
0.96
70.
937
0.83
30.
972
0.81
40.
365
0.76
00.
418
PAR
ES
IM-1
0.91
90.
635
0.97
40.
886
0.98
90.
971
0.96
80.
705
0.96
40.
937
0.83
30.
983
0.81
40.
365
0.76
00.
418
CO
ME
T0.
978
0.92
60.
972
0.86
30.
974
0.96
90.
981
0.80
00.
925
0.94
40.
798
0.00
70.
860
0.02
80.
858
0.15
2C
OM
ET-
2R0.
983
0.94
20.
972
0.86
90.
986
0.97
80.
982
0.80
30.
872
0.95
90.
852
-0.0
660.
848
-0.0
080.
867
0.17
7C
OM
ET-
HT
ER
0.97
60.
917
0.95
10.
852
0.98
90.
974
0.97
40.
763
0.80
30.
925
0.68
1-0
.073
0.90
00.
142
0.88
80.
092
CO
ME
T-M
QM
0.97
40.
910
0.88
10.
840
0.97
40.
965
0.96
70.
766
0.78
80.
910
0.64
10.
084
0.87
00.
129
0.86
70.
172
CO
ME
T-R
AN
K0.
959
0.86
80.
877
0.86
00.
931
0.92
80.
957
0.76
00.
676
0.87
60.
511
0.54
00.
283
0.09
90.
392
0.25
2B
AQ
DY
N−
−−
−−
−0.
904
−−
BA
QS
TAT
IC−
−−
−−
−0.
958
−−
EQ
DY
N−
−−
−−
−0.
948
−−
EQ
STA
TIC
−−
−−
−−
0.97
6−
−C
OM
ET-
QE
0.98
90.
974
0.90
30.
831
0.95
30.
955
0.96
90.
804
0.80
70.
887
0.62
20.
375
0.90
50.
578
0.92
80.
651
OP
EN
KIW
I-B
ER
T0.
920
0.83
00.
852
0.82
90.
363
0.78
30.
903
0.45
00.
834
0.84
60.
370
0.55
10.
573
-0.6
020.
808
0.19
4O
PE
NK
IWI-
XL
MR
0.97
20.
911
0.96
80.
814
0.99
20.
976
0.95
70.
638
0.87
50.
910
0.67
6-0
.010
0.51
3-0
.668
0.68
0-0
.358
YIS
I-2
0.71
40.
353
0.89
90.
552
0.85
40.
646
0.47
0-0
.107
0.58
40.
922
0.92
3-0
.215
0.80
2-0
.257
0.83
00.
065
Tabl
e6:
Pear
son
corr
elat
ion
ofou
t-of
-Eng
lish
syst
em-l
evel
met
rics
with
DA
hum
anas
sess
men
tove
rM
Tsy
stem
sus
ing
the
new
stes
t202
0re
fere
nces
;Fo
rla
ngua
gepa
irs
that
cont
ain
outli
ersy
stem
s,w
eal
sosh
owco
rrel
atio
naf
terr
emov
ing
outli
ersy
stem
s.C
orre
latio
nsof
met
rics
nots
igni
fican
tlyou
tper
form
edby
any
othe
rfor
that
lang
uage
pair
are
high
light
edin
bold
.#
The
Eng
lish→
Inuk
titut
hum
anev
alua
tion
only
cont
aine
dth
ene
ws
subs
et,s
ow
ere
com
pute
en-i
usy
stem
scor
esof
met
rics
onth
ene
ws
subs
etof
the
test
set(
1405
sent
ence
s).
Not
eth
atth
esc
ores
ofPA
RB
LE
Uan
dPA
RC
HR
Fw
ere
com
pute
das
aver
age
ofse
gmen
tsco
res
699
en-d
een
-de P
en-d
e Ben
-de P
en-z
hen
-zh B
de-e
nru
-en
zh-e
nH
IDH
uman
-BH
uman
-BH
uman
-AH
uman
-AH
uman
-BH
uman
-AH
uman
-BH
uman
-BH
uman
-BN
1212
1212
1313
1011
16
SE
NT
BL
EU
0.44
10.
851
0.63
90.
676
0.64
70.
837
0.43
70.
797
0.91
7B
LE
U0.
458
0.86
80.
672
0.66
50.
658
0.81
40.
480
0.73
80.
938
TE
R0.
233
0.49
50.
577
0.69
5-0
.131
-0.1
380.
466
0.81
20.
850
CH
RF
++
0.55
50.
917
0.74
80.
650
0.59
20.
805
0.43
70.
815
0.94
7C
HR
F0.
599
0.91
90.
772
0.64
50.
651
0.81
20.
442
0.82
10.
948
PAR
BL
EU
0.34
90.
676
0.58
00.
682
0.56
90.
787
0.49
80.
716
0.92
6PA
RC
HR
F+
+0.
573
0.89
00.
748
0.69
80.
559
0.77
60.
447
0.80
30.
950
CH
AR
AC
TE
R0.
472
0.89
00.
736
0.63
80.
687
0.85
00.
410
0.85
60.
938
EE
D0.
447
0.89
80.
685
0.64
60.
679
0.83
00.
466
0.86
10.
910
YIS
I-0
0.51
40.
892
0.72
80.
724
0.24
40.
274
0.56
60.
860
0.89
8S
WS
S+
ME
TE
OR
−−
−−
−−
−0.
866
0.91
4M
EE
0.51
20.
886
0.71
90.
642
−−
0.39
90.
855
0.94
1P
RIS
M0.
472
0.72
70.
731
0.74
20.
157
0.16
60.
591
0.83
70.
942
YIS
I-1
0.64
00.
895
0.83
00.
697
0.77
30.
916
0.71
30.
822
0.94
3Y
ISI-
CO
MB
I0.
607
0.89
10.
801
0.70
2−
−−
−−
BL
EU
RT-
YIS
I-C
OM
BI
0.60
70.
891
0.80
10.
702
−−
−−
−B
ER
T-B
AS
E-L
2−
−−
−−
−0.
785
0.81
30.
922
BE
RT-
LA
RG
E-L
2−
−−
−−
−0.
794
0.81
90.
923
MB
ER
T-L
20.
845
0.87
60.
875
0.81
00.
868
0.90
70.
748
0.78
90.
925
BL
EU
RT
−−
−−
−−
0.75
40.
823
0.92
3B
LE
UR
T-E
XT
EN
DE
D0.
888
0.89
60.
883
0.83
80.
865
0.91
00.
811
0.75
70.
914
ES
IM0.
719
0.92
00.
870
0.74
40.
837
0.92
40.
765
0.81
90.
911
PAR
ES
IM-1
0.68
70.
905
0.85
60.
763
0.82
20.
910
0.79
80.
815
0.91
1C
OM
ET
0.85
40.
894
0.87
90.
822
0.07
80.
062
0.75
90.
821
0.91
6C
OM
ET-
2R0.
820
0.86
60.
877
0.86
50.
009
-0.0
030.
756
0.83
70.
911
CO
ME
T-H
TE
R0.
840
0.87
10.
869
0.85
10.
006
-0.0
010.
761
0.71
80.
857
CO
ME
T-M
QM
0.83
90.
876
0.85
90.
825
0.15
80.
154
0.68
20.
722
0.84
6C
OM
ET-
RA
NK
0.78
20.
870
0.83
00.
794
0.57
80.
565
0.70
90.
725
0.89
6B
AQ
DY
N−
−−
−0.
739
−−
−0.
236
BA
QS
TAT
IC−
−−
−0.
915
−−
−0.
239
EQ
DY
N−
−−
−0.
729
−−
−−
EQ
STA
TIC
−−
−−
0.92
5−
−−
−C
OM
ET-
QE
0.88
50.
885
0.84
40.
844
0.47
30.
481
0.80
60.
749
0.86
5O
PE
NK
IWI-
BE
RT
0.74
10.
741
0.83
50.
835
0.48
70.
521
0.65
50.
682
0.74
2O
PE
NK
IWI-
XL
MR
0.73
60.
736
0.79
50.
795
0.05
30.
050
0.66
00.
694
0.89
3Y
ISI-
2-0
.333
-0.3
33-0
.039
-0.0
39-0
.190
-0.1
980.
123
0.51
30.
882
Tabl
e7:
Eva
luat
ing
Hum
antr
ansl
atio
n:Pe
arso
nco
rrel
atio
nof
met
rics
with
DA
hum
anas
sess
men
tfor
allM
Tsy
stem
spl
usH
uman
tran
slat
ion.
The
subs
crip
tBre
pres
ents
anal
tern
ate
refe
renc
e,P
repr
esen
tsa
para
phra
sed
refe
renc
e.N
isth
eto
taln
umbe
rof
MT
syst
ems
(exc
ludi
ngou
tlier
s)an
dH
IDis
the
iden
tity
ofth
ehu
man
tran
slat
ion
eval
uate
d.C
orre
latio
nsof
met
rics
nots
igni
fican
tlyou
tper
form
edby
any
othe
rfor
that
lang
uage
pair
are
high
light
edin
bold
.
700
For to-English language pairs, only the secondaryhuman reference translations were manually scoredwith DA as the primary human reference transla-tion was shown to the monolingual annotators.
For these language pairs, the metrics can scorea human translation by using the other one as thereference translation. For simplicity, we add thesecond human reference translation to the list oftranslation outputs and observe how its scoring bythe given metric affects the correlation.
Table 7 shows how well the metrics correlatewith the WMT human evaluation when includinghuman translations as additional output. In mostcases, the correlation decreases as metrics strug-gle to correctly score translations that are differentfrom MT systems. Metrics that rely on fine-tuningon existing human assessments from the previousWMT campaigns (e.g. BLEURT, ESIM, COMET)can handle human translations much better on av-erage. Also, the Paraphrased references help thelexical metrics correctly identify the high qualityof human translations.
We present a deeper analysis of how metricsscore human translations in Section 5.1.2. We basethis discussion on scatterplots of human vs metricscores. We include scatterplots of selected metricsin Appendix B.
Influence of References Rewarding multiple al-ternative translations is the primary motivation be-hind multiple-reference based evaluation. It is gen-erally assumed that using multiple reference trans-lation for automatic evaluation is helpful as wecover a wider space of possible translations (Pap-ineni et al., 2002b; Dreyer and Marcu, 2012; Bojaret al., 2013). Nevertheless, new studies (Freitaget al., 2020) showed that multi-reference evaluationdoes not improve the correlation for high qualityoutput anymore. Since we have multiple referencesavailable for five language pairs, we can look athow much the choice of reference(s) influencescorrelation.
Table 8 compares metric correlations on the pri-mary reference set newstest2020, alternative refer-ence newstestB2020, paraphrased reference new-stestP2020 (only for English-German), or usingall available references newstestM2020. We onlyreport system-level correlations of metrics on MTsystems after discarding outliers.
4.2 Segment- and Document-LevelEvaluation
Segment-level evaluation relies on the manualjudgements collected in the News Translation Taskevaluation. This year, again we were unable tofollow the methodology outlined in Graham et al.(2015) for evaluating of segment-level metrics be-cause the sampling of segments did not providesufficient number of assessments of the same seg-ment. We therefore convert pairs of DA scores forcompeting translations to DARR better/worse pref-erences as described in Section 2.3.2. We furtherfollow the same process to generate DARR groundtruth for documents, as we do not have enoughannotations to obtain accurate human scores.
We measure the quality of metrics’ scoresagainst the DARR golden truth using a Kendall’sTau-like formulation, which is an adaptation of theconventional Kendall’s Tau coefficient. Since wedo not have a total order ranking of all translations,it is not possible to apply conventional Kendall’sTau given the current DARR human evaluationsetup (Graham et al., 2015).
Our Kendall’s Tau-like formulation, τ , is as fol-lows:
τ =|Concordant| − |Discordant||Concordant|+ |Discordant|
(2)
where Concordant is the set of all human compar-isons for which a given metric suggests the sameorder and Discordant is the set of all human com-parisons for which a given metric disagrees. Theformula is not specific with respect to ties, i.e. caseswhere the annotation says that the two outputs areequally good.
The way in which ties (both in human and met-ric judgement) were incorporated in computingKendall τ has changed across the years of WMTMetrics Tasks. Here we adopt the version used inWMT17 DARR evaluation. For a detailed discus-sion on other options, see also Machacek and Bojar(2014).
Whether or not a given comparison of a pair ofdistinct translations of the same source input, s1and s2, is counted as a concordant (Conc) or dis-concordant (Disc) pair is defined by the followingmatrix:
In previous years, we used bootstrap resampling(Koehn, 2004; Graham et al., 2014) to estimate con-fidence intervals for our Kendall’s Tau formulation,and metrics with non-overlapping 95% confidence
701
en-d
een
-de B
en-d
e Pen
-de M
en-z
hen
-zh B
en-z
h Mde
-en
de-e
n Bde
-en M
ru-e
nru
-en B
ru-e
n Mzh
-en
zh-e
n Bzh
-en M
1111
1111
1212
129
99
1010
1015
1515
SE
NT
BL
EU
0.82
30.
837
0.81
50.
827
0.92
70.
911
0.91
90.
786
0.76
30.
788
0.83
30.
850
0.83
70.
950
0.92
80.
944
BL
EU
0.82
50.
844
0.83
00.
822
0.92
80.
899
0.91
30.
778
0.79
70.
805
0.76
10.
780
0.77
50.
957
0.93
40.
949
TE
R0.
848
0.86
00.
859
0.85
2-0
.213
-0.2
00-0
.203
0.76
60.
744
0.75
80.
829
0.83
20.
853
0.91
10.
875
0.91
1C
HR
F+
+0.
850
0.86
60.
876
0.85
80.
878
0.91
50.
885
0.69
90.
681
0.70
40.
833
0.83
90.
843
0.95
50.
948
0.95
2C
HR
F0.
862
0.87
40.
883
−0.
923
0.91
2−
0.68
70.
683
−0.
831
0.83
9−
0.95
40.
947
−PA
RB
LE
U0.
774
0.79
60.
724
0.79
40.
962
0.95
50.
959
0.83
80.
831
0.82
90.
744
0.76
70.
756
0.95
30.
934
0.94
5PA
RC
HR
F+
+0.
845
0.86
30.
865
0.85
60.
948
0.96
60.
896
0.70
80.
704
0.66
90.
823
0.83
40.
832
0.95
60.
950
0.95
6C
HA
RA
CT
ER
0.86
80.
889
0.83
50.
878
0.90
50.
908
0.90
10.
687
0.69
60.
713
0.86
90.
853
0.87
30.
950
0.94
20.
949
EE
D0.
869
0.87
10.
867
0.86
70.
928
0.92
30.
930
0.75
20.
747
0.75
20.
872
0.86
80.
879
0.93
20.
922
0.93
2Y
ISI-
00.
889
0.88
20.
873
0.88
60.
362
0.27
30.
332
0.78
60.
790
0.79
40.
874
0.86
70.
880
0.91
80.
911
0.91
8S
WS
S+
ME
TE
OR
−−
−−
−−
−−
−−
0.87
60.
891
0.89
10.
926
0.92
30.
929
ME
E0.
820
0.83
30.
830
0.82
0−
−−
0.71
20.
674
0.71
20.
878
0.87
60.
876
0.94
80.
940
0.94
8P
RIS
M0.
851
0.85
40.
839
0.85
10.
221
0.17
80.
221
0.77
50.
763
0.77
00.
839
0.84
10.
842
0.94
50.
949
0.94
5Y
ISI-
10.
887
0.88
60.
888
0.88
50.
959
0.95
90.
960
0.78
30.
781
0.78
10.
833
0.83
70.
838
0.94
20.
942
0.94
3Y
ISI-
CO
MB
I0.
868
0.87
30.
876
0.87
6−
−−
−−
−−
−−
−−
−B
LE
UR
T-Y
ISI-
CO
MB
I0.
868
0.87
30.
876
0.87
6−
−−
−−
−−
−−
−−
−B
ER
T-B
AS
E-L
2−
−−
−−
−−
0.79
10.
798
0.80
20.
836
0.83
30.
835
0.92
90.
936
0.93
3B
ER
T-L
AR
GE
-L2
−−
−−
−−
−0.
800
0.80
10.
812
0.84
30.
844
0.85
00.
928
0.93
50.
932
MB
ER
T-L
20.
861
0.86
20.
841
0.86
50.
934
0.92
50.
936
0.82
40.
825
0.83
40.
805
0.81
30.
816
0.93
50.
938
0.93
9B
LE
UR
T−
−−
−−
−−
0.77
00.
769
0.78
00.
844
0.84
70.
850
0.93
10.
936
0.93
5B
LE
UR
T-E
XT
EN
DE
D0.
870
0.87
00.
860
0.86
70.
928
0.92
30.
925
0.81
80.
805
0.81
20.
797
0.79
30.
795
0.93
10.
932
0.93
2E
SIM
0.89
40.
900
0.88
70.
898
0.97
20.
975
0.97
60.
808
0.79
80.
804
0.83
40.
842
0.83
90.
910
0.92
00.
916
PAR
ES
IM-1
0.88
60.
897
0.87
80.
890
0.98
30.
983
0.98
50.
835
0.80
70.
822
0.82
80.
840
0.83
50.
910
0.91
80.
915
CO
ME
T0.
863
0.86
40.
858
0.86
40.
007
-0.0
14-0
.004
0.77
30.
769
0.77
20.
836
0.83
60.
836
0.93
10.
936
0.93
4C
OM
ET-
2R0.
869
0.86
90.
866
0.87
5-0
.066
-0.0
76-0
.075
0.77
20.
764
0.77
10.
843
0.84
20.
843
0.92
80.
930
0.92
9C
OM
ET-
HT
ER
0.85
20.
855
0.84
80.
853
-0.0
73-0
.075
-0.0
740.
767
0.76
90.
768
0.74
10.
744
0.74
20.
873
0.86
90.
871
CO
ME
T-M
QM
0.84
00.
844
0.83
60.
842
0.08
40.
076
0.08
00.
684
0.68
60.
685
0.74
60.
750
0.74
80.
862
0.86
00.
861
CO
ME
T-R
AN
K0.
860
0.83
90.
831
0.85
20.
540
0.50
70.
530
0.75
70.
582
0.72
30.
732
0.74
30.
757
0.90
90.
908
0.91
9
Tabl
e8:
Influ
ence
ofre
fere
nces
:Pea
rson
corr
elat
ion
ofm
etri
csw
ithD
Ahu
man
asse
ssm
entf
orM
Tsy
stem
sex
clud
ing
outli
ers
inW
MT
2020
fora
llla
ngua
ge-p
airs
with
mul
tiple
refe
renc
es;c
orre
latio
nsof
met
rics
nots
igni
fican
tlyou
tper
form
edby
any
othe
rfo
rth
atla
ngua
gepa
irar
ehi
ghlig
hted
inbo
ld.
The
subs
crip
tBre
pres
ents
ase
cond
ary
refe
renc
e,P
repr
esen
tsa
para
phra
sed
refe
renc
e,M
repr
esen
tsal
lava
ilabl
ere
fere
nces
.N
ote
that
we
excl
ude
refe
renc
e-fr
eem
etri
csfr
omth
ista
ble,
soth
ew
inne
rsar
eno
tcom
para
ble
with
the
mai
nta
bles
.
702
Metrics1 < s2 s1 = s2 s1 > s2
Hum
an s1 < s2 Conc Disc Discs1 = s2 − − −s1 > s2 Disc Disc Conc
intervals are identified as having statistically sig-nificant difference in performance. The tests areinconclusive for most metric pairs this year and wedo not include them in the paper.
4.2.1 Segment-Level ResultsResults of the segment-level human evaluation fortranslations sampled from the News TranslationTask are shown in Tables 9 and 10, We expect thatcomparing between segments translated by twoMT systems that are far apart in quality would be arelatively easier task for automatic metrics. So wealso include results after discarding segments thatwere translated by outlier systems.
Note that we do not include any human-translated segments in this evaluation.
4.3 Document-level Results
Results of the document-level human evaluationfor translations sampled from the News TranslationTask are shown in Tables 11 and 12.
5 Discussion
5.1 System-Level Results
In general, there is no clear best metric this yearacross all language pairs. For most language pairs,the William’s significance test results in large clus-ters of metrics. The set of “winners” according tothe test (i.e., the metrics that are not outperformedby any other metric) are typically not consistentacross language pairs.
The sample of systems we employ to evaluatemetrics is often small, as few as six MT systems forPashto→ English, for example. This can lead toinconclusive results, as identification of significantdifferences in correlations of metrics is unlikely atsuch a small sample size. Furthermore, Williamstest takes into account the correlation between eachpair of metrics, in addition to the correlation be-tween the metric scores themselves, and this lattercorrelation increases the likelihood of a significantdifference being identified. In extreme cases, thetest would have low power when comparing a met-ric that doesn’t correlate well with other metrics,
resulting in this metric not being outperformed byother metrics despite having a much lower value ofcorrelation.
To strengthen the conclusions of our evaluation,in past years (Bojar et al., 2016, 2017; Ma et al.,2018), we included significance test results forlarge hybrid-super-samples of systems 10K hybridsystems were created per language pair, with cor-responding DA human assessment scores by sam-pling pairs of systems from the News TranslationTask, creating hybrid systems by randomly select-ing each candidate translation from one of the twoselected systems. However, as WMT human an-notations are collected with document context in2020, this style of hybridization is susceptible tobreaking cross-segment references in MT outputsand it would be unreasonable to shuffle individualsegments. The creation of hybrid systems wouldneed to be done by sampling documents instead ofsegments from all sets of systems. Finally, it is pos-sible that including documents translated by outliersystems might falsely lead to high correlations. Webelieve that this merits further investigation basedon data from previous of metrics tasks, and we donot attempt it this year.
In the rest of this section, we present analysisof various aspects of system-level evaluation basedon scatterplots of all metrics. Appendix B containsscatterplots of metrics for each language pair. Weinclude BLEU, chrF, the “best” reference-basedmetric and the “best” reference-free metric (weacknowledge that this is not the best way to definethe best metric, but we choose the metric that ismost highly correlated with humans on the set ofall MT systems after removing outliers).
5.1.1 Influence of Domain in English→Inuktitut
English→ Inuktitut training data was the CanadianHansards domain, and the development data con-tained a small amount of news data. The test setwas a mix of in-domain data from the Hansards andnews documents. The evaluation was only doneon the out-of-domain news documents, so we alsolook at metric scores computed only on the subsetof news sentences.
Figure 1 shows that BLEU scores on the out-of-domain dataset are considerably smaller thanthe full dataset, showing that MT systems have ahigher quality on the in-domain dataset. The rela-tive scores of metrics remain mostly stable whenwe compare scores on the full test set to scores on
703
cs-e
nde
-en
iu-e
nja
-en
km-e
npl
-en
ps-e
nru
-en
ta-e
nzh
-en
all
all-
out
all
all-
out
all
all-
out
all
all-
out
all
all
all-
out
all
all
all-
out
all
all-
out
all
all-
out
1401
894
6116
584
6185
8162
5381
1519
362
8637
0621
121
1797
935
0714
024
1102
012
789
8749
6258
653
610
SE
NT
BL
EU
0.06
80.
057
0.41
3-0
.025
0.18
20.
170
0.18
80.
061
0.22
6-0
.024
-0.0
460.
096
-0.0
05-0
.038
0.16
20.
069
0.09
30.
060
TE
R-0
.04
-0.0
60.
355
-0.1
370.
021
0.01
20.
044
-0.0
770.
125
-0.1
72-0
.196
-0.0
36-0
.117
-0.1
540.
046
-0.0
63-0
.01
-0.0
47C
HR
F+
+0.
090
0.07
50.
435
0.01
30.
246
0.25
10.
245
0.11
50.
275
0.03
40.
009
0.14
50.
054
0.01
80.
186
0.09
80.
130
0.09
6C
HR
F0.
086
0.07
20.
438
0.01
80.
254
0.26
00.
242
0.10
90.
267
0.02
80.
003
0.14
40.
049
0.01
20.
186
0.09
60.
132
0.09
8PA
RB
LE
U0.
058
0.03
80.
415
-0.0
390.
167
0.16
10.
198
0.07
40.
203
-0.0
25-0
.049
0.10
0-0
.011
-0.0
520.
159
0.06
40.
095
0.05
9PA
RC
HR
F+
+0.
096
0.08
20.
436
0.00
90.
232
0.23
50.
247
0.11
70.
267
0.02
70.
002
0.14
70.
044
0.00
70.
184
0.09
50.
132
0.09
9C
HA
RA
CT
ER
0.09
00.
087
0.44
00.
011
0.21
40.
220
0.22
10.
106
0.24
80.
023
-0.0
020.
172
0.05
70.
028
0.13
80.
078
0.12
30.
093
EE
D0.
091
0.07
80.
440
0.01
30.
256
0.25
80.
235
0.11
60.
271
0.04
50.
022
0.14
90.
053
0.01
80.
198
0.10
30.
129
0.09
3Y
ISI-
00.
072
0.06
50.
441
0.02
40.
261
0.26
30.
241
0.12
10.
268
0.03
50.
013
0.14
00.
065
0.03
00.
183
0.08
90.
127
0.09
3S
WS
S+
ME
TE
OR
−−
0.22
60.
218
0.22
80.
086
0.26
40.
011
-0.0
160.
130
0.04
80.
010
0.20
50.
120
0.13
30.
099
ME
E0.
063
0.04
50.
402
-0.0
40.
134
0.12
60.
187
0.06
40.
206
-0.0
84-0
.105
0.07
8-0
.041
-0.0
840.
114
0.03
20.
083
0.05
0Y
ISI-
10.
117
0.10
30.
468
0.05
10.
253
0.26
00.
277
0.12
80.
316
0.04
20.
023
0.14
70.
091
0.04
90.
248
0.16
20.
146
0.11
5B
ER
T-B
AS
E-L
20.
103
0.08
70.
454
0.02
60.
238
0.22
90.
263
0.12
90.
295
0.03
20.
013
0.15
90.
087
0.03
70.
223
0.13
50.
141
0.11
3B
ER
T-L
AR
GE
-L2
0.10
20.
087
0.45
60.
025
0.25
10.
249
0.26
20.
114
0.31
40.
044
0.02
70.
151
0.09
40.
047
0.24
50.
157
0.13
30.
102
MB
ER
T-L
20.
119
0.11
10.
442
0.00
10.
244
0.23
50.
251
0.12
00.
312
0.04
70.
029
0.15
10.
083
0.03
60.
227
0.13
90.
133
0.10
4B
LE
UR
T0.
126
0.11
80.
456
0.01
50.
258
0.25
60.
265
0.12
30.
327
0.05
70.
040
0.20
70.
093
0.04
60.
230
0.14
50.
137
0.10
7B
LE
UR
T-E
XT
EN
DE
D0.
127
0.11
30.
448
0.00
40.
259
0.25
90.
271
0.12
40.
330
0.04
40.
019
0.16
10.
101
0.05
70.
246
0.16
50.
137
0.10
7E
SIM
0.11
00.
103
0.45
40.
031
0.24
10.
233
0.23
90.
119
0.30
00.
058
0.04
50.
147
0.08
40.
044
0.20
80.
117
0.13
80.
108
PAR
ES
IM-1
0.10
50.
098
0.46
40.
051
0.24
90.
241
0.24
20.
121
0.29
20.
066
0.05
50.
149
0.08
90.
049
0.21
30.
123
0.13
90.
111
CO
ME
T0.
129
0.11
20.
485
0.09
00.
281
0.27
10.
274
0.12
70.
298
0.09
90.
085
0.15
80.
156
0.11
70.
241
0.16
30.
171
0.14
2C
OM
ET-
2R0.
120
0.10
70.
479
0.10
10.
257
0.25
10.
268
0.12
00.
308
0.09
80.
085
0.14
40.
148
0.11
00.
253
0.17
70.
163
0.13
6C
OM
ET-
HT
ER
0.10
30.
087
0.48
10.
088
0.19
80.
199
0.24
10.
095
0.26
90.
080
0.06
70.
116
0.13
10.
098
0.22
70.
151
0.13
50.
113
CO
ME
T-M
QM
0.10
80.
097
0.48
30.
100
0.21
50.
209
0.25
90.
112
0.28
20.
080
0.06
60.
141
0.13
70.
102
0.22
70.
158
0.14
10.
117
CO
ME
T-R
AN
K0.
099
0.09
60.
470
0.06
10.
188
0.18
10.
235
0.08
60.
228
0.07
30.
057
0.10
70.
118
0.08
20.
199
0.11
20.
142
0.11
7C
OM
ET-
QE
0.09
10.
072
0.41
00.
042
0.03
10.
020
0.15
30.
048
0.14
80.
039
0.02
90.
092
0.08
40.
049
0.16
30.
099
0.08
80.
070
OP
EN
KIW
I-B
ER
T0.
036
0.02
90.
379
0.01
3-0
.005
-0.0
090.
110
0.00
00.
168
-0.0
33-0
.043
0.07
6-0
.033
-0.0
670.
118
0.05
20.
029
0.02
0O
PE
NK
IWI-
XL
MR
0.09
30.
079
0.46
30.
074
0.05
60.
031
0.22
00.
086
0.24
40.
059
0.05
10.
106
0.09
20.
065
0.18
80.
109
0.11
50.
089
YIS
I-2
0.06
80.
054
0.41
30.
006
0.03
90.
028
0.20
40.
074
0.21
40.
048
0.04
20.
073
0.07
00.
056
0.19
90.
113
0.11
60.
084
PR
ISM
0.14
30.
135
0.47
50.
057
0.25
50.
254
0.27
20.
146
0.30
40.
109
0.09
30.
165
0.14
50.
111
0.23
70.
151
0.16
70.
138
BA
QD
YN
−−
−−
−−
−−
−0.
119
0.08
9B
AQ
STA
TIC
−−
−−
−−
−−
−0.
119
0.08
7
Tabl
e9:
Segm
ent-
leve
lmet
ric
resu
ltsfo
rto-
Eng
lish
lang
uage
pair
s:K
enda
ll’s
Tau
form
ulat
ion
ofse
gmen
t-le
velm
etri
csc
ores
with
DA
RR
scor
es.F
orla
ngua
gepa
irs
that
cont
ain
outli
ersy
stem
s,w
eal
sosh
owco
rrel
atio
naf
terd
isca
rdin
gse
gmen
tstr
ansl
ated
byou
tlier
syst
ems
704
en-c
sen
-de
en-iu
en-j
aen
-pl
en-r
uen
-ta
en-z
hal
lal
l-ou
tal
lal
l-ou
tal
lal
l-ou
tal
lal
lal
l-ou
tal
lal
lal
l-ou
tal
l
2112
110
283
9339
4637
1315
954
9012
830
1768
993
1683
3090
8736
9512
652
SE
NT
BL
EU
0.43
20.
194
0.30
30.
155
0.20
6-0
.084
0.47
90.
153
0.06
70.
051
0.39
80.
206
0.39
6T
ER
0.31
70.
067
0.18
20.
044
-0.0
71-0
.337
-0.5
910.
003
-0.0
94-0
.121
0.20
30.
019
-0.3
6C
HR
F+
+0.
478
0.22
80.
367
0.21
50.
338
0.07
50.
506
0.25
50.
154
0.15
60.
579
0.34
90.
388
CH
RF
0.47
20.
229
0.37
90.
224
0.34
40.
095
0.50
60.
250
0.15
00.
153
0.58
90.
359
0.40
0PA
RB
LE
U0.
460
0.22
60.
299
0.13
60.
212
-0.0
510.
052
0.18
30.
088
0.06
20.
340
0.17
80.
356
PAR
CH
RF+
+0.
492
0.25
30.
355
0.19
2−
0.52
70.
272
0.16
70.
176
−0.
398
CH
AR
AC
TE
R0.
413
0.19
50.
311
0.17
90.
309
0.10
80.
471
0.19
80.
107
0.14
30.
525
0.27
00.
339
EE
D0.
458
0.21
00.
363
0.20
30.
361
0.10
90.
515
0.24
80.
151
0.15
50.
587
0.34
20.
393
YIS
I-0
0.43
20.
191
0.34
90.
212
0.36
20.
101
0.48
40.
233
0.13
20.
151
0.54
70.
336
0.31
9M
EE
0.41
10.
157
0.28
90.
128
-0.0
74-0
.272
−0.
125
0.02
50.
027
0.37
30.
168
−Y
ISI-
10.
550
0.32
00.
427
0.26
30.
251
0.08
20.
568
0.34
90.
209
0.25
60.
669
0.44
00.
463
YIS
I-C
OM
BI
−0.
399
0.22
4−
−−
−−
−B
LE
UR
T-C
OM
BI
−0.
399
0.22
4−
−−
−−
−M
BE
RT-
L2
0.56
70.
359
0.36
10.
202
−0.
541
0.35
00.
212
0.24
60.
587
0.33
40.
432
BL
EU
RT-
EX
TE
ND
ED
0.68
90.
517
0.44
70.
278
0.35
90.
112
0.53
30.
430
0.27
10.
305
0.64
30.
419
0.46
0E
SIM
0.46
90.
253
0.34
70.
195
0.12
2-0
.018
0.52
20.
312
0.20
30.
224
0.59
90.
363
0.39
1PA
RE
SIM
-10.
475
0.25
70.
343
0.19
70.
122
-0.0
180.
510
0.32
40.
209
0.23
00.
599
0.36
30.
396
CO
ME
T0.
668
0.48
70.
468
0.32
40.
322
0.07
80.
624
0.46
20.
316
0.34
40.
671
0.45
70.
432
CO
ME
T-2R
0.66
90.
512
0.46
30.
321
0.32
60.
078
0.63
00.
445
0.29
40.
343
0.67
60.
463
0.43
4C
OM
ET-
HT
ER
0.66
50.
500
0.44
00.
303
0.33
10.
088
0.60
10.
427
0.27
40.
292
0.64
00.
411
0.41
1C
OM
ET-
MQ
M0.
666
0.49
00.
423
0.27
50.
313
0.07
80.
588
0.42
40.
271
0.28
10.
635
0.41
30.
388
CO
ME
T-R
AN
K0.
629
0.40
80.
379
0.21
70.
297
0.09
70.
569
0.38
80.
207
0.22
90.
588
0.34
20.
380
CO
ME
T-Q
E0.
614
0.47
00.
347
0.23
3-0
.04
-0.0
510.
470
0.36
00.
211
0.26
40.
514
0.32
00.
346
OP
EN
KIW
I-B
ER
T0.
262
0.14
20.
168
0.05
8-0
.115
-0.2
33-0
.529
0.15
30.
035
0.16
40.
169
0.02
20.
077
OP
EN
KIW
I-X
LM
R0.
607
0.41
70.
369
0.22
40.
060
0.00
90.
553
0.34
70.
189
0.27
90.
604
0.35
40.
377
YIS
I-2
0.18
70.
104
0.29
60.
171
0.14
60.
073
0.38
30.
115
0.05
20.
146
0.54
50.
332
0.15
2P
RIS
M0.
619
0.41
40.
447
0.28
00.
452
0.19
50.
579
0.41
40.
274
0.28
30.
448
0.21
10.
397
BA
QD
YN
−−
−−
−−
−0.
351
BA
QS
TAT
IC−
−−
−−
−−
0.34
4E
QD
YN
−−
−−
−−
−0.
356
EQ
STA
TIC
−−
−−
−−
−0.
409
Tabl
e10
:Se
gmen
t-le
velm
etri
cre
sults
foro
ut-o
f-E
nglis
hla
ngua
gepa
irs:
Ken
dall’
sTa
ufo
rmul
atio
nof
segm
ent-
leve
lmet
ric
scor
esw
ithD
AR
Rsc
ores
;For
lang
uage
pair
sth
atco
ntai
nou
tlier
syst
ems,
we
also
show
corr
elat
ion
afte
rdis
card
ing
segm
ents
tran
slat
edby
outli
ersy
stem
s
705
cs-e
nde
-en
iu-e
nja
-en
pl-e
nru
-en
ta-e
nzh
-en
all
all-
out
all
all-
out
all
all-
out
all
all-
out
all
all-
out
all
all-
out
all
all-
out
all
all-
out
1424
955
1866
495
3624
790
311
635
529
753
581
684
440
3085
2618
SE
NT
BL
EU
0.10
40.
058
0.60
1-0
.055
0.61
10.
417
0.41
30.
125
0.09
60.
059
0.11
30.
026
0.33
00.
150
0.21
10.
153
TE
R0.
115
0.06
80.
621
-0.0
020.
611
0.50
00.
370
0.04
80.
024
-0.0
590.
089
-0.0
090.
383
0.21
40.
197
0.14
1C
HR
F+
+0.
135
0.11
00.
624
0.00
60.
500
0.33
30.
435
0.15
80.
071
0.03
60.
169
0.08
80.
395
0.20
90.
199
0.13
9C
HR
F0.
126
0.09
10.
626
0.00
20.
611
0.41
70.
453
0.20
90.
065
0.02
10.
195
0.11
20.
395
0.20
00.
209
0.15
4PA
RB
LE
U0.
100
0.04
50.
630
-0.0
020.
556
0.41
70.
428
0.13
80.
065
0.03
20.
086
-0.0
190.
368
0.17
70.
201
0.14
3PA
RC
HR
F+
+0.
117
0.08
10.
642
0.04
20.
611
0.41
70.
438
0.16
40.
087
0.04
00.
171
0.09
50.
412
0.23
20.
203
0.14
6C
HA
RA
CT
ER
0.05
90.
049
0.64
60.
079
0.50
00.
250
0.41
00.
145
0.09
00.
051
0.18
70.
122
0.37
10.
196
0.21
90.
166
EE
D0.
105
0.06
40.
633
0.00
60.
722
0.58
30.
430
0.12
50.
080
0.01
70.
174
0.08
80.
395
0.20
00.
206
0.14
8Y
ISI-
00.
052
0.02
20.
616
0.00
60.
556
0.33
30.
425
0.12
50.
071
0.03
60.
187
0.09
80.
409
0.22
30.
196
0.13
9S
WS
S+
ME
TE
OR
−−
0.72
20.
583
0.37
70.
029
0.10
90.
047
0.21
10.
129
0.44
70.
291
0.20
10.
141
ME
E0.
126
0.11
40.
618
-0.0
060.
444
0.25
00.
438
0.19
00.
014
-0.0
130.
137
0.05
30.
398
0.24
50.
198
0.14
0Y
ISI-
10.
136
0.11
40.
640
0.03
40.
667
0.50
00.
420
0.11
90.
109
0.06
20.
150
0.03
30.
450
0.30
00.
210
0.15
3B
ER
T-B
AS
E-L
20.
164
0.13
90.
654
0.07
50.
778
0.66
70.
430
0.15
10.
046
-0.0
130.
179
0.06
40.
398
0.22
30.
206
0.14
9B
ER
T-L
AR
GE
-L2
0.13
10.
091
0.64
20.
030
0.72
20.
583
0.41
80.
119
0.02
7-0
.028
0.19
50.
084
0.43
90.
291
0.18
50.
124
MB
ER
T-L
20.
149
0.11
80.
621
-0.0
060.
833
0.75
00.
433
0.15
80.
033
-0.0
360.
232
0.12
60.
418
0.25
90.
216
0.16
2B
LE
UR
T0.
154
0.12
50.
641
0.03
80.
667
0.50
00.
420
0.10
00.
039
-0.0
090.
227
0.11
50.
418
0.25
90.
197
0.14
1B
LE
UR
T-E
XT
EN
DE
D0.
140
0.11
40.
633
0.01
40.
833
0.75
00.
430
0.11
30.
077
0.00
60.
243
0.14
30.
412
0.24
50.
198
0.14
1E
SIM
0.13
50.
110
0.67
00.
164
0.72
20.
583
0.40
00.
087
0.03
9-0
.017
0.17
40.
064
0.40
40.
236
0.20
30.
148
PAR
ES
IM-1
0.11
90.
093
0.67
00.
156
0.72
20.
583
0.39
20.
055
0.03
3-0
.021
0.17
10.
060
0.40
10.
232
0.20
80.
154
CO
ME
T0.
142
0.11
40.
626
-0.0
180.
667
0.50
00.
392
0.06
10.
112
0.07
00.
193
0.08
80.
395
0.21
80.
206
0.15
1C
OM
ET-
2R0.
138
0.11
60.
614
0.03
00.
778
0.66
70.
413
0.09
30.
090
0.04
70.
227
0.13
60.
404
0.23
20.
214
0.15
8C
OM
ET-
HT
ER
0.16
00.
133
0.63
80.
042
0.55
60.
333
0.41
50.
138
0.08
30.
040
0.16
90.
084
0.35
40.
191
0.15
00.
105
CO
ME
T-M
QM
0.14
00.
114
0.64
50.
075
0.61
10.
417
0.41
00.
119
0.08
00.
043
0.16
30.
081
0.38
60.
241
0.16
10.
117
CO
ME
T-R
AN
K0.
139
0.13
10.
615
-0.0
260.
667
0.50
00.
365
0.03
50.
112
0.09
60.
185
0.07
40.
325
0.15
40.
199
0.14
7C
OM
ET-
QE
0.09
10.
060
0.63
60.
042
0.38
90.
250
0.32
90.
023
-0.0
02-0
.028
0.15
30.
060
0.30
10.
127
0.16
90.
118
OP
EN
KIW
I-B
ER
T0.
087
0.06
40.
628
0.04
60.
444
0.25
00.
322
0.11
30.
096
0.07
70.
137
0.05
00.
281
0.14
50.
113
0.07
9O
PE
NK
IWI-
XL
MR
0.13
30.
114
0.61
30.
010
0.55
60.
500
0.41
80.
145
0.05
50.
017
0.15
50.
060
0.38
90.
227
0.18
70.
135
YIS
I-2
0.08
30.
072
0.54
7-0
.075
0.27
80.
250
0.38
50.
055
0.11
80.
153
0.24
80.
195
0.38
30.
196
0.19
90.
139
PR
ISM
0.16
90.
156
0.63
6-0
.002
0.66
70.
500
0.42
00.
119
0.10
20.
059
0.21
10.
102
0.40
60.
236
0.19
50.
138
BA
QD
YN
−−
−−
−−
−0.
223
0.17
2B
AQ
STA
TIC
−−
−−
−−
−0.
214
0.16
0
Tabl
e11
:D
ocum
ent-
leve
lmet
ric
resu
ltsfo
rto
-Eng
lish
lang
uage
pair
s:K
enda
ll’s
Tau
form
ulat
ion
ofse
gmen
t-le
velm
etri
csc
ores
with
DA
docu
men
t-le
velm
etri
csc
ores
with
DO
C-D
AR
Rju
dgem
ents
.For
lang
uage
pair
sth
atco
ntai
nou
tlier
syst
ems,
we
also
show
corr
elat
ion
afte
rdis
card
ing
docu
men
tstr
ansl
ated
byou
tlier
syst
ems.
706
en-c
sen
-de
en-iu
en-j
aen
-pl
en-r
uen
-ta
en-z
hal
lal
l-ou
tal
lal
l-ou
tal
lal
l-ou
tal
lal
lal
l-ou
tal
lal
lal
l-ou
tal
l
1442
572
729
312
203
4846
967
725
438
738
999
651
SE
NT
BL
EU
0.68
00.
273
0.55
00.
359
0.59
6-0
.25
0.80
80.
510
0.15
00.
287
0.79
90.
596
0.59
8T
ER
0.69
10.
294
0.51
70.
308
0.56
7-0
.292
-0.0
70.
439
0.09
40.
178
0.74
80.
616
0.11
8C
HR
F+
+0.
692
0.29
40.
583
0.37
20.
547
-0.4
170.
829
0.53
60.
165
0.33
90.
866
0.59
60.
579
CH
RF
0.68
80.
290
0.59
70.
397
0.57
6-0
.333
0.83
80.
524
0.16
50.
307
0.87
20.
616
0.59
1PA
RB
LE
U0.
727
0.38
10.
528
0.26
90.
576
-0.2
080.
565
0.56
90.
236
0.26
60.
805
0.59
60.
625
PAR
CH
RF+
+0.
717
0.36
40.
575
0.34
0−
0.82
50.
560
0.20
50.
307
−0.
650
CH
AR
AC
TE
R0.
656
0.22
40.
520
0.26
30.
547
-0.2
920.
842
0.44
80.
047
0.32
80.
872
0.65
70.
613
EE
D0.
678
0.28
00.
569
0.34
00.
596
-0.2
920.
834
0.55
40.
228
0.27
70.
856
0.55
60.
588
YIS
I-0
0.65
30.
245
0.55
30.
327
0.64
5-0
.125
0.82
10.
554
0.22
80.
318
0.83
00.
515
0.44
1M
EE
0.71
40.
357
0.55
80.
314
0.52
7-0
.375
−0.
489
0.07
10.
307
0.80
50.
495
−Y
ISI-
10.
763
0.46
20.
605
0.35
90.
616
-0.0
830.
855
0.66
30.
339
0.34
90.
882
0.65
70.
690
YIS
I-C
OM
BI
−0.
594
0.35
3−
−−
−−
−B
LE
UR
T-C
OM
BI
−0.
594
0.35
3−
−−
−−
−M
BE
RT-
L2
0.78
10.
517
0.58
00.
308
−0.
842
0.64
80.
299
0.43
10.
861
0.59
60.
693
BL
EU
RT-
EX
TE
ND
ED
0.84
70.
664
0.63
50.
378
0.63
5-0
.167
0.85
10.
740
0.48
80.
437
0.85
60.
636
0.70
8E
SIM
0.72
00.
392
0.53
40.
237
0.50
7-0
.25
0.85
50.
616
0.31
50.
354
0.83
60.
596
0.67
4PA
RE
SIM
-10.
741
0.44
10.
528
0.23
70.
507
-0.2
50.
829
0.62
80.
331
0.36
40.
836
0.59
60.
662
CO
ME
T0.
845
0.66
40.
632
0.40
40.
547
-0.1
250.
847
0.68
70.
346
0.35
90.
897
0.75
80.
561
CO
ME
T-2R
0.85
90.
699
0.61
30.
391
0.59
6-0
.25
0.86
80.
725
0.40
90.
375
0.86
60.
636
0.55
8C
OM
ET-
HT
ER
0.84
90.
675
0.61
60.
410
0.60
6-0
.042
0.85
50.
669
0.30
70.
287
0.85
60.
657
0.50
2C
OM
ET-
MQ
M0.
849
0.67
50.
594
0.37
20.
567
-0.1
250.
817
0.67
80.
291
0.36
40.
836
0.65
70.
472
CO
ME
T-R
AN
K0.
803
0.57
30.
572
0.32
10.
586
-0.3
750.
812
0.60
70.
165
0.30
70.
841
0.59
60.
472
CO
ME
T-Q
E0.
839
0.65
00.
514
0.25
00.
212
-0.1
670.
812
0.61
90.
142
0.29
70.
820
0.65
70.
469
OP
EN
KIW
I-B
ER
T0.
655
0.39
90.
443
0.14
70.
488
0.08
30.
139
0.42
70.
024
0.34
40.
584
0.11
10.
459
OP
EN
KIW
I-X
LM
R0.
821
0.62
20.
589
0.31
40.
527
0.16
70.
859
0.59
20.
094
0.32
80.
856
0.59
60.
524
YIS
I-2
0.25
50.
105
0.41
60.
224
0.52
70.
333
0.68
00.
117
-0.1
180.
209
0.80
50.
434
0.22
3P
RIS
M0.
792
0.54
50.
594
0.37
80.
665
-0.0
420.
829
0.63
40.
283
0.32
80.
733
0.37
40.
511
BA
QD
YN
−−
−−
−−
−0.
567
BA
QS
TAT
IC−
−−
−−
−−
0.61
9E
QD
YN
−−
−−
−−
−0.
613
EQ
STA
TIC
−−
−−
−−
−0.
644
Tabl
e12
:D
ocum
ent-
leve
lm
etri
cre
sults
for
out-
of-E
nglis
hla
ngua
gepa
irs:
Ken
dall’
sTa
ufo
rmul
atio
nof
docu
men
t-le
vel
met
ric
scor
esw
ithD
OC
-DA
RR
judg
emen
ts.
For
lang
uage
pair
sth
atco
ntai
nou
tlier
syst
ems,
we
also
show
corr
elat
ion
afte
rdis
card
ing
docu
men
tstr
ansl
ated
byou
tlier
syst
ems
707
0.4 0.2 0.0 0.2 0.4
5
10
15
20
25
BLEU
fullnews
Figure 1: English → Inuktitut: Human vs. BLEUscores on the full dataset vs. the news subset. Onlythe news subset was included in the human evaluation.Each dot corresponds to an MT system, the outlier onthe top-right is UQAM TANLE.
only the news subset that was evaluated. The mainexception is UQAM TANLE; BLEU scores are re-ally high on the out-of-domain data, and increasevery little when computed on the full dataset. Whenlooking at correlations with human scores (Table 7),we expected correlations to increase when com-puted over the news subset. This is true for mostmetrics such as COMET-QE, but the correlationstays the same or actually decreases for other met-rics like PARBLEU.
5.1.2 Scoring Human TranslationsThe alternate reference was included in the man-ual evaluation for German→ English, Russian→English and Chinese→ English. All human refer-ences were included in the out-of-English manualevaluation.7
German → English: HUMAN-B was rankedthird in the manual evaluation. The lexical met-rics (BLEU, CHRF, CHARACTER, EED, MEE,YISI-0, CHRF++, PARBLEU, PARCHRF) give ex-tremely low scores to the HUMAN-B reference.This is also true for PRISM and all the reference-free metrics except COMET-QE. The neural met-rics also give low scores to the human reference,however, the margin of error is much smaller.
7Findings 2020 in the official tables label the alternatereference in into-English direction simply as HUMAN. The“first” reference, which serves as the primary reference forus, was not scored manually in DA into English. Out ofEnglish, the primary reference for us is labelled HUMAN-Ain Findings.
COMET-QE is the only metric that gives highscores to the HUMAN-B reference.
Appendix B also shows the scatterplot “new-stestB2020” where HUMAN-B served as the refer-ence for the metrics. We see some differences inthe vertical axis but the general picture remains thesame even with this fairly different human transla-tion.
Russian→ English: The HUMAN-B referencewas ranked after 6 MT systems in the manual evalu-ation but still within the same cluster, so not signif-icantly distinguishable. Lexical metrics give rela-tively low scores to HUMAN-B. The neural metricsgive relatively higher scores, but score it aboveOnline-A and below ariel197197, i.e. differentlythan DA judgements.
Chinese→ English: The Human translation isranked 12th in the manual evaluation (in a giantcluster which puts together all but one top and onebottom system), and most metrics place it moreor less correctly. Many metrics, including lexicalmetrics, still have correlations above 0.9 even afterincluding the Human translation.
English→German: According to the WMT hu-man evaluation, the HUMAN-B reference receivesthe highest scores, the HUMAN-A reference isranked fourth and Human-P, which was generatedby linguists paraphrasing the WMT references, isranked lower at 10th place. Each human referencefalls into a separate cluster of significance.
Lexical metrics score around 10 MT systemsabove each WMT reference (using the other WMThuman translation as reference). COMET-QE andsome neural metrics (BLEURT, COMET-MQM,COMET-HTER and MBERT-L2) score HUMAN-A and HUMAN-B as better than all MT systems.
When using either of the WMT references, mostmetrics, including all the lexical metrics, score theparaphrased reference much lower than the rest ofthe systems. The COMET family of metrics andBLEURT-EXTENDED are the only metrics thatare able to recognise the merit of the paraphrasedreferences.
When using the paraphrased references, allreference-based metrics score the two human trans-lations above all MT systems, often by a largemargin. PRISM is the sole exception; it scoresthe HUMAN-B reference about half way betweenthe MT systems. Interestingly, most of these met-rics score HUMAN-A above HUMAN-B, i.e. dis-
708
agreeing with DA judgements. Metric correlationswhen including HUMAN-A system drop dramati-cally when using the alternate WMT reference, butthe correlations are higher with the paraphrased ref-erence. This also holds when scoring HUMAN-Busing the paraphrased vs the main WMT reference(Table 7).
Of the reference-free metrics, COMET-QEscores the two WMT references above all MT sys-tems, and ranks the paraphrased reference similarto its rank in the manual evaluation. OPENKIWI-BERT and OPENKIWI-XLMR are a little bi-ased against these human translations, and YISI-2scores all human translations below all MT sys-tems.
English → Chinese The manual evaluationranks the two Human translations above all MTsystems, but most metrics give these much lowscores.
To summarize, we see that the current MT met-rics generally struggle to score human translationsagainst machine translations reliably. Rare excep-tions include primarily trained neural metrics andreference-less COMET-QE. While the metrics arenot really prepared to score human translations,we find this type of test relevant as more and morelanguage pairs are getting closer to the human trans-lation benchmark. A general-enough metric shouldbe thus able to score human translation comparablyand not rely on some idiosyncratic properties ofMT outputs. We hope that human translations willbe included in WMT DA scoring in the upcomingyears, too.
5.1.3 Influence of OutliersThere are no outlier systems for some language-pairs like Khmer→ English and English → Rus-sian. For others, we have systems whose score isfar away from the scores of the rest of the systems.As these outliers have a large influence on Pear-son correlation, computing the correlation withoutoutliers typically makes the task harder for metricsand results in a decrease in correlation.
For example, we identify three outliers in theGerman → English set; the quality of the lastsystem is extremely low compared to rest. Allreference-based metrics have high correlationswhen including all systems, but correlations dropwhen discarding outliers. In particular, CHRF andPARESIM both had a correlation of 0.95 whencomputed over all systems, but this drops to 0.69
and 0.83 respectively after removing outliers, re-vealing that PARESIM is more reliable with thislanguage pair. An even larger drop is observed forCHRF and CHRF++ in English→ Czech, from 0.8to 0.3. We find this particularly surprising becauseCHRF has always performed well on this languagepair, including in the evaluation on the graduallyreducing set of top N systems, i.e. in harder andharder conditions, see SACREBLEU-CHRF in Ap-pendix A.4 of Ma et al. (2019).
In some cases, metrics can be inaccurate whenscoring outliers, resulting in an increased corre-lation when correlation is recomputed over non-outlier systems. For example, with Chinese→ English, the score of WMTBIOMEDBASE-LINE score is much lower than the next system.Most metrics correctly rank it last as well, butCOMET-HTER, COMET-MQM, COMET-QE andOPENKIWI-BERT give it a higher score than thenext system(s). Note that the other metrics all havea correlation of above 0.9 even after removing theoutlier.
In other cases, removing outliers decreases thecorrelation of a metric and yet it helps its finaloutcome. For instance SENTBLEU averaged overall sentences becomes one of the “winners” in thesystem-level evaluation of translation into English(Table 5). If we trust the results without outliersmore, using averaged sentBLEU seems better thanusing plain old BLEU and not significantly worsethan any other metric going from English into sev-eral target languages.
For some language-pairs, we override the de-cisions made by the outlier detection algorithm,based on whether we believe including or remov-ing these systems from consideration would havean impact on the correlations: For example, withTamil→ English, the last two systems are not clas-sified as outliers by the algorithm, but their humanscores is some distance away from the rest of thesystems. CHRF, CHRF++ and PARCHRF++ are theonly metrics that correctly order these two systems.OPENKIWI-BERT and OPENKIWI-XLMR bothget these two systems wrong with a large margin.But for all metrics, removing these systems leadsto a significant drop in correlation. Thus we countthese two systems as outliers.
Another example is Japanese → English. Forthis language-pair, we have two clusters of 7 and3 systems. Metrics have high correlations whenconsidering all systems, but when looking at MT
709
systems within individual clusters, there are dis-crepancies between the metric scores comparedto human scores. The outlier detection algorithmflags only the last two systems as outliers, but thepresence of the third system has a disproportion-ate impact on the correlation. We include all threesystems in the set of outliers.
The influence of references For all languagepairs where multiple references were available, thecorrelations are typically very close whether us-ing the primary reference or the alternate reference.For metrics where we do see a difference, there isno consistent pattern whether metrics prefer onereference or the other. We note that although thechange in correlations is small when comparingacross reference sets, the set of “winners” accord-ing to the William’s test for statistical significanceis not stable, particularly for English→ German.When combining references, in most cases, the cor-relation with multiple references lies between thecorrelation of the individual references. For exam-ple, with English→German, BLEU correlates bestwith the secondary reference with a correlation of0.844. But with multiple references, the correlationis 0.825, just above the correlation with the primaryreference with is 0.822 (Table 8).
There are a few exceptions where there is a smallincrease in metric correlation above both individualreferences. For example, the correlation of CHAR-ACTER with German → English increases from0.687 and 0.696 with a single reference to 0.713with both references ( Table 8). But there are nometrics which consistently show an improvementwith multiple references across multiple languagepairs.
5.1.4 Neural vs. Lexical MetricsFor many language pairs, when we look at cor-relation clustering of the reference-based metricsbased on their system-level scores, we end up withtwo major clusters: neural metrics and lexical met-rics. We have seen that lexical and neural metricsdiffer in how they score the human translations.For English→ German, all lexical metrics have aslightly higher correlation than any neural metricwhen evaluating MT systems. However, these met-rics make major errors evaluating the HUMAN-Atranslations with the HUMAN-B reference.
We also see such differences with some MT sys-tems. Selected examples:
• English→ Czech: All lexical metrics includ-
ing BLEU and CHRF are very biased towardsONLINE-B, with metric scores indicating thatthis system is better than all others by a largemargin. It is ranked 7th in the human evalua-tion. Neural metrics and reference-free met-rics are more or less correct when scoring thissystem. Surprisingly, ESIM is an exception tothis, and also ranks ONLINE-B on top.
• Polish→ English: Lexical metrics like BLEUgive very low scores to ONLINE-G.
• Tamil → English: Lexical metrics con-sistently score ONLINE-Z above MI-CROSOFT STC INDIA, but the remainingmetrics including the reference-free metricsrank them in the opposite order. The humanevaluation agrees with the lexical metrics.
• Khmer→ English: lexical metrics score thebest system lower than the next two, whereasmost neural metrics get the order of the topsystems right.
5.1.5 Other Discrepancies between Metricand Human Scores
Here we briefly draw attention to particularities wespotted when manually examining the results.
• German→ English: All metrics score Tohoku-AIP-NTT higher than OPPO, and UEDINhigher than PROMT NMT.
• Russian → English: ONLINE-A, which isranked 2nd in the human evaluation, receiveslow metric scores. In contrast, some met-rics including BLEU and PARBLEU chooseARIEL197197, which is ranked 6th in the hu-man evaluation, as the best system.
• Tamil → English: The highest ranked sys-tem according to human scores, GTCOM, re-ceives lower metric scores than the next threeto six systems. Metrics are biased towardsONLINE-A and against ONLINE-Z.
• Chinese→ English: HUOSHAN TRANSLATE
is a clear winner according to human evalua-tion, but BLEU ranks it lower than the next 3systems. The different between human scoresfor the next 8 systems is not statistically sig-nificant where metric ordering of the systemsdifferently than human scores and these dis-crepancies aren’t penalised harshly by Pear-son correlation.
710
• English→ Chinese: HUOSHAN TRANSLATE
is a clear winner according to human eval-uation, but BLEU ranks it lower than thenext 3 systems. The different between hu-man scores for the next 8 systems is not sta-tistically significant where metric ordering ofthe systems differently than human scores andthese discrepancies aren’t penalised harshlyby Pearson correlation. While many metricsincluding BLEU have high correlations, oth-ers make major errors scoring the NIUTRANS.OPENKIWI-BERT assigns really low scoresto
Overall, we note that these metric-human discrep-ancies often feature online systems which are prob-ably more diverse that the MT system submissionsto the WMT shared tasks.
5.1.6 Pearson vs. Kendall TauOverall, we found that Pearson correlation doesn’talways give us the complete picture. In particular,outliers have a large influence on the correlationand can mask the presence of discrepancies be-tween metric and human scores with the rest of thesystems. But making a decision on which systemsto discard is not easy.
In this paper, we also explore Kendall’s Tau as analternative to Pearson correlation. Tables 16 and 17in the Appendix show Kendall Tau correlation ofmetrics over all MT systems (not including humantranslations).
Kendall’s Tau is less sensitive to outliers, anddirectly measures whether metrics agree with hu-mans when comparing pairs of systems. However,Kendall’s Tau doesn’t consider the differences inscores, and two metrics whose errors differ in mag-nitude can have the same Kendall’s Tau correlation(Figure 2).
5.2 Segment and Document-Level ResultsOn the more fine-grained evaluation scales, PRISMand the trained neural metrics (the COMET andBLEURT family of metrics) have a better agree-ment with human judgements than lexical metrics
The correlations of the to-English language pairsare consistently much lower, on average, comparedto that of the out-of-English language pairs. Thedifference could be due to the differing set of anno-tators: the to-English human evaluation was crowd-sourced and therefore is likely to be noisier.
Finally, we find that correlations drop markedlyfor most language pairs if we consider only the
0.2 0.0 0.20.4
0.3
0.2
0.1
0.0
0.1
0.2
OKB: r = 0.73; tau = 0.62
0.2 0.0 0.2
0.50
0.48
0.46
0.44
0.42
0.40 EED: r = 0.97; tau = 0.62
Figure 2: Scatterplots of human scores against two met-rics that have the same Kendall Tau correlation withhuman scores, though OPENKIWI-BERT has bigger er-rors.
segment/document pairs that do not contain outliersystems. We suspect that as the quality of outliersystem translations is typically low, and most of thegenerated better-worse pairs that contain outlierscan be easy for metrics. Removing these pairswould make the task a lot harder. It is also verylikely that the remaining pairs of translations arenoisier, which decreases metric agreement withthese pairwise judgements.
The document-level correlations are typicallyhigher than segment-level correlations. This couldbe due to reduced noise in human scores when aver-aging the scores of multiple segments. Computingmetric scores over documents that contain multiplesegments also helps reduce metric noise.
5.3 Reference-Based Metrics vs.Reference-Free Metrics
We have four submissions of metrics that directlycompare MT outputs with the source segment:COMET-QE, OPENKIWI-BERT, OPENKIWI-XLMR, and YISI-2. Other members of theCOMET family of metrics use information fromboth the source and reference. The remaining met-rics compute scores by comparing the MT outputwith the reference.
While the task of comparing segments in differ-ent languages is harder than comparing segments inthe same language, reference-free metrics have oneadvantage: they are not encumbered by reference-bias. COMET-QE is the only metric that correctlygives a high score to the human translation in Ger-man→ English , and one of the few metrics thatdoes so for English→ Chinese.
This year, the reference-free metrics are highlycompetitive with reference-based metrics for alllanguage-pairs. For English → Tamil, COMET-
711
QE which has a near perfect correlation of 0.97even after discarding outliers. In contrast, manyreference-based metrics including BLEU and chrFgive really high scores to ONLINE-B, which resultsin low correlations.
6 Use Automatic Metrics to DetectIncorrect Human Preference
It has been argued that non-expert translators lackknowledge of translation and so might not noticesubtle differences that make one translation betterthan another. Castilho et al. (2017) compared theevaluation of MT output of professional translatorsagainst crowd workers. Results showed that forall language pairs, the crowd workers tend to bemore accepting of the MT output by giving higherfluency and adequacy scores. Toral et al. (2018)showed that the ratings acquired by professionaltranslators show a wider gap between human andmachine translations compared to judgments bynon-experts. They recommend using professionallinguists for MT evaluation going forward. Laubliet al. (2020) show that non-experts assess paritybetween human and machine translation whereprofessional translators do not, indicating that theformer neglect more subtle differences betweendifferent translation outputs. Given the previouswork and the fact that the WMT human evalua-tion has been conducted with a mix of researchersand crowd workers, we rerun human evaluationfor a subset of the submissions with professionallinguists. In particular, we want to investigate ifwe can use the quality scores obtained by the au-tomatic metrics to detect incorrect human ratings.We filtered out all pairs of systems where the hu-man evaluation results disagree with all automaticmetrics. Taking the metric scores as a signal, wererun human evaluation for a subset of submis-sions for 2 language pairs: German→English andEnglish→German. We hired 10 professional lin-guists, who rerun the source-based direct assess-ment human evaluation with the same document-based template that has been used for the originalWMT ratings.
6.1 German→English
For German→English, we found that all automaticmetrics disagree with the human evaluation resultsfor OPPO and TOHOKU. OPPO yields a higherhuman rating, while all automatic metrics gave TO-HOKU a higher score. To investigate which of the
results to trust, we rerun the source-based directassessment for these 2 systems with professionallinguists. The results in Table 13 show that profes-sional linguists in fact prefer the output of TOHOKU
as predicted by all automatic metrics.
Evaluation OPPO TOHOKU
avg metric8.85 8.95
(HUMAN-A ref)avg metric
10.15 10.26(Human-B ref)WMT 84.6 81.5z-score 0.220 0.179
prof. linguist 81.0 81.7z-score -0.005 0.010
Table 13: WMT 2020 German→English comparingthe reference-based ratings acquired with crowd work-ers/researcher (WMT) against source-based ratings ac-quired with professional linguists.
6.2 English→GermanFor English→German, we rerun human evaluationfor the top 2 ranked MT systems (based on hu-man evaluation): OPPO, TOHOKU and the humantranslation HUMAN-A. The quality of human trans-lations is usually underestimated by automatic met-rics when computed with standard references. Thisis also visible in this year’s evaluation campaignwhere the average metric scores of all submissionfor the human translation HUMAN-A is much lowerwhen compared to the top MT submissions. Toovercome this problem, Freitag et al. (2020) intro-duced paraphrased references that also value thetranslation quality of human translations and al-ternative (less simple/monotonic) MT output. Aswe can see in Table 14, the average metric scoresof all submissions when computed with the para-phrased references HUMAN-P yield a much higherscore for the human translation HUMAN-A whencompared to all MT outputs.
The official WMT human evaluation ranked thehuman translation third, right behind the two MToutputs from OPPO and TOHOKU. Interestingly,based on the z-scores, WMT predicts OPPO to beof higher quality than TOHOKU which is in dis-agreement with most of the metric scores whencalculated against both types of reference transla-tions. Overall, the automatic metrics come to avery different ranking than the human evaluationfor the top performing submissions.
712
Evaluation OPPO TOHOKU HUMAN-A
avg metric10.05 10.09 9.14
(Human-B ref)avg metric
11.93 12.07 15.74(Human-P ref)WMT 87.39 88.62 85.10z-score 0.495 0.468 0.379
prof. linguist 73.66 74.70 84.09z-score -0.051 -0.037 0.088
Table 14: WMT 2020 English→German comparingthe source-based ratings acquired with crowd work-ers/researcher (WMT) against source-based ratings ac-quired with professional linguists.
We rerun the human evaluation with the sametemplate, but with professional linguists. Inter-estingly, the human translation has been rankedfirst by a large margin. Furthermore, the MT out-put of TOHOKU has been rated as higher qualitywhen compared to the MT output from OPPO.The results of the human evaluation with profes-sional linguists yield a perfect correlation to themetric scores calculated with the paraphrased ref-erence. This indicates not only the advantages ofparaphrased references when scoring human trans-lations, but also that automatic metrics can be usedto identify incorrect human ratings.
7 Conclusion
This paper summarizes the results of WMT20shared task in machine translation evaluation, theMetrics Shared Task. Participating metrics wereevaluated in terms of their correlation with humanjudgement at the level of the whole test set (system-level evaluation), as well as at a more fine-grainedlevel (document-level evaluation and sentences orparagraphs for segment-level evaluation). We re-ported scores for standard metrics requiring thereference as well metrics that compare MT out-put directly with the source text. For system-level,best metrics reach over 0.95 Pearson correlationor better across several language pairs. In manycases, this correlation drops considerably when thecorrelation is recomputed after discarding outliersystems.
Computing Pearson correlation without outlierscan change the rankings of metrics, and selectingthese outlier systems is not an exact science. We re-port results both with all systems and after discard-ing outliers as together, and also include Kendall
Tau correlation, and hope that together, they givea more complete picture than just reporting onlyone of these numbers. In the end, we believe thatit is impossible to adequately describe data withsummary numbers, and that it’s best to visualisedata to understand patterns.
The results confirm the trends from previousyears, namely metrics based on word or sentence-level embeddings, achieve the highest perfor-mance (Ma et al., 2018, 2019).
For some language pairs, we had two referencesavailable. On these test sets, we found that com-puting scores with two references rarely helpedmetrics achieve a higher correlation than using ei-ther reference individually. This contradicts earlierresearch that shows that multiple references im-prove correlation (Bojar et al., 2013), but is in linewith more recent papers that show additional inde-pendent references might not be helpful (Freitaget al., 2020). We believe that the utility of addi-tional independent references is dependent on theMT systems evaluated, that perhaps they are not ashelpful when scoring high quality MT systems aswith low/mid quality MT.
In addition to scoring MT systems, this year, wealso requested scores for human reference transla-tions. This highlighted the difference between lexi-cal and embedding-based metrics, as lexical met-rics consistently gave low scores to human transla-tions. However, when using the English-Germanparaphrased references, all metrics scored the otherhuman references above all MT systems, highlight-ing the advantages of using paraphrased referenceswhen scoring human translations.
In addition to human references, there are someMT systems where metrics (either the majority ofmetrics, or only the lexical metrics) make majorerrors. It remains an open question as to what itis about these systems that metrics struggle withscoring them correctly.
Compared to last year, the performance of thereference-free metrics has improved, and the corre-lations this year are competitive with the reference-based metrics, and in many cases, outperformBLEU. In particular, COMET-QE is good at recog-nising the high quality of human translations whereBLEU falls short.
In terms of segment-level Kendall’s τ results,the standard metrics correlations was very low forthe to-English language pairs, particularly after dis-carding translations by outlier systems. The corre-
713
lations of the out-of-English language pair are morein line with recent years, reaching a maximum ofabove 0.6.
It has been shown that context is really importantwhen humans are rating MT outputs (Toral et al.,2018), and the WMT human evaluation is movingtowards evaluating segments with the documentcontext (Barrault et al., 2019). This creates a mis-match with automatic metrics, all of which, thisyear, score each segment independently. This year,we introduce document-level evaluation of metrics.When computing document-level scores, some met-rics from the COMET family include documentcontext when computing segment scores within thedocument. All other metrics included in this year’sevaluation either use the average of the segmentscores or compute the document score based onstatistics computed independently for each segment.In the future, we hope to see more metrics that con-sider broader context when evaluating translationsat all three levels.
For this year, we are unable to draw any mean-ingful conclusions from the document-level evalua-tion task, as it is hard to tease apart the influence ofnoise in the ground truth, inadequate segment-leveltranslations and inadequate translation in contextof the document.
We believe that the noise in the DARR judge-ments is a big factor in the low correlations inthe to-English language pairs. We need furtherresearch into understanding the factors that con-tribute to the Kendall Tau scores and how much wecan trust these results.
There are shortcomings in the methods used toevaluate metrics at the system-, document-, andsegment-level, and we believe that improving meth-ods for evaluating and analysing automatic metricsis a rich area for future research.
Finally, we assume that any discrepancies be-tween metrics and WMT manual evaluation is ametric error, and we acknowledge that this mightnot be true in all cases. There is always scope forimprovement in human evaluation methodology,and the best practice recommendations for humanevaluation are always evolving.
We rerun human evaluation by using the sametemplate as the WMT evaluation, but switchingthe rater pool from non-experts to professional lin-guists for a subset of translations where all metricsdisagree with the WMT human evaluation. Thisexperiment revealed a new use case of automatic
metrics and demonstrated that automatic metricscan be used to identify bad ratings in human eval-uations. The new obtained ratings were in linewith the scores suggested by the automatic metricsand also confirmed the higher translation quality ofhuman translations when compared to MT output.
In this paper, we looked at how outliers influencemetric evaluation, and we wonder how the presenceof these systems influence DA annotations. In aperfect world, annotators score each translation onits own merits without being influenced by previ-ous instances. In this world, given the presenceof much worse translations, do annotators assignhigh scores to the remaining translations that lookrelatively better? Does an MT system receive anunfair advantage if it is consistently scored along-side a low-scoring outlier? And does standardisingthe scores of individual annotators exacerbate thisissue? These and other research questions remainopen this year, keeping the WMT tasks increas-ingly interesting as MT systems are getting closerto human performance.
Acknowledgments
Results in this shared task would not be possiblewithout tight collaboration with organizers of theWMT News Translation Task.
We are grateful to Google for sponsoring theevaluation of selected MT systems by professionallinguists. We thank all participants of the task,particularly Weiyu Wang for pointing out some in-consistencies with the original release of inputs,Jackie Lo, Ricardo Rei, and Craig Stewart for somehelpful feedback and Thibault Sellam for also sub-mitting the scores of finetuned BERT on our re-quest.
Nitika Mathur is supported by the AustralianResearch Council. Ondrej Bojar would like to ac-knowledge the support from the Czech ScienceFoundation (grant n. 19-26934X, NEUREM3).
ReferencesManish Shrivastava Ananya Mukherjee, Hema Ala and
Dipti Misra Sharma. 2020. Mee: An automaticmetric for evaluation using embeddings for machinetranslation. (in press).
Loıc Barrault, Ondrej Bojar, Marta R Costa-jussa,Christian Federmann, Mark Fishel, Yvette Gra-ham, Barry Haddow, Matthias Huck, Philipp Koehn,Shervin Malmasi, et al. 2019. Findings of the 2019Conference on Machine Translation (WMT19). In
714
Proceedings of the Fourth Conference on MachineTranslation (Volume 2: Shared Task Papers, Day 1),pages 1–61.
Loıc Barrault, Magdalena Biesialska, Ondrej Bojar,Marta R. Costa-jussa, Christian Federmann, YvetteGraham, Roman Grundkiewicz, Barry Haddow,Matthias Huck, Eric Joanis, Tom Kocmi, PhilippKoehn, Chi-kiu Lo, Nikola Ljubesic, ChristofMonz, Makoto Morishita, Masaaki Nagata, Toshi-aki Nakazawa, Santanu Pal, Matt Post, and MarcosZampieri. 2020. Findings of the 2020 conferenceon machine translation (wmt20). In Proceedings ofthe Fifth Conference on Machine Translation, On-line. Association for Computational Linguistics.
Rachel Bawden, Biao Zhang, Andre Tattar, and MattPost. 2020. ParBLEU: Augmenting metrics with au-tomatic paraphrases for the WMT’20 metrics sharedtask. In Proceedings of the Fifth Conference on Ma-chine Translation (Volume 2: Shared Task Papers),Online. Association for Computational Linguistics.
Ondrej Bojar, Matous Machacek, Ales Tamchyna, andDaniel Zeman. 2013. Scratching the surface of pos-sible translations. In Text, Speech, and Dialogue,pages 465–474, Berlin, Heidelberg. Springer BerlinHeidelberg.
Ondrej Bojar, Yvette Graham, and Amir Kamran. 2017.Results of the wmt17 metrics shared task. In Pro-ceedings of the Second Conference on MachineTranslation, Volume 2: Shared Task Papers, pages489–513, Copenhagen, Denmark.
Ondrej Bojar, Yvette Graham, Amir Kamran, andMilos Stanojevic. 2016. Results of the WMT16Metrics Shared Task. In Proceedings of the FirstConference on Machine Translation, pages 199–231,Berlin, Germany.
Sheila Castilho, Joss Moorkens, Federico Gaspari,Andy Way, Panayota Georgakopoulou, Maria Gi-alama, Vilelmini Sosoni, and Rico Sennrich. 2017.Crowdsourcing for nmt evaluation: Professionaltranslators versus the crowd. Translating and theComputer, 39.
Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017. Enhanced LSTM fornatural language inference. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics, Vancouver, Canada.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.
Markus Dreyer and Daniel Marcu. 2012. HyTER:Meaning-equivalent semantics for translation evalu-ation. In Proceedings of the 2012 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, pages 162–171, Montreal, Canada. Associationfor Computational Linguistics.
Markus Freitag, David Grangier, and Isaac Caswell.2020. BLEU might be guilty but references are notinnocent. In Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP), pages 61–71, Online. Association forComputational Linguistics.
Yvette Graham and Timothy Baldwin. 2014. Testingfor significance of increased correlation with humanjudgment. In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP), pages 172–176, Doha, Qatar. Associ-ation for Computational Linguistics.
Yvette Graham, Timothy Baldwin, and Nitika Mathur.2015. Accurate evaluation of segment-level ma-chine translation metrics. In Proceedings of the2015 Conference of the North American Chapterof the Association for Computational Linguistics:Human Language Technologies, pages 1183–1191,Denver, Colorado. Association for ComputationalLinguistics.
Yvette Graham, Timothy Baldwin, Alistair Moffat, andJustin Zobel. 2013. Continuous measurement scalesin human evaluation of machine translation. In Pro-ceedings of the 7th Linguistic Annotation Workshopand Interoperability with Discourse, pages 33–41,Sofia, Bulgaria. Association for Computational Lin-guistics.
Yvette Graham, Qingsong Ma, Timothy Baldwin, QunLiu, Carla Parra, and Carolina Scarton. 2017. Im-proving evaluation of document-level machine trans-lation quality estimation. In Proceedings of the 15thConference of the European Chapter of the Associa-tion for Computational Linguistics: Volume 2, ShortPapers, pages 356–361, Valencia, Spain. Associa-tion for Computational Linguistics.
Yvette Graham, Nitika Mathur, and Timothy Baldwin.2014. Randomized significance tests in machinetranslation. In Proceedings of the Ninth Workshopon Statistical Machine Translation, pages 266–274,Baltimore, Maryland, USA. Association for Compu-tational Linguistics.
Boris Iglewicz and David Caster Hoaglin. 1993. Howto detect and handle outliers, volume 16. Asq Press.
Fabio Kepler, Jonay Trenous, Marcos Treviso, MiguelVera, and Andre F. T. Martins. 2019. OpenKiwi:An open source framework for quality estimation.In Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics: SystemDemonstrations, pages 117–122, Florence, Italy. As-sociation for Computational Linguistics.
715
Philipp Koehn. 2004. Statistical significance testsfor machine translation evaluation. In Proceed-ings of the 2004 Conference on Empirical Meth-ods in Natural Language Processing, pages 388–395, Barcelona, Spain. Association for Computa-tional Linguistics.
Philipp Koehn, Vishrav Chaudhary, Ahmed El-Kishky,Naman Goyal, Peng-Jen Chen, and FranciscoGuzman. 2020. Findings of the wmt 2020 sharedtask on parallel corpus filtering and alignment. InProceedings of the Fifth Conference on MachineTranslation, Online. Association for ComputationalLinguistics.
Samuel Laubli, Sheila Castilho, Graham Neubig, RicoSennrich, Qinlan Shen, and Antonio Toral. 2020.A set of recommendations for assessing human–machine parity in language translation. Journal ofArtificial Intelligence Research, 67:653–672.
Christophe Leys, Christophe Ley, Olivier Klein,Philippe Bernard, and Laurent Licata. 2013. Detect-ing outliers: Do not use standard deviation aroundthe mean, use absolute deviation around the me-dian. Journal of Experimental Social Psychology,49(4):764–766.
Chi-kiu Lo. 2019. YiSi - a unified semantic MT qualityevaluation and estimation metric for languages withdifferent levels of available resources. In Proceed-ings of the Fourth Conference on Machine Transla-tion (Volume 2: Shared Task Papers, Day 1), pages507–513, Florence, Italy. Association for Computa-tional Linguistics.
Chi-kiu Lo. 2020. Extended study on using pretrainedlanguage models and YiSi-1 for machine translationevaluation. In Proceedings of the Fifth Conferenceon Machine Translation: Shared Task Papers.
Chi-kiu Lo and Samuel Larkin. 2020. Machine trans-lation reference-less evaluation using YiSi-2 withbilingual mappings of massive multilingual lan-guage model. In Proceedings of the Fifth Confer-ence on Machine Translation: Shared Task Papers.
Qingsong Ma, Ondrej Bojar, and Yvette Graham. 2018.Results of the WMT18 metrics shared task: Bothcharacters and embeddings achieve good perfor-mance. In Proceedings of the Third Conference onMachine Translation: Shared Task Papers, pages671–688, Belgium, Brussels. Association for Com-putational Linguistics.
Qingsong Ma, Johnny Wei, Ondrej Bojar, and YvetteGraham. 2019. Results of the WMT19 metricsshared task: Segment-level and strong MT sys-tems pose big challenges. In Proceedings of theFourth Conference on Machine Translation (Volume2: Shared Task Papers, Day 1), pages 62–90, Flo-rence, Italy. Association for Computational Linguis-tics.
Matous Machacek and Ondrej Bojar. 2014. Results ofthe WMT14 metrics shared task. In Proceedings ofthe Ninth Workshop on Statistical Machine Trans-lation, pages 293–301, Baltimore, Maryland, USA.Association for Computational Linguistics.
Nitika Mathur, Timothy Baldwin, and Trevor Cohn.2019. Putting evaluation in context: Contextualembeddings improve machine translation evaluation.In Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics, pages2799–2808, Florence, Italy. Association for Compu-tational Linguistics.
Nitika Mathur, Timothy Baldwin, and Trevor Cohn.2020. Tangled up in BLEU: Reevaluating the eval-uation of automatic machine translation evaluationmetrics. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 4984–4997, Online. Association for Computa-tional Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002a. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics, pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002b. BLEU: A method for automaticevaluation of machine translation. In Proceedingsof the 40th Annual Meeting of the Association forComputational Linguistics (ACL 2002), pages 311–318, Philadelphia, USA.
Maja Popovic. 2015. chrF: character n-gram F-scorefor automatic MT evaluation. In Proceedings of theTenth Workshop on Statistical Machine Translation,pages 392–395, Lisbon, Portugal.
Maja Popovic. 2017. chrF++: words helping charac-ter n-grams. In Proceedings of the Second Con-ference on Machine Translation, pages 612–618,Copenhagen, Denmark. Association for Computa-tional Linguistics.
Matt Post. 2018. A call for clarity in reporting BLEUscores. In Proceedings of the Third Conference onMachine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computa-tional Linguistics.
Ricardo Rei, Craig Stewart, Ana C Farinha, and AlonLavie. 2020a. COMET: A neural framework for MTevaluation. In Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP), pages 2685–2702, Online. Associa-tion for Computational Linguistics.
Ricardo Rei, Craig Stewart, Ana C Farinha, and AlonLavie. 2020b. Unbabel’s participation in the wmt20metrics shared task. In Proceedings of the FifthConference on Machine Translation, Online. Asso-ciation for Computational Linguistics.
716
Peter J Rousseeuw and Mia Hubert. 2011. Robuststatistics for outlier detection. Wiley Interdisci-plinary Reviews: Data Mining and Knowledge Dis-covery, 1(1):73–79.
Thibault Sellam, Dipanjan Das, and Ankur Parikh.2020a. BLEURT: Learning robust metrics for textgeneration. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguis-tics, pages 7881–7892. Association for Computa-tional Linguistics.
Thibault Sellam, Amy Pu, Hyung Won Chung, Sebas-tian Gehrmann, Qijun Tan, Markus Freitag, Dipan-jan Das, and Ankur Parikh. 2020b. Learning to eval-uate translation beyond english: Bleurt submissionsto the wmt metrics 2020 shared task. In Proceedingsof the Fifth Conference on Machine Translation, On-line. Association for Computational Linguistics.
Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A studyof translation edit rate with targeted human annota-tion. In Proceedings of the Association for MachineTransaltion in the Americas, pages 223–231.
Peter Stanchev, Weiyue Wang, and Hermann Ney. 2019.Eed: Extended edit distance measure for machinetranslation. In Proceedings of the Fourth Confer-ence on Machine Translation (Volume 2: SharedTask Papers, Day 1), pages 514–520, Florence, Italy.Association for Computational Linguistics.
Brian Thompson and Matt Post. 2020. Automatic ma-chine translation evaluation in many languages viazero-shot paraphrasing. In Proceedings of the 2020Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), Online. Association forComputational Linguistics.
Antonio Toral, Sheila Castilho, Ke Hu, and Andy Way.2018. Attaining the Unattainable? ReassessingClaims of Human Parity in Neural Machine Trans-lation. In Proceedings of the Third Conference onMachine Translation: Research Papers, pages 113–123, Belgium, Brussels. Association for Computa-tional Linguistics.
Tereza Vojtechova, Michal Novak, Milos Kloucek, andOndrej Bojar. 2019. SAO WMT19 Test Suite: Ma-chine Translation of Audit Reports. In Proceedingsof the Fourth Conference on Machine Translation,Florence, Italy. Association for Computational Lin-guistics.
Weiyue Wang, Jan-Thorsten Peter, Hendrik Rosendahl,and Hermann Ney. 2016. CharacTer: Translationedit rate on character level. In Proceedings of theFirst Conference on Machine Translation: Volume2, Shared Task Papers, pages 505–510, Berlin, Ger-many. Association for Computational Linguistics.
Evan James Williams. 1959. Regression analysis. wi-ley.
Jin Xu, Yinuo Guo, and Junfeng Hu. 2020. Incorporatesemantic structures into machine translation evalua-tion via ucca. In Proceedings of the Fifth Confer-ence on Machine Translation, Online. Associationfor Computational Linguistics.
717
A List of Outliers
lp Outliers
cs-en ZLABS-NLP.1149, CUNI-DOCTRANSFORMER.1457de-en YOLO.1052, ZLABS-NLP.1153, WMTBIOMEDBASELINE.387iu-en NIUTRANS.1206, FACEBOOK AI.729ja-en ONLINE-G.1564, ZLABS-NLP.66, ONLINE-Z.1640pl-en ZLABS-NLP.1162ru-en ZLABS-NLP.1164ta-en ONLINE-G.1568, TALP UPC.192zh-en WMTBIOMEDBASELINE.183en-cs ZLABS-NLP.1151, ONLINE-G.1555en-de ZLABS-NLP.179, WMTBIOMEDBASELINE.388, ONLINE-G.1556en-iu news UEDIN.1281, OPPO.722, UQAM TANLE.521en-iu full UEDIN.1281, OPPO.722, UQAM TANLE.521en-iu UEDIN.1281, OPPO.722, UQAM TANLE.521en-pl ONLINE-Z.1634, ZLABS-NLP.180, ONLINE-A.1576en-ta TALP UPC.1049, SJTU-NICT.386, ONLINE-G.1561
Table 15: List of all MT systems that we consider as outliers
B Scatterplots
Here we show scatterplots of human and metric scores of selected metrics.We report the correlation of each metric with human scores on all systems as well as all systems minus
the outliers. Note that we do not exclude human translations when computing these correlations.In the following scatterplots, the violet triangles indicate individual indicate MT system submissions by
researchers and pink downward triangles are online systems. 8 The red crosses are outlier systems.The black diamonds are human translations. For newstest2020 reference set, this is the HUMAN-A
translation, and for newstestB2020 reference set, this is the HUMAN-B translation. The plots for English→ German have two human translations included, and we annotate the label in the plot. In many cases,metric errors scoring these translations stand out.
Metric scores of MT systems with multiple references does not deviate from the scores of eitherreference. So we do not include the scatterplots of the other reference sets unless a human translation isincluded (which is interesting).
We will have scatterplots for all metrics over all reference sets in the metrics package to be madeavailable at http://www.statmt.org/wmt20/results.html
cs-en
0.1 0.0 0.1
22
24
26
28
30BLEU: r = 0.85/ 0.80
0.1 0.0 0.1
0.53
0.54
0.55
0.56
0.57
0.58
chrF: r = 0.87/ 0.81
0.1 0.0 0.1
0.42
0.41
0.40
0.39
0.38
0.37EED: r = 0.88/ 0.84
0.1 0.0 0.1
0.10
0.05
0.00
0.05
0.10
0.15
OpenKiwiBert: r = 0.73/ 0.70
8We distinguish between the two in these scatterplots as we notice that metrics often make errors when scoring onlinesystems.
718
de-en newstest2020a
0.1 0.0 0.1 0.2
32
34
36
38
40
42
44BLEU: r = 0.81/ 0.48
0.1 0.0 0.1 0.2
0.60
0.62
0.64
0.66
0.68
chrF: r = 0.88/ 0.44
0.1 0.0 0.1 0.2
0.1
0.0
0.1
0.2
0.3
BLEURText: r = 0.96/ 0.81
0.1 0.0 0.1 0.2
0.20
0.25
0.30
0.35
COMETQE: r = 0.96/ 0.81
aIncluding the YOLO.1052 system, which has an extremely low quality, would make it hard to distinguish between the rest ofthe systems, so these plots exclude the system.
iu-en
0.0 0.1 0.2
24
26
28
30BLEU: r = 0.57/ 0.35
0.0 0.1 0.2
0.46
0.48
0.50
0.52
chrF: r = 0.73/ 0.34
0.0 0.1 0.2
0.800
0.775
0.750
0.725
0.700
0.675
COMET2R: r = 0.87/ 0.64
0.0 0.1 0.2
0.01
0.00
0.01
0.02
0.03
COMETQE: r = 0.68/ 0.66
ja-en
0.2 0.0 0.2
12.5
15.0
17.5
20.0
22.5
25.0
27.5 BLEU: r = 0.97/ 0.83
0.2 0.0 0.2
0.400
0.425
0.450
0.475
0.500
0.525
chrF: r = 0.97/ 0.86
0.2 0.0 0.2
0.50
0.48
0.46
0.44
0.42
0.40 EED: r = 0.97/ 0.90
0.2 0.0 0.2
0.87
0.88
0.89
0.90
0.91YiSi2: r = 0.97/ 0.78
pl-en
0.1 0.0 0.1
28
30
32
34
BLEU: r = 0.55/ 0.36
0.1 0.0 0.1
0.56
0.57
0.58
0.59
0.60
0.61
chrF: r = 0.53/ 0.31
0.1 0.0 0.1
0.15
0.20
0.25
0.30
BLEURT: r = 0.59/ 0.37
0.1 0.0 0.1
0.91
0.92
0.93
0.94
0.95YiSi2: r = 0.44/ 0.23
719
ru-en newstest2020
0.1 0.0 0.1
30
32
34
36
38
BLEU: r = 0.87/ 0.74
0.1 0.0 0.1
0.60
0.61
0.62
0.63
0.64
0.65
0.66chrF: r = 0.88/ 0.82
0.1 0.0 0.1
0.35
0.34
0.33
0.32
0.31
0.30
EED: r = 0.92/ 0.86
0.1 0.0 0.10.22
0.24
0.26
0.28
0.30
0.32
0.34
COMETQE: r = 0.87/ 0.75
ru-en newstestB2020
0.1 0.0 0.1
32
34
36
38
40
42 BLEU: r = 0.88/ 0.78
0.1 0.0 0.1
0.61
0.62
0.63
0.64
0.65
0.66
0.67chrF: r = 0.89/ 0.84
0.1 0.0 0.1
0.58
0.60
0.62
0.64
MEE: r = 0.91/ 0.88
0.1 0.0 0.1
0.91
0.92
0.93
0.94
0.95YiSi2: r = 0.82/ 0.81
ta-en
0.4 0.2 0.0 0.2
5
10
15
20
BLEU: r = 0.92/ 0.81
0.4 0.2 0.0 0.2
0.40
0.45
0.50
0.55chrF: r = 0.95/ 0.83
0.4 0.2 0.0 0.2
0.70
0.65
0.60
0.55
CharacTER: r = 0.96/ 0.88
0.4 0.2 0.0 0.20.87
0.88
0.89
0.90
0.91
0.92
YiSi2: r = 0.85/ 0.76
zh-en newstest2020
0.2 0.015
20
25
30
35
BLEU: r = 0.94/ 0.94
0.2 0.0
0.45
0.50
0.55
0.60
0.65
chrF: r = 0.97/ 0.95
0.2 0.0
45
50
55
60
65
parchrf++: r = 0.97/ 0.95
0.2 0.0
0.10
0.15
0.20
0.25
0.30
OpenKiwiXLMR: r = 0.90/ 0.89
720
zh-en newstestB2020
0.2 0.0
15
20
25
30
BLEU: r = 0.94/ 0.93
0.2 0.0
0.45
0.50
0.55
0.60
chrF: r = 0.97/ 0.95
0.2 0.0
45
50
55
60
parchrf++: r = 0.97/ 0.95
0.2 0.00.88
0.89
0.90
0.91
0.92
YiSi2: r = 0.96/ 0.93
km-en
0.2 0.0 0.210.0
12.5
15.0
17.5
20.0
22.5
25.0
BLEU: r = 0.97/ 0.97
0.2 0.0 0.20.375
0.400
0.425
0.450
0.475
0.500
0.525chrF: r = 0.98/ 0.98
0.2 0.0 0.2
0.52
0.50
0.48
0.46
0.44
0.42
0.40
EED: r = 0.99/ 0.99
0.2 0.0 0.2
0.05
0.10
0.15
0.20
COMETQE: r = 0.90/ 0.90
ps-en
0.2 0.1 0.0
14
16
18
20
22
BLEU: r = 0.89/ 0.89
0.2 0.1 0.00.38
0.40
0.42
0.44
0.46
0.48
0.50chrF: r = 0.90/ 0.90
0.2 0.1 0.0
2.8
2.6
2.4
2.2
prism: r = 0.97/ 0.97
0.2 0.1 0.0
0.86
0.87
0.88
0.89
0.90
YiSi2: r = 0.94/ 0.94
en-cs
0.5 0.0 0.5
20
25
30
35
40
BLEU: r = 0.82/ 0.39
0.5 0.0 0.5
0.50
0.55
0.60
0.65
chrF: r = 0.83/ 0.31
0.5 0.0 0.5
0.4
0.2
0.0
0.2
0.4
BLEURText: r = 0.99/ 0.96
0.5 0.0 0.5
0.2
0.3
0.4
0.5
COMETQE: r = 0.99/ 0.97
721
en-de newstest2020
0.0 0.5
15
20
25
30
35
40
P
B
BLEU: r = 0.54/ 0.31
0.0 0.5
0.50
0.55
0.60
0.65
P
B
chrF: r = 0.64/ 0.36
0.0 0.5
0.2
0.1
0.0
0.1
0.2
0.3
P
BBLEURText: r = 0.97/ 0.87
0.0 0.5
0.25
0.30
0.35
0.40
P
BCOMETQE: r = 0.91/ 0.89
en-de newstestB2020
0.25 0.00 0.25 0.50
15
20
25
30
35
40
P
A
BLEU: r = 0.53/ 0.38
0.25 0.00 0.25 0.500.45
0.50
0.55
0.60
0.65
P
A
chrF: r = 0.59/ 0.39
0.25 0.00 0.25 0.50
0.2
0.1
0.0
0.1
0.2
0.3
P
ABLEURText: r = 0.97/ 0.88
0.25 0.00 0.25 0.50
0.25
0.30
0.35
0.40
P
ACOMETQE: r = 0.90/ 0.85
en-de newstestP2020
0.0 0.58
10
12
14
16
B
ABLEU: r = 0.81/ 0.65
0.0 0.50.38
0.40
0.42
0.44
0.46
0.48
B
AchrF: r = 0.85/ 0.68
0.0 0.50.4
0.2
0.0
0.2 B A
BLEURText: r = 0.95/ 0.86
0.0 0.5
0.25
0.30
0.35
0.40
B
A
COMETQE: r = 0.91/ 0.89
en-ja
0.2 0.0 0.2 0.4
15.0
17.5
20.0
22.5
25.0
27.5
BLEU: r = 0.95/ 0.95
0.2 0.0 0.2 0.4
0.225
0.250
0.275
0.300
0.325
0.350
0.375chrF: r = 0.95/ 0.95
0.2 0.0 0.2 0.4
0.2
0.3
0.4
0.5
esim: r = 0.99/ 0.99
0.2 0.0 0.2 0.4
0.1
0.2
0.3
0.4
OpenKiwiXLMR: r = 0.99/ 0.99
722
en-pl
0.5 0.0 0.5
20
22
24
26
28BLEU: r = 0.94/ 0.74
0.5 0.0 0.5
0.50
0.52
0.54
0.56
chrF: r = 0.96/ 0.79
0.5 0.0 0.50.2
0.1
0.0
0.1
0.2
0.3
BLEURText: r = 0.98/ 0.83
0.5 0.0 0.5
0.25
0.30
0.35
0.40
0.45
0.50
0.55COMETQE: r = 0.97/ 0.80
en-ru
0.0 0.2 0.4
20
21
22
23
24
25
BLEU: r = 0.98/ 0.98
0.0 0.2 0.4
0.48
0.50
0.52
chrF: r = 0.98/ 0.98
0.0 0.2 0.4
46
47
48
49
50
chrF++: r = 0.98/ 0.98
0.0 0.2 0.4
0.30
0.35
0.40
0.45
0.50
OpenKiwiXLMR: r = 0.87/ 0.87
en-ta
0.5 0.0 0.5
2
4
6
8
10
12
14BLEU: r = 0.88/ 0.83
0.5 0.0 0.5
0.30
0.35
0.40
0.45
0.50
0.55
chrF: r = 0.94/ 0.89
0.5 0.0 0.5
0.55
0.50
0.45
EED: r = 0.96/ 0.91
0.5 0.0 0.5
0.87
0.88
0.89
0.90
0.91
0.92YiSi2: r = 0.92/ 0.92
en-zh newstest2020
0.2 0.4 0.6
30
35
40
45
50BLEU: r = 0.66/ 0.66
0.2 0.4 0.60.25
0.30
0.35
0.40
0.45 chrF: r = 0.65/ 0.65
0.2 0.4 0.6
0.20
0.25
0.30
0.35
0.40
EQ_static: r = 0.93/ 0.93
0.2 0.4 0.6
0.0
0.1
0.2
0.3
OpenKiwiBert: r = 0.49/ 0.49
723
en-zh newstestB2020
0.2 0.425.0
27.5
30.0
32.5
35.0
37.5
40.0
BLEU: r = 0.81/ 0.81
0.2 0.40.24
0.26
0.28
0.30
0.32
0.34
0.36chrF: r = 0.81/ 0.81
0.2 0.4
0.35
0.40
0.45
esim: r = 0.92/ 0.92
0.2 0.4
0.0
0.1
0.2
0.3
OpenKiwiBert: r = 0.52/ 0.52
en-iu Full test set
0.25 0.00 0.255
10
15
20
25
BLEU: r = 0.16/ 0.13
0.25 0.00 0.25
0.25
0.30
0.35
0.40
chrF: r = 0.35/ 0.12
0.25 0.00 0.25
0.500
0.525
0.550
0.575
0.600
0.625
esim: r = 0.81/ 0.37
0.25 0.00 0.25
0.02
0.01
0.00
0.01
0.02
COMETQE: r = 0.91/ 0.58
en-zh Out of domain (News) subset
0.25 0.00 0.25
5
10
15
20
25
BLEU: r = 0.07/ 0.11
0.25 0.00 0.25
0.25
0.30
0.35
0.40
0.45 chrF: r = 0.34/ 0.09
0.25 0.00 0.25
0.45
0.50
0.55
0.60
paresim1: r = 0.76/ 0.42
0.25 0.00 0.250.02
0.01
0.00
0.01
0.02
0.03COMETQE: r = 0.93/ 0.65
C Additional System-level Results
We also report Kendall Tau correlation of metrics at the system level.
724
cs-en de-en ja-en pl-en ru-en ta-en zh-en iu-en km-en ps-en12 12 10 14 11 14 16 11 7 6
HUMAN RAW 0.727 0.758 0.778 0.429 0.673 0.604 0.650 0.891 0.905 1.000SENTBLEU 0.788 0.758 0.733 0.297 0.564 0.692 0.850 0.455 0.619 0.600BLEU 0.848 0.697 0.778 0.407 0.455 0.692 0.833 0.309 0.714 0.600TER 0.758 0.788 0.689 0.287 0.600 0.780 0.800 0.514 0.878 0.867CHRF++ 0.818 0.697 0.778 0.407 0.673 0.714 0.850 0.418 0.619 0.733CHRF 0.818 0.727 0.822 0.363 0.709 0.714 0.833 0.418 0.619 0.733PARBLEU 0.809 0.779 0.778 0.420 0.491 0.685 0.807 0.404 0.714 0.867PARCHRF++ 0.818 0.727 0.822 0.407 0.709 0.714 0.817 0.491 0.619 0.733CHARACTER 0.758 0.758 0.822 0.341 0.745 0.692 0.800 0.527 0.810 0.733EED 0.788 0.727 0.733 0.297 0.782 0.758 0.833 0.636 0.714 0.733YISI-0 0.758 0.758 0.689 0.231 0.782 0.802 0.833 0.600 0.714 0.733SWSS+METEOR − − 0.822 0.341 0.818 0.736 0.817 0.491 0.714 0.733MEE 0.758 0.697 0.867 0.363 0.709 0.692 0.783 0.636 0.714 0.733PRISM 0.758 0.727 0.867 0.341 0.564 0.648 0.800 0.673 0.714 0.867YISI-1 0.758 0.758 0.778 0.451 0.564 0.692 0.817 0.673 1.000 0.867BERT-BASE-L2 0.758 0.848 0.822 0.407 0.491 0.604 0.633 0.564 1.000 0.867BERT-LARGE-L2 0.758 0.848 0.867 0.341 0.564 0.626 0.700 0.527 1.000 0.867MBERT-L2 0.758 0.818 0.822 0.429 0.564 0.604 0.750 0.673 1.000 0.867BLEURT 0.758 0.788 0.822 0.407 0.600 0.604 0.650 0.527 1.000 0.867BLEURT-EXTENDED 0.727 0.848 0.778 0.341 0.455 0.582 0.617 0.527 0.905 0.867ESIM 0.727 0.848 0.822 0.451 0.491 0.670 0.717 0.636 1.000 0.867PARESIM-1 0.727 0.879 0.822 0.451 0.491 0.670 0.700 0.636 1.000 0.867COMET 0.727 0.758 0.778 0.407 0.564 0.626 0.733 0.636 1.000 0.867COMET-2R 0.727 0.788 0.778 0.451 0.527 0.582 0.717 0.600 1.000 0.867COMET-HTER 0.667 0.788 0.822 0.275 0.491 0.604 0.533 0.564 1.000 0.867COMET-MQM 0.667 0.727 0.822 0.275 0.455 0.582 0.517 0.636 1.000 1.000COMET-RANK 0.576 0.727 0.822 0.341 0.455 0.626 0.650 0.309 0.810 1.000BAQ DYN − − − − − − 0.817 − − −BAQ STATIC − − − − − − 0.867 − − −COMET-QE 0.697 0.788 0.778 0.297 0.455 0.516 0.550 0.491 0.905 0.733OPENKIWI-BERT 0.697 0.667 0.733 0.187 0.455 0.429 0.450 -0.055 0.714 0.467OPENKIWI-XLMR 0.727 0.636 0.822 0.275 0.418 0.560 0.567 0.018 1.000 0.867YISI-2 0.576 0.515 0.778 0.319 0.527 0.582 0.750 0.491 0.810 0.867
Table 16: Kendall Tau correlation of system-level metrics with DA human assessment for all MT systems notincluding Human translations. In addition to the metrics, we also include raw human scores where annotatorscores were not standardised.
725
en-cs en-de en-ja en-pl en-ru en-ta en-zh en-iu full en-iu news12 14 11 14 9 15 12 11 11
HUMAN RAW 1.000 0.868 0.964 0.846 0.778 0.810 0.818 0.600 0.600SENTBLEU 0.515 0.802 0.855 0.604 0.944 0.867 0.727 0.236 0.273BLEU 0.515 0.802 0.818 0.582 0.889 0.829 0.727 0.236 0.236TER 0.515 0.824 0.018 0.641 0.556 0.752 0.242 0.309 0.309CHRF++ 0.485 0.868 0.782 0.604 0.889 0.829 0.727 0.309 0.309CHRF 0.485 0.868 0.818 0.604 0.889 0.810 0.727 0.345 0.309PARBLEU 0.504 0.736 0.611 0.633 0.761 0.842 0.718 0.404 0.345PARCHRF++ 0.515 0.846 0.818 0.670 0.889 − 0.727 − −CHARACTER 0.515 0.890 0.782 0.560 0.944 0.771 0.697 0.236 0.345EED 0.545 0.868 0.782 0.604 0.833 0.867 0.727 0.273 0.273YISI-0 0.545 0.846 0.818 0.604 0.944 0.790 0.515 0.236 0.345MEE 0.576 0.802 − 0.582 0.667 0.829 − 0.273 0.382PRISM 0.818 0.868 0.818 0.670 0.611 0.562 0.576 0.418 0.600YISI-1 0.606 0.868 0.782 0.626 0.833 0.810 0.758 0.091 0.273YISI-COMBI − 0.824 − − − − − − −BLEURT-YISI-COMBI − 0.824 − − − − − − −MBERT-L2 0.788 0.846 0.782 0.736 0.778 0.752 0.909 − −BLEURT-EXTENDED 0.879 0.802 0.782 0.780 0.833 0.771 0.848 0.382 0.345ESIM 0.606 0.912 0.855 0.692 0.833 0.752 0.788 0.382 0.455PARESIM-1 0.667 0.890 0.818 0.692 0.833 0.752 0.818 0.382 0.455COMET 0.909 0.846 0.745 0.736 0.722 0.771 0.606 0.382 0.382COMET-2R 0.909 0.890 0.891 0.714 0.611 0.790 0.606 0.309 0.418COMET-HTER 0.909 0.802 0.818 0.736 0.667 0.619 0.576 0.491 0.491COMET-MQM 0.909 0.802 0.818 0.736 0.667 0.619 0.545 0.527 0.455COMET-RANK 0.848 0.780 0.782 0.692 0.556 0.524 0.515 0.127 0.345BAQ DYN − − − − − − 0.697 − −BAQ STATIC − − − − − − 0.788 − −EQ DYN − − − − − − 0.727 − −EQ STATIC − − − − − − 0.818 − −COMET-QE 0.848 0.802 0.709 0.802 0.667 0.543 0.576 0.600 0.673OPENKIWI-BERT 0.758 0.780 0.236 0.538 0.722 0.314 0.606 -0.273 0.200OPENKIWI-XLMR 0.909 0.780 0.818 0.692 0.667 0.657 0.545 0.018 0.200YISI-2 0.485 0.582 0.527 0.077 0.444 0.886 0.121 0.309 0.455
Table 17: Kendall Tau correlation of out-of-English system-level metrics with DA human assessment for all MTsystems not including Human translations. In addition to the metrics, we also include raw human scores whereannotator scores were not standardised.