Proceedings of the Fourth Conference on Machine Translation (WMT), Volume 2: Shared Task Papers (Day 1) pages 62–90Florence, Italy, August 1-2, 2019. c©2019 Association for Computational Linguistics
62
Results of the WMT19 Metrics Shared Task:Segment-Level and Strong MT Systems Pose Big Challenges
Qingsong MaTencent-CSIG, AI Evaluation Lab
Johnny Tian-Zheng WeiUMass Amherst, CICS
Ondřej BojarCharles University, MFF ÚFAL
Yvette GrahamDublin City University, ADAPT
Abstract
This paper presents the results of theWMT19 Metrics Shared Task. Par-ticipants were asked to score the out-puts of the translations systems compet-ing in the WMT19 News TranslationTask with automatic metrics. 13 researchgroups submitted 24 metrics, 10 of whichare reference-less “metrics” and constitutesubmissions to the joint task with WMT19Quality Estimation Task, “QE as a Met-ric”. In addition, we computed 11 baselinemetrics, with 8 commonly applied base-lines (BLEU, SentBLEU, NIST, WER,PER, TER, CDER, and chrF) and 3 reim-plementations (chrF+, sacreBLEU-BLEU,and sacreBLEU-chrF). Metrics were evalu-ated on the system level, how well a givenmetric correlates with the WMT19 offi-cial manual ranking, and segment level,how well the metric correlates with humanjudgements of segment quality. This year,we use direct assessment (DA) as our onlyform of manual evaluation.
1 Introduction
To determine system performance in machinetranslation (MT), it is often more practicalto use an automatic evaluation, rather thana manual one. Manual/human evaluation canbe costly and time consuming, and so an au-tomatic evaluation metric, given that it suffi-ciently correlates with manual evaluation, canbe useful in developmental cycles. In studiesinvolving hyperparameter tuning or architec-ture search, automatic metrics are necessaryas the amount of human effort implicated inmanual evaluation is generally prohibitivelylarge. As objective, reproducible quantities,metrics can also facilitate cross-paper compar-
isons. The WMT Metrics Shared Task1 annu-ally serves as a venue to validate the use ofexisting metrics (including baselines such asBLEU), and to develop new ones; see Koehnand Monz (2006) through Ma et al. (2018).
In the setup of our Metrics Shared Task,an automatic metric compares an MT sys-tem’s output translations with manual ref-erence translations to produce: either (a)system-level score, i.e. a single overall scorefor the given MT system, or (b) segment-levelscores for each of the output translations, orboth.
This year we teamed up with the organizersof the QE Task and hosted “QE as a Metric” asa joint task. In the setup of the Quality Esti-mation Task (Fonseca et al., 2019), no human-produced translations are provided to estimatethe quality of output translations. Quality es-timation (QE) methods are built to assess MToutput based on the source or based on thetranslation itself. In this task, QE developerswere invited to perform the same scoring asstandard metrics participants, with the excep-tion that they refrain from using a referencetranslation in production of their scores. Wethen evaluate the QE submissions in exactlythe same way as regular metrics are evalu-ated, see below. From the point of view ofcorrelation with manual judgements, there isno difference in metrics using or not using ref-erences.
The source, reference texts, and MT sys-tem outputs for the Metrics task come fromthe News Translation Task (Barrault et al.,2019, which we denote as Findings 2019). Thetexts were drawn from the news domain andinvolve translations of English (en) to/from
1http://www.statmt.org/wmt19/metrics-task.html
63
Czech (cs), German (de), Finnish (fi), Gu-jarati (gu), Kazakh (kk), Lithuanian (lt), Rus-sian (ru), and Chinese (zh), but excluding cs-en (15 language pairs). Three other languagepairs not including English were also manu-ally evaluated as part of the News TranslationTask: German→Czech and German↔French.In total, metrics could participate in 18 lan-guage pairs, with 10 target languages.
In the following, we first give an overview ofthe task (Section 2) and summarize the base-line (Section 3) and submitted (Section 4) met-rics. The results for system- and segment-levelevaluation are provided in Sections 5.1 and5.2, respectively, followed by a joint discussionSection 6.
2 Task Setup
This year, we provided task participants withone test set for each examined language pair,i.e. a set of source texts (which are commonlyignored by MT metrics), corresponding MToutputs (these are the key inputs to be scored)and a reference translation (held out for theparticipants of “QE as a Metric” track).
In the system-level, metrics aim to corre-late with a system’s score which is an aver-age over many human judgments of segmenttranslation quality produced by the given sys-tem. In the segment-level, metrics aim toproduce scores that correlate best with a hu-man ranking judgment of two output trans-lations for a given source segment (more onthe manual quality assessment in Section 2.3).Participants were free to choose which lan-guage pairs and tracks (system/segment andreference-based/reference-free) they wanted totake part in.
2.1 Source and Reference TextsThe source and reference texts we use arenewstest2019 from this year’s WMT NewsTranslation Task (see Findings 2019). Thisset contains approximately 2,000 sentences foreach translation direction (except Gujarati,Kazakh and Lithuanian which have approx-imately 1,000 sentences each, and Germanto/from French which has 1701 sentences).
The reference translations provided in new-stest2019 were created in the same direc-tion as the MT systems were translating.
The exceptions are German→Czech whereboth sides are translations from English andGerman↔French which followed last years’practice. Last year and the years before, thedataset consisted of two halves, one originat-ing in the source language and one in the tar-get language. This however lead to adverseartifacts in MT evaluation.
2.2 System OutputsThe results of the Metrics Task are affectedby the actual set of MT systems participatingin a given translation direction. On one hand,if all systems are very close in their transla-tion quality, then even humans will struggleto rank them. This in turn will make the taskfor MT metrics very hard. On the other hand,if the task includes a wide range of systemsof varying quality, correlating with humansshould be generally easier, see Section 6.1 fora discussion on this. One can also expect thatif the evaluated systems are of different types,they will exhibit different error patterns andvarious MT metrics can be differently sensi-tive to these patterns.
This year, all MT systems included in theMetrics Task come from the News TranslationTask (see Findings 2019). There are howeverstill noticeable differences among the variouslanguage pairs.
• Unsupervised MT Systems. TheGerman→Czech research systems weretrained in an unsupervised fashion, i.e.without the access to parallel Czech-German texts (except for a couple ofthousand sentences used primarily for val-idation). We thus expect the researchGerman-Czech systems to be “more cre-ative” and depart further away from thereferences. The online systems in thislanguage directions are however standardMT systems so the German-Czech evalu-ation could be to some extent bimodal.
• EU Election. The French↔Germantranslation was focused on a sub-domainof news, namely texts related EU Elec-tion. Various MT system developers mayhave invested more or less time to the do-main adaptation.
• Regular News Tasks Systems. These
64
are all the other MT systems in theevaluation; differing in whether they aretrained only on WMT provided data(“Constrained”, or “Unconstrained”) asin the previous years. All the freely avail-able web services (online MT systems) aredeemed unconstrained.
Overall, the results are based on 233 systemsacross 18 language pairs.2
2.3 Manual Quality AssessmentDirect Assessment (DA, Graham et al., 2013,2014a, 2016) was employed as the source of the“golden truth” to evaluate metrics again thisyear. The details of this method of humanevaluation are provided in Findings 2019.
The basis of DA is to collect a large numberof quality assessments (a number on a scale of1–100, i.e. effectively a continuous scale) forthe outputs of all MT systems. These scoresare then standardized per annotator.
In the past years, the underlying man-ual scores were reference-based (human judgeshad access to the same reference translationas the MT quality metric). This year, the of-ficial WMT19 scores are reference-based (or“monolingual”) for some language pairs andreference-free (or “bilingual”) for others.3
Due to these different types of golden truthcollection, reference-based language pairs arein a closer match with the standard reference-based metrics, while the reference-free lan-guage pairs are better fit for the “QE as ametric” subtask.
Note that system-level manual scores aredifferent than those of the segment-level. Sincefor segment-level evaluation, collecting enoughDA judgements for each segment is infeasible,so we resort to converting DA judgements to
2This year, we do not use the artificially constructed“hybrid systems” (Graham and Liu, 2016) because theconfidence on the ranking of system-level metrics is suf-ficient even without hybrids.
3Specifically, the reference-based language pairswere those where the anticipated translation qualitywas lower or where the manual judgements were ob-tained with the help of anonymous crowdsourcing.Most of these cases were translations into English (fi-en, gu-en, kk-en, lt-en, ru-en and zh-en) and then thelanguage pairs not involving English (de-cs, de-fr andfr-de). The reference-less (bilingual) evaluations werethose where mainly MT researchers themselves wereinvolved in the annotations: en-cs, en-de, en-fi, en-gu,en-kk, en-lt, en-ru, en-zh.
golden truth expressed as relative rankings, seeSection 2.3.2.
The exact methods used to calculate corre-lations of participating metrics with the goldentruth are described below, in the two sectionsfor system-level evaluation (Section 5.1) andsegment-level evaluation (Section 5.2).
2.3.1 System-level Golden Truth: DAFor the system-level evaluation, the collectedcontinuous DA scores, standardized for eachannotator, are averaged across all assessed seg-ments for each MT system to produce a scalarrating for the system’s performance.
The underlying set of assessed segments isdifferent for each system. Thanks to the factthat the system-level DA score is an averageover many judgments, mean scores are consis-tent and have been found to be reproducible(Graham et al., 2013). For more details seeFindings 2019.
2.3.2 Segment-level Golden Truth:daRR
Starting from Bojar et al. (2017), when WMTfully switched to DA, we had to come up with asolid golden standard for segment-level judge-ments. Standard DA scores are reliable onlywhen averaged over sufficient number of judg-ments.4
Fortunately, when we have at least two DAscores for translations of the same source in-put, it is possible to convert those DA scoresinto a relative ranking judgement, if the dif-ference in DA scores allows conclusion thatone translation is better than the other. Inthe following, we denote these re-interpretedDA judgements as “daRR”, to distinguishit clearly from the relative ranking (“RR”)golden truth used in the past years.5
4For segment-level evaluation, one would need tocollect many manual evaluations of the exact same seg-ment as produced by each MT system. Such a samplingwould be however wasteful for the evaluation needed byWMT, so only some MT systems happen to be evalu-ated for a given input sentence. In principle, we wouldlike to return to DA’s standard segment-level evalua-tion in future, where a minimum of 15 human judge-ments of translation quality are collected per transla-tion and combined to get highly accurate scores fortranslations, but this would increase annotation costs.
5Since the analogue rating scale employed by DA ismarked at the 0-25-50-75-100 points, we use 25 pointsas the minimum required difference between two sys-tem scores to produce daRR judgements. Note that we
65
DA>1 Ave DA pairs daRR
de-en 2,000 16.0 239,220 85,365fi-en 1,996 9.5 83,168 38,307
gu-en 1,016 11.0 55,880 31,139kk-en 1,000 11.0 55,000 27,094lt-en 1,000 11.0 55,000 21,862
ru-en 1,999 11.9 131,766 46,172zh-en 2,000 10.1 95,174 31,070en-cs 1,997 9.1 75,560 27,178en-de 1,997 19.1 347,109 99,840en-fi 1,997 8.1 59,129 31,820en-gu 998 6.9 21,854 11,355en-kk 998 9.0 37,032 18,172en-lt 998 9.0 36,435 17,401en-ru 1,997 8.7 69,503 24,334en-zh 1,997 9.8 87,501 18,658de-cs 1,997 8.5 65,039 35,793de-fr 1,605 4.1 12,055 4,862fr-de 1,224 3.0 4,258 1,369
newstest2019
Table 1: Number of judgements for DA convertedto daRR data; “DA>1” is the number of sourceinput sentences in the manual evaluation where atleast two translations of that same source inputsegment received a DA judgement; “Ave” is theaverage number of translations with at least oneDA judgement available for the same source inputsentence; “DA pairs” is the number of all possi-ble pairs of translations of the same source inputresulting from “DA>1”; and “daRR” is the num-ber of DA pairs with an absolute difference in DAscores greater than the 25 percentage point mar-gin.
From the complete set of human assess-ments collected for the News Translation Task,all possible pairs of DA judgements attributedto distinct translations of the same source wereconverted into daRR better/worse judge-ments. Distinct translations of the samesource input whose DA scores fell within 25percentage points (which could have beendeemed equal quality) were omitted from theevaluation of segment-level metrics. Conver-sion of scores in this way produced a large setof daRR judgements for all language pairs,rely on judgements collected from known-reliable vol-unteers and crowd-sourced workers who passed DA’squality control mechanism. Any inconsistency thatcould arise from reliance on DA judgements collectedfrom low quality crowd-sourcing is thus prevented.
shown in Table 1 due to combinatorial ad-vantage of extracting daRR judgements fromall possible pairs of translations of the samesource input. We see that only German-Frenchand esp. French-German can suffer from in-sufficient number of these simulated pairwisecomparisons.
The daRR judgements serve as the goldenstandard for segment-level evaluation inWMT19.
3 Baseline MetricsIn addition to validating popular metrics, in-cluding baselines metrics serves as comparisonand prevents “loss of knowledge” as mentionedby Bojar et al. (2016).
Moses scorer6 is one of the MT evaluationtools that aggregated several useful metricsover the time. Since Macháček and Bojar(2013), we have been using Moses scorer toprovide most of the baseline metrics and keptencouraging authors of well-performing MTmetrics to include them in Moses scorer.7
The baselines we report are:
BLEU and NIST The metrics BLEU(Papineni et al., 2002) and NIST(Doddington, 2002) were computedusing mteval-v13a.pl8 from theOpenMT Evaluation Campaign.The tool includes its own tokeniza-tion. We run mteval with the flag--international-tokenization.9
TER, WER, PER and CDER. The met-rics TER (Snover et al., 2006), WER,PER and CDER (Leusch et al., 2006)were produced by the Moses scorer, whichis used in Moses model optimization.We used the standard tokenizer script asavailable in Moses toolkit for tokeniza-tion.
sentBLEU. The metric sentBLEU is com-puted using the script sentence-bleu,a part of the Moses toolkit. It is a
6https://github.com/moses-smt/mosesdecoder/blob/master/mert/evaluator.cpp
7If you prefer standard BLEU, we recommend sacre-BLEU (Post, 2018a), found at https://github.com/mjpost/sacreBLEU.
8http://www.itl.nist.gov/iad/mig/tools/9International tokenization is found to perform
slightly better (Macháček and Bojar, 2013).
66
Scor
ing
Leve
lM
etri
cFe
atur
esLe
arne
d?Se
gSy
sC
itat
ion/
Par
tici
pant
Ava
ilabi
lity
Baselines
sent
BLE
Un-
gram
s•
−(m
oses
deco
der)
mert
/sen
tenc
e-bl
euB
LEU
n-gr
ams
−•
Papi
neni
etal
.(20
02)
(mos
esde
code
r)sc
ript
s/ge
neri
c/mt
eval
-v13
a.pl
NIS
Tn-
gram
s−
•D
oddi
ngto
n(2
002)
(mos
esde
code
r)sc
ript
s/ge
neri
c/mt
eval
-v13
a.pl
WE
RLe
vens
htei
ndi
stan
ce−
•Le
usch
etal
.(20
06)
(mos
esde
code
r)me
rt/e
valu
ator
TE
Red
itdi
stan
ce,e
dit
type
s−
•Sn
over
etal
.(20
06)
(mos
esde
code
r)me
rt/e
valu
ator
PE
Red
itdi
stan
ce,e
dit
type
s−
•Le
usch
etal
.(20
03)
(mos
esde
code
r)me
rt/e
valu
ator
CD
ER
edit
dist
ance
,edi
tty
pes
−•
Leus
chet
al.(
2006
)(m
oses
deco
der)
mert
/eva
luat
orch
rFch
arac
ter
n-gr
ams
•⊘
Popo
vić
(201
5)ht
tp:/
/git
hub.
com/
m-po
povi
c/ch
rFch
rF+
char
acte
rn-
gram
s•
⊘Po
povi
ć(2
017)
http
://g
ithu
b.co
m/m-
popo
vic/
chrF
sacr
eBLE
U-B
LEU
n-gr
ams
−•
Post
(201
8a)
http
://g
ithu
b.co
m/mj
post
/sac
reBL
EUsa
creB
LEU
-chr
Fn-
gram
s−
•Po
st(2
018a
)ht
tp:/
/git
hub.
com/
mjpo
st/s
acre
BLEU
Metrics
BE
ER
char
.n-
gram
s,pe
rmut
atio
ntr
ees
yes
•⊘
Uni
v.of
Am
ster
dam
,ILC
C(S
tano
jevi
ćan
dSi
ma’
an,2
015)
http
://g
ithu
b.co
m/st
anoj
evic
/bee
rB
ER
Tr
cont
extu
alwo
rdem
bedd
ings
•⊘
Uni
v.of
Mel
bour
ne(M
athu
ret
al.,
2019
)ht
tp:/
/git
hub.
com/
niti
kam/
mtev
al-i
n-co
ntex
tch
arac
TE
Rch
ar.
edit
dist
ance
,edi
tty
pes
•⊘
RWT
HA
ache
nU
niv.
(Wan
get
al.,
2016
a)ht
tp:/
/git
hub.
com/
rwth
-i6/
Char
acTE
RE
ED
char
.ed
itdi
stan
ce,e
dit
type
s•
⊘RW
TH
Aac
hen
Uni
v.(S
tanc
hev
etal
.,20
19)
http
://g
ithu
b.co
m/rw
th-i
6/Ex
tend
edEd
itDi
stan
ceE
SIM
lear
ned
neur
alre
pres
enta
tions
yes
•⊘
Uni
v.of
Mel
bour
ne(M
athu
ret
al.,
2019
)ht
tp:/
/git
hub.
com/
niti
kam/
mtev
al-i
n-co
ntex
tLE
PO
Ra
surfa
celin
guist
icfe
atur
es•
⊘D
ublin
City
Uni
vers
ity,A
DA
PT(H
anet
al.,
2012
,201
3)ht
tp:/
/git
hub.
com/
poet
han/
LEPO
RLE
PO
Rb
surfa
celin
guist
icfe
atur
es•
⊘D
ublin
City
Uni
vers
ity,A
DA
PT(H
anet
al.,
2012
,201
3)ht
tp:/
/git
hub.
com/
poet
han/
LEPO
RM
eteo
r++
_2.
0(s
ynta
x)wo
rdal
ignm
ents
•⊘
Peki
ngU
nive
rsity
(Guo
and
Hu,
2019
)−
Met
eor+
+_
2.0
(syn
tax+
copy
)wo
rdal
ignm
ents
•⊘
Peki
ngU
nive
rsity
(Guo
and
Hu,
2019
)−
PR
ePps
uedo
-ref
eren
ces,
para
phra
ses
•⊘
Toky
oM
etro
polit
anU
niv.
(Yos
him
ura
etal
.,20
19)
http
://g
ithu
b.co
m/ko
kema
n/PR
ePW
MD
Owo
rdm
over
dist
ance
•⊘
Impe
rialC
olle
geLo
ndon
(Cho
wet
al.,
2019
a)−
YiS
i-0se
man
ticsim
ilarit
y•
⊘N
RC
(Lo,
2019
)ht
tp:/
/git
hub.
com/
chik
iulo
/YiS
iY
iSi-1
sem
antic
simila
rity
•⊘
NR
C(L
o,20
19)
http
://g
ithu
b.co
m/ch
ikiu
lo/Y
iSi
YiS
i-1_
srl
sem
antic
simila
rity
•⊘
NR
C(L
o,20
19)
http
://g
ithu
b.co
m/ch
ikiu
lo/Y
iSi
QESystems
IBM
1-m
orph
eme
LMlo
gpr
obs.,
ibm
1le
xico
n•
⊘D
ublin
City
Uni
vers
ity,A
DA
PT(P
opov
ic,2
012)
−IB
M1-
pos4
gram
LMlo
gpr
obs.,
ibm
1le
xico
n•
⊘D
ublin
City
Uni
vers
ity,A
DA
PT(P
opov
ic,2
012)
−LP
cont
extu
alwo
rdem
b.,M
Tlo
gpr
ob.
yes
•⊘
Uni
v.of
Tart
u(Y
anko
vska
yaet
al.,
2019
)−
LASI
Mco
ntex
tual
word
embe
ddin
gsye
s•
⊘U
niv.
ofTa
rtu
(Yan
kovs
kaya
etal
.,20
19)
−U
NI
??
•⊘
??
UN
I+?
?•
⊘?
?U
SFD
??
•⊘
Uni
v.of
Sheffi
eld
?U
SFD
-TL
??
•⊘
Uni
v.of
Sheffi
eld
?Y
iSi-2
sem
antic
simila
rity
•⊘
NR
C(L
o,20
19)
http
://g
ithu
b.co
m/ch
ikiu
lo/Y
iSi
YiS
i-2_
srl
sem
antic
simila
rity
•⊘
NR
C(L
o,20
19)
http
://g
ithu
b.co
m/ch
ikiu
lo/Y
iSi
Tabl
e2:
Part
icip
ants
ofW
MT
19M
etric
sSha
red
Task
.“•
”de
note
stha
tthe
met
ricto
okpa
rtin
(som
eof
the
lang
uage
pairs
)oft
hese
gmen
t-an
d/or
syst
em-le
vel
eval
uatio
n.“⊘
”in
dica
test
hatt
hesy
stem
-leve
lsco
resa
reim
plie
d,sim
ply
taki
ngar
ithm
etic
(mac
ro-)
aver
age
ofse
gmen
t-le
vels
core
s.“−
”in
dica
test
hatt
hem
etric
didn
’tpa
rtic
ipat
eth
etr
ack
(Seg
/Sys
-leve
l).A
met
ricis
lear
ned
ifit
istr
aine
don
aQ
Eor
met
ricev
alua
tion
data
set
(i.e.
pret
rain
ing
orpa
rser
sdo
n’t
coun
t,bu
ttr
aini
ngon
WM
T20
17m
etric
sta
skda
tado
es).
For
the
base
line
met
rics
avai
labl
ein
the
Mos
esto
olki
t,pa
ths
are
rela
tive
toht
tp:/
/git
hub.
com/
mose
s-sm
t/mo
sesd
ecod
er/.
67
smoothed version of BLEU for scoringat the segment-level. We used the stan-dard tokenizer script as available in Mosestoolkit for tokenization.
chrF and chrF+. The metrics chrF andchrF+ (Popović, 2015, 2017) are com-puted using their original Python im-plementation, see Table 2. We ranchrF++.py with the parameters -nw 0 -b3 to obtain the chrF score and with-nw 1 -b 3 to obtain the chrF+ score.Note that chrF intentionally removes allspaces before matching the n-grams, deto-kenizing the segments but also concate-nating words.10
sacreBLEU-BLEU and sacreBLEU-chrF. The metrics sacreBLEU-BLEUand sacreBLEU-chrF (Post, 2018a)are re-implementation of BLEU and chrFrespectively. We ran sacreBLEU-chrFwith the same parameters as chrF, buttheir scores are slightly different. The sig-nature strings produced by sacreBLEUfor BLEU and chrF respectively areBLEU+case.lc+lang.de-en+numrefs.1+smooth.exp+tok.intl+version.1.3.6and chrF3+case.mixed+lang.de-en+numchars.6+numrefs.1+space.False+tok.13a+version.1.3.6.
The baselines serve in system and segment-level evaluations as customary: BLEU, TER,WER, PER, CDER, sacreBLEU-BLEUand sacreBLEU-chrF for system-level only;sentBLEU for segment-level only and chrFfor both.
Chinese word segmentation is unfortunatelynot supported by the tokenization scripts men-tioned above. For scoring Chinese with base-line metrics, we thus pre-processed MT out-puts and reference translations with the scripttokenizeChinese.py11 by Shujian Huang,which separates Chinese characters from eachother and also from non-Chinese parts.
10We originally planned to use the chrF implemen-tation which was recently made available in MosesScorer but it mishandles Unicode characters for now.
11http://hdl.handle.net/11346/WMT17-TVXH
4 Submitted Metrics
Table 2 lists the participants of the WMT19Shared Metrics Task, along with their metricsand links to the source code where available.We have collected 24 metrics from a total of 13research groups, with 10 reference-less “met-rics” submitted to the joint task “QE as a Met-rich” with WMT19 Quality Estimation Task.
The rest of this section provides a brief sum-mary of all the metrics that participated.
4.1 BEERBEER (Stanojević and Sima’an, 2015) is atrained evaluation metric with a linear modelthat combines sub-word feature indicators(character n-grams) and global word order fea-tures (skip bigrams) to achieve a language ag-nostic and fast to compute evaluation metric.BEER has participated in previous years ofthe evaluation task.
4.2 BERTrBERTr (Mathur et al., 2019) uses contextualword embeddings to compare the MT outputwith the reference translation.
The BERTr score of a translation is theaverage recall score over all tokens, us-ing a relaxed version of token matchingbased on BERT embeddings: namely, com-puting the maximum cosine similarity be-tween the embedding of a reference to-ken against any token in the MT out-put. BERTr uses bert_base_uncased em-beddings for the to-English language pairs,and bert_base_multilingual_cased embed-dings for all other language pairs.
4.3 CharacTERCharacTER (Wang et al., 2016b,a), identi-cal to the 2016 setup, is a character-level met-ric inspired by the commonly applied transla-tion edit rate (TER). It is defined as the mini-mum number of character edits required to ad-just a hypothesis, until it completely matchesthe reference, normalized by the length of thehypothesis sentence. CharacTER calculatesthe character-level edit distance while per-forming the shift edit on word level. Unlikethe strict matching criterion in TER, a hy-pothesis word is considered to match a refer-ence word and could be shifted, if the edit dis-
68
tance between them is below a threshold value.The Levenshtein distance between the refer-ence and the shifted hypothesis sequence iscomputed on the character level. In addition,the lengths of hypothesis sequences instead ofreference sequences are used for normalizingthe edit distance, which effectively countersthe issue that shorter translations normallyachieve lower TER.
Similarly to other character-level metrics,CharacTER is generally applied to non-tokenized outputs and references, which alsoholds for this year’s submission with one ex-ception. This year tokenization was carriedout for en-ru hypotheses and references be-fore calculating the scores, since this results inlarge improvements in terms of correlations.For other language pairs, no tokenizer wasused for pre-processing.
4.4 EED
EED (Stanchev et al., 2019) is a character-based metric, which builds upon CDER. Itis defined as the minimum number of opera-tions of an extension to the conventional editdistance containing a “jump” operation. Theedit distance operations (insertions, deletionsand substitutions) are performed at the char-acter level and jumps are performed when ablank space is reached. Furthermore, the cov-erage of multiple characters in the hypothesisis penalised by the introduction of a coveragepenalty. The sum of the length of the refer-ence and the coverage penalty is used as thenormalisation term.
4.5 ESIM
Enhanced Sequential Inference Model (ESIM;Chen et al., 2017; Mathur et al., 2019) is aneural model proposed for Natural LanguageInference that has been adapted for MT evalu-ation. It uses cross-sentence attention and sen-tence matching heuristics to generate a repre-sentation of the translation and the reference,which is fed to a feedforward regressor. Themetric is trained on singly-annotated DirectAssessment data that has been collected forevaluating WMT systems: all WMT 2018 to-English data for the to-English language pairs,and all WMT 2018 data for all other languagepairs.
4.6 hLEPORb_baseline,hLEPORa_baseline
The submitted metric hLEPOR_baseline isa metric based on the factor combination oflength penalty, precision, recall, and positiondifference penalty. The weighted harmonicmean is applied to group the factors togetherwith tunable weight parameters. The system-level score is calculated with the same formulabut with each factor weighted using weight es-timated at system-level and not at segment-level.
In this submitted baseline version, hLE-POR_baseline was not tuned for each lan-guage pair separately but the default weightswere applied across all submitted languagepairs. Further improvements can be achievedby tuning the weights according to the devel-opment data, adding morphological informa-tion and applying n-gram factor scores intoit (e.g. part-of-speech, n-gram precision andn-gram recall that were added into LEPORin WMT13.). The basic model factors andfurther development with parameters settingwere described in the paper (Han et al., 2012)and (Han et al., 2013).
For sentence-level score, only hLE-PORa_baseline was submitted with scorescalculated as the weighted harmonic meanof all the designed factors using defaultparameters.
For system-level score, bothhLEPORa_baseline and hLE-PORb_baseline were submitted, wherehLEPORa_baseline is the the averagescore of all sentence-level scores, and hLE-PORb_baseline is calculated via the samesentence-level hLEPOR equation but replac-ing each factor value with its system-levelcounterpart.
4.7 Meteor++_2.0 (syntax),Meteor++_2.0 (syntax+copy)
Meteor++ 2.0 (Guo and Hu, 2019) isa metric based on Meteor (Denkowski andLavie, 2014) that takes syntactic-level para-phrase knowledge into consideration, whereparaphrases may sometimes be skip-grams.i.e. (protect...from, protect...against). Asthe original Meteor-based metrics only payattention to consecutive string matching,
69
they perform badly when reference-hypothesispairs contain skip n-gram paraphrases. Me-teor++ 2.0 extracts the knowledge from theParaphrase Database (PPDB; Bannard andCallison-Burch, 2005) and integrates it intoMeteor-based metrics.
4.8 PRePPReP (Yoshimura et al., 2019) is a method forfiltering pseudo-references to achieve a goodmatch with a gold reference.
At the beginning, the source sentence istranslated with some off-the-shelf MT sys-tems to create a set of pseudo-references.(Here the MT systems were Google Translateand Microsoft Bing Translator.) The pseudo-references are then filtered using BERT (De-vlin et al., 2019) fine-tuned on the MPRCcorpus (Dolan and Brockett, 2005), estimat-ing the probability of the paraphrase betweengold reference and pseudo-references. Thanksto the high quality of the underlying MT sys-tems, a large portion of their outputs is indeedconsidered as a valid paraphrase.
The final metric score is calculated sim-ply with SentBLEU with these multiple ref-erences.
4.9 WMDOWMDO (Chow et al., 2019b) is a metric basedon distance between distributions in the se-mantic vector space. Matching in the seman-tic space has been investigated for translationevaluation, but the constraints of a transla-tion’s word order have not been fully explored.Building on the Word Mover’s Distance metricand various word embeddings, WMDO intro-duces a fragmentation penalty to account forfluency of a translation. This word order ex-tension is shown to perform better than stan-dard WMD, with promising results againstother types of metrics.
4.10 YiSi-0, YiSi-1, YiSi-1_srl, YiSi-2,YiSi-2_srl
YiSi (Lo, 2019) is a unified semantic MT qual-ity evaluation and estimation metric for lan-guages with different levels of available re-sources.
YiSi-1 is a MT evaluation metric that mea-sures the semantic similarity between a ma-chine translation and human references by
aggregating the idf-weighted lexical semanticsimilarities based on the contextual embed-dings extracted from BERT and optionally in-corporating shallow semantic structures (de-noted as YiSi-1_srl).
YiSi-0 is the degenerate version of YiSi-1that is ready-to-deploy to any language. Ituses longest common character substring tomeasure the lexical similarity.
YiSi-2 is the bilingual, reference-less versionfor MT quality estimation, which uses the con-textual embeddings extracted from BERT toevaluate the crosslingual lexical semantic simi-larity between the input and MT output. LikeYiSi-1, YiSi-2 can exploit shallow semanticstructures as well (denoted as YiSi-2_srl).
4.11 QE SystemsIn addition to the submitted standard metrics,10 quality estimation systems were submittedto the “QE as a Metric” track. The submittedQE systems are evaluated in the same settingsas metrics to facilitate comparison. Their de-scriptions can be found in the Findings of theWMT 2019 Shared Task on Quality Estima-tion (Fonseca et al., 2019).
5 ResultsWe discuss system-level results for news tasksystems in Section 5.1. The segment-level re-sults are in Section 5.2.
5.1 System-Level EvaluationAs in previous years, we employ the Pearsoncorrelation (r) as the main evaluation measurefor system-level metrics. The Pearson correla-tion is as follows:
r =
∑ni=1(Hi −H)(Mi −M)√∑n
i=1(Hi −H)2√∑n
i=1(Mi −M)2(1)
where Hi are human assessment scores of allsystems in a given translation direction, Mi
are the corresponding scores as predicted bya given metric. H and M are their means,respectively.
Since some metrics, such as BLEU, aim toachieve a strong positive correlation with hu-man assessment, while error metrics, such asTER, aim for a strong negative correlation wecompare metrics via the absolute value |r| of a
70
de-en fi-en gu-en kk-en lt-en ru-en zh-enn 16 12 11 11 11 14 15Correlation |r| |r| |r| |r| |r| |r| |r|
BEER 0.906 0.993 0.952 0.986 0.947 0.915 0.942BERTr 0.926 0.984 0.938 0.990 0.948 0.971 0.974BLEU 0.849 0.982 0.834 0.946 0.961 0.879 0.899CDER 0.890 0.988 0.876 0.967 0.975 0.892 0.917CharacTER 0.898 0.990 0.922 0.953 0.955 0.923 0.943chrF 0.917 0.992 0.955 0.978 0.940 0.945 0.956chrF+ 0.916 0.992 0.947 0.976 0.940 0.945 0.956EED 0.903 0.994 0.976 0.980 0.929 0.950 0.949ESIM 0.941 0.971 0.885 0.986 0.989 0.968 0.988hLEPORa_baseline − − − 0.975 − − 0.947hLEPORb_baseline − − − 0.975 0.906 − 0.947Meteor++_2.0(syntax) 0.887 0.995 0.909 0.974 0.928 0.950 0.948Meteor++_2.0(syntax+copy) 0.896 0.995 0.900 0.971 0.927 0.952 0.952NIST 0.813 0.986 0.930 0.942 0.944 0.925 0.921PER 0.883 0.991 0.910 0.737 0.947 0.922 0.952PReP 0.575 0.614 0.773 0.776 0.494 0.782 0.592sacreBLEU.BLEU 0.813 0.985 0.834 0.946 0.955 0.873 0.903sacreBLEU.chrF 0.910 0.990 0.952 0.969 0.935 0.919 0.955TER 0.874 0.984 0.890 0.799 0.960 0.917 0.840WER 0.863 0.983 0.861 0.793 0.961 0.911 0.820WMDO 0.872 0.987 0.983 0.998 0.900 0.942 0.943YiSi-0 0.902 0.993 0.993 0.991 0.927 0.958 0.937YiSi-1 0.949 0.989 0.924 0.994 0.981 0.979 0.979YiSi-1_srl 0.950 0.989 0.918 0.994 0.983 0.978 0.977QE as a Metric:ibm1-morpheme 0.345 0.740 − − 0.487 − −ibm1-pos4gram 0.339 − − − − − −LASIM 0.247 − − − − 0.310 −LP 0.474 − − − − 0.488 −UNI 0.846 0.930 − − − 0.805 −UNI+ 0.850 0.924 − − − 0.808 −YiSi-2 0.796 0.642 0.566 0.324 0.442 0.339 0.940YiSi-2_srl 0.804 − − − − − 0.947
newstest2019
Table 3: Absolute Pearson correlation of to-English system-level metrics with DA human assessment innewstest2019; correlations of metrics not significantly outperformed by any other for that language pairare highlighted in bold.
71
en-cs en-de en-fi en-gu en-kk en-lt en-ru en-zhn 11 22 12 11 11 12 12 12Correlation |r| |r| |r| |r| |r| |r| |r| |r|
BEER 0.990 0.983 0.989 0.829 0.971 0.982 0.977 0.803BLEU 0.897 0.921 0.969 0.737 0.852 0.989 0.986 0.901CDER 0.985 0.973 0.978 0.840 0.927 0.985 0.993 0.905CharacTER 0.994 0.986 0.968 0.910 0.936 0.954 0.985 0.862chrF 0.990 0.979 0.986 0.841 0.972 0.981 0.943 0.880chrF+ 0.991 0.981 0.986 0.848 0.974 0.982 0.950 0.879EED 0.993 0.985 0.987 0.897 0.979 0.975 0.967 0.856ESIM − 0.991 0.957 − 0.980 0.989 0.989 0.931hLEPORa_baseline − − − 0.841 0.968 − − −hLEPORb_baseline − − − 0.841 0.968 0.980 − −NIST 0.896 0.321 0.971 0.786 0.930 0.993 0.988 0.884PER 0.976 0.970 0.982 0.839 0.921 0.985 0.981 0.895sacreBLEU.BLEU 0.994 0.969 0.966 0.736 0.852 0.986 0.977 0.801sacreBLEU.chrF 0.983 0.976 0.980 0.841 0.967 0.966 0.985 0.796TER 0.980 0.969 0.981 0.865 0.940 0.994 0.995 0.856WER 0.982 0.966 0.980 0.861 0.939 0.991 0.994 0.875YiSi-0 0.992 0.985 0.987 0.863 0.974 0.974 0.953 0.861YiSi-1 0.962 0.991 0.971 0.909 0.985 0.963 0.992 0.951YiSi-1_srl − 0.991 − − − − − 0.948QE as a Metric:ibm1-morpheme 0.871 0.870 0.084 − − 0.810 − −ibm1-pos4gram − 0.393 − − − − − −LASIM − 0.871 − − − − 0.823 −LP − 0.569 − − − − 0.661 −UNI 0.028 0.841 0.907 − − − 0.919 −UNI+ − − − − − − 0.918 −USFD − 0.224 − − − − 0.857 −USFD-TL − 0.091 − − − − 0.771 −YiSi-2 0.324 0.924 0.696 0.314 0.339 0.055 0.766 0.097YiSi-2_srl − 0.936 − − − − − 0.118
newstest2019
Table 4: Absolute Pearson correlation of out-of-English system-level metrics with DA human assessmentin newstest2019; correlations of metrics not significantly outperformed by any other for that languagepair are highlighted in bold.
72
de-en fi-en gu-en
YiS
i.1_s
rlY
iSi.1
ES
IMB
ER
Trch
rFch
rF.
sacr
eBLE
U.c
hrF
BE
ER
EE
DY
iSi.0
Cha
racT
ER
Met
eor..
_2.0
.syn
tax.
copy
.C
DE
RM
eteo
r.._2
.0.s
ynta
x.P
ER
TE
RW
MD
OW
ER
UN
I.B
LEU
UN
Isa
creB
LEU
.BLE
UN
IST
YiS
i.2_s
rlY
iSi.2
PR
ePLP
.1ib
m1.
mor
phem
eib
m1.
pos4
gram
LAS
IM
LASIMibm1.pos4gramibm1.morphemeLP.1PRePYiSi.2YiSi.2_srlNISTsacreBLEU.BLEUUNIBLEUUNI.WERWMDOTERPERMeteor.._2.0.syntax.CDERMeteor.._2.0.syntax.copy.CharacTERYiSi.0EEDBEERsacreBLEU.chrFchrF.chrFBERTrESIMYiSi.1YiSi.1_srl
Met
eor..
_2.0
.syn
tax.
copy
.M
eteo
r.._2
.0.s
ynta
x.E
ED
BE
ER
YiS
i.0ch
rF.
chrF
PE
RC
hara
cTE
Rsa
creB
LEU
.chr
FY
iSi.1
_srl
YiS
i.1C
DE
RW
MD
ON
IST
sacr
eBLE
U.B
LEU
TE
RB
ER
TrW
ER
BLE
UE
SIM
UN
IU
NI.
ibm
1.m
orph
eme
YiS
i.2P
ReP
PRePYiSi.2ibm1.morphemeUNI.UNIESIMBLEUWERBERTrTERsacreBLEU.BLEUNISTWMDOCDERYiSi.1YiSi.1_srlsacreBLEU.chrFCharacTERPERchrFchrF.YiSi.0BEEREEDMeteor.._2.0.syntax.Meteor.._2.0.syntax.copy.
YiS
i.0W
MD
OE
ED
chrF
sacr
eBLE
U.c
hrF
BE
ER
chrF
.B
ER
TrN
IST
YiS
i.1C
hara
cTE
RY
iSi.1
_srl
PE
RM
eteo
r.._2
.0.s
ynta
x.M
eteo
r.._2
.0.s
ynta
x.co
py.
TE
RE
SIM
CD
ER
WE
RB
LEU
sacr
eBLE
U.B
LEU
PR
ePY
iSi.2
YiSi.2PRePsacreBLEU.BLEUBLEUWERCDERESIMTERMeteor.._2.0.syntax.copy.Meteor.._2.0.syntax.PERYiSi.1_srlCharacTERYiSi.1NISTBERTrchrF.BEERsacreBLEU.chrFchrFEEDWMDOYiSi.0
kk-en lt-en ru-en
WM
DO
YiS
i.1Y
iSi.1
_srl
YiS
i.0B
ER
TrE
SIM
BE
ER
EE
Dch
rFch
rF.
hLE
PO
Ra_
base
line
hLE
PO
Rb_
base
line
Met
eor..
_2.0
.syn
tax.
Met
eor..
_2.0
.syn
tax.
copy
.sa
creB
LEU
.chr
FC
DE
RC
hara
cTE
RB
LEU
sacr
eBLE
U.B
LEU
NIS
TT
ER
WE
RP
ReP
PE
RY
iSi.2
YiSi.2PERPRePWERTERNISTsacreBLEU.BLEUBLEUCharacTERCDERsacreBLEU.chrFMeteor.._2.0.syntax.copy.Meteor.._2.0.syntax.hLEPORb_baselinehLEPORa_baselinechrF.chrFEEDBEERESIMBERTrYiSi.0YiSi.1_srlYiSi.1WMDO
ES
IMY
iSi.1
_srl
YiS
i.1C
DE
RB
LEU
WE
RT
ER
sacr
eBLE
U.B
LEU
Cha
racT
ER
BE
RTr
PE
RB
EE
RN
IST
chrF
.ch
rFsa
creB
LEU
.chr
FE
ED
Met
eor..
_2.0
.syn
tax.
Met
eor..
_2.0
.syn
tax.
copy
.Y
iSi.0
hLE
PO
Rb_
base
line
WM
DO
PR
ePib
m1.
mor
phem
eY
iSi.2
YiSi.2ibm1.morphemePRePWMDOhLEPORb_baselineYiSi.0Meteor.._2.0.syntax.copy.Meteor.._2.0.syntax.EEDsacreBLEU.chrFchrFchrF.NISTBEERPERBERTrCharacTERsacreBLEU.BLEUTERWERBLEUCDERYiSi.1YiSi.1_srlESIM
YiS
i.1Y
iSi.1
_srl
BE
RTr
ES
IMY
iSi.0
Met
eor..
_2.0
.syn
tax.
copy
.M
eteo
r.._2
.0.s
ynta
x.E
ED
chrF
chrF
.W
MD
ON
IST
Cha
racT
ER
PE
Rsa
creB
LEU
.chr
FT
ER
BE
ER
WE
RC
DE
RB
LEU
sacr
eBLE
U.B
LEU
UN
I.U
NI
PR
ePLP
.1Y
iSi.2
LAS
IM
LASIMYiSi.2LP.1PRePUNIUNI.sacreBLEU.BLEUBLEUCDERWERBEERTERsacreBLEU.chrFPERCharacTERNISTWMDOchrF.chrFEEDMeteor.._2.0.syntax.Meteor.._2.0.syntax.copy.YiSi.0ESIMBERTrYiSi.1_srlYiSi.1
zh-en en-cs en-de
ES
IMY
iSi.1
YiS
i.1_s
rlB
ER
Trch
rF.
chrF
sacr
eBLE
U.c
hrF
PE
RM
eteo
r.._2
.0.s
ynta
x.co
py.
EE
DM
eteo
r.._2
.0.s
ynta
x.hL
EP
OR
b_ba
selin
ehL
EP
OR
a_ba
selin
eY
iSi.2
_srl
WM
DO
Cha
racT
ER
BE
ER
YiS
i.2Y
iSi.0
NIS
TC
DE
Rsa
creB
LEU
.BLE
UB
LEU
TE
RW
ER
PR
eP
PRePWERTERBLEUsacreBLEU.BLEUCDERNISTYiSi.0YiSi.2BEERCharacTERWMDOYiSi.2_srlhLEPORa_baselinehLEPORb_baselineMeteor.._2.0.syntax.EEDMeteor.._2.0.syntax.copy.PERsacreBLEU.chrFchrFchrF.BERTrYiSi.1_srlYiSi.1ESIM
sacr
eBLE
U.B
LEU
Cha
racT
ER
EE
DY
iSi.0
chrF
.B
EE
Rch
rFC
DE
Rsa
creB
LEU
.chr
FW
ER
TE
RP
ER
YiS
i.1B
LEU
NIS
Tib
m1.
mor
phem
eY
iSi.2
UN
I
UNIYiSi.2ibm1.morphemeNISTBLEUYiSi.1PERTERWERsacreBLEU.chrFCDERchrFBEERchrF.YiSi.0EEDCharacTERsacreBLEU.BLEU
ES
IMY
iSi.1
YiS
i.1_s
rlC
hara
cTE
RE
ED
YiS
i.0B
EE
Rch
rF.
chrF
sacr
eBLE
U.c
hrF
CD
ER
PE
Rsa
creB
LEU
.BLE
UT
ER
WE
RY
iSi.2
_srl
YiS
i.2B
LEU
LAS
IMib
m1.
mor
phem
eU
NI
LP.1
ibm
1.po
s4gr
amN
IST
US
FD
US
FD
.TL
USFD.TLUSFDNISTibm1.pos4gramLP.1UNIibm1.morphemeLASIMBLEUYiSi.2YiSi.2_srlWERTERsacreBLEU.BLEUPERCDERsacreBLEU.chrFchrFchrF.BEERYiSi.0EEDCharacTERYiSi.1_srlYiSi.1ESIM
en-fi en-gu en-kk
BE
ER
EE
DY
iSi.0
chrF
chrF
.P
ER
TE
Rsa
creB
LEU
.chr
FW
ER
CD
ER
YiS
i.1N
IST
BLE
UC
hara
cTE
Rsa
creB
LEU
.BLE
UE
SIM
UN
IY
iSi.2
ibm
1.m
orph
eme
ibm1.morphemeYiSi.2UNIESIMsacreBLEU.BLEUCharacTERBLEUNISTYiSi.1CDERWERsacreBLEU.chrFTERPERchrF.chrFYiSi.0EEDBEER
Cha
racT
ER
YiS
i.1E
ED
TE
RY
iSi.0
WE
Rch
rF.
sacr
eBLE
U.c
hrF
chrF
hLE
PO
Ra_
base
line
hLE
PO
Rb_
base
line
CD
ER
PE
RB
EE
RN
IST
BLE
Usa
creB
LEU
.BLE
UY
iSi.2
YiSi.2sacreBLEU.BLEUBLEUNISTBEERPERCDERhLEPORb_baselinehLEPORa_baselinechrFsacreBLEU.chrFchrF.WERYiSi.0TEREEDYiSi.1CharacTER
YiS
i.1E
SIM
EE
Dch
rF.
YiS
i.0ch
rFB
EE
RhL
EP
OR
a_ba
selin
ehL
EP
OR
b_ba
selin
esa
creB
LEU
.chr
FT
ER
WE
RC
hara
cTE
RN
IST
CD
ER
PE
Rsa
creB
LEU
.BLE
UB
LEU
YiS
i.2
YiSi.2BLEUsacreBLEU.BLEUPERCDERNISTCharacTERWERTERsacreBLEU.chrFhLEPORb_baselinehLEPORa_baselineBEERchrFYiSi.0chrF.EEDESIMYiSi.1
en-lt en-ru en-zh
TE
RN
IST
WE
RE
SIM
BLE
Usa
creB
LEU
.BLE
UC
DE
RP
ER
BE
ER
chrF
.ch
rFhL
EP
OR
b_ba
selin
eE
ED
YiS
i.0sa
creB
LEU
.chr
FY
iSi.1
Cha
racT
ER
ibm
1.m
orph
eme
YiS
i.2
YiSi.2ibm1.morphemeCharacTERYiSi.1sacreBLEU.chrFYiSi.0EEDhLEPORb_baselinechrFchrF.BEERPERCDERsacreBLEU.BLEUBLEUESIMWERNISTTER
TE
RW
ER
CD
ER
YiS
i.1E
SIM
NIS
TB
LEU
Cha
racT
ER
sacr
eBLE
U.c
hrF
PE
Rsa
creB
LEU
.BLE
UB
EE
RE
ED
YiS
i.0ch
rF.
chrF
UN
IU
NI.
US
FD
LAS
IMU
SF
D.T
LY
iSi.2
LP.1
LP.1YiSi.2USFD.TLLASIMUSFDUNI.UNIchrFchrF.YiSi.0EEDBEERsacreBLEU.BLEUPERsacreBLEU.chrFCharacTERBLEUNISTESIMYiSi.1CDERWERTER
YiS
i.1Y
iSi.1
_srl
ES
IMC
DE
RB
LEU
PE
RN
IST
chrF
chrF
.W
ER
Cha
racT
ER
YiS
i.0E
ED
TE
RB
EE
Rsa
creB
LEU
.BLE
Usa
creB
LEU
.chr
FY
iSi.2
_srl
YiS
i.2
YiSi.2YiSi.2_srlsacreBLEU.chrFsacreBLEU.BLEUBEERTEREEDYiSi.0CharacTERWERchrF.chrFNISTPERBLEUCDERESIMYiSi.1_srlYiSi.1
Figure 1: System-level metric significance test results for DA human assessment for into English andout-of English language pairs (newstest2019): Green cells denote a statistically significant increase incorrelation with human assessment for the metric in a given row over the metric in a given columnaccording to Williams test.
73
given metric’s correlation with human assess-ment.
5.1.1 System-Level ResultsTables 3, 4 and 5 provide the system-level cor-relations of metrics evaluating translation ofnewstest2019. The underlying texts are partof the WMT19 News Translation test set (new-stest2019) and the underlying MT systems areall MT systems participating in the WMT19News Translation Task.
As recommended by Graham and Bald-win (2014), we employ Williams significancetest (Williams, 1959) to identify differencesin correlation that are statistically significant.Williams test is a test of significance of a dif-ference in dependent correlations and there-fore suitable for evaluation of metrics. Corre-lations not significantly outperformed by anyother metric for the given language pair arehighlighted in bold in Tables 3, 4 and 5.
Since pairwise comparisons of metrics maybe also of interest, e.g. to learn which metricssignificantly outperform the most widely em-ployed metric BLEU, we include significancetest results for every competing pair of metricsincluding our baseline metrics in Figure 1 andFigure 2.
This year, the increased number of systemsparticipating in the news tasks has provided alarger sample of system scores for testing met-rics. Since we already have sufficiently con-clusive results on genuine MT systems, we donot need to generate hybrid system results asin Graham and Liu (2016) and past metricstasks.
5.2 Segment-Level EvaluationSegment-level evaluation relies on the man-ual judgements collected in the News Trans-lation Task evaluation. This year, again wewere unable to follow the methodology out-lined in Graham et al. (2015) for evaluation ofsegment-level metrics because the sampling ofsentences did not provide sufficient number ofassessments of the same segment. We there-fore convert pairs of DA scores for compet-ing translations to daRR better/worse prefer-ences as described in Section 2.3.2.
We measure the quality of metrics’ segment-level scores against the daRR golden truth us-ing a Kendall’s Tau-like formulation, which is
an adaptation of the conventional Kendall’sTau coefficient. Since we do not have a to-tal order ranking of all translations, it is notpossible to apply conventional Kendall’s Tau(Graham et al., 2015).
Our Kendall’s Tau-like formulation, τ , is asfollows:
τ =|Concordant| − |Discordant||Concordant|+ |Discordant|
(2)
where Concordant is the set of all human com-parisons for which a given metric suggests thesame order and Discordant is the set of allhuman comparisons for which a given metricdisagrees. The formula is not specific with re-spect to ties, i.e. cases where the annotationsays that the two outputs are equally good.
The way in which ties (both in human andmetric judgement) were incorporated in com-puting Kendall τ has changed across the yearsof WMT Metrics Tasks. Here we adopt theversion used in WMT17 daRR evaluation.For a detailed discussion on other options, seealso Macháček and Bojar (2014).
Whether or not a given comparison of a pairof distinct translations of the same source in-put, s1 and s2, is counted as a concordant(Conc) or disconcordant (Disc) pair is definedby the following matrix:
Metrics1 < s2 s1 = s2 s1 > s2
Hum
an s1 < s2 Conc Disc Discs1 = s2 − − −s1 > s2 Disc Disc Conc
In the notation of Macháček and Bojar(2014), this corresponds to the setup used inWMT12 (with a different underlying methodof manual judgements, RR):
MetricWMT12 < = >
Hum
an < 1 -1 -1= X X X> -1 -1 1
The key differences between the evaluationused in WMT14–WMT16 and evaluation usedin WMT17–WMT19 were (1) the move fromRR to daRR and (2) the treatment of ties. Inthe years 2014-2016, ties in metrics scores werenot penalized. With the move to daRR, wherethe quality of the two candidate translations
74
de-cs de-fr fr-den 11 11 10Correlation |r| |r| |r|
BEER 0.978 0.941 0.848BLEU 0.941 0.891 0.864CDER 0.864 0.949 0.852CharacTER 0.965 0.928 0.849chrF 0.974 0.931 0.864chrF+ 0.972 0.936 0.848EED 0.982 0.940 0.851ESIM 0.980 0.950 0.942hLEPORa_baseline 0.941 0.814 −hLEPORb_baseline 0.959 0.814 −NIST 0.954 0.916 0.862PER 0.875 0.857 0.899sacreBLEU-BLEU 0.869 0.891 0.869sacreBLEU-chrF 0.975 0.952 0.882TER 0.890 0.956 0.895WER 0.872 0.956 0.894YiSi-0 0.978 0.952 0.820YiSi-1 0.973 0.969 0.908YiSi-1_srl − − 0.912QE as a Metric:ibm1-morpheme 0.355 0.509 0.625ibm1-pos4gram − 0.085 0.478YiSi-2 0.606 0.721 0.530
newstest2019
Table 5: Absolute Pearson correlation of system-level metrics for language pairs not involving Englishwith DA human assessment in newstest2019; correlations of metrics not significantly outperformed byany other for that language pair are highlighted in bold.
de-cs de-fr fr-de
EE
DE
SIM
YiS
i.0B
EE
Rsa
creB
LEU
.chr
Fch
rFY
iSi.1
chrF
.C
hara
cTE
RhL
EP
OR
b_ba
selin
eN
IST
BLE
UhL
EP
OR
a_ba
selin
eT
ER
PE
RW
ER
sacr
eBLE
U.B
LEU
CD
ER
YiS
i.2ib
m1.
mor
phem
e
ibm1.morphemeYiSi.2CDERsacreBLEU.BLEUWERPERTERhLEPORa_baselineBLEUNISThLEPORb_baselineCharacTERchrF.YiSi.1chrFsacreBLEU.chrFBEERYiSi.0ESIMEED
YiS
i.1W
ER
TE
RY
iSi.0
sacr
eBLE
U.c
hrF
ES
IMC
DE
RB
EE
RE
ED
chrF
.ch
rFC
hara
cTE
RN
IST
BLE
Usa
creB
LEU
.BLE
UP
ER
hLE
PO
Rb_
base
line
hLE
PO
Ra_
base
line
YiS
i.2ib
m1.
mor
phem
eib
m1.
pos4
gram
ibm1.pos4gramibm1.morphemeYiSi.2hLEPORa_baselinehLEPORb_baselinePERsacreBLEU.BLEUBLEUNISTCharacTERchrFchrF.EEDBEERCDERESIMsacreBLEU.chrFYiSi.0TERWERYiSi.1
ES
IMY
iSi.1
_srl
YiS
i.1P
ER
TE
RW
ER
sacr
eBLE
U.c
hrF
sacr
eBLE
U.B
LEU
chrF
BLE
UN
IST
CD
ER
EE
DC
hara
cTE
RB
EE
Rch
rF.
YiS
i.0ib
m1.
mor
phem
eY
iSi.2
ibm
1.po
s4gr
am
ibm1.pos4gramYiSi.2ibm1.morphemeYiSi.0chrF.BEERCharacTEREEDCDERNISTBLEUchrFsacreBLEU.BLEUsacreBLEU.chrFWERTERPERYiSi.1YiSi.1_srlESIM
Figure 2: System-level metric significance test results for DA human assessment in newstest2019 forGerman to Czech, German to French and French to German; green cells denote a statistically significantincrease in correlation with human assessment for the metric in a given row over the metric in a givencolumn according to Williams test.
75
de-en fi-en gu-en kk-en lt-en ru-en zh-enHuman Evaluation daRR daRR daRR daRR daRR daRR daRRn 85,365 38,307 31,139 27,094 21,862 46,172 31,070BEER 0.128 0.283 0.260 0.421 0.315 0.189 0.371BERTr 0.142 0.331 0.291 0.421 0.353 0.195 0.399CharacTER 0.101 0.253 0.190 0.340 0.254 0.155 0.337chrF 0.122 0.286 0.256 0.389 0.301 0.180 0.371chrF+ 0.125 0.289 0.257 0.394 0.303 0.182 0.374EED 0.120 0.281 0.264 0.392 0.298 0.176 0.376ESIM 0.167 0.337 0.303 0.435 0.359 0.201 0.396hLEPORa_baseline − − − 0.372 − − 0.339Meteor++_2.0(syntax) 0.084 0.274 0.237 0.395 0.291 0.156 0.370Meteor++_2.0(syntax+copy) 0.094 0.273 0.244 0.402 0.287 0.163 0.367PReP 0.030 0.197 0.192 0.386 0.193 0.124 0.267sentBLEU 0.056 0.233 0.188 0.377 0.262 0.125 0.323WMDO 0.096 0.281 0.260 0.420 0.300 0.162 0.362YiSi-0 0.117 0.271 0.263 0.402 0.289 0.178 0.355YiSi-1 0.164 0.347 0.312 0.440 0.376 0.217 0.426YiSi-1_srl 0.199 0.346 0.306 0.442 0.380 0.222 0.431QE as a Metric:ibm1-morpheme −0.074 0.009 − − 0.069 − −ibm1-pos4gram −0.153 − − − − − −LASIM −0.024 − − − − 0.022 −LP −0.096 − − − − −0.035 −UNI 0.022 0.202 − − − 0.084 −UNI+ 0.015 0.211 − − − 0.089 −YiSi-2 0.068 0.126 −0.001 0.096 0.075 0.053 0.253YiSi-2_srl 0.068 − − − − − 0.246
newstest2019
Table 6: Segment-level metric results for to-English language pairs in newstest2019: absolute Kendall’sTau formulation of segment-level metric scores with DA scores; correlations of metrics not significantlyoutperformed by any other for that language pair are highlighted in bold.
76
en-cs en-de en-fi en-gu en-kk en-lt en-ru en-zhHuman Evaluation daRR daRR daRR daRR daRR daRR daRR daRRn 27,178 99,840 31,820 11,355 18,172 17,401 24,334 18,658
BEER 0.443 0.316 0.514 0.537 0.516 0.441 0.542 0.232CharacTER 0.349 0.264 0.404 0.500 0.351 0.311 0.432 0.094chrF 0.455 0.326 0.514 0.534 0.479 0.446 0.539 0.301chrF+ 0.458 0.327 0.514 0.538 0.491 0.448 0.543 0.296EED 0.431 0.315 0.508 0.568 0.518 0.425 0.546 0.257ESIM − 0.329 0.511 − 0.510 0.428 0.572 0.339hLEPORa_baseline − − − 0.463 0.390 − − −sentBLEU 0.367 0.248 0.396 0.465 0.392 0.334 0.469 0.270YiSi-0 0.406 0.304 0.483 0.539 0.494 0.402 0.535 0.266YiSi-1 0.475 0.351 0.537 0.551 0.546 0.470 0.585 0.355YiSi-1_srl − 0.368 − − − − − 0.361QE as a Metric:ibm1-morpheme −0.135 −0.003 −0.005 − − −0.165 − −ibm1-pos4gram − −0.123 − − − − − −LASIM − 0.147 − − − − −0.24 −LP − −0.119 − − − − −0.158 −UNI 0.060 0.129 0.351 − − − 0.226 −UNI+ − − − − − − 0.222 −USFD − −0.029 − − − − 0.136 −USFD-TL − −0.037 − − − − 0.191 −YiSi-2 0.069 0.212 0.239 0.147 0.187 0.003 −0.155 0.044YiSi-2_srl − 0.236 − − − − − 0.034
newstest2019
Table 7: Segment-level metric results for out-of-English language pairs in newstest2019: absoluteKendall’s Tau formulation of segment-level metric scores with DA scores; correlations of metrics notsignificantly outperformed by any other for that language pair are highlighted in bold.
de-cs de-fr fr-deHuman Evaluation daRR daRR daRRn 35,793 4,862 1,369
BEER 0.337 0.293 0.265CharacTER 0.232 0.251 0.224chrF 0.326 0.284 0.275chrF+ 0.326 0.284 0.278EED 0.345 0.301 0.267ESIM 0.331 0.290 0.289hLEPORa_baseline 0.207 0.239 −sentBLEU 0.203 0.235 0.179YiSi-0 0.331 0.296 0.277YiSi-1 0.376 0.349 0.310YiSi-1_srl − − 0.299QE as a Metric:ibm1-morpheme 0.048 −0.013 −0.053ibm1-pos4gram − −0.074 −0.097YiSi-2 0.199 0.186 0.066
newstest2019
Table 8: Segment-level metric results for languagepairs not involving English in newstest2019: ab-solute Kendall’s Tau formulation of segment-levelmetric scores with DA scores; correlations of met-rics not significantly outperformed by any other forthat language pair are highlighted in bold.
is deemed substantially different and no tiesin human judgements arise, it makes sense topenalize ties in metrics’ predictions in order topromote discerning metrics.
Note that the penalization of ties makes ourevaluation asymmetric, dependent on whetherthe metric predicted the tie for a pair wherehumans predicted <, or >. It is now impor-tant to interpret the meaning of the compar-ison identically for humans and metrics. Forerror metrics, we thus reverse the sign of themetric score prior to the comparison with hu-man scores: higher scores have to indicate bet-ter translation quality. In WMT19, the origi-nal authors did this for CharacTER.
To summarize, the WMT19 Metrics Taskfor segment-level evaluation:
• ensures that error metrics are first con-verted to the same orientation as the hu-man judgements, i.e. higher score indi-cating higher translation quality,
• excludes all human ties (this is alreadyimplied by the construction of daRRfrom DA judgements),
77
de-en fi-en gu-en
YiS
i.1_s
rlE
SIM
YiS
i.1B
ER
TrB
EE
Rch
rF.
chrF
EE
DY
iSi.0
Cha
racT
ER
WM
DO
Met
eor..
_2.0
.syn
tax.
copy
.M
eteo
r.._2
.0.s
ynta
x.Y
iSi.2
YiS
i.2_s
rlse
ntB
LEU
PR
eP UN
IU
NI.
LAS
IMib
m1.
mor
phem
eLP
ibm
1.po
s4gr
am
ibm1−pos4gramLPibm1−morphemeLASIMUNI+UNIPRePsentBLEUYiSi−2_srlYiSi−2Meteor++_2.0(syntax)Meteor++_2.0(syntax+copy)WMDOCharacTERYiSi−0EEDchrFchrF+BEERBERTrYiSi−1ESIMYiSi−1_srl
YiS
i.1Y
iSi.1
_srl
ES
IMB
ER
Trch
rF.
chrF
BE
ER
WM
DO
EE
DM
eteo
r.._2
.0.s
ynta
x.M
eteo
r.._2
.0.s
ynta
x.co
py.
YiS
i.0C
hara
cTE
Rse
ntB
LEU
UN
I.U
NI
PR
ePY
iSi.2
ibm
1.m
orph
eme
ibm1−morphemeYiSi−2PRePUNIUNI+sentBLEUCharacTERYiSi−0Meteor++_2.0(syntax+copy)Meteor++_2.0(syntax)EEDWMDOBEERchrFchrF+BERTrESIMYiSi−1_srlYiSi−1
YiS
i.1Y
iSi.1
_srl
ES
IMB
ER
TrE
ED
YiS
i.0W
MD
OB
EE
Rch
rF.
chrF
Met
eor..
_2.0
.syn
tax.
copy
.M
eteo
r.._2
.0.s
ynta
x.P
ReP
Cha
racT
ER
sent
BLE
UY
iSi.2
YiSi−2sentBLEUCharacTERPRePMeteor++_2.0(syntax)Meteor++_2.0(syntax+copy)chrFchrF+BEERWMDOYiSi−0EEDBERTrESIMYiSi−1_srlYiSi−1
kk-en lt-en ru-en
YiS
i.1_s
rlY
iSi.1
ES
IMB
ER
TrB
EE
RW
MD
OY
iSi.0
Met
eor..
_2.0
.syn
tax.
copy
.M
eteo
r.._2
.0.s
ynta
x.ch
rF.
EE
Dch
rFP
ReP
sent
BLE
UhL
EP
OR
a_ba
selin
eC
hara
cTE
RY
iSi.2
YiSi−2CharacTERhLEPORa_baselinesentBLEUPRePchrFEEDchrF+Meteor++_2.0(syntax)Meteor++_2.0(syntax+copy)YiSi−0WMDOBEERBERTrESIMYiSi−1YiSi−1_srl
YiS
i.1_s
rlY
iSi.1
ES
IMB
ER
TrB
EE
Rch
rF.
chrF
WM
DO
EE
DM
eteo
r.._2
.0.s
ynta
x.Y
iSi.0
Met
eor..
_2.0
.syn
tax.
copy
.se
ntB
LEU
Cha
racT
ER
PR
ePY
iSi.2
ibm
1.m
orph
eme
ibm1−morphemeYiSi−2PRePCharacTERsentBLEUMeteor++_2.0(syntax+copy)YiSi−0Meteor++_2.0(syntax)EEDWMDOchrFchrF+BEERBERTrESIMYiSi−1YiSi−1_srl
YiS
i.1_s
rlY
iSi.1
ES
IMB
ER
TrB
EE
Rch
rF.
chrF
YiS
i.0E
ED
Met
eor..
_2.0
.syn
tax.
copy
.W
MD
OM
eteo
r.._2
.0.s
ynta
x.C
hara
cTE
Rse
ntB
LEU
PR
ePU
NI.
UN
IY
iSi.2
LAS
IM LP
LPLASIMYiSi−2UNIUNI+PRePsentBLEUCharacTERMeteor++_2.0(syntax)WMDOMeteor++_2.0(syntax+copy)EEDYiSi−0chrFchrF+BEERBERTrESIMYiSi−1YiSi−1_srl
zh-en en-cs en-de
YiS
i.1_s
rlY
iSi.1
BE
RTr
ES
IME
ED
chrF
.B
EE
Rch
rFM
eteo
r.._2
.0.s
ynta
x.M
eteo
r.._2
.0.s
ynta
x.co
py.
WM
DO
YiS
i.0hL
EP
OR
a_ba
selin
eC
hara
cTE
Rse
ntB
LEU
PR
ePY
iSi.2
YiS
i.2_s
rl
YiSi−2_srlYiSi−2PRePsentBLEUCharacTERhLEPORa_baselineYiSi−0WMDOMeteor++_2.0(syntax+copy)Meteor++_2.0(syntax)chrFBEERchrF+EEDESIMBERTrYiSi−1YiSi−1_srl
YiS
i.1
chrF
.
chrF
BE
ER
EE
D
YiS
i.0
sent
BLE
U
Cha
racT
ER
YiS
i.2
UN
I
ibm
1.m
orph
eme
ibm1−morpheme
UNI
YiSi−2
CharacTER
sentBLEU
YiSi−0
EED
BEER
chrF
chrF+
YiSi−1
YiS
i.1_s
rlY
iSi.1
ES
IMch
rF.
chrF
BE
ER
EE
DY
iSi.0
Cha
racT
ER
sent
BLE
UY
iSi.2
_srl
YiS
i.2LA
SIM
UN
Iib
m1.
mor
phem
eU
SF
DU
SF
D.T
LLP
ibm
1.po
s4gr
am
ibm1−pos4gramLPUSFD−TLUSFDibm1−morphemeUNILASIMYiSi−2YiSi−2_srlsentBLEUCharacTERYiSi−0EEDBEERchrFchrF+ESIMYiSi−1YiSi−1_srl
en-fi en-gu en-kk
YiS
i.1
chrF
.
BE
ER
chrF
ES
IM
EE
D
YiS
i.0
Cha
racT
ER
sent
BLE
U
UN
I
YiS
i.2
ibm
1.m
orph
eme
ibm1−morpheme
YiSi−2
UNI
sentBLEU
CharacTER
YiSi−0
EED
ESIM
chrF
BEER
chrF+
YiSi−1
EE
D
YiS
i.1
YiS
i.0
chrF
.
BE
ER
chrF
Cha
racT
ER
sent
BLE
U
hLE
PO
Ra_
base
line
YiS
i.2
YiSi−2
hLEPORa_baseline
sentBLEU
CharacTER
chrF
BEER
chrF+
YiSi−0
YiSi−1
EED
YiS
i.1
EE
D
BE
ER
ES
IM
YiS
i.0
chrF
.
chrF
sent
BLE
U
hLE
PO
Ra_
base
line
Cha
racT
ER
YiS
i.2
YiSi−2
CharacTER
hLEPORa_baseline
sentBLEU
chrF
chrF+
YiSi−0
ESIM
BEER
EED
YiSi−1
en-lt en-ru en-zh
YiS
i.1
chrF
.
chrF
BE
ER
ES
IM
EE
D
YiS
i.0
sent
BLE
U
Cha
racT
ER
YiS
i.2
ibm
1.m
orph
eme
ibm1−morpheme
YiSi−2
CharacTER
sentBLEU
YiSi−0
EED
ESIM
BEER
chrF
chrF+
YiSi−1
YiS
i.1E
SIM
EE
Dch
rF.
BE
ER
chrF
YiS
i.0se
ntB
LEU
Cha
racT
ER
UN
IU
NI.
US
FD
.TL
US
FD
YiS
i.2 LPLA
SIM
LASIMLPYiSi−2USFDUSFD−TLUNI+UNICharacTERsentBLEUYiSi−0chrFBEERchrF+EEDESIMYiSi−1
YiS
i.1_s
rl
YiS
i.1
ES
IM
chrF
chrF
.
sent
BLE
U
YiS
i.0
EE
D
BE
ER
Cha
racT
ER
YiS
i.2
YiS
i.2_s
rl
YiSi−2_srl
YiSi−2
CharacTER
BEER
EED
YiSi−0
sentBLEU
chrF+
chrF
ESIM
YiSi−1
YiSi−1_srl
Figure 3: daRR segment-level metric significance test results for into English and out-of English languagepairs (newstest2019): Green cells denote a significant win for the metric in a given row over the metricin a given column according bootstrap resampling.
78
de-cs de-fr fr-de
YiS
i.1
EE
D
BE
ER
ES
IM
YiS
i.0
chrF
.
chrF
Cha
racT
ER
hLE
PO
Ra_
base
line
sent
BLE
U
YiS
i.2
ibm
1.m
orph
eme
ibm1−morpheme
YiSi−2
sentBLEU
hLEPORa_baseline
CharacTER
chrF
chrF+
YiSi−0
ESIM
BEER
EED
YiSi−1
YiS
i.1E
ED
YiS
i.0B
EE
RE
SIM
chrF
.ch
rFC
hara
cTE
RhL
EP
OR
a_ba
selin
ese
ntB
LEU
YiS
i.2ib
m1.
mor
phem
eib
m1.
pos4
gram
ibm1−pos4gramibm1−morphemeYiSi−2sentBLEUhLEPORa_baselineCharacTERchrFchrF+ESIMBEERYiSi−0EEDYiSi−1
YiS
i.1Y
iSi.1
_srl
ES
IMch
rF.
YiS
i.0ch
rFE
ED
BE
ER
Cha
racT
ER
sent
BLE
UY
iSi.2
ibm
1.m
orph
eme
ibm
1.po
s4gr
am
ibm1−pos4gramibm1−morphemeYiSi−2sentBLEUCharacTERBEEREEDchrFYiSi−0chrF+ESIMYiSi−1_srlYiSi−1
Figure 4: daRR segment-level metric significance test results for German to Czech, German to Frenchand French to German (newstest2019): Green cells denote a significant win for the metric in a given rowover the metric in a given column according bootstrap resampling.
• counts metric’s ties as a Discordant pairs.
We employ bootstrap resampling (Koehn,2004; Graham et al., 2014b) to estimate con-fidence intervals for our Kendall’s Tau for-mulation, and metrics with non-overlapping95% confidence intervals are identified as hav-ing statistically significant difference in perfor-mance.
5.2.1 Segment-Level ResultsResults of the segment-level human evaluationfor translations sampled from the News Trans-lation Task are shown in Tables 6, 7 and 8,where metric correlations not significantly out-performed by any other metric are highlightedin bold. Head-to-head significance test resultsfor differences in metric performance are in-cluded in Figures 3 and 4.
6 DiscussionThis year, human data was collected fromreference-based evaluations (or “monolin-gual”) and reference-free evaluations (or“bilingual”). The reference-based (mono-lingual) evaluations were obtained with thehelp of anonymous crowdsourcing, whilethe reference-less (bilingual) evaluations weremainly from MT researchers who committedtheir time contribution to the manual evalua-tion for each submitted system.
6.1 Stability across MT SystemsThe observed performance of metrics dependson the underlying texts and systems that par-ticipate in the News Translation Task (see Sec-tion 2). For the strongest MT systems, distin-guishing which system outputs are better is
468101214161820
sacreBLEU-BLEU
Figure 5: Pearson correlations of sacreBLEU-BLEU for English-German system-level evalua-tion for all systems (left) down to only top 4 sys-tems (right). The y-axis spans from -1 to +1, base-line metrics for the language pair in grey.
hard, even for human assessors. On the otherhand, if the systems are spread across a wideperformance range, it will be easier for metricsto correlate with human judgements.
To provide a more reliable view, we createdplots of Pearson correlation when the under-lying set of MT systems is reduced to top nones. One sample such plot is in Figure 5, alllanguage pairs and most of the metrics are inAppendix A.
As the plot documents, the official correla-tions reported in Tables 3 to 5 can lead towrong conclusions. sacreBLEU-BLEU cor-relates at .969 when all systems are considered,but as we start considering only the top n sys-tems, the correlation falls relatively quickly.With 10 systems, we are below .5 and whenonly the top 6 or 4 systems are considered,the correlation falls even to the negave val-ues. Note that correlations point estimates(the value in the y-axis) become noiser withthe decreasing number of the underlying MTsystems.
Figure 6 explains the situation and illus-
79
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
-2 -1.5 -1 -0.5 0 0.5
Sacr
eB
LEU
-BLE
U
DA
Top 4
Top 6
Top 8
Top 10
Top 12
Top 15
All systems
Figure 6
trates the sensitivity of the observed correla-tions to the exact set of systems. On the fullset of systems, the single outlier (the worst-performing system called en_de_task) helpsto achieve a great positive correlation. Themajority of MT systems however form a cloudwith Pearson correlation around .5 and the top4 systems actually exhibit a negative corre-lation of the human score and sacreBLEU-BLEU.
In Appendix A, baseline metrics are plottedin grey in all the plots, so that their trends canbe observed jointly. In general, most baselineshave similar correlations, as most baselines usesimilar features (n-gram or word-level features,with the exception of chrF). In a number oflanguage pairs (de-en, de-fr, en-de, en-kk, lt-en, ru-en, zh-en), baseline correlations tend to-wards 0 (no correlation) or even negative Pear-son correlation. For a widely applied metricsuch as sacreBLEU-BLEU, our analysis re-veals weak correlation in comparing top state-of-the-art systems in these language pairs, es-pecially in en-de, de-en, ru-en, and zh-en.
We will restrict our analysis to those lan-guage pairs where the baseline metrics have anobvious downward trend (de-en, de-fr, en-de,en-kk, lt-en, ru-en, zh-en). Examining the top-n correlation in the submitted metrics (not in-cluding QE systems), most metrics show thesame degredation in correlation as the base-lines. We note BERTr as the one exceptionconsistently degrading less and retaining pos-itive correlation compared to other submittedmetrics and baselines, in the language pairswhere it participated.
For QE systems, we noticed that in some in-stances, QE systems have upward correlationtrends when other metrics and baselines havedownward trends. For instance, LP, UNI, andUNI+ in the de-en language pair, YiSi-2 inen-kk, and UNI and UNI+ in ru-en. Theseresults suggest that QE systems such as UNIand UNI+ perform worse on judging systemsof wide ranging quality, but better for top per-forming systems, or perhaps for systems closerin quality.
If our method of human assessment is sound,we should believe that BLEU, a widely ap-plied metric, is no longer a reliable metric forjudging our best systems. Future investiga-tions are needed to understand when BLEUapplies well, and why BLEU is not effectivefor output from our state of the art models.
Metrics and QE systems such as BERTr,ESIM, YiSi that perform well at judgingour best systems often use more semanticfeatures compared to our n-gram/char-grambased baselines. Future metrics may want toexplore a) whether semantic features such ascontextual word embeddings are achieving se-mantic understanding and b) whether seman-tic understanding is the true source of a met-ric’s performance gains.
It should be noted that some language pairsdo not show the strong degrading pattern withtop-n systems this year, for instance en-cs, en-gu, en-ru, or kk-en. English-Chinese is partic-ularly interesting because we see a clear trendtowards better correlations as we reduce theset of underlying systems to the top scoringones.
6.2 Overall Metric Performance6.2.1 System-Level EvaluationIn system-level evaluation, the series of YiSimetrics achieve the highest correlations in sev-eral language pairs and it is not significantlyoutperformed by any other metrics (denotedas a “win” in the following) for almost all lan-guage pairs.
The new metric ESIM performs best on 5language languages (18 language pairs) andobtains 11 “wins” out of 16 language pairs inwhich ESIM participated.
The metric EED performs better for lan-guage pairs out-of English and excluding En-
80
glish compared to into-English language pairs,achieving 7 out of 11 “wins” there.
6.2.2 Segment-Level Evaluation
For segment-level evaluation, most languagepairs are quite discerning, with only one ortwo metrics taking the “winner” position (ofnot being significantly surpassed by others).Only French-German differs, with all metricsperforming similarly except the significantlyworse sentBLEU.
YiSi-1_srl stands out as the “winner” forall language pairs in which it participated.The excluded language pairs were probablydue to the lack of semantic information re-quired by YiSi-1_srl. YiSi-1 participatedall language pairs and its correlations are com-parable with those of YiSi-1_srl.
ESIM obtain 6 “winners” out of all 18 lan-guages pairs.
Both YiSi and ESIM are based on neu-ral networks (YiSi via word and phrase em-beddings, as well as other types of availableresources, ESIM via sentence embeddings).This is a confirmation of a trend observed lastyear.
6.2.3 QE Systems as Metrics
Generally, correlations for the standardreference-based metrics are obviously betterthan those in “QE as a Metric” track, bothwhen using monolingual and bilingual goldentruth.
In system-level evaluation, correlations for“QE as a Metric” range from 0.028 to 0.947across all language pairs and all metrics butthey are very unstable. Even for a singlemetric, take UNI for example, the correla-tions range from 0.028 to 0.930 across languagepairs.
In segment-level evaluation, correlations forQE metrics range from -0.153 to 0.351 acrossall language pairs and show the same instabil-ity across language pairs for a given metric.
In either case, we do not see any patternthat could explain the behaviour, e.g. whetherthe manual evaluation was monolingual orbilingual, or the characteristics of the givenlanguage pair.
6.3 Dependence on ImplementationAs it already happened in the past, we hadmultiple implementations for some metrics,BLEU and chrF in particular.
The detailed configuration of BLEU andsacreBLEU-BLEU differ and hence theirscores and correlation results are different.
chrF and sacreBLEU-chrF use the sameparameters and should thus deliver the samescores but we still observe some differences,leading to different correlations. For instancefor German-French Pearson correlation, chrFobtains 0.931 (no win) but sacreBLEU-chrF reaches 0.952, tying for a win with othermetrics.
We thus fully support the call for clarity byPost (2018b) and invite authors of metrics toinclude their implementations either in Mosesscorer or sacreBLEU to achieve a long-termassessment of their metric.
7 Conclusion
This paper summarizes the results of WMT19shared task in machine translation evaluation,the Metrics Shared Task. Participating met-rics were evaluated in terms of their correla-tion with human judgement at the level ofthe whole test set (system-level evaluation),as well as at the level of individual sentences(segment-level evaluation).
We reported scores for standard metrics re-quiring the reference as well as quality estima-tion systems which took part in the track “QEas a metric”, joint with the Quality Estimationtask.
For system-level, best metrics reach over0.95 Pearson correlation or better across sev-eral language pairs. As expected, QE sys-tems are visibly in all language pairs but theycan also reach high system-level correlations,up to .947 (Chinese-English) or .936 (English-German) by YiSi-1_srl or over .9 for multi-ple language pairs by UNI.
An important caveat is that the correlationsare heavily affected by the underlying set ofMT systems. We explored this by reducingthe set of systems to top-n ones for various nsand found out that for many language pairs,system-level correlations are much worse whenbased on only the better performing systems.With both good and bad MT systems partic-
81
ipating in the news task, the metrics resultscan be overly optimistic compared to what weget when evaluating state-of-the-art systems.
In terms of segment-level Kendall’s τ re-sults, the standard metrics correlations variedbetween 0.03 and 0.59, and QE systems ob-tained even negative correlations.
The results confirm the observation from thelast year, namely metrics based on word orsentence-level embeddings (YiSi and ESIM),achieve the highest performance.
AcknowledgmentsResults in this shared task would not be pos-sible without tight collaboration with organiz-ers of the WMT News Translation Task. Wewould like to thank Marcin Junczys-Dowmuntfor the suggestion to examine metrics perfor-mance across varying subsets of MT systems,as we did in Appendix A.
This study was supported in parts by thegrants 19-26934X (NEUREM3) of the CzechScience Foundation, ADAPT Centre for Dig-ital Content Technology (www.adaptcentre.ie) at Dublin City University funded un-der the SFI Research Centres Programme(Grant 13/RC/2106) co-funded under theEuropean Regional Development Fund, andCharles University Research Programme “Pro-gres” Q18+Q48.
ReferencesColin Bannard and Chris Callison-Burch. 2005.
Paraphrasing with bilingual parallel corpora. InProceedings of the 43rd Annual Meeting on Asso-ciation for Computational Linguistics, ACL ’05,pages 597–604, Stroudsburg, PA, USA. Associ-ation for Computational Linguistics.
Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel,Yvette Graham, Barry Haddow, Matthias Huck,Philipp Koehn, Shervin Malmasi, ChristofMonz, Mathias Müller, Santanu Pal, MattPost, and Marcos Zampieri. 2019. Findings ofthe 2019 Conference on Machine Translation(WMT19). In Proceedings of the Fourth Con-ference on Machine Translation, Florence, Italy.Association for Computational Linguistics.
Ondřej Bojar, Christian Federmann, Barry Had-dow, Philipp Koehn, Matt Post, and Lucia Spe-cia. 2016. Ten Years of WMT Evaluation Cam-paigns: Lessons Learnt. In Proceedings of theLREC 2016 Workshop “Translation Evaluation
– From Fragmented Tools and Data Sets to anIntegrated Ecosystem”, pages 27–34, Portorose,Slovenia.
Ondřej Bojar, Yvette Graham, and Amir Kamran.2017. Results of the WMT17 metrics sharedtask. In Proceedings of the Second Confer-ence on Machine Translation, Volume 2: SharedTasks Papers, Copenhagen, Denmark. Associa-tion for Computational Linguistics.
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei,Hui Jiang, and Diana Inkpen. 2017. Enhancedlstm for natural language inference. In Proceed-ings of the 55th Annual Meeting of the Associ-ation for Computational Linguistics (Volume 1:Long Papers), pages 1657–1668.
Julian Chow, Pranava Madhyastha, and LuciaSpecia. 2019a. Wmdo: Fluency-based wordmover’s distance for machine translation eval-uation. In Proceedings of Fourth Conference onMachine Translation.
Julian Chow, Lucia Specia, and Pranava Mad-hyastha. 2019b. WMDO: Fluency-based WordMover’s Distance for Machine Translation Eval-uation. In Proceedings of the Fourth Conferenceon Machine Translation, Florence, Italy. Asso-ciation for Computational Linguistics.
Michael Denkowski and Alon Lavie. 2014. MeteorUniversal: Language Specific Translation Evalu-ation for Any Target Language. In Proceedingsof the Ninth Workshop on Statistical MachineTranslation, pages 376–380, Baltimore, Mary-land, USA. Association for Computational Lin-guistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-trainingof deep bidirectional transformers for languageunderstanding. In Proceedings of the 2019 Con-ference of the North American Chapter of theAssociation for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Longand Short Papers), pages 4171–4186, Minneapo-lis, Minnesota. Association for ComputationalLinguistics.
George Doddington. 2002. Automatic Evalua-tion of Machine Translation Quality Using N-gram Co-occurrence Statistics. In Proceedingsof the Second International Conference on Hu-man Language Technology Research, HLT ’02,pages 138–145, San Francisco, CA, USA. Mor-gan Kaufmann Publishers Inc.
William B. Dolan and Chris Brockett. 2005. Au-tomatically constructing a corpus of sententialparaphrases. In Proceedings of the Third Inter-national Workshop on Paraphrasing (IWP2005).
Erick Fonseca, Lisa Yankovskaya, André F. T.Martins, Mark Fishel, and Christian Feder-mann. 2019. Findings of the WMT 2019 Shared
82
Task on Quality Estimation. In Proceedings ofthe Fourth Conference on Machine Translation,Florence, Italy. Association for ComputationalLinguistics.
Yvette Graham and Timothy Baldwin. 2014. Test-ing for Significance of Increased Correlation withHuman Judgment. In Proceedings of the 2014Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 172–176,Doha, Qatar. Association for ComputationalLinguistics.
Yvette Graham, Timothy Baldwin, Alistair Mof-fat, and Justin Zobel. 2013. Continuous Mea-surement Scales in Human Evaluation of Ma-chine Translation. In Proceedings of the 7th Lin-guistic Annotation Workshop & Interoperabilitywith Discourse, pages 33–41, Sofia, Bulgaria. As-sociation for Computational Linguistics.
Yvette Graham, Timothy Baldwin, Alistair Mof-fat, and Justin Zobel. 2014a. Is Machine Trans-lation Getting Better over Time? In Proceed-ings of the 14th Conference of the EuropeanChapter of the Association for ComputationalLinguistics, pages 443–451, Gothenburg, Swe-den. Association for Computational Linguistics.
Yvette Graham, Timothy Baldwin, Alistair Mof-fat, and Justin Zobel. 2016. Can machine trans-lation systems be evaluated by the crowd alone.Natural Language Engineering, FirstView:1–28.
Yvette Graham and Qun Liu. 2016. Achieving Ac-curate Conclusions in Evaluation of AutomaticMachine Translation Metrics. In Proceedings ofthe 15th Annual Conference of the North Amer-ican Chapter of the Association for Computa-tional Linguistics: Human Language Technolo-gies, San Diego, CA. Association for Computa-tional Linguistics.
Yvette Graham, Nitika Mathur, and TimothyBaldwin. 2014b. Randomized significance testsin machine translation. In Proceedings of theACL 2014 Ninth Workshop on Statistical Ma-chine Translation, pages 266–274. Associationfor Computational Linguistics.
Yvette Graham, Nitika Mathur, and Timo-thy Baldwin. 2015. Accurate Evaluation ofSegment-level Machine Translation Metrics. InProceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Com-putational Linguistics Human Language Tech-nologies, Denver, Colorado.
Yinuo Guo and Junfeng Hu. 2019. Meteor++2.0: Adopt Syntactic Level Paraphrase Knowl-edge into Machine Translation Evaluation. InProceedings of the Fourth Conference on Ma-chine Translation, Florence, Italy. Associationfor Computational Linguistics.
Aaron L.-F. Han, Derek F. Wong, and Lidia S.Chao. 2012. Lepor: A robust evaluation metricfor machine translation with augmented factors.In Proceedings of the 24th International Con-ference on Computational Linguistics (COLING2012), pages 441–450. Association for Computa-tional Linguistics.
Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao,Liangye He, Yi Lu, Junwen Xing, and XiaodongZeng. 2013. Language-independent model formachine translation evaluation with reinforcedfactors. In Machine Translation Summit XIV,pages 215–222. International Association forMachine Translation.
Philipp Koehn. 2004. Statistical significance testsfor machine translation evaluation. In Proc. ofEmpirical Methods in Natural Language Process-ing, pages 388–395, Barcelona, Spain. Associa-tion for Computational Linguistics.
Philipp Koehn and Christof Monz. 2006. Manualand Automatic Evaluation of Machine Trans-lation Between European Languages. In Pro-ceedings of the Workshop on Statistical Ma-chine Translation, StatMT ’06, pages 102–121,Stroudsburg, PA, USA. Association for Compu-tational Linguistics.
Gregor Leusch, Nicola Ueffing, and Hermann Ney.2003. A novel string-to-string distance measurewith applications to machine translation evalu-ation. In Proceedings of Mt Summit IX, pages240–247.
Gregor Leusch, Nicola Ueffing, and Hermann Ney.2006. CDER: Efficient MT Evaluation UsingBlock Movements. In In Proceedings of EACL,pages 241–248.
Chi-kiu Lo. 2019. YiSi - a Unified Semantic MTQuality Evaluation and Estimation Metric forLanguages with Different Levels of Available Re-sources. In Proceedings of the Fourth Conferenceon Machine Translation, Florence, Italy. Asso-ciation for Computational Linguistics.
Qingsong Ma, Ondřej Bojar, and Yvette Graham.2018. Results of the WMT18 metrics sharedtask: Both characters and embeddings achievegood performance. In Proceedings of the ThirdConference on Machine Translation, Volume 2:Shared Task Papers, Brussels, Belgium. Associ-ation for Computational Linguistics.
Matouš Macháček and Ondřej Bojar. 2014. Re-sults of the WMT14 metrics shared task. InProceedings of the Ninth Workshop on Statisti-cal Machine Translation, pages 293–301, Balti-more, MD, USA. Association for ComputationalLinguistics.
Matouš Macháček and Ondřej Bojar. 2013. Resultsof the WMT13 Metrics Shared Task. In Proceed-
83
ings of the Eighth Workshop on Statistical Ma-chine Translation, pages 45–51, Sofia, Bulgaria.Association for Computational Linguistics.
Nitika Mathur, Tim Baldwin, and Trevor Cohn.2019. Putting evaluation in context: Contextualembeddings improve machine translation evalu-ation. In Proc. of ACL (short papers). To ap-pear.
Kishore Papineni, Salim Roukos, Todd Ward, andWei-Jing Zhu. 2002. BLEU: A Method for Au-tomatic Evaluation of Machine Translation. InProceedings of the 40th Annual Meeting on Asso-ciation for Computational Linguistics, ACL ’02,pages 311–318.
Maja Popovic. 2012. Morpheme- and POS-basedIBM1 and language model scores for translationquality estimation. In Proceedings of the Sev-enth Workshop on Statistical Machine Trans-lation, WMT@NAACL-HLT 2012, June 7-8,2012, Montréal, Canada, pages 133–137.
Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceed-ings of the Tenth Workshop on Statistical Ma-chine Translation, Lisboa, Portugal. Associationfor Computational Linguistics.
Maja Popović. 2017. chrF++: words helping char-acter n-grams. In Proceedings of the SecondConference on Machine Translation, Volume 2:Shared Tasks Papers, Copenhagen, Denmark.Association for Computational Linguistics.
Matt Post. 2018a. A call for clarity in reportingBLEU scores. In Proceedings of the Third Con-ference on Machine Translation: Research Pa-pers, pages 186–191, Belgium, Brussels. Associ-ation for Computational Linguistics.
Matt Post. 2018b. A call for clarity in reportingbleu scores. In Proceedings of the Third Confer-ence on Machine Translation, Belgium, Brus-sels. Association for Computational Linguistics.
Matthew Snover, Bonnie Dorr, Richard Schwartz,Linnea Micciulla, and John Makhoul. 2006. Astudy of translation edit rate with targeted hu-man annotation. In In Proceedings of Associa-tion for Machine Translation in the Americas,pages 223–231.
Peter Stanchev, Weiyue Wang, and Hermann Ney.2019. EED: Extended Edit Distance Measurefor Machine Translation. In Proceedings of theFourth Conference on Machine Translation, Flo-rence, Italy. Association for Computational Lin-guistics.
Miloš Stanojević and Khalil Sima’an. 2015. BEER1.1: ILLC UvA submission to metrics and tun-ing task. In Proceedings of the Tenth Workshopon Statistical Machine Translation, Lisboa, Por-tugal. Association for Computational Linguis-tics.
Weiyue Wang, Jan-Thorsten Peter, HendrikRosendahl, and Hermann Ney. 2016a. Charac-ter: Translation edit rate on character level. InACL 2016 First Conference on Machine Trans-lation, pages 505–510, Berlin, Germany.
Weiyue Wang, Jan-Thorsten Peter, HendrikRosendahl, and Hermann Ney. 2016b. Charac-Ter: Translation Edit Rate on Character Level.In Proceedings of the First Conference on Ma-chine Translation, Berlin, Germany. Associationfor Computational Linguistics.
Evan James Williams. 1959. Regression analysis,volume 14. Wiley New York.
Elizaveta Yankovskaya, Andre Tättar, and MarkFishel. 2019. Quality Estimation and Transla-tion Metrics via Pre-trained Word and SentenceEmbeddings. In Proceedings of the Fourth Con-ference on Machine Translation, Florence, Italy.Association for Computational Linguistics.
Ryoma Yoshimura, Hiroki Shimanaka, Yukio Mat-sumura, Hayahide Yamagishi, and Mamoru Ko-machi. 2019. Filtering Pseudo-References byParaphrasing for Automatic Evaluation of Ma-chine Translation. In Proceedings of the FourthConference on Machine Translation, Florence,Italy. Association for Computational Linguis-tics.
84
A Correlations for Top-N Systems
A.1 de-cs
1
0
1 BEER CharacTER EED ESIM LEPORa
1
0
1 LEPORb NIST PER TER YiSi-0
46891
0
1 YiSi-1
4689
YiSi-2
4689
ibm1-morpheme
4689
sacreBLEU-BLEU
4689
sacreBLEU-chrF
A.2 de-en
1
0
1 BEER BERTr CharacTER EED ESIM
1
0
1 LASIM LP Meteor++_2.0 Meteor++_2.0(+copy)NIST
1
0
1 PER PReP TER UNI UNI+
1
0
1 WMDO YiSi-0 YiSi-1 YiSi-1_srl YiSi-2
4681012141
0
1 YiSi-2_srl
468101214
ibm1-morpheme
468101214
ibm1-pos4gram
468101214
sacreBLEU-BLEU
468101214
sacreBLEU-chrF
A.3 de-fr
1
0
1 BEER CharacTER EED ESIM LEPORa
1
0
1 LEPORb NIST PER TER YiSi-0
1
0
1 YiSi-1
4689
YiSi-2
4689
ibm1-morpheme
4689
ibm1-pos4gram
4689
sacreBLEU-BLEU
46891
0
1 sacreBLEU-chrF
85
A.4 en-cs
1
0
1 BEER CharacTER EED NIST PER
1
0
1 TER UNI YiSi-0
4689
YiSi-1
4689
YiSi-2
46891
0
1 ibm1-morpheme
4689
sacreBLEU-BLEU
4689
sacreBLEU-chrF
A.5 en-de
1
0
1 BEER CharacTER EED ESIM LASIM
1
0
1 LP NIST PER TER UNI
1
0
1 USFD USFD-TL YiSi-0 YiSi-1 YiSi-1_srl
1
0
1 YiSi-2
468101214161820
YiSi-2_srl
468101214161820
ibm1-morpheme
468101214161820
ibm1-pos4gram
468101214161820
sacreBLEU-BLEU
4681012141618201
0
1 sacreBLEU-chrF
A.6 en-fi
1
0
1 BEER CharacTER EED ESIM NIST
1
0
1 PER TER UNI YiSi-0
46810
YiSi-1
468101
0
1 YiSi-2
46810
ibm1-morpheme
46810
sacreBLEU-BLEU
46810
sacreBLEU-chrF
86
A.7 en-gu
1
0
1 BEER CharacTER EED LEPORa LEPORb
1
0
1 NIST PER TER
4689
YiSi-0
4689
YiSi-1
46891
0
1 YiSi-2
4689
sacreBLEU-BLEU
4689
sacreBLEU-chrF
A.8 en-kk
1
0
1 BEER CharacTER EED ESIM LEPORa
1
0
1 LEPORb NIST PER TER
4689
YiSi-0
46891
0
1 YiSi-1
4689
YiSi-2
4689
sacreBLEU-BLEU
4689
sacreBLEU-chrF
A.9 en-lt
1
0
1 BEER CharacTER EED ESIM LEPORb
1
0
1 NIST PER TER YiSi-0
46810
YiSi-1
468101
0
1 YiSi-2
46810
ibm1-morpheme
46810
sacreBLEU-BLEU
46810
sacreBLEU-chrF
87
A.10 en-ru
1
0
1 BEER CharacTER EED ESIM LASIM
1
0
1 LP NIST PER TER UNI
1
0
1 UNI+ USFD USFD-TL
46810
YiSi-0
46810
YiSi-1
468101
0
1 YiSi-2
46810
sacreBLEU-BLEU
46810
sacreBLEU-chrF
A.11 en-zh
1
0
1 BEER CharacTER EED ESIM NIST
1
0
1 PER TER YiSi-0 YiSi-1
46810
YiSi-1_srl
468101
0
1 YiSi-2
46810
YiSi-2_srl
46810
sacreBLEU-BLEU
46810
sacreBLEU-chrF
A.12 fi-en
1
0
1 BEER BERTr CharacTER EED ESIM
1
0
1 Meteor++_2.0 Meteor++_2.0(+copy)NIST PER PReP
1
0
1 TER UNI UNI+ WMDO YiSi-0
1
0
1 YiSi-1
46810
YiSi-1_srl
46810
YiSi-2
46810
ibm1-morpheme
46810
sacreBLEU-BLEU
468101
0
1 sacreBLEU-chrF
88
A.13 fr-de
1
0
1 BEER CharacTER EED ESIM NIST
1
0
1 PER TER YiSi-0 YiSi-1 YiSi-1_srl
4681
0
1 YiSi-2
468
ibm1-morpheme
468
ibm1-pos4gram
468
sacreBLEU-BLEU
468
sacreBLEU-chrF
A.14 gu-en
1
0
1 BEER BERTr CharacTER EED ESIM
1
0
1 Meteor++_2.0 Meteor++_2.0(+copy)NIST PER PReP
1
0
1 TER WMDO YiSi-0
4689
YiSi-1
4689
YiSi-1_srl
46891
0
1 YiSi-2
4689
sacreBLEU-BLEU
4689
sacreBLEU-chrF
A.15 kk-en
1
0
1 BEER BERTr CharacTER EED ESIM
1
0
1 LEPORa LEPORb Meteor++_2.0 Meteor++_2.0(+copy)NIST
1
0
1 PER PReP TER WMDO YiSi-0
46891
0
1 YiSi-1
4689
YiSi-1_srl
4689
YiSi-2
4689
sacreBLEU-BLEU
4689
sacreBLEU-chrF
89
A.16 lt-en
1
0
1 BEER BERTr CharacTER EED ESIM
1
0
1 LEPORb Meteor++_2.0 Meteor++_2.0(+copy)NIST PER
1
0
1 PReP TER WMDO YiSi-0 YiSi-1
46891
0
1 YiSi-1_srl
4689
YiSi-2
4689
ibm1-morpheme
4689
sacreBLEU-BLEU
4689
sacreBLEU-chrF
A.17 ru-en
1
0
1 BEER BERTr CharacTER EED ESIM
1
0
1 LASIM LP Meteor++_2.0 Meteor++_2.0(+copy)NIST
1
0
1 PER PReP TER UNI UNI+
1
0
1 WMDO YiSi-0
4681012
YiSi-1
4681012
YiSi-1_srl
4681012
YiSi-2
46810121
0
1 sacreBLEU-BLEU
4681012
sacreBLEU-chrF
90
A.18 zh-en
1
0
1 BEER BERTr CharacTER EED ESIM
1
0
1 LEPORa LEPORb Meteor++_2.0 Meteor++_2.0(+copy)NIST
1
0
1 PER PReP TER WMDO YiSi-0
1
0
1 YiSi-1
468101213
YiSi-1_srl
468101213
YiSi-2
468101213
YiSi-2_srl
468101213
sacreBLEU-BLEU
4681012131
0
1 sacreBLEU-chrF