Evalua&onofMachineTransla&onQualityMarcoTurchiFBKTrento,Italyturchi@<k.eu
Slidesfromthepresenta&onbyMaDeoNegri…andmyself
Disclaimer
“MorehasbeenwriDenaboutMTevalua&on
overthepast50yearsthanaboutMTitself”
Hovyetal.:PrinciplesofContext-BasedMachineTransla7onEvalua7on.
MachineTransla&on,16,pp.1–33,2002
(aDributedtoYorickWilks)
“ItisimpossibletowriteacomprehensiveoverviewoftheMTevalua&onliterature”
AdamLopez.:Sta7s7calMachineTransla7on.
ACMCompu&ngSurveys40(3)pp.1–49,August2008.
MTEvalua&on,Trento,DoctoralSchool-April2016
Outline
• ImportanceofMTEvalua&on
• DifficultyofMTEvalua&on
• Humanevalua&on:fluency/adequacy
• Automa&cevalua&on:
– Reference-based:BLEU,TER,HTER(chosenamongMANYothers)– Reference-free:qualityes&ma&on(es&ma&ngpost-edi&ngeffort)
MTEvalua&on,Trento,DoctoralSchool-April2016
TheimportanceofMTevalua&on
• Answering“HowgoodisanMTsystem?”asawayto:– Whichsystemtouseforagiventask
– Assessandcomparesystems’performance– Definethestateoftheart– Drivesystemdevelopmentandmeasureimprovements– DecidewhethertoapplyMTatall
• …Necessary(yes,notsufficient)condi&onsforprogressinanyresearchfield
• Difficulttask!
MTEvalua&on,Trento,DoctoralSchool-April2016
TheimportanceofMTevalua&on
• Answering“HowgoodisanMTsystem?”asawayto:– Whichsystemtouseforagiventask
– Assessandcomparesystems’performance– Definethestateoftheart– Drivesystemdevelopmentandmeasureimprovements– DecidewhethertoapplyMTatall
• …Necessary(yes,notsufficient)condi&onsforprogressinanyresearchfield
• Difficulttask!
MTEvalua&on,Trento,DoctoralSchool-April2016
TheimportanceofMTevalua&on
• Answering“HowgoodisanMTsystem?”asawayto:– Whichsystemtouseforagiventask
– Assessandcomparesystems’performance– Definethestateoftheart– Drivesystemdevelopmentandmeasureimprovements– DecidewhethertoapplyMTatall
• …Necessary(yes,notsufficient)condi&onsforprogressinanyresearchfield
• Difficulttask!
MTEvalua&on,Trento,DoctoralSchool-April2016
DifficultyofMTevalua&on
• Noformaldefini=onof“transla&on”!nodefini&onof“goodtransla&on”
• Theno&onofqualityisinherentlysubjec=ve• Exactquan&fica&onisdifficult(especiallyforlongsentences)
• MTerrorsareveryvariedinnature
DifficultyofMTevalua&on
• Noformaldefini=onof“transla&on”!nodefini&onof“goodtransla&on”
• Theno&onofqualityisinherentlysubjec=ve• Exactquan&fica&onisdifficult(especiallyforlongsentences)
• MTerrorsareveryvariedinnature
DifficultyofMTevalua&on
• Noformaldefini=onof“transla&on”!nodefini&onof“goodtransla&on”
• Theno&onofqualityisinherentlysubjec=ve• Exactquan&fica&onisdifficult(especiallyforlongsentences)
• MTerrorsareveryvariedinnature
DifficultyofMTevalua&on
• Noformaldefini=onof“transla&on”!nodefini&onof“goodtransla&on”
• Theno&onofqualityisinherentlysubjec=ve• Exactquan&fica&onisdifficult(especiallyforlongsentences)
• MTerrorsareveryvariedinnature
DifficultyofMTevalua&on
• Noformaldefini=onof“transla&on”!nodefini&onof“goodtransla&on”
• Theno&onofqualityisinherentlysubjec=ve• Exactquan&fica&onisdifficult(especiallyforlongsentences)
• MTerrorsareveryvariedinnature• Perfectorverypoortransla&ons
areeasytoscore,butwhathappensinbetween?
DifficultyofMTevalua&on
• Manydifferentacceptabletransla&onsforthesamesentence
���������
– Iam[experiencing|sufferingfrom|feeling]athrobbingpain.– I[feel|canfeel|have]a[throbbingpain|painfulthrobbing].– [Itisa|It’sin|I’vegota]throbbingpain.– It’sthrobbing[anditreallyhurts|withpain].– [It’spainfuland|Ithurtssomuch]it’sthrobbing.
MTEvalua&on,Trento,DoctoralSchool-April2016
DifficultyofMTevalua&on
• Howwouldyoutranslate:
It’srainingcatsanddogsAceinthehole
BeataroundthebushChewthefat
Wildgoosechase
TieoneonSunnysmile
• Literally,itsmeaningorthecorrespondingidiom(ifany)?
MTEvalua&on,Trento,DoctoralSchool-April2016
DifficultyofMTevalua&on
MTEvalua&on,Trento,ISITSchool-November2013
• Classifica&onoferrors:aquiterichtaxonomy
Note:errortypesarenotmutuallyexclusiveandonenco-occur(Vilaretal.2006)
HumanVsAutoma&cevalua&on
• HumanMTevalua=on:– criteria:adequacy(fidelity)andfluency(intelligibility)– pros:veryaccurate,highquality– cons:expensive,slow,subjec&ve
• Automa=cMTevalua=on:– criteria:“similarity”toprofessionalhumantransla&on
– pros:inexpensive,quick,objec&ve– cons:qualityis“slightly”lowerthanhumancheck
MTEvalua&on,Trento,DoctoralSchool-April2016
HumanVsAutoma&cevalua&on
• HumanMTevalua=on:– criteria:adequacy(fidelity)andfluency(intelligibility)– pros:veryaccurate,highquality– cons:expensive,slow,subjec&ve
• Automa=cMTevalua=on:– criteria:“similarity”toprofessionalhumantransla&on
– pros:inexpensive,quick,objec&ve– cons:qualityis“slightly”lowerthanhumancheck
MTEvalua&on,Trento,DoctoralSchool-April2016
Humanevalua&on
MTEvalua&on,Trento,ISITSchool-November2013
Humanevalua&on
• Given:– MToutput,sourceand/orreferencetransla&on
• Task:assessthequalityoftheMToutput
• Metrics
– Adequacy:doestheoutputconveythesamemeaningastheinputsentence?Ispartofthemessagelost,added,ordistorted?…requiresbilingualjudgesorareferencetransla&on
– Fluency:istheoutputgoodfluentEnglish?Thisinvolvesbothgramma&calcorrectnessandidioma&cwordchoices.…monolingualjudgesaresufficient,noreferenceneeded
MTEvalua&on,Trento,DoctoralSchool-April2016
Humanevalua&on
• Given:– MToutput,sourceand/orreferencetransla&on
• Task:assessthequalityoftheMToutput
• Metrics
– Adequacy:doestheoutputconveythesamemeaningastheinputsentence?Ispartofthemessagelost,added,ordistorted?…requiresbilingualjudgesorareferencetransla&on
– Fluency:istheoutputgoodfluentEnglish?Thisinvolvesbothgramma&calcorrectnessandidioma&cwordchoices.…monolingualjudgesaresufficient,noreferenceneeded
MTEvalua&on,Trento,DoctoralSchool-April2016
Humanevalua&on
• Given:– MToutput,sourceand/orreferencetransla&on
• Task:assessthequalityoftheMToutput
• Metrics
– Adequacy:doestheoutputconveythesamemeaningastheinputsentence?Ispartofthemessagelost,added,ordistorted?…requiresbilingualjudgesorareferencetransla&on
– Fluency:istheoutputgoodfluentEnglish?Thisinvolvesbothgramma&calcorrectnessandidioma&cwordchoices.…monolingualjudgesaresufficient,noreferenceneeded
MTEvalua&on,Trento,DoctoralSchool-April2016
Humanevalua&on:adequacyandfluency
• Sourcesentence:Lechatentredanslachambre.
(a)Adequatefluenttransla&on: Thecatenterstheroom.(b)Adequatedisfluenttransla&on:Thecatenterintheroom.(c)Fluentinadequatetransla&on:Thecatsenterthebedroom.(d)Disfluentinadequatetransla&on:Bedroomthedogsentersthe
MTEvalua&on,Trento,DoctoralSchool-April2016
Humanevalua&on:Likertscales
Adequacy
5 allmeaning
4 mostmeaning
3 muchmeaning
2 liDlemeaning
1 none
MTEvalua&on,Trento,DoctoralSchool-April2016
Fluency
5 flawlessEnglish
4 goodEnglish
3 non-na&veEnglish
2 disfluentEnglish
1 incomprehensible
Humanevalua&on:subjec&vity
a
fluency
adeq
uacy b
cd
a
fluency
adeq
uacy b
c
d
a
fluency
adeq
uacy b
cd
JUDGE1 JUDGE2 JUDGE3
• Perfectorverypoortransla&onsareeasytoscore… …butwhathappensinbetween?
(a)Adequatefluenttransla&on: Thecatenterstheroom.(b)Adequatedisfluenttransla&on:Thecatenterintheroom.(c)Fluentinadequatetransla&on:Thecatsenterthebedroom.(d)Disfluentinadequatetransla&on:Bedroomthedogsentersthe
Humanevalua&on:subjec&vity
Evaluatorsdisagree!• …lookatthishistogramofadequacyjudgmentsby
differenthumanevaluators
MTEvalua&on,Trento,ISITSchool-November2013
Humanevalua&on:measuringagreement
• Kappacoefficient
– p(A):propor&onof&mesthattheevaluatorsagree
– p(E):propor&onof&methattheywouldagreebychance
(5-pointscale→p(E)=1/5)
– Completeagreement:K=1
– Noagreementhigherthanchance:K=0
• Example:inter-evaluatoragreementinWMT2007
€
K =p(A) − p(E)1− p(E)
p(A) p(E) K
Fluency .400 .2 .250
Adequacy .380 .2 .226
Humanevalua&on:alterna&ves
• Rankingtransla=ons:istransla&onXbeDerthantransla&onY?– Evaluatorsaremoreconsistent
• Informa=veness: answer comprehension ques&ons using thetransla&on(who?where?when?names,numbers,datesetc.)– Veryhardtodeviseques&ons
p(A) p(E) K
Fluency .400 .2 .250
Adequacy .380 .2 .226
Sentenceranking .582 .333 .373
Humanevalua&on:alterna&ves
• Rankingtransla=ons:istransla&onXbeDerthantransla&onY?– Evaluatorsaremoreconsistent
• Informa=veness: answer comprehension ques&ons using thetransla&on(who?where?when?names,numbers,datesetc.)– Veryhardtodeviseques&ons
p(A) p(E) K
Fluency .400 .2 .250
Adequacy .380 .2 .226
Sentenceranking .582 .333 .373
Humanevalua&on:alterna&ves
• Reading=me– peoplereadmorequicklyawell-formedtext
• Post-edi=ngeffort(=me/HTER)– TimerequiredtoturnMTintoagoodtransla&on
– HTER (Human-Targeted Transla&on Error Rate) – number ofedi&ng opera&ons required to turn MT output into anacceptabletransla&on
Humanevalua&on:alterna&ves
• Reading=me– peoplereadmorequicklyawell-formedtext
• Post-edi=ngeffort(=me/HTER)– TimerequiredtoturnMTintoagoodtransla&on
– HTER (Human-Targeted Transla&on Error Rate) – number ofedi&ng opera&ons required to turn MT output into anacceptabletransla&on
Automa&cmetricsforMTevalua&on
MTEvalua&on,Trento,ISITSchool-November2013
Requirementsforautoma&cmetrics
• Lowcost(wrthumanevalua&on)
• Objec=ve(unbiased)• Meaningful:scoreshouldgiveintui&veinterpreta&onof
transla&onquality
• Efficient:tobecomputedquicklyandonen
• Consistent:repeateduseofmetricshouldgivesameresults
• Correct:metricmustrankbeDersystemshigher
MTEvalua&on,Trento,DoctoralSchool-April2016
Reference-basedmetrics
• Idea:computeasimilarityscorebetweenacandidatetransla&onandoneormorehigh-qualityreferencetransla&ons– Referencesarecreatedbyhumanexperts(e.g.professionaltranslators)
– Severalreferencesallowustoaccountforvariabilityofgoodtransla&ons
• Criterionforvalida=ngautoma=cmetrics:automa&cscoresmustcorrelatewithhumanonesontestdata
MTEvalua&on,Trento,DoctoralSchool-April2016
Reference-basedmetrics• Typically:
– Simisasimilaritymetricbetweensentences– Simcanuseavarietyofproper&es:stringdistance,wordprecision/
recall,syntac&csimilarity,seman&cdistance,etc.
WER:ra&oofsmallesteditdistanceandoutputlength
BLEU:weightedsumofprecisionofn-grams
TER:normalizednumberofeditstomatchtheclosestreference
METEOR:harmonicmeanofunigramprecision/recallNIST,PER,GTM,HTER,TERP,CDER,GTM,BLANC,PER,ULC,MT-NCD,ATEC,TESLA,SEPIA,IQTM,BEWT-E,MEANT,etc.
€
1k
sim(refii=1
k
∑ ,cand) 1≤k≤4
“candidate”,“reference”,“n-grams”
Candidate(or“target”or“hypothesis”):thegunmanwasshotdeadbypolice.
Referencetransla=on:thegunmanwasshottodeathbythepolice.
N-grams:the,gunman,was,shot,by,police,.
thegunman,gunmanwas,wasshot,police. thegunmanwas,gunmanwasshot
thegunmanwasshot4-grams
3-grams
2-grams
1-grams
TheBLEUmetric(BiLingualEvalua&onUnderstudy)
• ProposedbyIBM[Papinenietal.,2001](namefromIBM’scolor)• Anumericalmeasureofclosenessbetweentexts
• Ra&onal:thecloserMTistohumantransla&on,thebeDer
• Idea:checkmatchesofwords(unigrams)andphrases(n-grams)between:– onehypothesis(thetransla&onproducedbyMT)
– asetofreferences(professionalhumantransla&ons)
• Criterion:themorethematches,thebeDerthehypothesis
• Needsgoodqualityreferencestocoverlinguis&cvariety
Important:onlythetargetlanguageistakenintoaccount!
MTEvalua&on,Trento,DoctoralSchool-April2016
TheBLEUmetric(BiLingualEvalua&onUnderstudy)
• ProposedbyIBM[Papinenietal.,2001](namefromIBM’scolor)• Anumericalmeasureofclosenessbetweentexts
• Ra&onal:thecloserMTistohumantransla&on,thebeDer
• Idea:checkmatchesofwords(unigrams)andphrases(n-grams)between:– onehypothesis(thetransla&onproducedbyMT)
– asetofreferences(professionalhumantransla&ons)
• Criterion:themorethematches,thebeDerthehypothesis
• Needsgoodqualityreferencestocoverlinguis&cvariety
Important:onlythetargetlanguageistakenintoaccount!
MTEvalua&on,Trento,DoctoralSchool-April2016
TheBLEUmetric(BiLingualEvalua&onUnderstudy)
• ProposedbyIBM[Papinenietal.,2001](namefromIBM’scolor)• Anumericalmeasureofclosenessbetweentexts
• Ra&onal:thecloserMTistohumantransla&on,thebeDer
• Idea:checkmatchesofwords(unigrams)andphrases(n-grams)between:– onehypothesis(thetransla&onproducedbyMT)
– asetofreferences(professionalhumantransla&ons)
• Criterion:themorethematches,thebeDerthehypothesis
• Needsgoodqualityreferencestocoverlinguis&cvariety
Important:onlythetargetlanguageistakenintoaccount!
MTEvalua&on,Trento,DoctoralSchool-April2016
TheBLEUmetric(BiLingualEvalua&onUnderstudy)
• ProposedbyIBM[Papinenietal.,2001](namefromIBM’scolor)• Anumericalmeasureofclosenessbetweentexts
• Ra&onal:thecloserMTistohumantransla&on,thebeDer
• Idea:checkmatchesofwords(unigrams)andphrases(n-grams)between:– onehypothesis(thetransla&onproducedbyMT)
– asetofreferences(professionalhumantransla&ons)
• Criterion:themorethematches,thebeDerthehypothesis
• Needsgoodqualityreferencestocoverlinguis&cvariety
Important:onlythetargetlanguageistakenintoaccount!
MTEvalua&on,Trento,DoctoralSchool-April2016
TheBLEUmetric(BiLingualEvalua&onUnderstudy)
MTEvalua&on,Trento,DoctoralSchool-April2016
REF
HYP1
HYP2
HYP3
VERYGOOD
BAD
VERYBAD
TheBLEUmetric:modifiedn-gramprecision
• n-gramPrecision:percentageofn-gramsinthehypothesisthatoccuralsoin(anyofthe)references(0≤p≤1)– matchesofshortern-grams(n=1,2)captureadequacy
– matchesoflongern-grams(n=3,4,...)capturefluency
• Modified:areferencewordisconsideredexhaustedaneramatchingwordisiden&fiedinthehypothesis. – Example:
Hyp: thethethethethethethe
Ref: thecatisonthemat
MTEvalua&on,Trento,DoctoralSchool-April2016
TheBLEUmetric:modifiedn-gramprecision
• n-gramPrecision:percentageofn-gramsinthehypothesisthatoccuralsoin(anyofthe)references(0≤p≤1)– matchesofshortern-grams(n=1,2)captureadequacy
– matchesoflongern-grams(n=3,4,...)capturefluency
• Modified:areferencewordisconsideredexhaustedaneramatchingwordisiden&fiedinthehypothesis. – Example:
Hyp: thethethethethethethe
Ref: thecatisonthemat
MTEvalua&on,Trento,DoctoralSchool-April2016
€
p1standard =
77
€
p1modified =
27
TheBLEUmetric:brevitypenalty
• Brevitypenalty(BP):topenalizetooshorthypotheses– Example:
Hyp: the
Ref: thecatisonthemat
…Can’tjusttypeoutsingleword“the’’(precision1.0!)
– c=lengthofMThypothesis,r=lengthoftheclosestreference
MTEvalua&on,Trento,DoctoralSchool-April2016
TheBLEUmetric:computa&on
BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)
Hypothesis:Thegunmanwasshotdeadbypolice.– Ref1: Thegunmanwasshottodeathbythepolice.– Ref2: Thehitmanwaskilledbythepoliceforces.– Ref3: Policekilledthegunman.– Ref4: Thegunmanwasshotdeadbythepolice.
• Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)• BrevityPenalty:c=8,r=9,BP=0.8825• FinalScore:
MTEvalua&on,Trento,DoctoralSchool-April2016
TheBLEUmetric:computa&on
BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)
Hypothesis:Thegunmanwasshotdeadbypolice.– Ref1: Thegunmanwasshottodeathbythepolice.– Ref2: Thehitmanwaskilledbythepoliceforces.– Ref3: Policekilledthegunman.– Ref4: Thegunmanwasshotdeadbythepolice.
• Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)• BrevityPenalty:c=8,r=9,BP=0.8825• FinalScore:
MTEvalua&on,Trento,DoctoralSchool-April2016
TheBLEUmetric:computa&on
BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)
Hypothesis:Thegunmanwasshotdeadbypolice.– Ref1: Thegunmanwasshottodeathbythepolice.– Ref2: Thehitmanwaskilledbythepoliceforces.– Ref3: Policekilledthegunman.– Ref4: Thegunmanwasshotdeadbythepolice.
• Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)• BrevityPenalty:c=8,r=9,BP=0.8825• FinalScore:
MTEvalua&on,Trento,DoctoralSchool-April2016
TheBLEUmetric:computa&on
BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)
Hypothesis:Thegunmanwasshotdeadbypolice.– Ref1: Thegunmanwasshottodeathbythepolice.– Ref2: Thehitmanwaskilledbythepoliceforces.– Ref3: Policekilledthegunman.– Ref4: Thegunmanwasshotdeadbythepolice.
• Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)• BrevityPenalty:c=8,r=9,BP=0.8825• FinalScore:
MTEvalua&on,Trento,DoctoralSchool-April2016
TheBLEUmetric:computa&on
BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)
Hypothesis:Thegunmanwasshotdeadbypolice.– Ref1: Thegunmanwasshottodeathbythepolice.– Ref2: Thehitmanwaskilledbythepoliceforces.– Ref3: Policekilledthegunman.– Ref4: Thegunmanwasshotdeadbythepolice.
• Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)• BrevityPenalty:c=8,r=9,BP=0.8825• FinalScore:
MTEvalua&on,Trento,DoctoralSchool-April2016
TheBLEUmetric:computa&on
BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)
Hypothesis:Thegunmanwasshotdeadbypolice.– Ref1: Thegunmanwasshottodeathbythepolice.– Ref2: Thehitmanwaskilledbythepoliceforces.– Ref3: Policekilledthegunman.– Ref4: Thegunmanwasshotdeadbythepolice.
• Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)• BrevityPenalty:BP=0.8825(exp(1-(9/8))• FinalScore:
MTEvalua&on,Trento,DoctoralSchool-April2016
c=8
r=9
TheBLEUmetric:computa&on
BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)
Hypothesis:Thegunmanwasshotdeadbypolice.– Ref1: Thegunmanwasshottodeathbythepolice.– Ref2: Thehitmanwaskilledbythepoliceforces.– Ref3: Policekilledthegunman.– Ref4: Thegunmanwasshotdeadbythepolice.
• Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)• BrevityPenalty:BP=0.8825(exp(1-(9/8))• FinalScore:
€
1× 0.86 × 0.67 × 0.64 × 0.8825= 0.68
TheBLEUmetric:computa&on
BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)
Hypothesis:Thegunmanwasshotdeadbypolice.– Ref1: Thegunmanwasshottodeathbythepolice.– Ref2: Thegunmanwaskilledbythepolice.– Ref3: Policekilledthegunman.– Ref4: Thegunmanwasshotdeadbythepolice.
• Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)• BrevityPenalty:BP=0.8825(exp(1-(9/8))• FinalScore:
€
1× 0.86 × 0.67 × 0.64 × 0.8825= 0.68
NOTE:thisisaproduct!!!! Ifoneofthefactorsis0(e.g.no4-grammatches)thefinalscorewillbe0!!!Forthisreasonthefinalscoreisusuallycalculatedontheen=reevalua=oncorpus,notonsinglesentences!
TheBLEUmetric:correla&onwithtrainingsetsize
MTEvalua&on,Trento,DoctoralSchool-April2016
ExperimentsbyPhilippKoehn
BLEUscore
No.sentencepairsusedintraining
FromGeorgeDoddington,NIST,2002
TheBLEUmetric:correla&onwithhumanjudgments
TheBLEUmetriclimita&ons:examples
• Reference: abcdefghIjklmnopqrs
• Hyp1: abcdfegihjlkmonprqs
• Hyp2: abcdefgxxxxxxxxxxxx
Hyp1 Hyp2
1-gram 1.0000 0.3684
2-gram 0.1666 0.3333
3-gram 0.1176 0.2941
4-gram 0.0625 0.2500
BLEUScore 0.1871 0.3083
MTEvalua&on,Trento,DoctoralSchool-April2016
Longern-gramsdominateshortern-grams!!!
TheBLEUmetriclimita&ons:examples
HYPOTHESES BLEU
GeorgeBushwillonentakeaholidayinCrawfordTexas 1.0000
BushwillonenholidayinTexas 0.4611
BushwillonenholidayinCrawfordTexas 0.6363
GeorgeBushwillonenholidayinCrawfordTexas 0.7490
GeorgeBushwillnotonenvaca&oninTexas 0.4491
GeorgeBushwillnotonentakeaholidayinCrawfordTexas 0.9129
MTEvalua&on,Trento,DoctoralSchool-April2016
• Reference:
GeorgeBushwillonentakeaholidayinCrawfordTexas
TheBLEUmetriclimita&ons:examples
HYPOTHESES BLEU
GeorgeBushwillonentakeaholidayinCrawfordTexas 1.0000
BushwillonenholidayinTexas 0.4611
BushwillonenholidayinCrawfordTexas 0.6363
GeorgeBushwillonenholidayinCrawfordTexas 0.7490
GeorgeBushwillnotonenvaca&oninTexas 0.4491
GeorgeBushwillnotonentakeaholidayinCrawfordTexas 0.9129!
MTEvalua&on,Trento,DoctoralSchool-April2016
• Reference:
GeorgeBushwillonentakeaholidayinCrawfordTexas
TheBLEUmetriclimita&ons:examples
HYPOTHESES BLEU
GeorgeBushwillonentakeaholidayinCrawfordTexas 1.0000
BushwillonenholidayinTexas 0.4611
BushwillonenholidayinCrawfordTexas 0.6363
GeorgeBushwillonenholidayinCrawfordTexas 0.7490
GeorgeBushwillnotonenvaca&oninTexas 0.4491
GeorgeBushwillnotonentakeaholidayinCrawfordTexas 0.9129!
MTEvalua&on,Trento,DoctoralSchool-April2016
• Reference:
GeorgeBushwillonentakeaholidayinCrawfordTexas
Smallchangesinthetextmaydeterminebigmeaningchanges!
• Reference:ThePresidentfrequentlymakeshisvaca&oninCrawfordTexas
WHY?
TheBLEUmetriclimita&ons:examples
HYPOTHESES BLEU(4-gram)
GeorgeBushonentakesaholidayinCrawfordTexas 0.2627
holidayonenBushatakesGeorgeinCrawfordTexas 0.2627
MTEvalua&on,Trento,DoctoralSchool-April2016
• Reference:ThePresidentfrequentlymakeshisvaca&oninCrawfordTexas
WHY?
…The“invisibleregion”[Hovy&Ravichandran2003]
TheBLEUmetriclimita&ons:examples
MTEvalua&on,Trento,DoctoralSchool-April2016
HYPOTHESES BLEU(4-gram)
GeorgeBushonentakesaholidayinCrawfordTexas 0.2627
holidayonenBushatakesGeorgeinCrawfordTexas 0.2627
• Reference:ThePresidentfrequentlymakeshisvaca&oninCrawfordTexas
DTNNPRBVBZPRP$NNINNNPNNP
Solu=on#1:matchesatPOSlevel[Hovy&Ravichandran2003]
TheBLEUmetriclimita&ons:improvements
HYPOTHESES BLEU(4-gram)
GeorgeBushonentakesaholidayinCrawfordTexas 0.2627
holidayonenBushatakesGeorgeinCrawfordTexas 0.2627
MTEvalua&on,Trento,DoctoralSchool-April2016
• Reference:ThePresidentfrequentlymakeshisvaca&oninCrawfordTexas
DTNNPRBVBZPRP$NNINNNPNNP
Solu=on#1:matchesatPOSlevel[Hovy&Ravichandran2003]
TheBLEUmetriclimita&ons:improvements
HYPOTHESES BLEU(4-gram)
NNPNNPRBVBZDTNNINNNPNNP 0.5411
NNRBNNPDTVBZNNPINNNPNNP 0.3117
HYPOTHESES BLEU(4-gram)
GeorgeBushonentakesaholidayinCrawfordTexas 0.2627
holidayonenBushatakesGeorgeinCrawfordTexas 0.2627
MTEvalua&on,Trento,DoctoralSchool-April2016
• Reference:ThePresidentfrequentlymakeshisvaca&oninCrawfordTexas
DTNNPRBVBZPRP$NNINNNPNNP
Solu=on#2:(Words+POS)/2[Hovy&Ravichandran2003]
TheBLEUmetriclimita&ons:improvements
HYPOTHESES BLEU(4-gram)
NNPNNPRBVBZDTNNINNNPNNP 0.4020
NNRBNNPDTVBZNNPINNNPNNP 0.2966
HYPOTHESES BLEU(4-gram)
GeorgeBushonentakesaholidayinCrawfordTexas 0.2627
holidayonenBushatakesGeorgeinCrawfordTexas 0.2627
MTEvalua&on,Trento,DoctoralSchool-April2016
TheBLEUmetric:prosandcons• BLEUrangesfrom0to1(transla&onqualityas“percentage”)
• Themorethereferences,thehigherthescore
• Highcorrela&onwithhumanassignedscores,especiallyonfluency
• Rankingof“similar”MTsystemsequivalenttohumanranking
• Collec&ngreferencehasahighcost
• Longern-gramsdominateshortern-grams
• Smallchangesinthetext(e.g.“not”)maydeterminebigmeaningchanges
• Scoresarenotstraigh�orwardtointerpret(BLEU=30…sowhat?)
• Syntaxpoorlymodeled
• Ignoreswordrelevanceandseman&cequivalence(stringlevelcomparisons)
• Canfailinrankingsystemsbasedondifferentapproaches
MTEvalua&on,Trento,DoctoralSchool-April2016
TheTERmetric(Transla&onEditRate)
• Idea:simulatepost-edi=ng[Snoveretal.2006]– Givenatransla&onhypothesis(H)ANDareferencetransla&on(R)– CalculatetheminimalnumberofeditstotransformHintoR
(normalizedbytheaveragelengthofthereferences)
– Possibleedits:inser&ons/dele&on/subs&tu&onofsinglewords,shinsofwordsequences
• Criterion:thelessthenumberofedits,thebeDerthehypothesis
MTEvalua&on,Trento,DoctoralSchool-April2016
TheTERmetric(Transla&onEditRate)
• Idea:simulatepost-edi=ng[Snoveretal.2006]– Givenatransla&onhypothesis(H)ANDareferencetransla&on(R)– CalculatetheminimalnumberofeditstotransformHintoR
(normalizedbytheaveragelengthofthereferences)
– Possibleedits:inser&ons/dele&on/subs&tu&onofsinglewords,shinsofwordsequences
• Criterion:thelessthenumberofedits,thebeDerthehypothesis
MTEvalua&on,Trento,DoctoralSchool-April2016
TheTERmetric:exampleREF:SaudiArabiadeniedthisweekinforma&onpublishedintheAmericanNYTHYP:thisweektheSaudisdeniedinforma&onpublishedintheNYT
• HYP:fluent,samemeaningofreference(except“American”)
• butnotexactmatch:
– thisweekisshined– SaudiArabiaintheREFappearsastheSaudisintheHYP– AmericanappearsonlyintheREF
• Numberofedits=4(1shin,2subs&tu&ons,and1dele&on):
TER%=4/11*100=36.36%
MTEvalua&on,Trento,DoctoralSchool-April2016
TheTERmetric:discussion
• Evalua&onclosetoarealtask(post-edi&ng)• Resultsaremoreinterpretablethanforothermetrics
• Canbecomputedonlyforasinglesentence
• Insensi&vetoseman&ccloseness(e.g.synonyms,paraphrases)
• Complexityofcomputa&on(op&malcalcula&onofedit-distancewithmoveopera&ons:NP-complete)– approximatesearchviadynamicprogramming(decomposi&oninsub-
problems
MTEvalua&on,Trento,DoctoralSchool-April2016
TheHTERmetric(Human-targetedTER)
• TERignoresseman&cequivalenceandheavilydependsonthereferencetransla&on
• Idea:referencesashumanpost-edi=ons– Performhumanpost-edi&ngtotransformthehypothesisintotheclosestacceptabletransla&on
– HTERmeasuresTERbetweenthehypothesisandtheresul&ngreferencetransla&on
• Criterion:thelessthenumberofedits,thebeDerthehypothesis(sameasTER)
MTEvalua&on,Trento,DoctoralSchool-April2016
TER/HTER:pros/cons
• TER– intui&vemeasureofMTquality
– adequateforfastdevelopment
– reasonablycorrelateswithhumanjudgments(>BLEU,<thanotherse.g.METEOR)
– ignoresseman&cequivalence
• HTER– intui&vemeasureofMTquality
– highestcorrela&onwithhumanjudgments
– possiblesubs&tuteforhumanevalua&onsbecauselesssubjec&ve
– expensive:3to7minutespersentenceforahumantoannotate
– notsuitableforusinginthedevelopmentcycleofanMTsystem
MTEvalua&on,Trento,DoctoralSchool-April2016
Applica&on-orientedMTevalua&on
QualityEs&ma&on(QE)
• Fromcontrolledlabtestsandevalua&oncampaigns…
• …toMTevalua&oninreal-lifecondi&ons(e.g.theCATframework)– Asasupporttohumantranslators
– Atrun&me
– Withoutreferencetransla&ons
MTEvalua&on,Trento,DoctoralSchool-April2016
(One)scenario:theCATframework
CATTool
?
TheCATtool1. Segmentstheinputdocument2. Provides,foreachsegment:
• Sugges&onsfromatransla&onmemory(TM)
• Sugges&onsfromanMTengine
Thetranslator,foreachsegment1. Selectsthebestsugges&on2. Post-editsit(ifnecessary)to
reachpublica&onquality
(One)scenario:theCATframework• Questions:
– Is this suggestion good enough to be published?
– Can I trust it? – Can a reader get the gist? – Is it publishable “as is”? – If not, what is better: post-editing
or rewriting?
• Huge market interest – Increased translators’ productivity – No manual intervention on
reliable MT suggestions
Predic&ngMToutputquality• Task:automa&callyes&mateMToutputqualityatrun-8me
andwithoutreferencetransla8ons• Approach:supervisedlearning.First(trainingstep),amodelis
learnedfromhuman-labelleddata.Then(predic&onstep),thethemodelisusedtolabelnew,unseendata.
Predic&ngMToutputquality• Task:automa&callyes&mateMToutputqualityatrun-8me
andwithoutreferencetransla8ons• Approach:supervisedlearning.First(trainingstep),amodelis
learnedfromhuman-labelleddata.Then(predic&onstep),thethemodelisusedtolabelnew,unseendata.
Posi&ve/Nega&veexamples
Possiblefeatures:hasWings,hasFeathers,sound,moves,hasPalmateFeet,etc.
Predic&ngMToutputquality
• Whatisagoodindicatoroftransla&onquality?
• Itshouldtakeintoaccount:– Correctnessandusefulnessofthetransla&on– Cogni&veeffortneededbyhumanforthecorrec&on
• Alltheseaspectscanbesummarizedinthe:– Post-edi=ngeffort
MTEvalua&on,Trento,ISITSchool-November2013
Predic&ngMToutputquality
• Whatisagoodindicatoroftransla&onquality?
• Itshouldtakeintoaccount:– Correctnessandusefulnessofthetransla&on– Cogni&veeffortneededbyhumanforthecorrec&on
MTEvalua&on,Trento,ISITSchool-November2013
Predic&ngMToutputquality
• Whatisagoodindicatoroftransla&onquality?
• Itshouldtakeintoaccount:– Correctnessandusefulnessofthetransla&on– Cogni&veeffortneededbyhumanforthecorrec&on
• Alltheseaspectscanbesummarizedinthe:– Post-edi=ngeffort
MTEvalua&on,Trento,ISITSchool-November2013
Predic&ngMToutputquality
• Whatispost-edi&ng?– Aprocessofmodifica&onratherthanrevision(Loffler-Laurian1985)
– The“termusedforthecorrec&onofmachinetransla&onoutputbyhumanlinguists/editors”(VealeandWay1997)
– Repairingtexts(Krings,2001)
– “…theprocessofimprovingamachine-generatedtransla&onwithaminimumofmanuallabor”(TAUSreport,2010)
MTEvalua&on,Trento,DoctoralSchool-April2016
Predic&ngMToutputquality
• Whatispost-edi&ngeffort?– theeffortmadebyapost-editortomanuallyimproveamachinegeneratedtransla&on
• Measureofpost-edi&ngeffort:– Qualityscore(ases&matedbyhumansona1-5Likertscale)
– Numberofeditopera&ons(HTER)
– Post-Edi&ng&me(totalsecondsorsecondsperwords)
– Numberofkeystrokes
– …
MTEvalua&on,Trento,DoctoralSchool-April2016
Qualityscores
• Arbitrarychoiceofthelevelsofquality 1=requirescompleteretransla&on;
2=requiressomeretransla&on;
3=veryliDlepostedi&ngneeded;
4=fitforpurpose
• Labelingrequireshumaninterven&on
• Aprecisemeasure
• Subjec&ve/expensive/&meconsumingtask
MTEvalua&on,Trento,DoctoralSchool-April2016
• WorkshoponSMTscoringschema:1. TheMToutputisincomprehensible,withliDleorno
informa&ontransferredaccurately.Itcannotbeedited,needstobetranslatedfromscratch.
2. About50%-70%oftheMToutputneedstobeedited.Itrequiresasignificantedi&ngeffortinordertoreachpublishablelevel.
3. About25-50%oftheMToutputneedstobeedited.Itcontainsdifferenterrorsandmistransla&onsthatneedtobecorrected.
4. About10-25%oftheMToutputneedstobeedited.Itisgenerallyclearandintelligible.
5. TheMToutputisperfectlyclearandintelligible.Itisnotnecessarilyaperfecttransla&on,butrequiresliDletonoedi&ng.
81
Qualityscores
MTEvalua&on,Trento,DoctoralSchool-April2016
Post-edi&ng&me• Secondsneededtopost-editasentence• normalizedversioninsecondsperword
– liDle&me=goodtransla&on
– large&me=badtransla&on
• Usuallyincludes:– reading&me
– searchingforinforma&ononexternalresources
– typing&me
– extra&meforsecondaryac&vity(e.g.correc&on)
• Highvariabilityacrosssentencesandtranslators
MTEvalua&on,Trento,DoctoralSchool-April2016
HTER(again!)• HumantargetedTERisthestandardeditdistancebetweentheoriginalmachinetransla&onanditsminimallypost-editedversion
– edits:inser&on,dele&on,subs&tu&on,shin
• Lowervariability(wrt&me)acrosssentences/translators
MTEvalua&on,Trento,DoctoralSchool-April2016
€
HTER =#edits
#words_ postedited _version
Post-edi&ng&meVsHTER
MTEvalua&on,Trento,DoctoralSchool-April2016
• Time:pros/cons– Accountsfordifferenteffortsin
transla&ngdifferentwords
– Variabilityamongpost-editors
• HTER:pros/cons– Objec&ve,easytocomputemeasure– lessvarianceacrosspost-editors
(bad=badforall)– Ignoresdifferenteffortsintransla&ng
differentwords
Predic&ngMToutputquality
• Tasks:– Automa&clabeling
• realvalues=regression• integers=classifica&on
– Automa&cranking
• Granularity– Wordlevel(e.g.“Thecatenterintheroom”)– Sentencelevel(e.g.“Thecatenterintheroom”:2.27)– Documentlevel
MTEvalua&on,Trento,DoctoralSchool-April2016
Evalua&onMetrics-Regression• Regression(predic&onsasrealvalues):
– MeanAbsoluteError(MAE)– RootMeanSquaredError(RMSE)
• GivenasetofpredictedscoresHandasetofhumanscoresV
€
MAE =
H(si) −V (si)i=1
N
∑N
RMSE =
(H(si) −V (si))2
i=1
N
∑N
MTEvalua&on,Trento,DoctoralSchool-April2016
Evalua&onMetrics-Classifica&on• Classifica&on(predic&onsasintegers):
– Precision(Pr)– Recall(Re)– f–score(F1)
• GivenasetofpredictedscoresHandasetofhumanscoresV• Anexampleforbinaryclassifica&on
V
1 -1
H1 TruePosi&ve FalsePosi&ve
-1 FalseNega&ve TrueNega&ve
€
Pr =tp
tp+ fp
Re =tp
tp+ fn
F1 = 2* Pr*RePr+Re
MTEvalua&on,Trento,DoctoralSchool-April2016
Evalua&onMetrics-Ranking
MTEvalua&on,Trento,DoctoralSchool-April2016
• Spearman’sRankCoefficient
• DeltaAverage(introducedatWMT2012)
Score Ranking
s1 3.2 3
s2 1 5
s3 5 1
s4 2.7 4
s5 4 2
Judgment Ranking
s1 5 1
s2 1 5
s3 4 2
s4 2 4
s5 3 3
System Human
RankSimilarityMetric
Qualityindicators
• Featurescanbeextractedfrom– Thesourcesentence(“Complexity”indicators)– Thetranslatedsentence(“Fluency”indicators)– SourceandTargetsentences(“Adequacy”andotherindicators)– MTsystemduringthetransla&onprocess(“Confidence”indicators)
MTEvalua&on,Trento,DoctoralSchool-April2016
Sourcesentence
Translatedsentence
MTsystem
Qualityindicators-Complexity
• Capturethedifficultytotranslatethesourcesentence• Complexsentencesarehardertotranslate
– sourcesentencelength– n-gramlanguagemodelprobability– numberofpunctua&onmarks– sourcesentencetype/tokenra&o(e.g.#nouns/#tokens)– avg.#oftransla&onsperword(asgivenbyprobabilis&cdic&onaries)– %ofcontent/non-contentwords– …
Sourcesentence
Translatedsentence
MTsystem
MTEvalua&on,Trento,DoctoralSchool-April2016
Qualityindicators-Fluency
• Capturethelevelofnaturalnessofthetransla=oninthetargetlanguage• Thetransla&onshouldconformtothetargetlanguageintermsof
grammar,withlexicalchoicesappropriatetothegenreofthesourcetext
– n-gramlanguagemodelprobability
– POS-tagtargetlanguagemodel
– …
Sourcesentence
MTsystem
Translatedsentence
MTEvalua&on,Trento,DoctoralSchool-April2016
Qualityindicators-Adequacy
• Capturethelevelofseman=cequivalencebetweensourceandtransla=on• Sourceandtargetsentencesshouldconveythesamemeaning.Meaning
drins/lossesfromsourcetotargetsentenceindicateabadtransla&on
– %ofalignedwordsinsourceandtarget– %ofalignmentsbetweenwordswiththesamepartofspeech
– %ofalignednouns/verbs/adjec&ves– alignedIDFmass(IDFasindicatoroftermrelevance)
– …
MTsystem
Translatedsentence
Sourcesentence
MTEvalua&on,Trento,DoctoralSchool-April2016
Qualityindicators-Confidence
• CapturethelevelofconfidenceoftheSMTsystem• sentencesforwhichthetransla&onprocessiscomplexaremorelikelytobe
badtransla&ons
– lengthNoftheNbestlist– numberofprunedhypotheses
– log-likelihoodscore– avg.edit-distanceofthe1-bestfromthefirstk-bests
– …
Sourcesentence
Translatedsentence
MTsystem
MTEvalua&on,Trento,DoctoralSchool-April2016
OpenIssues
• Lackofanobjec&vequalityscoreabletocatchcogni&veefforts– AnewscorethatcontainsthemainfeaturesofHTERandcorrelateswellwithPE&me
• Lackofatechniqueabletothresholdthequalityscore(badvs.goodtransla&ons)– IsHTER=0.3/0.5/0.7abadorgoodtransla&on?– UsefulintheCATtoolscenario,whereitisnecessarytodiscardbadtransla&ons
MTEvalua&on,Trento,DoctoralSchool-April2016
OpenIssues
• Morethan1,000qualityindicatorshavebeendevelopedinthelastyears.– Doweneedalloftheminarealapplica&on?
– Whicharethemostreliableineachgroup?
– Whichisthebestcombina&on?
• Subjec&vityinthepost-editorworkandinthetask– Asinglequalityes&matorforverydifferentpost-editorbehaviorandtask
– Adaptability/personaliza&on
MTEvalua&on,Trento,DoctoralSchool-April2016
MTEvalua=onDilemma
Summary
• MTevalua&on:ahottopic…– Sharedevalua&onmethods/rou&nesareakeyassetinanyfield
• …butadifficulttask– Wetalkedabouterrorvariability,costs,speed,replicability,subjec&vity,correla&onwithhumanjudgments,etc.
Summary
• Humanevalua&on– Accurate,highquality,meaningful,expensive,slow,subjec&ve
• Automa&cevalua&on– Cheap,quick,repeatable,objec&ve,approximate,lessaccurate
– Fluency,adequacy
– Reference-based:BLEU,TER,HTER(prosandcons)– Reference-free:qualityes&ma&on(goal,methods,openissues)
Summary• Keyconcepts:
Adequacy
Referencetra
nsla&on
Agreement
Correla&on
Post-edi&ngeffort
CATtoolFeature
Cogni&veeffortHTER
MeanAbsoluteError
Evalua&onofMachineTransla&onQuality
MarcoTurchiFBKTrento,Italyturchi@<k.eu