Download - Importance of MT Evalua&on Diﬃculty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Evalua&onofMachineTransla&onQualityMarcoTurchiFBKTrento,Italyturchi@<k.eu

Slidesfromthepresenta&onbyMaDeoNegri…andmyself

Disclaimer

“MorehasbeenwriDenaboutMTevalua&on

overthepast50yearsthanaboutMTitself”

Hovyetal.:PrinciplesofContext-BasedMachineTransla7onEvalua7on.

MachineTransla&on,16,pp.1–33,2002

(aDributedtoYorickWilks)

“ItisimpossibletowriteacomprehensiveoverviewoftheMTevalua&onliterature”

AdamLopez.:Sta7s7calMachineTransla7on.

ACMCompu&ngSurveys40(3)pp.1–49,August2008.

MTEvalua&on,Trento,DoctoralSchool-April2016

Outline

• ImportanceofMTEvalua&on

• DifficultyofMTEvalua&on

• Humanevalua&on:fluency/adequacy

• Automa&cevalua&on:

– Reference-based:BLEU,TER,HTER(chosenamongMANYothers)– Reference-free:qualityes&ma&on(es&ma&ngpost-edi&ngeffort)


TheimportanceofMTevalua&on

• Answering“HowgoodisanMTsystem?”asawayto:– Whichsystemtouseforagiventask

– Assessandcomparesystems’performance– Definethestateoftheart– Drivesystemdevelopmentandmeasureimprovements– DecidewhethertoapplyMTatall

• …Necessary(yes,notsufficient)condi&onsforprogressinanyresearchfield

• Difficulttask!






• Difficulttask!






• Difficulttask!


DifficultyofMTevalua&on

• Noformaldefini=onof“transla&on”!nodefini&onof“goodtransla&on”

• Theno&onofqualityisinherentlysubjec=ve• Exactquan&fica&onisdifficult(especiallyforlongsentences)

• MTerrorsareveryvariedinnature
















• MTerrorsareveryvariedinnature• Perfectorverypoortransla&ons

areeasytoscore,butwhathappensinbetween?


• Manydifferentacceptabletransla&onsforthesamesentence

��

– Iam[experiencing|sufferingfrom|feeling]athrobbingpain.– I[feel|canfeel|have]a[throbbingpain|painfulthrobbing].– [Itisa|It’sin|I’vegota]throbbingpain.– It’sthrobbing[anditreallyhurts|withpain].– [It’spainfuland|Ithurtssomuch]it’sthrobbing.



• Howwouldyoutranslate:

It’srainingcatsanddogsAceinthehole

BeataroundthebushChewthefat

Wildgoosechase

TieoneonSunnysmile

• Literally,itsmeaningorthecorrespondingidiom(ifany)?



MTEvalua&on,Trento,ISITSchool-November2013

• Classifica&onoferrors:aquiterichtaxonomy

Note:errortypesarenotmutuallyexclusiveandonenco-occur(Vilaretal.2006)

HumanVsAutoma&cevalua&on

• HumanMTevalua=on:– criteria:adequacy(fidelity)andfluency(intelligibility)– pros:veryaccurate,highquality– cons:expensive,slow,subjec&ve

• Automa=cMTevalua=on:– criteria:“similarity”toprofessionalhumantransla&on

– pros:inexpensive,quick,objec&ve– cons:qualityis“slightly”lowerthanhumancheck


HumanVsAutoma&cevalua&on

• HumanMTevalua=on:– criteria:adequacy(fidelity)andfluency(intelligibility)– pros:veryaccurate,highquality– cons:expensive,slow,subjec&ve

• Automa=cMTevalua=on:– criteria:“similarity”toprofessionalhumantransla&on

– pros:inexpensive,quick,objec&ve– cons:qualityis“slightly”lowerthanhumancheck


Humanevalua&on


Humanevalua&on

• Given:– MToutput,sourceand/orreferencetransla&on

• Task:assessthequalityoftheMToutput

• Metrics

– Adequacy:doestheoutputconveythesamemeaningastheinputsentence?Ispartofthemessagelost,added,ordistorted?…requiresbilingualjudgesorareferencetransla&on

– Fluency:istheoutputgoodfluentEnglish?Thisinvolvesbothgramma&calcorrectnessandidioma&cwordchoices.…monolingualjudgesaresufficient,noreferenceneeded


Humanevalua&on



• Metrics




Humanevalua&on



• Metrics




Humanevalua&on:adequacyandfluency

• Sourcesentence:Lechatentredanslachambre.

(a)Adequatefluenttransla&on: Thecatenterstheroom.(b)Adequatedisfluenttransla&on:Thecatenterintheroom.(c)Fluentinadequatetransla&on:Thecatsenterthebedroom.(d)Disfluentinadequatetransla&on:Bedroomthedogsentersthe


Humanevalua&on:Likertscales

Adequacy

5 allmeaning

4 mostmeaning

3 muchmeaning

2 liDlemeaning

1 none


Fluency

5 flawlessEnglish

4 goodEnglish

3 non-na&veEnglish

2 disfluentEnglish

1 incomprehensible

Humanevalua&on:subjec&vity

a

fluency

adeq

uacy b

cd

a

fluency

adeq

uacy b

c

d

a

fluency

adeq

uacy b

cd

JUDGE1 JUDGE2 JUDGE3

• Perfectorverypoortransla&onsareeasytoscore… …butwhathappensinbetween?

(a)Adequatefluenttransla&on: Thecatenterstheroom.(b)Adequatedisfluenttransla&on:Thecatenterintheroom.(c)Fluentinadequatetransla&on:Thecatsenterthebedroom.(d)Disfluentinadequatetransla&on:Bedroomthedogsentersthe

Humanevalua&on:subjec&vity

Evaluatorsdisagree!• …lookatthishistogramofadequacyjudgmentsby

differenthumanevaluators


Humanevalua&on:measuringagreement

• Kappacoefficient

– p(A):propor&onof&mesthattheevaluatorsagree

– p(E):propor&onof&methattheywouldagreebychance

(5-pointscale→p(E)=1/5)

– Completeagreement:K=1

– Noagreementhigherthanchance:K=0

• Example:inter-evaluatoragreementinWMT2007

€

K =p(A) − p(E)1− p(E)

p(A) p(E) K

Fluency .400 .2 .250

Adequacy .380 .2 .226

Humanevalua&on:alterna&ves

• Rankingtransla=ons:istransla&onXbeDerthantransla&onY?– Evaluatorsaremoreconsistent

• Informa=veness: answer comprehension ques&ons using thetransla&on(who?where?when?names,numbers,datesetc.)– Veryhardtodeviseques&ons

p(A) p(E) K

Fluency .400 .2 .250

Adequacy .380 .2 .226

Sentenceranking .582 .333 .373


• Rankingtransla=ons:istransla&onXbeDerthantransla&onY?– Evaluatorsaremoreconsistent

• Informa=veness: answer comprehension ques&ons using thetransla&on(who?where?when?names,numbers,datesetc.)– Veryhardtodeviseques&ons

p(A) p(E) K

Fluency .400 .2 .250

Adequacy .380 .2 .226

Sentenceranking .582 .333 .373


• Reading=me– peoplereadmorequicklyawell-formedtext

• Post-edi=ngeffort(=me/HTER)– TimerequiredtoturnMTintoagoodtransla&on

– HTER (Human-Targeted Transla&on Error Rate) – number ofedi&ng opera&ons required to turn MT output into anacceptabletransla&on


• Reading=me– peoplereadmorequicklyawell-formedtext

• Post-edi=ngeffort(=me/HTER)– TimerequiredtoturnMTintoagoodtransla&on

– HTER (Human-Targeted Transla&on Error Rate) – number ofedi&ng opera&ons required to turn MT output into anacceptabletransla&on

Automa&cmetricsforMTevalua&on


Requirementsforautoma&cmetrics

• Lowcost(wrthumanevalua&on)

• Objec=ve(unbiased)• Meaningful:scoreshouldgiveintui&veinterpreta&onof

transla&onquality

• Efficient:tobecomputedquicklyandonen

• Consistent:repeateduseofmetricshouldgivesameresults

• Correct:metricmustrankbeDersystemshigher


Reference-basedmetrics

• Idea:computeasimilarityscorebetweenacandidatetransla&onandoneormorehigh-qualityreferencetransla&ons– Referencesarecreatedbyhumanexperts(e.g.professionaltranslators)

– Severalreferencesallowustoaccountforvariabilityofgoodtransla&ons

• Criterionforvalida=ngautoma=cmetrics:automa&cscoresmustcorrelatewithhumanonesontestdata


Reference-basedmetrics• Typically:

– Simisasimilaritymetricbetweensentences– Simcanuseavarietyofproper&es:stringdistance,wordprecision/

recall,syntac&csimilarity,seman&cdistance,etc.

WER:ra&oofsmallesteditdistanceandoutputlength

BLEU:weightedsumofprecisionofn-grams

TER:normalizednumberofeditstomatchtheclosestreference

METEOR:harmonicmeanofunigramprecision/recallNIST,PER,GTM,HTER,TERP,CDER,GTM,BLANC,PER,ULC,MT-NCD,ATEC,TESLA,SEPIA,IQTM,BEWT-E,MEANT,etc.

€

1k

sim(refii=1

k

∑ ,cand) 1≤k≤4

“candidate”,“reference”,“n-grams”

Candidate(or“target”or“hypothesis”):thegunmanwasshotdeadbypolice.

Referencetransla=on:thegunmanwasshottodeathbythepolice.

N-grams:the,gunman,was,shot,by,police,.

thegunman,gunmanwas,wasshot,police. thegunmanwas,gunmanwasshot

thegunmanwasshot4-grams

3-grams

2-grams

1-grams

TheBLEUmetric(BiLingualEvalua&onUnderstudy)

• ProposedbyIBM[Papinenietal.,2001](namefromIBM’scolor)• Anumericalmeasureofclosenessbetweentexts

• Ra&onal:thecloserMTistohumantransla&on,thebeDer

• Idea:checkmatchesofwords(unigrams)andphrases(n-grams)between:– onehypothesis(thetransla&onproducedbyMT)

– asetofreferences(professionalhumantransla&ons)

• Criterion:themorethematches,thebeDerthehypothesis

• Needsgoodqualityreferencestocoverlinguis&cvariety

Important:onlythetargetlanguageistakenintoaccount!































REF

HYP1

HYP2

HYP3

VERYGOOD

BAD

VERYBAD

TheBLEUmetric:modifiedn-gramprecision

• n-gramPrecision:percentageofn-gramsinthehypothesisthatoccuralsoin(anyofthe)references(0≤p≤1)– matchesofshortern-grams(n=1,2)captureadequacy

– matchesoflongern-grams(n=3,4,...)capturefluency

• Modified:areferencewordisconsideredexhaustedaneramatchingwordisiden&fiedinthehypothesis. – Example:

Hyp: thethethethethethethe

Ref: thecatisonthemat


TheBLEUmetric:modifiedn-gramprecision

• n-gramPrecision:percentageofn-gramsinthehypothesisthatoccuralsoin(anyofthe)references(0≤p≤1)– matchesofshortern-grams(n=1,2)captureadequacy

– matchesoflongern-grams(n=3,4,...)capturefluency

• Modified:areferencewordisconsideredexhaustedaneramatchingwordisiden&fiedinthehypothesis. – Example:

Hyp: thethethethethethethe



€

p1standard =

77

€

p1modified =

27

TheBLEUmetric:brevitypenalty

• Brevitypenalty(BP):topenalizetooshorthypotheses– Example:

Hyp: the


…Can’tjusttypeoutsingleword“the’’(precision1.0!)

– c=lengthofMThypothesis,r=lengthoftheclosestreference


TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.– Ref1: Thegunmanwasshottodeathbythepolice.– Ref2: Thehitmanwaskilledbythepoliceforces.– Ref3: Policekilledthegunman.– Ref4: Thegunmanwasshotdeadbythepolice.

• Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)• BrevityPenalty:c=8,r=9,BP=0.8825• FinalScore:

























• Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)• BrevityPenalty:BP=0.8825(exp(1-(9/8))• FinalScore:


c=8

r=9





€

1× 0.86 × 0.67 × 0.64 × 0.8825= 0.68



Hypothesis:Thegunmanwasshotdeadbypolice.– Ref1: Thegunmanwasshottodeathbythepolice.– Ref2: Thegunmanwaskilledbythepolice.– Ref3: Policekilledthegunman.– Ref4: Thegunmanwasshotdeadbythepolice.


€

1× 0.86 × 0.67 × 0.64 × 0.8825= 0.68

NOTE:thisisaproduct!!!! Ifoneofthefactorsis0(e.g.no4-grammatches)thefinalscorewillbe0!!!Forthisreasonthefinalscoreisusuallycalculatedontheen=reevalua=oncorpus,notonsinglesentences!

TheBLEUmetric:correla&onwithtrainingsetsize


ExperimentsbyPhilippKoehn

BLEUscore

No.sentencepairsusedintraining

FromGeorgeDoddington,NIST,2002

TheBLEUmetric:correla&onwithhumanjudgments

TheBLEUmetriclimita&ons:examples

• Reference: abcdefghIjklmnopqrs

• Hyp1: abcdfegihjlkmonprqs

• Hyp2: abcdefgxxxxxxxxxxxx

Hyp1 Hyp2

1-gram 1.0000 0.3684

2-gram 0.1666 0.3333

3-gram 0.1176 0.2941

4-gram 0.0625 0.2500

BLEUScore 0.1871 0.3083


Longern-gramsdominateshortern-grams!!!


HYPOTHESES BLEU

GeorgeBushwillonentakeaholidayinCrawfordTexas 1.0000

BushwillonenholidayinTexas 0.4611

BushwillonenholidayinCrawfordTexas 0.6363

GeorgeBushwillonenholidayinCrawfordTexas 0.7490

GeorgeBushwillnotonenvaca&oninTexas 0.4491

GeorgeBushwillnotonentakeaholidayinCrawfordTexas 0.9129


• Reference:

GeorgeBushwillonentakeaholidayinCrawfordTexas


HYPOTHESES BLEU






GeorgeBushwillnotonentakeaholidayinCrawfordTexas 0.9129!


• Reference:



HYPOTHESES BLEU






GeorgeBushwillnotonentakeaholidayinCrawfordTexas 0.9129!


• Reference:


Smallchangesinthetextmaydeterminebigmeaningchanges!

• Reference:ThePresidentfrequentlymakeshisvaca&oninCrawfordTexas

WHY?


HYPOTHESES BLEU(4-gram)

GeorgeBushonentakesaholidayinCrawfordTexas 0.2627

holidayonenBushatakesGeorgeinCrawfordTexas 0.2627



WHY?

…The“invisibleregion”[Hovy&Ravichandran2003]







DTNNPRBVBZPRP$NNINNNPNNP

Solu=on#1:matchesatPOSlevel[Hovy&Ravichandran2003]

TheBLEUmetriclimita&ons:improvements







Solu=on#1:matchesatPOSlevel[Hovy&Ravichandran2003]



NNPNNPRBVBZDTNNINNNPNNP 0.5411

NNRBNNPDTVBZNNPINNNPNNP 0.3117







Solu=on#2:(Words+POS)/2[Hovy&Ravichandran2003]



NNPNNPRBVBZDTNNINNNPNNP 0.4020

NNRBNNPDTVBZNNPINNNPNNP 0.2966





TheBLEUmetric:prosandcons• BLEUrangesfrom0to1(transla&onqualityas“percentage”)

• Themorethereferences,thehigherthescore

• Highcorrela&onwithhumanassignedscores,especiallyonfluency

• Rankingof“similar”MTsystemsequivalenttohumanranking

• Collec&ngreferencehasahighcost

• Longern-gramsdominateshortern-grams

• Smallchangesinthetext(e.g.“not”)maydeterminebigmeaningchanges

• Scoresarenotstraigh�orwardtointerpret(BLEU=30…sowhat?)

• Syntaxpoorlymodeled

• Ignoreswordrelevanceandseman&cequivalence(stringlevelcomparisons)

• Canfailinrankingsystemsbasedondifferentapproaches


TheTERmetric(Transla&onEditRate)

• Idea:simulatepost-edi=ng[Snoveretal.2006]– Givenatransla&onhypothesis(H)ANDareferencetransla&on(R)– CalculatetheminimalnumberofeditstotransformHintoR

(normalizedbytheaveragelengthofthereferences)

– Possibleedits:inser&ons/dele&on/subs&tu&onofsinglewords,shinsofwordsequences

• Criterion:thelessthenumberofedits,thebeDerthehypothesis


TheTERmetric(Transla&onEditRate)

• Idea:simulatepost-edi=ng[Snoveretal.2006]– Givenatransla&onhypothesis(H)ANDareferencetransla&on(R)– CalculatetheminimalnumberofeditstotransformHintoR

(normalizedbytheaveragelengthofthereferences)

– Possibleedits:inser&ons/dele&on/subs&tu&onofsinglewords,shinsofwordsequences

• Criterion:thelessthenumberofedits,thebeDerthehypothesis


TheTERmetric:exampleREF:SaudiArabiadeniedthisweekinforma&onpublishedintheAmericanNYTHYP:thisweektheSaudisdeniedinforma&onpublishedintheNYT

• HYP:fluent,samemeaningofreference(except“American”)

• butnotexactmatch:

– thisweekisshined– SaudiArabiaintheREFappearsastheSaudisintheHYP– AmericanappearsonlyintheREF

• Numberofedits=4(1shin,2subs&tu&ons,and1dele&on):

TER%=4/11*100=36.36%


TheTERmetric:discussion

• Evalua&onclosetoarealtask(post-edi&ng)• Resultsaremoreinterpretablethanforothermetrics

• Canbecomputedonlyforasinglesentence

• Insensi&vetoseman&ccloseness(e.g.synonyms,paraphrases)

• Complexityofcomputa&on(op&malcalcula&onofedit-distancewithmoveopera&ons:NP-complete)– approximatesearchviadynamicprogramming(decomposi&oninsub-

problems


TheHTERmetric(Human-targetedTER)

• TERignoresseman&cequivalenceandheavilydependsonthereferencetransla&on

• Idea:referencesashumanpost-edi=ons– Performhumanpost-edi&ngtotransformthehypothesisintotheclosestacceptabletransla&on

– HTERmeasuresTERbetweenthehypothesisandtheresul&ngreferencetransla&on

• Criterion:thelessthenumberofedits,thebeDerthehypothesis(sameasTER)


TER/HTER:pros/cons

• TER– intui&vemeasureofMTquality

– adequateforfastdevelopment

– reasonablycorrelateswithhumanjudgments(>BLEU,<thanotherse.g.METEOR)

– ignoresseman&cequivalence

• HTER– intui&vemeasureofMTquality

– highestcorrela&onwithhumanjudgments

– possiblesubs&tuteforhumanevalua&onsbecauselesssubjec&ve

– expensive:3to7minutespersentenceforahumantoannotate

– notsuitableforusinginthedevelopmentcycleofanMTsystem


Applica&on-orientedMTevalua&on

QualityEs&ma&on(QE)

• Fromcontrolledlabtestsandevalua&oncampaigns…

• …toMTevalua&oninreal-lifecondi&ons(e.g.theCATframework)– Asasupporttohumantranslators

– Atrun&me

– Withoutreferencetransla&ons


(One)scenario:theCATframework

CATTool

?

TheCATtool1. Segmentstheinputdocument2. Provides,foreachsegment:

• Sugges&onsfromatransla&onmemory(TM)

• Sugges&onsfromanMTengine

Thetranslator,foreachsegment1. Selectsthebestsugges&on2. Post-editsit(ifnecessary)to

reachpublica&onquality

(One)scenario:theCATframework• Questions:

– Is this suggestion good enough to be published?

– Can I trust it? – Can a reader get the gist? – Is it publishable “as is”? – If not, what is better: post-editing

or rewriting?

• Huge market interest – Increased translators’ productivity – No manual intervention on

reliable MT suggestions

Predic&ngMToutputquality• Task:automa&callyes&mateMToutputqualityatrun-8me

andwithoutreferencetransla8ons• Approach:supervisedlearning.First(trainingstep),amodelis

learnedfromhuman-labelleddata.Then(predic&onstep),thethemodelisusedtolabelnew,unseendata.

Predic&ngMToutputquality• Task:automa&callyes&mateMToutputqualityatrun-8me

andwithoutreferencetransla8ons• Approach:supervisedlearning.First(trainingstep),amodelis

learnedfromhuman-labelleddata.Then(predic&onstep),thethemodelisusedtolabelnew,unseendata.

Posi&ve/Nega&veexamples

Possiblefeatures:hasWings,hasFeathers,sound,moves,hasPalmateFeet,etc.

Predic&ngMToutputquality

• Whatisagoodindicatoroftransla&onquality?

• Itshouldtakeintoaccount:– Correctnessandusefulnessofthetransla&on– Cogni&veeffortneededbyhumanforthecorrec&on

• Alltheseaspectscanbesummarizedinthe:– Post-edi=ngeffort









• Alltheseaspectscanbesummarizedinthe:– Post-edi=ngeffort



• Whatispost-edi&ng?– Aprocessofmodifica&onratherthanrevision(Loffler-Laurian1985)

– The“termusedforthecorrec&onofmachinetransla&onoutputbyhumanlinguists/editors”(VealeandWay1997)

– Repairingtexts(Krings,2001)

– “…theprocessofimprovingamachine-generatedtransla&onwithaminimumofmanuallabor”(TAUSreport,2010)



• Whatispost-edi&ngeffort?– theeffortmadebyapost-editortomanuallyimproveamachinegeneratedtransla&on

• Measureofpost-edi&ngeffort:– Qualityscore(ases&matedbyhumansona1-5Likertscale)

– Numberofeditopera&ons(HTER)

– Post-Edi&ng&me(totalsecondsorsecondsperwords)

– Numberofkeystrokes

– …


Qualityscores

• Arbitrarychoiceofthelevelsofquality 1=requirescompleteretransla&on;

2=requiressomeretransla&on;

3=veryliDlepostedi&ngneeded;

4=fitforpurpose

• Labelingrequireshumaninterven&on

• Aprecisemeasure

• Subjec&ve/expensive/&meconsumingtask


• WorkshoponSMTscoringschema:1. TheMToutputisincomprehensible,withliDleorno

informa&ontransferredaccurately.Itcannotbeedited,needstobetranslatedfromscratch.

2. About50%-70%oftheMToutputneedstobeedited.Itrequiresasignificantedi&ngeffortinordertoreachpublishablelevel.

3. About25-50%oftheMToutputneedstobeedited.Itcontainsdifferenterrorsandmistransla&onsthatneedtobecorrected.

4. About10-25%oftheMToutputneedstobeedited.Itisgenerallyclearandintelligible.

5. TheMToutputisperfectlyclearandintelligible.Itisnotnecessarilyaperfecttransla&on,butrequiresliDletonoedi&ng.

81

Qualityscores


Post-edi&ng&me• Secondsneededtopost-editasentence• normalizedversioninsecondsperword

– liDle&me=goodtransla&on

– large&me=badtransla&on

• Usuallyincludes:– reading&me

– searchingforinforma&ononexternalresources

– typing&me

– extra&meforsecondaryac&vity(e.g.correc&on)

• Highvariabilityacrosssentencesandtranslators


HTER(again!)• HumantargetedTERisthestandardeditdistancebetweentheoriginalmachinetransla&onanditsminimallypost-editedversion

– edits:inser&on,dele&on,subs&tu&on,shin

• Lowervariability(wrt&me)acrosssentences/translators


€

HTER =#edits

#words_ postedited _version

Post-edi&ng&meVsHTER


• Time:pros/cons– Accountsfordifferenteffortsin

transla&ngdifferentwords

– Variabilityamongpost-editors

• HTER:pros/cons– Objec&ve,easytocomputemeasure– lessvarianceacrosspost-editors

(bad=badforall)– Ignoresdifferenteffortsintransla&ng

differentwords


• Tasks:– Automa&clabeling

• realvalues=regression• integers=classifica&on

– Automa&cranking

• Granularity– Wordlevel(e.g.“Thecatenterintheroom”)– Sentencelevel(e.g.“Thecatenterintheroom”:2.27)– Documentlevel


Evalua&onMetrics-Regression• Regression(predic&onsasrealvalues):

– MeanAbsoluteError(MAE)– RootMeanSquaredError(RMSE)

• GivenasetofpredictedscoresHandasetofhumanscoresV

€

MAE =

H(si) −V (si)i=1

N

∑N

RMSE =

(H(si) −V (si))2

i=1

N

∑N


Evalua&onMetrics-Classifica&on• Classifica&on(predic&onsasintegers):

– Precision(Pr)– Recall(Re)– f–score(F1)

• GivenasetofpredictedscoresHandasetofhumanscoresV• Anexampleforbinaryclassifica&on

V

1 -1

H1 TruePosi&ve FalsePosi&ve

-1 FalseNega&ve TrueNega&ve

€

Pr =tp

tp+ fp

Re =tp

tp+ fn

F1 = 2* Pr*RePr+Re


Evalua&onMetrics-Ranking


• Spearman’sRankCoefficient

• DeltaAverage(introducedatWMT2012)

Score Ranking

s1 3.2 3

s2 1 5

s3 5 1

s4 2.7 4

s5 4 2

Judgment Ranking

s1 5 1

s2 1 5

s3 4 2

s4 2 4

s5 3 3

System Human

RankSimilarityMetric

Qualityindicators

• Featurescanbeextractedfrom– Thesourcesentence(“Complexity”indicators)– Thetranslatedsentence(“Fluency”indicators)– SourceandTargetsentences(“Adequacy”andotherindicators)– MTsystemduringthetransla&onprocess(“Confidence”indicators)


Sourcesentence

Translatedsentence

MTsystem

Qualityindicators-Complexity

• Capturethedifficultytotranslatethesourcesentence• Complexsentencesarehardertotranslate

– sourcesentencelength– n-gramlanguagemodelprobability– numberofpunctua&onmarks– sourcesentencetype/tokenra&o(e.g.#nouns/#tokens)– avg.#oftransla&onsperword(asgivenbyprobabilis&cdic&onaries)– %ofcontent/non-contentwords– …

Sourcesentence

Translatedsentence

MTsystem


Qualityindicators-Fluency

• Capturethelevelofnaturalnessofthetransla=oninthetargetlanguage• Thetransla&onshouldconformtothetargetlanguageintermsof

grammar,withlexicalchoicesappropriatetothegenreofthesourcetext

– n-gramlanguagemodelprobability

– POS-tagtargetlanguagemodel

– …

Sourcesentence

MTsystem

Translatedsentence


Qualityindicators-Adequacy

• Capturethelevelofseman=cequivalencebetweensourceandtransla=on• Sourceandtargetsentencesshouldconveythesamemeaning.Meaning

drins/lossesfromsourcetotargetsentenceindicateabadtransla&on

– %ofalignedwordsinsourceandtarget– %ofalignmentsbetweenwordswiththesamepartofspeech

– %ofalignednouns/verbs/adjec&ves– alignedIDFmass(IDFasindicatoroftermrelevance)

– …

MTsystem

Translatedsentence

Sourcesentence


Qualityindicators-Confidence

• CapturethelevelofconfidenceoftheSMTsystem• sentencesforwhichthetransla&onprocessiscomplexaremorelikelytobe

badtransla&ons

– lengthNoftheNbestlist– numberofprunedhypotheses

– log-likelihoodscore– avg.edit-distanceofthe1-bestfromthefirstk-bests

– …

Sourcesentence

Translatedsentence

MTsystem


OpenIssues

• Lackofanobjec&vequalityscoreabletocatchcogni&veefforts– AnewscorethatcontainsthemainfeaturesofHTERandcorrelateswellwithPE&me

• Lackofatechniqueabletothresholdthequalityscore(badvs.goodtransla&ons)– IsHTER=0.3/0.5/0.7abadorgoodtransla&on?– UsefulintheCATtoolscenario,whereitisnecessarytodiscardbadtransla&ons


OpenIssues

• Morethan1,000qualityindicatorshavebeendevelopedinthelastyears.– Doweneedalloftheminarealapplica&on?

– Whicharethemostreliableineachgroup?

– Whichisthebestcombina&on?

• Subjec&vityinthepost-editorworkandinthetask– Asinglequalityes&matorforverydifferentpost-editorbehaviorandtask

– Adaptability/personaliza&on


MTEvalua=onDilemma

Summary

• MTevalua&on:ahottopic…– Sharedevalua&onmethods/rou&nesareakeyassetinanyfield

• …butadifficulttask– Wetalkedabouterrorvariability,costs,speed,replicability,subjec&vity,correla&onwithhumanjudgments,etc.

Summary

• Humanevalua&on– Accurate,highquality,meaningful,expensive,slow,subjec&ve

• Automa&cevalua&on– Cheap,quick,repeatable,objec&ve,approximate,lessaccurate

– Fluency,adequacy

– Reference-based:BLEU,TER,HTER(prosandcons)– Reference-free:qualityes&ma&on(goal,methods,openissues)

Summary• Keyconcepts:

Adequacy

Referencetra

nsla&on

Agreement

Correla&on

Post-edi&ngeffort

CATtoolFeature

Cogni&veeffortHTER

MeanAbsoluteError

Evalua&onofMachineTransla&onQuality

MarcoTurchiFBKTrento,Italyturchi@<k.eu