Adaptable Automatic EvaluationMetrics for
Machine Translation
Lucian Vlad Litajoint work with Alon Lavie and Monica Rogati
Outline
BLEU and ROUGE metric families BLANC –family of adaptable metrics
All common skip n-grams Local n-gram model Overall model
Experiments and results Conclusions Future work References
Automatic Evaluation Metrics
Manual human judgments Edit distance (WER) Word overlap (PER) Metrics based on n-grams
n-gram precision (BLEU) weighted n-grams (NIST) longest common subsequence (Rouge-L) skip 2-grams (pairs of ordered words – Rouge-S)
Integrate additional knowledge (synonyms, stemming)(METEOR)
t i m
e
translation quality ( candidate | reference )
Automatic Evaluation Metrics
Manual human judgments
Machine translation (MT) evaluation metrics Manually created estimators of quality Improvements often shown on the same data Rigid notion of quality Based on existing judgment guidelines
Goal: trainable evaluation metric
t i m
e
translation quality ( candidate | reference )
Goal: Trainable MT Metric
Build on the features used by established metrics (BLEU, ROUGE)
Extendable – additional features/processing Correlate well with human judgments Trainable models
Different notions of “translation quality”E.g. computer consumption vs. human consumption
Different features will be more important for differentLanguagesDomains
The WER Metric
R: the students asked the professor
C: the students talk professor
Word Error Rate =# of word insertions, deletions, and substitutions
# words in R
Transform reference (human) translation R into candidate (machine) translation C Levenshtein (edit) distance
The PER Metric
Word overlap between candidate (machine) translation C and reference (human) translation R Bag of words
Position IndependentError Rate
|count of w in R – count of w in C|
# words in R
R: the students asked the professor
C: the students talk professor
w in C
The BLEU Metric
Modified n-gram precisions 1-gram precision = 3 / 4 2-gram precision = 1 / 3 …
Contiguous n-gram overlap between reference (human) translation R and candidate (machine) translation C
R: the students asked the professor
C: the students talk professor
BLEU = ( Pi-gram )1/n * ( brevity penalty )i = 1
n
The BLEU Metric
BLEU is the most established evaluation metric in MT
Basic feature: contiguous n-grams of all sizes Computes modified precision Uses a simple formula to combine all precision
scores Bigram precision is “as important” as unigram
precision Brevity penalty – quasi recall
The Rouge-L Metric
R: the students asked the professor
C: the students talk professor
Longest common subsequence (LCS) of the candidate (machine) translation C and reference (human) translation R LCS = 3 “the students … professor”
PrecisionLCS (C,R)
# words in C= Recall
LCS (C,R)
# words in R=
Rouge-L = harmonic mean (Precision, Recall) = 2PR / (P+R)
The Rouge-S Metric
R: the students asked the professor
C: the students talk professor
Skip 2-gram overlap (LCS) of the candidate (machine) translation C and reference (human) translation R
Skip2(C) = 6 { “the students”, “the talk”, “the professor”,
“students talk”, “students professor”, “talk professor” }
Skip2(C,R) = 3 { “the students” , “the professor”, “students professor” }
11
The Rouge-S Metric
R: the students asked the professor
C: the students talk professor
Skip 2-gram overlap (LCS) of the candidate (machine) translation C and reference (human) translation R
PrecisionSkip2 (C,R)
|C| choose 2= Recall
Skip2 (C,R)
|R| choose 2=
Rouge-S = harmonic mean (Precision, Recall)
The ROUGE Metrics
Rouge-L Basic feature: longest common subsequence LCS
Size of the longest common skip n-gram Weighted LCS
Rouge-S Basic feature: skip bigrams Skip bigram gap size irrelevant Limited to n-grams of size 2
Both use harmonic mean (F1-measure) to combine precision and recall
Is BLEU Trainable?
Can we assign/learn relative importance between P2 and P3?
Simplest model: regression Train/test on past MT output [C,R] Inputs: P1, P2 , P2 … and brevity penalty
P1, P2 , P2, bp HJ fluency score
BLEU = ( Pi-gram )1/n * ( brevity penalty )i = 1
n
Is Rouge Trainable?
Simple regression on Size of the longest common skip n-gram Number of common skip 2-grams
Second order parameters (dependencies) – model is not linear in its inputs anymore Window size (computation reasons) F-measure to F (replacing brevity penalty)
Potential models Iterative methods Hill climbing?
Non-linear (Bp, |LCS|, Skip2, F, ws) HJ fluency score
The BLANC Metric Family
Generalization of established evaluation metrics N-gram features used by BLEU and ROUGE
Trainable parameters Skip n-gram contiguity in C Relative importance of n (i.e. bigrams vs. trigrams) Precision-recall balance
Adaptability to different: Translation quality criteria, languages, domains
Allow additional processing/features (e.g. METEOR matching)
All Common Skip N-grams
C: the one pure student brought the necessary condiments
R: the new student brought the food
C: the one pure student brought the necessary condiments
R: the new student brought the food
( , , , )
( , , , )
( , , , )
( , , , )1
1
1
1 0
1 2
3
0
0 1
3
0
0 0
1# 1grams: 4# 2grams: 6# 3grams: 4# 4grams: 1
the(0,0)
the(4,5)
student(2,3) brought(3,4)
the(0,5)the(4,0)
All Common Skip N-grams
C: the one pure student brought the necessary condiments
R: the new student brought the food
C: the one pure student brought the necessary condiments
R: the new student brought the food
the(0,0)
the(4,5)
student(2,3) brought(3,4)
( , , , )
( , , , )
( , , , )
( , , , )1
1
1
1 0
s22 s32
3
0
0 1
?
0
0 0
1score(1-grams)score(2-grams)score(3-grams)score(4-grams)
score(the0,0,student2,3)
’ ’’
All Common Skip N-grams
Algorithms literature: all common subsequences Listing vs. counting subsequences Interested in counting
# common subsequences of size 1, 2, 3 …
Replace counting with score over all n-grams of the same size Score(w1…wi,wi+1…wn) =
Score(w1…wi) Score(w1+1…wn)
BLANCi(C,R) = f(common i-grams of C,R)
Modeling Gap Size Importance
skip 3-grams
… the ____ ____ ____ ____ student ____ ____ has …
… the ____ student has …
… the student has …
Modeling Gap Size Importance
Model the importance of skip n-gram gap size as an exponential function with one parameter ()
Special cases Gap size doesn’t matter (Rouge-S): = 0 No gaps are allowed (BLEU): = large number
C: … the __ __ __ __ student __ __ has …
Modeling Candidate-Reference Gap Difference
skip 3-gram match
C1: … the ____ ____ ____ ____ student ____ ____ has …
R: … the ____ student has …
C2: … the student has …
Modeling Candidate-Reference Gap Difference
Model the importance of gap size difference between the candidate and reference translations as an exponential function with one parameter ()
Special cases Gap size differences do not matter: = 0 Skip 2-gram overlap (Rouge-S): = 0, = 0, n=2 Largest skip n-gram (Rouge-L): = 0, = 0, n=LCS
C: … the __ __ __ __ student __ __ has …R: … the __ student has …
Skip N-gram Model Incorporate simple scores into an exponential model
Skip n-gram gap size Candidate-reference gap size difference
Possible to incorporate higher level features Partial skip n-grams matching (e.g. synonyms, stemming)
“the __ students” vs. “the __ pupils”, “the __ students” vs. “the __ student”
From word classing to syntax e.g. score( “students __ __ professor”) ? score (“the __ __ of”)
CandidatesReferences
Find CommonSkip Ngram
Find All Common Skip Ngrams
Compute Skip NgramPair Features e-ifi (sn)
Combine All CommonSkip Ngram Scores
Global parameters • precision/recall• f(skip ngram size)
Compute CorrelationCoefficient • pearson• spearman
Criterion• adequacy• fluency• f(adequacy, fluency)• other
Trained Metric
BLANC Overview
Incorporating Global Features
Compute BLANC precision and recall for each n-gram size i
Global exponential model based on N-gram size: I BLANCi (C,R) i=1..n F-measure parameter F for each size i Average reference segment size Other scores (i.e. BLEU, ROUGE-L, ROUGE-S) …
Train for average human judgment vs. train for best overall correlation (as the error function)
Experiment Setup
Tides evaluation data Arabic English 2003, 2004
Training and test sentences separated by year Optimized:
n-gram contiguity difference in gap size (C vs. R) Balance between precision and recall
Correlation using the Pearson correlation coefficient Compared BLANC to BLEU and ROUGE Trained BLANC for
Fluency vs. adequacy System level vs. sentence level
Tides 2003 Arabic Evaluation
System Level Sentence Level
Method Adequacy Fluency Adequacy Fluency
BLEU 0.950 0.934 0.382 0.286
NIST 0.962 0.939 0.439 0.304
Rouge-L 0.974 0.926 0.440 0.328
Rouge-S 0.949 0.935 0.360 0.328
BLANC 0.988 0.979 0.492 0.391
Pearson [-1,1] correlation with human judgments at system level and sentence level
Tides 2004 Arabic Evaluation
System Level Sentence Level
Method Adequacy Fluency Adequacy Fluency
BLEU 0.978 0.994 0.446 0.337
NIST 0.987 0.952 0.529 0.358
Rouge-L 0.981 0.985 0.538 0.412
Rouge-S 0.937 0.980 0.367 0.408
BLANC 0.982 0.994 0.565 0.438
Pearson [-1,1] correlation with human judgments at system level and sentence level
Advantages of BLANC
Consistently good performance Candidate evaluation is fast Adaptable
fluency and adequacy languages, domains
Help train MT systems for specific tasks e.g. information extraction, information retrieval
Model complexity Can be optimized for specific MT system
performance levels
Disadvantages of BLANC
Training data vs. number of parameters Model complexity Guarantees of the training process
Conclusions
Move towards learning evaluation metrics Quality criteria – e.g. fluency, adequacy Correlation coefficients – e.g. Pearson, Spearman Languages – e.g. English, Arabic, Chinese
BLANC – family of trainable evaluation metrics Consistently performs well on evaluating machine
translation output
Future Work
Recently obtained a two year NSF Grant
Try different models and improve the training mechanism for BLANC Is a local exponential model the best choice? Is a global exponential model the best choice? Explore different training methods
Integrate additional features Apply BLANC to other tasks (summarization)
References
Leusch, Ueffing, Vilar and Ney, “Preprocessing and Normalization for Automatic Evaluation of Machine Translation.” IEEMTS Workshop, ACL 2005
Lin and Och, “Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics”, ACL 2004
Lita, Lavie and Rogati, “BLANC: Learning Evaluation Metrics for MT”, HLT-EMNLP 2005
Papineni, Roukos, Ward and Zhu, “BLEU: A Method for Automatic Evaluation of Machine Translation”, IBM Report 2002
Akiba, Imamura and Sumita, “Using Multiple Edit Distances to Automatically Rank Machine Translation Output”, MT Summit VIII 2001
Su, Wu and Chang, “A new Quantitative Quality Measure for a Machine Translation System”, COLING 1992
Thank you
Acronyms, acronyms …
Official: Broad Learning Adaptation for Numeric Criteria
Inspiration: white light contains light of all frequencies
Fun: Building on Legacy Acronym Naming Conventions Bleu, Rouge, Orange, Pourpre … Blanc?