Quality Estimation Method HT vs MT Data analysis Conclusions
Predicting Human Translation Quality
Lucia Specia
University of [email protected]
QTLaunchPad Workshop, Dubrovnik15 June 2014
(Joint work with Kashif Shah)
Predicting Human Translation Quality 1 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Outline
1 Quality Estimation
2 Method
3 HT vs MT
4 Data analysis
5 Conclusions
Predicting Human Translation Quality 2 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Outline
1 Quality Estimation
2 Method
3 HT vs MT
4 Data analysis
5 Conclusions
Predicting Human Translation Quality 3 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Overview
Translation quality estimation (QE) automatic metrics toprovide an estimate on the quality of a translated text
No access to reference translations, MT systems in use
So far, only applied to machine translated (MT) texts
Predicting Human Translation Quality 4 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Overview
Translation quality estimation (QE) automatic metrics toprovide an estimate on the quality of a translated text
No access to reference translations, MT systems in use
So far, only applied to machine translated (MT) texts
Predicting Human Translation Quality 4 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Overview
Translation quality estimation (QE) automatic metrics toprovide an estimate on the quality of a translated text
No access to reference translations, MT systems in use
So far, only applied to machine translated (MT) texts
Predicting Human Translation Quality 4 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Applications
Can a reader get the gist?
Is it worth post-editing it?
How much time to fix it?
Can we publish it as is?
Does it need human checking?
Predicting Human Translation Quality 5 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Outline
1 Quality Estimation
2 Method
3 HT vs MT
4 Data analysis
5 Conclusions
Predicting Human Translation Quality 6 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Learning
Supervised machine learning to build models based ontraining data:
annotated with quality labels (human input at“training” time)described by features
“Quality” defined according to the problem (and data):
Post-editing time for a sentenceMQM issue for a word
Models predict such quality scores
Predicting Human Translation Quality 7 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Learning
Supervised machine learning to build models based ontraining data:
annotated with quality labels (human input at“training” time)described by features
“Quality” defined according to the problem (and data):
Post-editing time for a sentenceMQM issue for a word
Models predict such quality scores
Predicting Human Translation Quality 7 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Learning
Supervised machine learning to build models based ontraining data:
annotated with quality labels (human input at“training” time)described by features
“Quality” defined according to the problem (and data):
Post-editing time for a sentenceMQM issue for a word
Models predict such quality scores
Predicting Human Translation Quality 7 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Outline
1 Quality Estimation
2 Method
3 HT vs MT
4 Data analysis
5 Conclusions
Predicting Human Translation Quality 8 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Can we predict HT quality?
Key objective in QTLP: (automated) metrics to evaluateand estimate translation quality of human and machinetranslations
MT quality estimation works well (at sentence-level):
WMT12-14 shared tasks
QuEst framework: www.quest.dcs.shef.ukLarge number of recent papersCommercial adoption: Multilizer, SDL-LW, Yandex
Question: Can we apply the same framework to predictthe quality of human translations?
Predicting Human Translation Quality 9 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Can we predict HT quality?
Key objective in QTLP: (automated) metrics to evaluateand estimate translation quality of human and machinetranslations
MT quality estimation works well (at sentence-level):
WMT12-14 shared tasks
QuEst framework: www.quest.dcs.shef.ukLarge number of recent papersCommercial adoption: Multilizer, SDL-LW, Yandex
Question: Can we apply the same framework to predictthe quality of human translations?
Predicting Human Translation Quality 9 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Can we predict HT quality?
Key objective in QTLP: (automated) metrics to evaluateand estimate translation quality of human and machinetranslations
MT quality estimation works well (at sentence-level):
WMT12-14 shared tasks
QuEst framework: www.quest.dcs.shef.ukLarge number of recent papersCommercial adoption: Multilizer, SDL-LW, Yandex
Question: Can we apply the same framework to predictthe quality of human translations?
Predicting Human Translation Quality 9 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Can we predict HT quality?
Motivation: Automate/sample for quality assurance
Encouraging fact: hard to distinguish MT and HT (EAMT-Tuesday). But:
1 Do (professional) human translators make mistakes?
2 Are HT errors the same/similar to MT errors?
3 Are current quality estimation tools good for HT?
Data analysis to answer these questions: sentence- and(partially) word-level
Predicting Human Translation Quality 10 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Can we predict HT quality?
Motivation: Automate/sample for quality assurance
Encouraging fact: hard to distinguish MT and HT (EAMT-Tuesday). But:
1 Do (professional) human translators make mistakes?
2 Are HT errors the same/similar to MT errors?
3 Are current quality estimation tools good for HT?
Data analysis to answer these questions: sentence- and(partially) word-level
Predicting Human Translation Quality 10 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Can we predict HT quality?
Motivation: Automate/sample for quality assurance
Encouraging fact: hard to distinguish MT and HT (EAMT-Tuesday). But:
1 Do (professional) human translators make mistakes?
2 Are HT errors the same/similar to MT errors?
3 Are current quality estimation tools good for HT?
Data analysis to answer these questions: sentence- and(partially) word-level
Predicting Human Translation Quality 10 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Can we predict HT quality?
Motivation: Automate/sample for quality assurance
Encouraging fact: hard to distinguish MT and HT (EAMT-Tuesday). But:
1 Do (professional) human translators make mistakes?
2 Are HT errors the same/similar to MT errors?
3 Are current quality estimation tools good for HT?
Data analysis to answer these questions: sentence- and(partially) word-level
Predicting Human Translation Quality 10 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Can we predict HT quality?
Motivation: Automate/sample for quality assurance
Encouraging fact: hard to distinguish MT and HT (EAMT-Tuesday). But:
1 Do (professional) human translators make mistakes?
2 Are HT errors the same/similar to MT errors?
3 Are current quality estimation tools good for HT?
Data analysis to answer these questions: sentence- and(partially) word-level
Predicting Human Translation Quality 10 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Can we predict HT quality?
Motivation: Automate/sample for quality assurance
Encouraging fact: hard to distinguish MT and HT (EAMT-Tuesday). But:
1 Do (professional) human translators make mistakes?
2 Are HT errors the same/similar to MT errors?
3 Are current quality estimation tools good for HT?
Data analysis to answer these questions: sentence- and(partially) word-level
Predicting Human Translation Quality 10 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
QTLP datasets
Not possible before QTLP: no large enough datasetavailable with both MTs and HTs
Our data is different from existing ’learner’ corpora:
HTs produced by professionals2-3 state-of-the-art MT systems (RBMT, SMT, hybrid)Both HT and MT annotated by professionaltranslators4 language-pairsNews and ’customer’ data
Datasets also used for WMT14 QE shared tasks
Predicting Human Translation Quality 11 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
QTLP datasets
Not possible before QTLP: no large enough datasetavailable with both MTs and HTs
Our data is different from existing ’learner’ corpora:
HTs produced by professionals2-3 state-of-the-art MT systems (RBMT, SMT, hybrid)Both HT and MT annotated by professionaltranslators4 language-pairsNews and ’customer’ data
Datasets also used for WMT14 QE shared tasks
Predicting Human Translation Quality 11 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
QTLP datasets - sentence-level
Labels:
1 = Perfect translation, no post-editing needed at all
2 = Near miss translation: translation contains maximum of2-3 errors, and possibly additional errors that can be easilyfixed (capitalisation, punctuation, etc.)
3 = Very low quality translation, cannot be easily fixed
Sentences:
# Source # HT+MTs # Target1,104 English 4 4,416 Spanish500 English 4 2,000 German500 German 3 1,500 English500 Spanish 3 1,500 English
Predicting Human Translation Quality 12 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
QTLP datasets - word-level
Labels: Core MQM:
Sentences: subset of 2s, from 2-3 MT systems + HT
# Source-target2,339 English-Spanish865 English-German450 German-English
1,050 Spanish-English
Predicting Human Translation Quality 13 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
QTLP datasets - word-level
Labels: Core MQM:
Sentences: subset of 2s, from 2-3 MT systems + HT
# Source-target2,339 English-Spanish865 English-German450 German-English
1,050 Spanish-English
Predicting Human Translation Quality 13 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
QTLP datasets - word-level
Labels: Core MQM:
Sentences: subset of 2s, from 2-3 MT systems + HT
# Source-target2,339 English-Spanish865 English-German450 German-English
1,050 Spanish-English
Predicting Human Translation Quality 13 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Outline
1 Quality Estimation
2 Method
3 HT vs MT
4 Data analysis
5 Conclusions
Predicting Human Translation Quality 14 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Translators are humans...
1 Do (professional) human translators make mistakes?
Sentence-level: de-en
1 - perfect 2 - few errors 3 - too bad
HT MT-1 MT-2
Predicting Human Translation Quality 15 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Translators are humans...
1 Do (professional) human translators make mistakes?
Sentence-level: en-es
1 - perfect 2 - few errors 3 - too bad
HT MT-1 MT-2 MT-3
Predicting Human Translation Quality 15 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Translators are humans...
1 Do (professional) human translators make mistakes?
Sentence-level: en-de
1 - perfect 2 - few errors 3 - too bad
HT MT-1 MT-2 MT-3
Predicting Human Translation Quality 15 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Translators are humans...
1 Do (professional) human translators make mistakes?
Sentence-level: es-en
1 - perfect 2 - few errors 3 - too bad
HT MT-1 MT-2
Predicting Human Translation Quality 15 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Translators are humans...
1 Do (professional) human translators make mistakes?
Word-level: OK vs BAD
LangWords tagged
HT MT
de-en 808 7420
en-es 8933 48089
en-de 1241 12406
es-en 1206 21818
de-en en-es en-de es-en0
0.2
0.4
0.6
0.8
1
1.2
OK – HT OK – MT
Predicting Human Translation Quality 16 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Humans and machines are not the same
2 Are HT errors the same/similar to MT errors?
Word-level: Accuracy vs Fluency
de-en en-es en-de es-en
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fluency Accuracy
de-en en-es en-de es-en0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Fluency Accuracy
HT
MT
Predicting Human Translation Quality 17 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Humans and machines are not the same
2 Are HT errors the same/similar to MT errors?
Word-level: Types of errors de-en
Addition Agreement
Capitalization Function_words
Grammar Mistranslation
Morphology Part_of_speech
Punctuation Spelling
Style/register Tense/aspect/mood
Terminology Typography
Unintelligible Untranslated Word_order
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
MT
HT
Predicting Human Translation Quality 18 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Humans and machines are not the same
2 Are HT errors the same/similar to MT errors?
Word-level: Types of errors en-es
Addition
Capitalization
Function_words
Mistranslation
Omission
Punctuation
Style/register
Terminology
Unintelligible
Word_order
0 0.05 0.1 0.15 0.2 0.25 0.3
MT
HT
Predicting Human Translation Quality 18 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Humans and machines are not the same
2 Are HT errors the same/similar to MT errors?
Word-level: Types of errors en-de
Accuracy
Addition Agreement
Capitalization Function_words
Grammar Mistranslation
Morphology Omission
Part_of_speech Punctuation
Spelling Style/register
Tense/aspect/mood Terminology Typography
Unintelligible Untranslated Word_order
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
MT
HT
Predicting Human Translation Quality 18 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Humans and machines are not the same
2 Are HT errors the same/similar to MT errors?
Word-level: Types of errors es-en
Accuracy
Agreement
Fluency
Grammar
Morphology
Part_of_speech
Spelling
Tense/aspect/mood
Typography
Untranslated
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
MT
HT
Predicting Human Translation Quality 18 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Humans are harder to predict
3 Are current quality estimation tools good for HT?
en-de de-en en-es es-en0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
MT Delta Baseline-Quest
HT Delta Baseline-Quest
Incr
ea
de
in p
erf
orm
an
ce
Predicting Human Translation Quality 19 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Humans are harder to predict - Why?
Common (baseline) features (sentence-level):
no. of tokens in the source & target texts
average source token length
average no. of occurrences of target words in target text
no. of punctuation marks in source & target texts
language model probability of source & target texts
avg. no. of translations per source word
% of 1-grams, 2-grams & 3-grams in frequency quartiles1 & 4 (lower/higher frequency) in source language corpus
% of 1-grams in source text seen in source language corpus
Word-level: requires more labelled data
Predicting Human Translation Quality 20 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Outline
1 Quality Estimation
2 Method
3 HT vs MT
4 Data analysis
5 Conclusions
Predicting Human Translation Quality 21 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Conclusions
Human translation quality is harder to estimate:
Need more labelled data - hard to collect
Need more linguistically motivated features, to capturee.g. mistranslation - hard to generalise, require tools
On-going work within QTLP: collecting larger sets ofannotated data
Predicting Human Translation Quality 22 / 23
Quality Estimation Method HT vs MT Data analysis Conclusions
Predicting Human Translation Quality
Lucia Specia
University of [email protected]
QTLaunchPad Workshop, Dubrovnik15 June 2014
(Joint work with Kashif Shah)
Predicting Human Translation Quality 23 / 23