transread.limsi.frtransread.limsi.fr › Deliverables › Deliverable_3.2.pdf · In the context of...

Lecture et interaction bil ingues

enrichies par les données d'al ignement

Projet ANR 201 2 CORD 01 5

Avri l 201 4

Benoit Le Ny - Guil laume Wisniewski - François Yvon

TRANSREAD

LIVRABLE 3.2FEATURES FOR PREDICTING

TRANSLATION QUALITY AT THE WORDAND SENTENCE LEVEL

Features for Predicting Translation Qualityat the Word and the Sentence Levels

Benoit Le Ny, Guillaume Wisniewski, Francois Yvon

April 2014

Abstract

This document corresponds to the deliverable D3.2 of theTransRead project. It describes the features that are usedin the quality estimation systems we have developed to pre-dict translation quality at the word and sentence levels. Thesesystems have been tested during the WMT’13 and WMT’14evaluation campaign.

1 Introduction

The third work-package of the TransRead project aims at developing confidenceestimation methods for machine and human translations. This information about trans-lation quality will be one of the ‘signal’ displayed in the user interface developed in theother subtasks of this work-package in order to help human translators / correctors toeasily revise and validate translations.

In the first 18 months of the project, we have considered two main tasks : qualityprediction at the word level and at the sentence level. The first task consists in deciding,knowing only the source sentence and a translation hypothesis, whether each word ofa translation hypothesis, should be post-edited or not. The second task consists in pre-dicting the time needed to post-edit a complete translation hypothesis, or alternativelythe number of editions required to correct it. Since 2012, an international evaluationcampaign is organized in the context of the WMT workshop 1 that aims at assessing theperformance achieved by various quality prediction systems for these two tasks. LIMSIhas taken part to this evaluation since its very beginning.

The rest of the report is organized as follows : in Section 2 we will give a quick overviewof the systems developed ; then, in Section 3 we will detail the different features used byour system. The performance of our systems and the predictive power of the different

1. www.statmt.org/wmt14

1

features we have considered are analyzed in details in the different articles we havepublished. The list of these publications is in Section 4.

2 Architecture of the Quality Estimation System

In the context of TransRead, LIMSI has developed two pipelines for predictingtranslation quality a the word and at the sentence levels. These pipelines rely on statis-tical machine learning methods to automatically learn to predict the quality of a newtranslation. The general architecture of our pipelines, reproduced on Figure 1, is similarto the architectures of all applications based on supervised machine learning techniques :

– a first module is responsible for extracting features, which allow turn textual input(a source sentence and an hypothetic translation ; or a source sentence and a singleword in the translation hypothesis) into a vector or real numbers. The featuresconsidered are documented in Section 3.

– a second module is used to learn, from a set of labeled examples (i.e. translationhypotheses for which an indicator of translation quality is known) a classifier ableto predict the label of a new example.

Features extraction

Machine learning algorithm

FeaturesX: example

source + machine

translation

Y : label describing the quality of an

example

The systems is able to predict the quality of new

examples

Quality estimation

system

Figure 1: General architecture of our quality prediction system

The second module relies on the free machine learning library scikit-learn 2. Thefirst one has been developed from scratch at LIMSI. Many models (word and part-of-speech (POS) based language models, translation models, etc) are necessary to achieve

2. http://scikit-learn.org

2

good results for confidence estimation, which makes the code quite complex as a lot ofspecific IO methods are required.

The two systems can be trained using the annotated datasets provided in the contextof the WMT evaluation campaign on translation quality. Experiments on other coporaare in progress.

3 Features

3.1 Overview

In this section, we give a high-level overview of the features that can be used to predictthe quality of a translation. This overview is directly based on the description of existingtools for quality control (cf. Deliverable D3.1 of the project). We have identified 5 broadcategories of features :

Typographic features Such features are part of the most of modern computer-assistedtranslation environments and include such features as consistent use of whitespaces,multiple spaces check, date, time, and number format verification, consistency ofcapitalization, of brackets and of punctuation, as well as field and tags matching.They allow translators to ensure perfect formatting of the target. However, thisparticular type of information is not really the subject of the project, which ismore focused on quality estimation and visualization for the actual translationcontent. Moreover, formatting information is not used in statistical translationquality estimation models, which makes it difficult to be used by the bilingualtool. For these reasons, typographic features will not be included as an indicatorof translation quality in the TransRead tool. They will however be used to cleancollected corpora.

Basic translation consistency controls These simple and easily calculated features in-clude various checks for human- generated errors of the oversight type. It consistsin :– source equals target control : to detect untranslated segments ;– detecting segments with missing or partial translation ;– repeated words and repeated segments detection ;– consistency of usage of terminology with a given dictionary ;– consistency of usage of terminology with a given translation memory ;– consistency throughout the whole document.

Basic monolingual and bilingual statistics This type of information includes all the in-dicators that can be computed on the fly : various simple yet useful cues at segmentand document level.– segment-level statistics : length ratio of source versus translation ;– document-level statistics :

– total number of words in source and in target (considered separately or toge-ther) ;

– number and ratios for particular word types, POS, terms, etc ;

3

– distribution of word occurrences ;– average word length.

Complex monolingual cues These are the monolingual features that use external toolsand models that require a lot of computational resources, both in time and inspace. The tool will not include this type of processing tools as its part, and willonly use the result of the corresponding computations.– Language model scores (both word language models and part-of-speech (POS)

language models). Despite the fact that language model score is already one ofthe important features for the global phrase-level quality translation estimationscore, it might also be useful to include and visualize it separately to provide aparticular viewpoint on the translation quality.

– Out-of-vocabulary and rare words : showing to the user that a particular sourcesegment contains out-of-vocabulary words may also be a useful quality indicator,both as part of the general quality measure and as an independent feature.

– Spellchecking : We notice that the majority of CAT tools contain this useful fea-ture. Also during the ANR/TRACE project, both LIMSI and Reverso-Softissimodeveloped spell-checking tools that could be used as yet another type of qualityindicator for the TransRead tool. This would be particularly useful in case ofhuman translation revision scenario, since automatic translation rarely containsspelling errors.

Complex bilingual clues Some of the bilingual information also require a lot of com-putational resources and will be pre-calculated before loading the data into thebilingual tool.– Alignment links and alignment confidence scores : alignment between the source

and the target at various levels is the most important feature of the bilingualreader. They indicate the correct alignment between different chunks of sourceand target text (if the confidence score is high), and to highlight potential align-ment problems (if the confidence score is low). It concerns basic alignments aswell as idiomatic expressions alignments. Alignments also allow calculating andvisualizing other related useful features for translation quality estimation, suchas non-aligned words, crossing links and complexity of the reordering. Visualiza-tion of the alignment and of its quality and is yet another viewpoint that wouldhelp the user to quickly grasp translation structure and translation quality.

– Concordance information : alignment links also give a possibility to visualize allthe occurrences of a given word or phrase and its alignments. Thus the user canrapidly check if the same source is translated differently at different places.

– Translation model scores : indicate the likelihood for the source to generatea given translation aligned with it. This is a potentially useful indicator andyet another viewpoint on the translation quality for both machine and humantranslation revision, yet the model needs be adapted to the type of the translatedtext.

– Global translation quality estimation score : overall phrase-level translation qua-lity estimation score is computed by including all of the above types of mono-

4

lingual and bilingual cues as features in the statistical classification model. Yetagain, this score is potentially useful for both human and automatic translationscenarios, but the model needs to be adapted for each type of task. A first proto-type of a full-featured translation quality estimation system was developed andtested as part of quality estimation shared task organized for the 2013 Work-shop on Machine Translation and very encouraging results have been obtainedfor these data [5].

Some indicators that require a substantial amount of computer resources and prepa-ratory work may need to be calculated beforehand, whereas others may be calculatedon the fly. Some useful features may also be obtained by using external tools and ser-vices, such as spell-checking. Some of these indicators will be available for all possibleuse cases ; others only should be activated for some particular tool usage (for exampleonly for human translation revision). In the next section the different user scenarios aredescribed with the implementation of the different quality indicators listed previously.

3.2 Sentence Level Features

A wide variety of features have been proposed for the task of quality estimation and re-lated problems, such as predicting readability [9] or developing automated essay scoringsystems [6]. For instance, the final report of the 2004 Workshop on Confidence Esti-mation for Machine Translation [4] listed 91 features or feature sets for sentence levelconfidence estimation and the recent works of [7, 12] have significantly increased thesenumbers. These features can be monolingual (i.e. consider the source and target side inisolation) or bilingual (i.e. look at the alignment) ; they can look at isolated words, shortspan segments, or conversely take the whole sentence into account ; some only look atsurface characteristic, such as sentence length, while others can require more sophisti-cated analysis (for instance POS tagging, or even parse trees) ; some features are easilyderived from isolated sentences, while others require some kind or corpus analysis (ie.average word frequency, or the number of hapaxes), or the training of statistical models.

The baseline feature set distributed for the WMT 2012 shared task has been ourstarting point and has been extended in various ways. Starting with the 17 baselinefeatures, we have considered a set of 107 features that can be classified into five broadcategories :

– Association Features : Measures of the quality of the ‘association’ between thesource and the target sentences like, for instance, features derived from the IBMmodel 1 scores ;

– Fluency Features : Measures of the ‘fluency’ or the ‘grammaticality’ of the targetsentence such as features based on language model scores ;

– Surface Features : Surface features extracted mainly from the source sentencesuch as the number of words, the number of out-of-vocabulary words or words thatare not aligned ;

– Syntactic Features : some simple syntactic features like the number of nouns,modifiers, verbs, function words, WH-words, number words, etc., in a sentence (formappings from tag to category, see Appendix B) ;

5

Since we plan to consider different MT systems, we have not considered any of thefeatures derived from the decoder.

Here is the full lists of features that we have considered in our studies :

17 Baseline Features (Surface features)– BL1. Number of tokens in source sentence– BL2. Number of tokens in target sentence– BL3. Average source token length– BL4. n-gram language model score (source)– BL5. n-gram language model score (target)– BL6. Average occurrences of word in sentence (target)– BL7. Average translations per word in sentence, prob(t|s) > 0.2 (source)– BL8. Average translations per word in sentence, prob(t|s) > 0.01 (source)– BL9. Percentage of unigrams in quartile 1 (source)– BL10. Percentage of unigrams in quartile 4 (source)– BL11. Percentage of bigrams in quartile 1 (source)– BL12. Percentage of bigrams in quartile 4 (source)– BL13. Percentage of trigrams in quartile 1 (source)– BL14. Percentage of trigrams in quartile 4 (source)– BL15. Number of punctuation marks in sentence (target)– BL16. Number of punctuation marks in sentence (target)– BL17. Percentage of unigrams in seen in a corpus (source)

6 Source-Target Association Features – IS1. Normalized Source to target IBM1 score/ source length

– IS2. Normalized Target to source IBM1 score / source length– IS3. Normalized Source to target IBM1 score / target length– IS4. Normalized Target to source IBM1 score / target length– IS5. Source to target IBM1 score– IS6. Target to source IBM1 score

4 SOUL LM Features (fluency features)– SL1. Normalized SOUL language model score (source)– SL2. Normalized SOUL language model score (target)– SL3. 10-gram SOUL language model score (source)– SL4. 10-gram SOUL language model score (target)

15 Simple LM Features (fluency features)– LM1. Normalized n-gram language model score (source)– LM2. Normalized n-gram language model score (target)– LM3. Ratio of n-gram language model scores (source/target)– LM4. Ratio of n-gram language model scores (target/source)– LM5. n-gram language model score (source)– LM6. n-gram language model score (target)– LM7. 3-gram language model score– LM8. 3-gram language model perplexity– LM9. 3-gram language model perplexity (not considering sentence boundaries)

6

– LM10. 4-gram language model score– LM11. 4-gram language model perplexity– LM12. 4-gram language model perplexity (not considering sentence boundaries)– LM13. 5-gram language model score– LM14. 5-gram language model perplexity– LM15. 5-gram language model perplexity (not considering sentence boundaries)

24 poscount Features (Syntactic Features)– PC1. Normalized count of function words (source)– PC2. Normalized count of modifier words (source)– PC3. Normalized count of nouns (source)– PC4. Normalized count of number words (source)– PC5. Normalized count of verbs (source)– PC6. Normalized count of WH words (source)– PC7. Normalized count of function words (target)– PC8. Normalized count of modifier words (target)– PC9. Normalized count of nouns (target)– PC10. Normalized count of number words (target)– PC11. Normalized count of verbs (target)– PC12. Normalized count of WH words (target)– PC13. Count of function words (source)– PC14. Count of modifier words (source)– PC15. Count of nouns (source)– PC16. Count of number words (source)– PC17. Count of verbs (source)– PC18. Count of WH words (source)– PC19. Count of function words (target)– PC20. Count of modifier words (target)– PC21. Count of nouns (target)– PC22. Count of number words (target)– PC23. Count of verbs (target)– PC24. Count of WH words (source)

24 POS Tagged LM Features (on sequences of POS tag and word pairs)– PL1. Normalized 5-gram POS tagged language model score (source)– PL2. Normalized 5-gram POS tagged language model perplexity (source)– PL3. Normalized 5-gram POS tagged language model perplexity (source, no

sentence boundaries)– PL4. Normalized 5-gram POS tagged language model score (target)– PL5. Normalized 5-gram POS tagged language model perplexity (target)– PL6. Normalized 5-gram POS tagged language model perplexity (target, no

sentence boundaries)– PL7. 3-gram POS tagged language model score (source)– PL8. 3-gram POS tagged language model perplexity (source)– PL9. 3-gram POS tagged language model perplexity (source, no sentence boun-

7

daries)– PL10. 4-gram POS tagged language model score (source)– PL11. 4-gram POS tagged language model perplexity (source)– PL12. 4-gram POS tagged language model perplexity (source, no sentence boun-

daries)– PL13. 5-gram POS tagged language model score (source)– PL14. 5-gram POS tagged language model perplexity (source)– PL15. 5-gram POS tagged language model perplexity (source, no sentence boun-

daries)– PL16. 3-gram POS tagged language model score (target)– PL17. 3-gram POS tagged language model perplexity (target)– PL18. 3-gram POS tagged language model perplexity (target, no sentence boun-



daries)

2 Sentence Length Features– LL1. Source sentence length– LL2. Target sentence length

4 POS Edited Translation Features– PE1. Number of tokens in the post-edited target sentence– PE2. LM probability of the post-edited target sentence– PE3. Average occurrences of word in the post-edited target sentence– PE4. Number of punctuation marks in the post-edited target sentence

3.3 Word Level Features

We consider a much smaller number of features in our system for predicting the qualityof a word in a translation, since most of the features described in the previous subsectiondo not easily decompose over words.

The current version of our system uses 16 features to describe a given target wordti in a translation hypothesis t = (tj)

mj=1. To avoid sparsity issues, we decided not to

include any lexicalized information such as the word or the previous word identities.As the translation hypotheses can be generated by different MT systems, no white-boxfeatures (such as word alignment or model scores) are considered. Our features can beorganized in two broad categories :

Association Features These features measure the quality of the ‘association’ betweenthe source sentence and a target word : they characterize the likeliness for a target word

8

to appear in a translation of the source sentence. Two kinds of association features canbe distinguished.

The first one is derived from the lexicalized probabilities p(t|s) that estimate theprobability that a source word s is translated by the target word tj . These probabilitiesare aggregated using an arithmetic mean :

p(tj |s) =1

n

n∑i=1

p(tj |si) (1)

where s = (si)ni=1 is the source sentence (with an extra NULL token). We assume that

p(tj |si) = 0 if the words tj and si have never been aligned in the train set and alsoconsider the geometric mean of the lexicalized probabilities, their maximum value (i.e.maxs∈s p(tj |s)) as well as a binary feature that fires when the target word tj is not inthe lexicalized probabilities table.

The second kind of association features relies on pseudo-references, that is to say,translations of the source sentence produced by an independent MT system. Many workshave considered pseudo-references to design new MT metrics [1, 2] or for confidenceestimation [13, 14] but, to the best of our knowledge, this is the first time that they areused to predict confidence at the word level.

Pseudo-references are used to define 3 binary features which fire if the target wordoccurs in the pseudo-reference, in a 2-gram shared between the pseudo-reference and thetranslation hypothesis or in a common 3-gram, respectively. The lattices representingthe search space considered to generate these pseudo-references also allow us to estimatethe posterior probability of a target word that quantifies the probability that it is partof the system output [8]. Posteriors aggregate two pieces of information for each wordin the final hypothesis : first, all the paths in the lattice (i.e. the number of translationhypotheses in the search space) where the word appears in are considered ; second, thedecoder scores of these paths are accumulated in order to derive a confidence measure atthe word level. In our experiments, we considered pseudo-references and lattices producedby the n-gram based system developed by our team for last year WMT evaluationcampaign [3], that has achieved very good performance.

Fluency Features These features measure the ‘fluency’ of the target sentence and arebased on different language models : a ‘traditional’ 4-gram language model estimatedon WMT monolingual and bilingual data (the language model used by our system togenerate the pseudo-references) ; a continuous-space 10-gram language model estimatedwith Soul [10] (also used by our MT system) and a 4-gram language model based onPart-of-Speech sequences. The latter model was estimated on the Spanish side of thebilingual data provided in the translation shared task in 2013. These data were POS-tagged with the FreeLing[11] tool.

All these language models have been used to define two sets of features :– the probability of the word of interest p(tj |h) where h = tj−1, ..., tj−n+1 is the history

made of the n− 1 previous words or POS

9

– the ratio between the probability of the sentence and the ‘best’ probability that canbe achieved if the target word is replaced by any other word (i.e. maxv∈V p(t1, ..., tj−1, v, tj+1, ..., tm)where the max runs over all the vocabulary words).

There is also a feature that describes the back-off behavior of the conventional languagemodel : its value is the size of the largest n-gram of the translation hypothesis that canbe estimated by the language model without relying on back-off probabilities.

Finally, there is a feature describing, for each word that appears more than once inthe train set, the probability that this word is labeled Bad. This probability is simplyestimated by the ratio between the number of times this word is labeled Bad and thenumber of occurrences of this word.

4 Publications

Here is the complete list of articles describing the performance of the quality systemswe have developed we have published. These are available online from transread.limsi.

fr.– G. Wisniewski, N. Pecheux, A. Allauzen, F. Yvon, LIMSI Submission for WMT’14

QE Task, WMT, 2014– G. Wisniewski, A. K. Singh, F. Yvon Quality Estimation for Machine Translation :

Some Lessons Learned, MT Journal. 2013– A. K. Singh, G. Wisniewski, F. Yvon LIMSI Submission for the WMT’13 Quality

Estimation Task : an Experiment with n-gram Posteriors, WMT’13. 2013.– G. Wisniewski, A. K. Singh, N. Segal, F. Yvon Design and Analysis of a Large

Corpus of Post-Edited Translations : Quality Estimation, Failure Analysis and theVariability of Post-Edition, MT Summit. 2013.

– Y. Zhuang, G. Wisniewski, F. Yvon, Non-linear models for confidence estimation,Seventh Workshop on Statistical Machine Translation (WMT 2012). 2012.

References

[1] Joshua Albrecht and Rebecca Hwa. Regression for sentence-level mt evaluation withpseudo references. In Proceedings of the 45th Annual Meeting of the Associationof Computational Linguistics, pages 296–303, Prague, Czech Republic, June 2007.ACL.

[2] Joshua Albrecht and Rebecca Hwa. The role of pseudo references in MT evaluation.In Proceedings of the Third Workshop on Statistical Machine Translation, pages187–190, Columbus, Ohio, June 2008. ACL.

[3] Alexandre Allauzen, Nicolas Pecheux, Quoc Khanh Do, Marco Dinarelli, ThomasLavergne, Aurelien Max, Hai-Son Le, and Francois Yvon. LIMSI @ WMT13. InProceedings of the Eighth Workshop on Statistical Machine Translation, pages 62–69, Sofia, Bulgaria, August 2013. ACL.

[4] John Blatz, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, AlexKulesza, Alberto Sanchis, and Nicola Ueffing. Confidence estimation for machine

10

translation. In Proceedings of Coling 2004, pages 315–321, Geneva, Switzerland,2004. COLING.

[5] Ondrej Bojar, Christian Buck, Chris Callison-Burch, Christian Federmann, BarryHaddow, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Spe-cia. Findings of the 2013 Workshop on Statistical Machine Translation. In Procee-dings of the Eighth Workshop on Statistical Machine Translation, pages 1–44, Sofia,Bulgaria, August 2013. Association for Computational Linguistics.

[6] Jill Burstein, Karen Kukich, Susanne Wolff, Chi Lu, Martin Chodorow, Lisa Braden-Harder, and Mary Dee Harris. Automated scoring using a hybrid feature identi-fication technique. In Proceedings of the 36th Annual Meeting of the Associationfor Computational Linguistics and 17th International Conference on Computatio-nal Linguistics, Volume 1, pages 206–210, Montreal, Quebec, Canada, August 1998.Association for Computational Linguistics.

[7] Mariano Felice and Lucia Specia. Linguistic features for quality estimation. InProceedings of the Seventh Workshop on Statistical Machine Translation, pages 96–103, Montreal, Canada, June 2012. Association for Computational Linguistics.

[8] Adria Gispert, Graeme Blackwood, Gonzalo Iglesias, and William Byrne. N-gramposterior probability confidence measures for statistical machine translation : anempirical study. Machine Translation, 27(2) :85–114, 2013.

[9] Tapas Kanungo and David Orr. Predicting the readability of short web summaries.In Proceedings of the Second ACM International Conference on Web Search andData Mining, pages 202–211, New York, NY, USA, 2009. ACM.

[10] Hai-Son Le, Ilya Oparin, Alexandre Allauzen, Jean-Luc Gauvain, and FrancoisYvon. Structured output layer neural network language model. In Acoustics, Speechand Signal Processing (ICASSP), 2011 IEEE International Conference on, pages5524–5527. IEEE, 2011.

[11] Lluıs Padro and Evgeny Stanilovsky. Freeling 3.0 : Towards wider multilinguality.In Proceedings of the Language Resources and Evaluation Conference (LREC 2012),Istanbul, Turkey, May 2012. ELRA.

[12] Raphael Rubino, Jennifer Foster, Joachim Wagner, Johann Roturier, Rasul SamadZadeh Kaljahi, and Fred Hollowood. Dcu-symantec submission for the wmt 2012quality estimation task. In Proceedings of the Seventh Workshop on StatisticalMachine Translation, pages 138–144, Montreal, Canada, June 2012. Association forComputational Linguistics.

[13] Radu Soricut and Abdessamad Echihabi. Trustrank : Inducing trust in automatictranslations via ranking. In Proceedings of the 48th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 612–621, Uppsala, Sweden, July 2010.ACL.

[14] Radu Soricut and Sushant Narsale. Combining quality prediction and system se-lection for improved automatic translation output. In Proceedings of the SeventhWorkshop on Statistical Machine Translation, pages 163–170, Montreal, Canada,June 2012. ACL.

11

Date post:	05-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

transread.limsi.frtransread.limsi.fr › Deliverables › Deliverable_3.2.pdf · In the context of...

Documents