CUHK intern PPT. Machine Translation Evaluation: Methods and Tools

Aaron L.-F. Han

Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory

University of Macau, Macau S.A.R., China

2013.08 @ CUHK, Hong Kong

Email: hanlifengaaron AT gmail DOT com

Homepage: http://www.linkedin.com/in/aaronhan

http://www.linkedin.com/in/aaronhan

http://www.linkedin.com/in/aaronhan

The importance of machine translation (MT) evaluation

Automatic MT evaluation metrics introduction 1. Lexical similarity

2. Linguistic features

3. Metrics combination

Designed metric: LEPOR Series 1. Motivation

2. LEPOR Metrics Description

3. Performances on international ACL-WMT corpora

4. Publications and Open source tools

Other research interests and publications

• Eager communication with each other of different nationalities

– Promote the translation technology

• Rapid development of Machine translation

– machine translation (MT) began as early as in the 1950s (Weaver, 1955)

– big progress science the 1990s due to the development of computers (storage capacity and computational power) and the enlarged bilingual corpora (Marino et al. 2006)

• Some recent works of MT research:

– Och (2003) present MERT (Minimum Error Rate Training) for log-linear SMT

– Su et al. (2009) use the Thematic Role Templates model to improve the translation

– Xiong et al. (2011) employ the maximum-entropy model, etc.

– The data-driven methods including example-based MT (Carl and Way, 2003) and statistical MT (Koehn, 2010) became main approaches in MT literature.

• How well the MT systems perform and whether they make some progress?

• Difficulties of MT evaluation

– language variability results in no single correct translation

– the natural languages are highly ambiguous and different languages do not always express the same content in the same way (Arnold, 2003)

• Traditional manual evaluation criteria:

– intelligibility (measuring how understandable the sentence is)

– fidelity (measuring how much information the translated sentence retains as compared to the original) by the Automatic Language Processing Advisory Committee (ALPAC) around 1966 (Carroll, 1966)

– adequacy (similar as fidelity), fluency (whether the sentence is well-formed and fluent) and comprehension (improved intelligibility) by Defense Advanced Research Projects Agency (DARPA) of US (White et al., 1994)

• Problems of manual evaluations :

– Time-consuming

– Expensive

– Unrepeatable

– Low agreement (Callison-Burch, et al., 2011)

2.1 Lexical similarity

2.2 Linguistic features

2.3 Metrics combination

• Precision-based

Bleu (Papineni et al., 2002 ACL)

• Recall-based

ROUGE(Lin, 2004 WAS)

• Precision and Recall

Meteor (Banerjee and Lavie, 2005 ACL)

• Word-order based

NKT_NSR(Isozaki et al., 2010EMNLP), Port (Chen et al., 2012 ACL), ATEC (Wong et al., 2008AMTA)

• Word-alignment based

AER (Och and Ney, 2003 J.CL)

• Edit distance-based

WER(Su et al., 1992Coling), PER(Tillmann et al., 1997 EUROSPEECH), TER (Snover et al., 2006 AMTA)

• Language model

LM-SVM (Gamon et al., 2005EAMT)

• Shallow parsing

GLEU (Mutton et al., 2007ACL), TerrorCat (Fishel et al., 2012WMT)

• Semantic roles

Named entity, morphological, synonymy, paraphrasing, discourse representation, etc.

• MTeRater-Plus (Parton et al., 2011WMT)

– Combine BLEU, TERp (Snover et al., 2009) and Meteor (Banerjee and Lavie, 2005; Lavie and Denkowski, 2009)

• MPF & WMPBleu (Popovic, 2011WMT)

– Arithmetic mean of F score and BLEU score

• SIA (Liu and Gildea, 2006ACL)

– Combine the advantages of n-gram-based metrics and loose-sequence-based metrics

• LEPOR: automatic machine translation evaluation metric considering the: Length Penalty, Precision, n-gram Position difference Penalty and Recall.

• Weaknesses in existing metrics:

– perform well on certain language pairs but weak on others, which we call as the language-bias problem;

– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call as the extremism problem;

– present incomprehensive factors (e.g. BLEU focus on precision only).

– What to do?

• to address some of the existing problems:

– Design tunable parameters to address the language-bias problem;

– Use concise or optimized linguistic features for the linguistic extremism problem;

– Design augmented factors.

• Sub-factors:

• 𝐿𝑃 =

exp 1 −𝑟

𝑐: 𝑐 < 𝑟

1 ∶ 𝑐 = 𝑟

exp 1 −𝑐

𝑟: 𝑐 > 𝑟

(1)

• 𝑟: length of reference sentence

• 𝑐: length of candidate (system-output) sentence

• 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 = exp −𝑁𝑃𝐷 (2)

• 𝑁𝑃𝐷 =1

𝐿𝑒𝑛𝑔𝑡ℎ𝑜𝑢𝑡𝑝𝑢𝑡 |𝑃𝐷𝑖|

𝐿𝑒𝑛𝑔𝑡ℎ𝑜𝑢𝑡𝑝𝑢𝑡

𝑖=1 (3)

• 𝑃𝐷𝑖 = |𝑀𝑎𝑡𝑐ℎ𝑁𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓| (4)

• 𝑀𝑎𝑡𝑐ℎ𝑁𝑜𝑢𝑡𝑝𝑢𝑡: position of matched token in

output sentence

• 𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓: position of matched token in reference

sentence

Fig. 1. N-gram word alignment algorithm

Fig. 2. Example of n-gram word alignment

Fig. 3. Example of NPD calculation

• N-gram precision and recall:

• 𝑃𝑛 =#𝑛𝑔𝑟𝑎𝑚𝑚𝑎𝑡𝑐ℎ𝑒𝑑

#𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘𝑠 𝑖𝑛 𝑠𝑦𝑠𝑡𝑒𝑚 𝑜𝑢𝑡𝑝𝑢𝑡 (5)

• 𝑅𝑛 =#𝑛𝑔𝑟𝑎𝑚𝑚𝑎𝑡𝑐ℎ𝑒𝑑

#𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘𝑠 𝑖𝑛 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 (6)

• 𝐻𝑃𝑅 = 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝛼𝑅𝑛, 𝛽𝑃𝑛 =𝛼+𝛽𝛼

𝑅𝑛+

𝛽

𝑃𝑛

(7)

Fig. 4. Example of bigram matching

• LEPOR Metrics:

• 𝐿𝐸𝑃𝑂𝑅 = 𝐿𝑃 × 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 × 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐(𝛼𝑅, 𝛽𝑃) (8)

• ℎ𝐿𝐸𝑃𝑂𝑅 =𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝑤𝐿𝑃𝐿𝑃,𝑤𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙, 𝑤𝐻𝑃𝑅𝐻𝑃𝑅

• = 𝑤𝑖

𝑛𝑖=1

𝑤𝑖

𝐹𝑎𝑐𝑡𝑜𝑟𝑖

𝑛𝑖=1

=𝑤𝐿𝑃+𝑤𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙+𝑤𝐻𝑃𝑅𝑤𝐿𝑃𝐿𝑃

+𝑤𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙

+𝑤𝐻𝑃𝑅𝐻𝑃𝑅

(9)

• 𝑛𝐿𝐸𝑃𝑂𝑅 = 𝐿𝑃 × 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 × 𝑒𝑥𝑝 ( 𝑤𝑛𝑙𝑜𝑔𝐻𝑃𝑅𝑁𝑛=1 )

(10)

• Example, employment of linguistic features:

Fig. 5. Example of n-gram POS alignment

Fig. 6. Example of NPD calculation

• Combination with linguistic features:

• ℎ𝐿𝐸𝑃𝑂𝑅𝑓𝑖𝑛𝑎𝑙 =1

𝑤ℎ𝑤+𝑤ℎ𝑝(𝑤ℎ𝑤ℎ𝐿𝐸𝑃𝑂𝑅𝑤𝑜𝑟𝑑 +

𝑤ℎ𝑝ℎ𝐿𝐸𝑃𝑂𝑅𝑃𝑂𝑆) (11)

• ℎ𝐿𝐸𝑃𝑂𝑅𝑃𝑂𝑆 and ℎ𝐿𝐸𝑃𝑂𝑅𝑤𝑜𝑟𝑑 use the same algorithm on POS sequence and word sequence respectively.

• When multi-references:

• Select the alignment that results in the minimum NPD score.

Fig. 7. N-gram alignment when multi-references

• How reliable is the automatic metric?

• Evaluation criteria for evaluation metrics:

– Human judgments are the golden to approach, currently.

• Correlation with human judgments:

– System-level correlation

– Segment-level correlation

• System-level correlation:

• Spearman rank correlation coefficient:

– 𝜌 𝑋𝑌 = 1 −6 𝑑𝑖

2𝑛𝑖=1

𝑛(𝑛2−1) (12)

– 𝑋 = 𝑥1, … , 𝑥𝑛 , 𝑌 = {𝑦1, … , 𝑦𝑛}

• Pearson correlation coefficient:

– 𝜌 𝑋𝑌 = (𝑥𝑖−𝜇𝑥)(𝑦𝑖−𝜇𝑦)𝑛

𝑖=1

𝑥𝑖−𝜇𝑥2𝑛

𝑖=1 𝑦𝑖−𝜇𝑦2𝑛

𝑖=1

(13)

• Segment-level Kendall’s tau correlation:

• 𝜏 =𝑛𝑢𝑚 𝑐𝑜𝑛𝑐𝑜𝑟𝑑𝑎𝑛𝑡 𝑝𝑎𝑖𝑟𝑠−𝑛𝑢𝑚 𝑑𝑖𝑠𝑐𝑜𝑟𝑑𝑎𝑛𝑡 𝑝𝑎𝑖𝑟𝑠

𝑡𝑜𝑡𝑎𝑙 𝑝𝑎𝑖𝑟𝑠 (14)

• The segment unit can be a single sentence or fragment that contains several sentences.

• Performances on ACL-WMT 2011 copora

• Two translation directions:

– English-to-other (Spanish, German, French, Czech)

– Other-to-English

• System-level metrics:

– 𝐿𝐸𝑃𝑂𝑅𝐴 =1

𝑛𝑢𝑚𝑠𝑒𝑛𝑡 |𝐿𝐸𝑃𝑂𝑅𝑖|

𝑛𝑢𝑚𝑠𝑒𝑛𝑡𝑖=1 (15)

– 𝐿𝐸𝑃𝑂𝑅𝐵 = 𝐿𝑃 × 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 × 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐(𝛼𝑅, 𝛽𝑃) (16)

Table 1. system-level Spearman correlation with human judgment on WMT11 corpora

• Performances on ACL-WMT 2013 copora

• Two translation directions:

– English-to-other (Spanish, German, French, Czech, and Russian)

– Other-to-English

• System-level & sentence-level

– LEPOR_v3.1: hLEPOR, nLEPOR_baseline

– ℎ𝐿𝐸𝑃𝑂𝑅 =1

𝑛𝑢𝑚𝑠𝑒𝑛𝑡 |ℎ𝐿𝐸𝑃𝑂𝑅𝑖|

𝑛𝑢𝑚𝑠𝑒𝑛𝑡𝑖=1 (17)

– ℎ𝐿𝐸𝑃𝑂𝑅𝑓𝑖𝑛𝑎𝑙 =1

𝑤ℎ𝑤+𝑤ℎ𝑝(𝑤ℎ𝑤ℎ𝐿𝐸𝑃𝑂𝑅𝑤𝑜𝑟𝑑 + 𝑤ℎ𝑝ℎ𝐿𝐸𝑃𝑂𝑅𝑃𝑂𝑆) (18)

– 𝑛𝐿𝐸𝑃𝑂𝑅 = 𝐿𝑃 × 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 × exp ( 𝑤𝑛𝑙𝑜𝑔𝐻𝑃𝑅𝑁𝑛=1 ) (19)

Table 2&3. system-level Pearson correlation with human judgment

Table 4&5. segment-level Kendall’s tau correlation with human judgment

• LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors

– Aaron L.-F. Han, Derek F. Wong and Lidia S. Chao. Proceedings of COLING 2012: Posters, pages 441–450, Mumbai, India.

• Language-independent Model for Machine Translation Evaluation with Reinforced Factors

– Aaron L.-F. Han, Derek Wong, Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing, Xiaodong Zeng. Proceedings of MT Summit 2013. Nice, France.

http://wenku.baidu.com/view/f209400c4431b90d6c85c734.html




http://wenku.baidu.com/view/8b8a195a33687e21af45a9e6.html






• Language independent MT evaluation-LEPOR: https://github.com/aaronlifenghan/aaron-project-lepor

• MT evaluation with linguistic features-hLEPOR: https://github.com/aaronlifenghan/aaron-project-hlepor

• English-French Phrase tagset mapping and application in unsupervised MT evaluation-HPPR: https://github.com/aaronlifenghan/aaron-project-hppr

• Unsupervised English-Spanish MT evaluation-EBLEU: https://github.com/aaronlifenghan/aaron-project-ebleu

• Projects Homepage: https://github.com/aaronlifenghan

https://github.com/aaronlifenghan/aaron-project-lepor






https://github.com/aaronlifenghan/aaron-project-hlepor





https://github.com/aaronlifenghan/aaron-project-hppr






https://github.com/aaronlifenghan/aaron-project-ebleu





https://github.com/aaronlifenghan

• My research interests:

– Natural Language Processing

– Signal Processing

– Machine Learning

– Artificial Intelligence

– Pattern Recognition

• My past research works:

– Machine Translation Evaluation, Word Segmentation, Entity Recognition, Multilingual Treebanks

• Other publications:

• A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task

– Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yi Lu, Yervant Ho, Yiming Wang, Zhou jiaji. Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August 2013. Sofia, Bulgaria.

– ACL-WMT13 METRICS TASK: Our metrics are language independent

English-vs-other (French, Spanish, Czech, German, Russian)

Can perform on both system-level and segment-level

The official results show our metrics have advantages as compared to others.

http://wenku.baidu.com/view/4615099883d049649b665827.html

http://wenku.baidu.com/view/4615099883d049649b665827.html

• Quality Estimation for Machine Translation Using the Joint Method of Evaluation Criteria and Statistical Modeling

– Aaron Li-Feng Han, Yi Lu, Derek F. Wong, Lidia S. Chao, Yervant Ho, Anson Xing. Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August 2013. Sofia, Bulgaria.

– ACL-WMT13 QUALITY ESTIMATION TASK (no reference translation): Task 1.1: sentence level EN-ES quality estimation

Task 1.2: system selection, EN-ES, EN-DE, new

Task 2: word-level QE, EN-ES, binary classification, multi-class classification, new

We design novel EN-ES POS tagset mapping and metric EBLEU in task 1.1.

We explore the Naïve Bayes and Support Vector Machine in task 1.2.

We achieve the highest F1 score in task 2 using Conditional Random Field.

http://wenku.baidu.com/view/4e3e25d1da38376baf1fae79.html



Designed POS tagset mapping of Spanish (Tree tagger) to universal tagset

(Petrov et al., 2012)

• 𝐸𝐵𝐿𝐸𝑈 = 1 − 𝑀𝐿𝑃 × exp ( 𝑤𝑛log (𝐻(𝛼𝑅𝑛, 𝛽𝑃𝑛)))

– 𝑀𝐿𝑃 = 𝑒1−

𝑠

ℎ 𝑖𝑓 ℎ < 𝑠

𝑒1−ℎ

𝑠 𝑖𝑓 ℎ ≥ 𝑠

– 𝑃𝑛 =#𝑐𝑜𝑚𝑚𝑜𝑛 𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘

#𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘 𝑖𝑛 𝑡𝑎𝑟𝑔𝑒𝑡 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒

– 𝑅𝑛 =#𝑐𝑜𝑚𝑚𝑜𝑛 𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘

#𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘 𝑖𝑛 𝑠𝑜𝑢𝑟𝑐𝑒 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒

• Bayes’ rule:

– 𝑝 𝑐𝑖 𝑥1, 𝑥2, … , 𝑥𝑛 =𝑝 𝑥1,𝑥2,…,𝑥𝑛 𝑐𝑖 𝑝(𝑐𝑖)

𝑝(𝑥1,𝑥2,…,𝑥𝑛)

• SVM, find the point with smallest margin to hyper plane and then maximize this margin:

– 𝑎𝑟𝑔𝑚𝑎𝑥𝑤 ,𝑏 {𝑚𝑖𝑛𝑛

(𝑙𝑎𝑏𝑒𝑙 ∙ (𝑤𝑇𝑥 + 𝑏)) ∙1

||𝑤||}

• Conditional random fields:

– 𝑃 𝑌 𝑋 ∝exp ( 𝜆𝑘𝑓𝑘 𝑒, 𝑌|𝑒 , 𝑋𝑒∈𝐸,𝑘 + 𝜇𝑘𝑔𝑘(𝑣, 𝑌|𝑣 , 𝑋)𝑣∈𝑉,𝑘 )

Designed features for CRF & NB algorithms

ACL-WMT13 word level quality estimation task results

• Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation

– Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yervant Ho, Shuo Li, Lynn Ling Zhu. In GSCL 2013. LNCS Vol. 8105, Volume Editors: Iryna Gurevych, Chris Biemann and Torsten Zesch.

– German Society for Computational Linguistics (oral presentation): To facilitate future research in unsupervised induction of syntactic structures

We design French-English phrase tagset mapping

We propose a universal phrase tagset

Phase tags extracted from French Treebank and English Penn Treebank

Explore the employment of the proposed mapping in unsupervised MT evaluation

http://wenku.baidu.com/view/85bd65da58f5f61fb7366670.html





Designed phrase tagset mapping for English and French

Evaluation based on parsing information from syntactic treebanks

Convert the word sequence into universal phrase tagset sequence

• 𝐻𝑃𝑃𝑅 = 𝐻𝑎𝑟(𝑤𝑃𝑠𝑁1𝑃𝑠𝐷𝑖𝑓, 𝑤𝑃𝑟𝑁2𝑃𝑟𝑒,𝑤𝑅𝑐𝑁3𝑅𝑒𝑐)

– 𝑁1𝑃𝑠𝐷𝑖𝑓 =1

𝑛 𝑁1𝑃𝑠𝐷𝑖𝑓𝑖

– 𝑁2𝑃𝑟𝑒 = 𝑒𝑥𝑝 𝑤𝑛𝑙𝑜𝑔𝑃𝑛𝑁2

– 𝑁3𝑅𝑒𝑐 = 𝑒𝑥𝑝 ( 𝑤𝑛𝑙𝑜𝑔𝑅𝑛𝑁3)

• A Study of Chinese Word Segmentation Based on the Characteristics of Chinese

– Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yervant Ho, Lynn Ling Zhu, Shuo Li. Accepted. In GSCL 2013. LNCS Vol. 8105, Volume Editors: Iryna Gurevych, Chris Biemann and Torsten Zesch.

– German Society for Computational Linguistics (poster paper): No word boundary in the Chinese expression

Chinese word segmentation is a difficult problem

Word segmentation is crucial to the word alignment in machine translation

We discuss the characteristics of Chinese and design optimized features

We formulize some problems and issues in Chinese word segmentation

http://wenku.baidu.com/view/a6b6e66b8e9951e79b8927e6.html



• AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION

– Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yervant Ho. In TSD 2013. Plzen, Czech Republic. LNAI Vol. 8082, pp. 121-128. Volume Editors: I. Habernal and V. Matousek. Springer-Verlag Berlin Heidelberg.

– Text, Speech and Dialogue 2013 (oral presentation): We explore the unsupervised machine translation evaluation method

We design hLEPOR algorithm for the first time

We explore the POS usage in unsupervised MT evaluation

Experiments are performed on English vs French, German

http://wenku.baidu.com/view/b15b8ef7e009581b6bd9eb3f.html






• Chinese Named Entity Recognition with Conditional Random Fields in the Light of Chinese Characteristics

– Aaron Li-Feng Han, Derek Fai Wong and Lidia Sam Chao. In Proceeding of LP&IIS. M.A. Klopotek et al. (Eds.): IIS 2013, LNCS Vol. 7912, pp. 57–68, Warsaw, Poland. Springer-Verlag Berlin Heidelberg.

– Intelligent Information System 2013 (oral presentation): Named entity recognition is important in IR, MT, text analysis, etc.

Chinese named entity recognition is more difficult due to no word boundary

We compare the performances of different algorithm, NB, CRF, SVM, ME

We analysis the characteristics respectively on personal, location, organization names

We show the performance of different features and select the optimized one.

http://www.linkedin.com/redir/redirect?url=http://link.springer.com/chapter/10.1007/978-3-642-38634-3_8&urlhash=u6Pl



• Ongoing and further works:

– The combination of translation and evaluation, tuning the translation model using evaluation metrics

– Evaluation models from the perspective of semantics

– The exploration of unsupervised evaluation models, extracting features from source and target languages

• Actually speaking, the evaluation works are very related to the similarity measuring. Where I have employed them is in the MT evaluation only. These works can be further developed into other literature:

– information retrieval

– question and answering

– Searching

– text analysis

– etc.

Q and A Thanks for your attention!

Aaron L.-F. Han, 2013.08

• 1. Weaver, Warren.: Translation. In William Locke and A. Donald Booth, editors,

• Machine Translation of Languages: Fourteen Essays. John Wiley and Sons, New

• York, pages 15-23 (1955)

• 2. Marino B. Jose, Rafael E. Banchs, Josep M. Crego, Adria de Gispert, Patrik Lambert,

• Jose A. Fonollosa, Marta R. Costa-jussa: N-gram based machine translation,

• Computational Linguistics, Vol. 32, No. 4. pp. 527-549, MIT Press (2006)

• 3. Och, F. J.: Minimum Error Rate Training for Statistical Machine Translation. In

• Proceedings of (ACL-2003). pp. 160-167 (2003)

• 4. Su Hung-Yu and Chung-Hsien Wu: Improving Structural Statistical Machine Translation

• for Sign Language With Small Corpus Using Thematic Role Templates as

• Translation Memory, IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE

• PROCESSING, VOL. 17, NO. 7, SEPTEMBER (2009)

• 5. Xiong D., M. Zhang, H. Li: A Maximum-Entropy Segmentation Model for Statistical

• Machine Translation, Audio, Speech, and Language Processing, IEEE Transactions

• on, Volume: 19, Issue: 8, 2011 , pp. 2494- 2505 (2011)

• 6. Carl, M. and A. Way (eds): Recent Advances in Example-Based Machine Translation.

• Kluwer Academic Publishers, Dordrecht, The Netherlands (2003)

• 7. Koehn P.: Statistical Machine Translation, (University of Edinburgh), Cambridge

• University Press (2010)

• 8. Arnold, D.: Why translation is dicult for computers. In Computers and Translation:

• A translator's guide. Benjamins Translation Library (2003)

• 9. Carroll, J. B.: Aan experiment in evaluating the quality of translation, Pierce, J.

• (Chair), Languages and machines: computers in translation and linguistics. A report

• by the Automatic Language Processing Advisory Committee (ALPAC), Publication

• 1416, Division of Behavioral Sciences, National Academy of Sciences, National Research

• Council, page 67-75 (1966)

• 10. White, J. S., O'Connell, T. A., and O'Mara, F. E.: The ARPA MT evaluation

• methodologies: Evolution, lessons, and future approaches. In Proceedings of the

• Conference of the Association for Machine Translation in the Americas (AMTA

• 1994). pp 193-205 (1994)

• 11. Su Keh-Yih, Wu Ming-Wen and Chang Jing-Shin: A New Quantitative Quality

• Measure for Machine Translation Systems. In Proceedings of the 14th International

• Conference on Computational Linguistics, pages 433-439, Nantes, France, July

• (1992)

• 12. Tillmann C., Stephan Vogel, Hermann Ney, Arkaitz Zubiaga, and Hassan Sawaf:

• Accelerated DP Based Search For Statistical Translation. In Proceedings of the 5th

• European Conference on Speech Communication and Technology (EUROSPEECH97)

• (1997)

• 13. Papineni, K., Roukos, S., Ward, T. and Zhu, W. J.: BLEU: a method for automatic

• evaluation of machine translation. In Proceedings of the (ACL 2002), pages 311-318,

• Philadelphia, PA, USA (2002)

• 14. Doddington, G.: Automatic evaluation of machine translation quality using ngram

• co-occurrence statistics. In Proceedings of the second international conference

• on Human Language Technology Research(HLT 2002), pages 138-145, San Diego,

• California, USA (2002)

• 15. Turian, J. P., Shen, L. and Melanmed, I. D.: Evaluation of machine translation

• and its evaluation. In Proceedings of MT Summit IX, pages 386-393, New Orleans,

• LA, USA (2003)

• 16. Banerjee, S. and Lavie, A.: Meteor: an automatic metric for MT evaluation with

• high levels of correlation with human judgments. In Proceedings of ACL-WMT,

• pages 65-72, Prague, Czech Republic (2005)

• 17. Denkowski, M. and Lavie, A.: Meteor 1.3: Automatic metric for reliable optimization

• and evaluation of machine translation systems. In Proceedings of (ACL-WMT),

• pages 85-91, Edinburgh, Scotland, UK (2011)

• 18. Snover, M., Dorr, B., Schwartz, R., Micciulla, L. and Makhoul, J.: A study of

• translation edit rate with targeted human annotation. In Proceedings of the Conference

• of the Association for Machine Translation in the Americas (AMTA), pages

• 223-231, Boston, USA (2006)

• 19. Chen, B. and Kuhn, R.: Amber: A modied bleu, enhanced ranking metric. In

• Proceedings of (ACL-WMT), pages 71-77, Edinburgh, Scotland, UK (2011)

• 20. Bicici, E. and Yuret, D.: RegMT system for machine translation, system combination,

• and evaluation. In Proceedings ACL-WMT, pages 323-329, Edinburgh,

• Scotland, UK (2011)

• 21. Taylor, J. Shawe and N. Cristianini: Kernel Methods for Pattern Analysis. Cambridge

• University Press 2004.

• 22. Wong, B. T-M and Kit, C.: Word choice and word position for automatic MT

• evaluation. In Workshop: MetricsMATR of the Association for Machine Translation

• in the Americas (AMTA), short paper, 3 pages, Waikiki, Hawai'I, USA (2008)

• 23. Isozaki, H., Hirao, T., Duh, K., Sudoh, K., and Tsukada, H.: Automatic evaluation

• of translation quality for distant language pairs. In Proceedings of the 2010

• Conference on (EMNLP), pages 944{952, Cambridge, MA (2010)

• 24. Talbot, D., Kazawa, H., Ichikawa, H., Katz-Brown, J., Seno, M. and Och, F.: A

• Lightweight Evaluation Framework for Machine Translation Reordering. In Proceedings

• of the Sixth (ACL-WMT), pages 12-21, Edinburgh, Scotland, UK (2011)

• 25. Song, X. and Cohn, T.: Regression and ranking based optimisation for sentence

• level MT evaluation. In Proceedings of the (ACL-WMT), pages 123-129, Edinburgh,

• Scotland, UK (2011)

• 26. Popovic, M.: Morphemes and POS tags for n-gram based evaluation metrics. In

• Proceedings of (ACL-WMT), pages 104-107, Edinburgh, Scotland, UK (2011)

• 27. Popovic, M., Vilar, D., Avramidis, E. and Burchardt, A.: Evaluation without references:

• IBM1 scores as evaluation metrics. In Proceedings of the (ACL-WMT),

• pages 99-103, Edinburgh, Scotland, UK (2011)

• 28. Petrov S., Leon Barrett, Romain Thibaux, and Dan Klein: Learning accurate,

• compact, and interpretable tree annotation. Proceedings of the 21st ACL, pages

• 433-440, Sydney, July (2006)

• 29. Callison-Bruch, C., Koehn, P., Monz, C. and Zaidan, O. F.: Findings of the 2011

• Workshop on Statistical Machine Translation. In Proceedings of (ACL-WMT), pages

• 22-64, Edinburgh, Scotland, UK (2011)

• 30. Callison-Burch, C., Koehn, P., Monz, C., Peterson, K., Przybocki, M. and Zaidan,

• O. F.: Findings of the 2010 Joint Workshop on Statistical Machine Translation and

• Metrics for Machine Translation. In Proceedings of (ACL-WMT), pages 17-53, PA,

• USA (2010)

• 31. Callison-Burch, C., Koehn, P., Monz,C. and Schroeder, J.: Findings of the 2009

• Workshop on Statistical Machine Translation. In Proceedings of ACL-WMT, pages

• 1-28, Athens, Greece (2009)

• 32. Callison-Burch, C., Koehn, P., Monz,C. and Schroeder, J.: Further meta-evaluation

• of machine translation. In Proceedings of (ACL-WMT), pages 70-106, Columbus,

• Ohio, USA (2008)

• 33. Avramidis E., Popovic, M., Vilar, D., Burchardt, A.: Evaluate with Condence

• Estimation: Machine ranking of translation outputs using grammatical features. In

• Proceedings of the Sixth Workshop on Statistical Machine Translation, Association

• for Computational Linguistics (ACL-WMT), pages 65-70, Edinburgh, Scotland, UK

• (2011)

Date post:	18-Nov-2014
Category:	Technology
Upload:	aaron-l-f-han
View:	745 times
Download:	6 times