Statistical Machine Translation of Texts with Misspelled Words
Nicola Bertoldi, Mauro Cettolo, Marcello FedericoFBK - Fondazione Bruno Kessler,
Trento, Italy
ACL 2010
Outline
Introduction System Data Evaluation Conclusions
Introduction
non-word error
Introduction
real-word error
Introduction
Six different typing error operations ◆ Substitution
Target: [We] had just come in from Australia.Error : [Ww] had just come in from Australia.
◆ InsertionTarget: is a good place to stay, if you are looking for a hotel [around] LAX airport.Error : is a good place to stay, if you are looking for a hotel [arround] LAX airport.
◆ DeletionTarget: The room was [excellent] but the hallway was [filthy].Error : The room was [exellent] but the hallway was [filty].
Introduction
◆ TranspositionTarget: The staff was [friendly].Error : The staff was [freindly].
◆ Run-OnTarget: I saw a teacher[.] who cares?Error : I saw a teacher[ ] who cares?
◆ SplitTarget: [We] had just come in from Australia.Error : [W e] had just come in from Australia.
Introduction
Outline
Introduction System Data Evaluation Conclusions
System
SystemStep 1.
Step 2.
SystemStep 3.
SystemStep 4.
SystemStep 5.Translation of the CN (e) is performed with the Moses decoder (Koehn et al., 2007)
Outline
Introduction System Data Evaluation Conclusions
Data
DataEvaluation DataNon-word NoiseRandomly replace words in the text according to a list of 4,100frequently non-word errors provided in the Wikipedia.
Real-word NoiseReal-word errors are automatically introduced by another list of frequently misused words in the Wikipedia.
Random-word NoiseCorrupting the original text by randomly replacing, inserting,and deleting Characters.
Outline
Introduction System Data Evaluation Conclusions
Evaluation
Evaluation
Outline
Introduction System Data Evaluation Conclusions
Conclusions
◆ This paper addressed the issue of automatically translating written texts that are corrupted by misspelling errors.
◆ The enhanced MT system has been tested on texts corrupted with increasing noise levels of three different sources: random, non-word, and real-word errors.
◆ The impact of misspelling errors on MT performance depends on the noise rate, but not on the noise source.