Post on 08-Oct-2020
transcript
Methods for Machine Translation
Prasanth Kolachina
Statistical methods for NLP
March 13th 2014
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 1 / 34
Outline
1 Introduction to Machine Translation
2 Statistical Machine Translation
3 IBM Word Based Models
4 Current approches in SMT
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 2 / 34
What is M. Translation?
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 3 / 34
Machine Translation
Translation is task of transforming text in one language to anotherlanguage
interpretation of meaningpreservation of meaning and structure in original text
Importance of context in interpretation and translation
There is nothing outside the text.– Jacques Derrida, “Of Grammatology” (1967)
This transformation process: can it be automatized?
Machine TranslationIf not completely, to what extent?
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 4 / 34
Origins of Mechanical Translation
First ideas from information theory
“Translation memorandum” Weaver [1955]
Essentially, decoding the meaning in one language and re-encoding thesame in target language
Early attempts to translate using a bilingual dictionary
Information encoding in text is more complex than simple wordmeanings
Encoded at different levels of “linguistic analysis”Morphology, Syntax, Semantics, Discourse and Pragmatics
ALPAC report led to the creation of Computational Linguistics
Advanced research in both Linguistics and Computer ScienceE.g. Quick sort
Was originally called Mechanical Translation!
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 5 / 34
Formalizing approaches to MT
Post ALPAC report of 1966
Formal grammars and algorithms for NLU, NLG and MTVauquois [1968]
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 6 / 34
Corpus-based MT
Hand-crafted translation grammars are difficult to develop
Require many man-hours from linguistic expertsLimitations on coverage of grammars
Can these grammars be learnt from data?
Parallel corporaNaturally “occurring” for different languagesTranslation fragments are aligned at some level
Typically, sentence alignedTranslation memories!
Example-based, Statistical MT
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 7 / 34
Corpus-based MT
1Example from Petrov [2012]P. Kolachina (Sprakbanken) MT 13th Mar, 2014 8 / 34
Statistical MT
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 9 / 34
Noisy-Channel model
Warren Weaver’s “memorandum”When I look at an article in Russian, I say: “This is really written in
English, but it has been coded in some strange symbols. I will now
proceed to decode”.
– Weaver [1955]
Original message in English (S) can be “reconstructed” using a sourcemodel and a channel model for a signal in Russian (R)
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 10 / 34
Statistical MT
Translation is a search problem
E = argmaxE
P(E|F)
By application of Bayes rule and mathematical simplification
E = argmaxE
PTM(F|E) ∗PLM(E)
Two primary components in the model
Translation model PTM ≈ channel modelLanguage model PLM ≈ source model
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 11 / 34
Formalizing approaches to SMT
2Example from Petrov [2012]P. Kolachina (Sprakbanken) MT 13th Mar, 2014 12 / 34
Word level SMT
Model to learn word translations from corpora
Proposed by Brown et al. [1993] at IBM
Hence the name, IBM word models
Notion of word alignment
What do these alignments tell?
First approximations to extract “richer” translation models“richer” i.e. linguistic fragments higher in the pyramid
Useful not only in MT, but many other applications
cross-lingual *
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 13 / 34
Word Alignment
3Example from Petrov [2012]P. Kolachina (Sprakbanken) MT 13th Mar, 2014 14 / 34
IBM Models
Different models to capture regular variations across language
morphologyword order
Models 1-4 for PTM
How toestimate parameters, say
p(new|nouvelles) orp(collecting|perception)
decode new sentences using these parameters
We will look at Models 1 and 2 in today’s lecture!
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 15 / 34
Nuts and Bolts of the IBM Models
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 16 / 34
IBM Models
Modeling PTM
Different parameters are defined to explain translation process
lexical translation t(f |e) –model 1distortion q –model 2fertility n –model 3relative distortion q′ –model 4
t(f |e) and q for the current discussion
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 17 / 34
IBM Model 1
Given sentence pairs with word-alignments
can we compute t(f |e)maximum likelihood based on counts
t(haus|house) = C(haus,house)C(house)
or
t(das|the) = C(das,the)C(the)
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 18 / 34
IBM Model 2
For the same example,
can we compute q(j|i, l,m)maximum likelihood based on counts
q(j|i, l,m) =C(j|i, l,m)
C(i, l,m)
i word position in source sentencej word position in translation
m length of source sentencel length of translation
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 19 / 34
Parameter estimation in IBM Models
So, what is missing in the previous examples?
Assumption of alignments being given is unlikely
Alignments are hidden or latent variablesUnobserved
Recall the Expectation-Maximization algorithm P. Dempster et al. [1977]
estimates a statistical model when hidden variables are present
E-step estimates the parameter values and M-step maximizes thelikelihood of the translations
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 20 / 34
Estimation step
Given the table of counts from previous iteration
estimate distributions t and q defined previously
t(fi|ej) =c(ej , fi)
c(ej)
q(j|i, l,m) =c(j|i, l,m)
c(i, l,m)
for all possible values of (ej , fi) and (j|i, l,m)
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 21 / 34
Maximization step
Given distributions t and q
modify counts to reflect probability of translations
how to estimate probability of translation
δ(i, j, l,m) =q(j|i, l,m) ∗ t(fi|ej)∑l
j′=0 q(j′|i, l,m) ∗ t(fi|e′j)
how to modify counts
c(ej , fi) = c(ej , fi) + δ(i, j, l,m)
c(ej) = c(ej) + δ(i, j, l,m)
c(j|i, l,m) = c(j|i, l,m) + δ(i, j, l,m)
c(i, l,m) = c(i, l,m) + δ(i, j, l,m)
for all possible values of (ej , fi) and (j|i, l,m)
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 22 / 34
Practical issues
How to implement this EM for IBM models?
initialize parameter distributions t and q to random valuesinitialize all count tables c to 0
Maximize first using initial t values over entire corpus
Estimate new parameter distributions using new count tables
Iterate over these two steps until EM reaches convergence
EM will converge for model 2 Collins [2012]
The result can be local optimum rather than “real” solution
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 23 / 34
Decoder
Given a translation model PTM and a language model for targetlanguage PLM
find the most “likely” translation for a source sentence
An intractable problem: no exact solution
Maximize over all possible translationsEach translation can be generated by many underlying alignmentsSum over all such plausible alignments
Number of plausible permutations and alignments are exponential insentence length
Inexact search instead of exact search
approximations make decoding tractable
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 24 / 34
Greedy decoder
Start by assigning each word its most probable translation
hypothesis
Compute the probability of the hypothesis
scores from both PTM and PLM
Make mutations to the hypotheses until no difference in probabilityscores (Turitzin [2005])
What are plausible mutations
Change translation options for each wordAdd new words to hypothesis or remove existing wordsMoving words around inside the hypothesis
swap non-overlapping segments
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 25 / 34
Decoding example
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 26 / 34
Beyond Word models in SMT
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 27 / 34
Shortcomings of IBM Models
Simplifying assumptions in model formulation (Brown et al. [1993])
Lack of context in predicting likely translation of a word
1. The ball went past the bat and hit the stumps in the last ball of the innings.
2. The bat flew out of the cave with wings as black as night itself.
3. They danced to the music all night at the ball.
Not very different from dictionary lookup to translate
Discarding linguistic information encoded in a sentence
Morphological variantsSyntactic structure like part-of-speech tags
Multi-word concepts
break the ice
liven up
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 28 / 34
Extending words to phrases
Phrasal translations rather than word translations (Koehn et al. [2003])
Simple way to incorporate local context into translation model
Phrase pairs are extracted using alignment template
Word alignments are used to extract “good” phrase pairsReordering at phrase level instead of word reorderings
Notion of phrase is not defined linguistically
any n-gram in the language is a phrase
State-of-art models
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 29 / 34
Reiterating ..
4Example from Petrov [2012]P. Kolachina (Sprakbanken) MT 13th Mar, 2014 30 / 34
Encoding Linguistic Information
Various levels of linguistic information
Morphology: gender, number or tense
Factored phrase models
Syntax: syntactic reordering between language pairs
regular patterns for a language pairfor e.g. adjectives in English and French orClause reordering between English and GermanSyntax-based SMT
Other information
Semantics, Discourse, Pragmatics
All of these are open research problems !!
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 31 / 34
Evaluating MT
Evaluation criteria
fluency of translationsadequacy i.e. translations preserving meaning
Human judgements are most reliable
Nießen et al. [2000]
Very expensive and time-consumingVariation in judgements
Automatic evaluation metrics
Compute similarity of translations to reference translationsBLEU, NIST (A. Papineni et al. [2002]) and many moreChoice of metric varies depending on application requirements
How to interpret evaluation scores?
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 32 / 34
Next?
VG assignment (optional)
Implement IBM Model 2
Help session next week
Interested further in MT
Feel free to contact Richard or me :-)
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 33 / 34
References I
A. Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation ofMachine Translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, 311–318.Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/P02-1040.
Brown, Peter E., Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of StatisticalMachine Translation: Parameter Estimation. Computational Linguistics 19:263–311. URLhttp://aclweb.org/anthology-new/J/J93/J93-2003.
Collins, Michael. 2012. Statistical Machine Translation: IBM Models 1 and 2.
Koehn, Philipp, Franz J. Och, and Daniel Marcu. 2003. Statistical Phrase-Based Translation. In Proceedings of HumanLanguage Technologies: The 2003 Annual Conference of the North American Chapter of the Association for ComputationalLinguistics, 48–54. Edmonton, Canada: Association for Computational Linguistics. URLhttp://aclweb.org/anthology-new/N/N03/N03-1017.
Nießen, Sonja, Franz Josef Och, Gregor Leusch, and Herrmann Ney. 2000. An Evaluation Tool for Machine Translation: FastEvaluation for MT Research. In Proceedings of the Second Conference on International Language Resources and Evaluation(LREC’00), 39–45. Athens, Greece: European Language Resources Association (ELRA).
P. Dempster, Arthur, Laird M. Nan, and Bruce Rubin Donald. 1977. Maximum Likelihood from Incomplete Data via the EMAlgorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39:1–38. URLhttp://www.jstor.org/stable/2984875.
Petrov, Slav. 2012. Statistical NLP.
Turitzin, Michael. 2005. SMT of French and German into English Using IBM Model 2 Greedy Decoding.
Vauquois, Bernard. 1968. A Survey of Formal Grammars and Algorithms for Recognition and Transformation in MechanicalTranslation. In Proceedings of IFIP Congress, 1114–1122. Edinburgh.
Weaver, Warren. 1955. Translation. Technical report, Cambridge, Massachusetts.
P. Kolachina (Sprakbanken) MT 13th Mar, 2014 34 / 34