Jan 2009Statistical MT1 Advanced Techniques in NLP Machine Translation III Statistical MT.

Jan 2009 Statistical MT 1

Advanced Techniques in NLP

Machine Translation IIIStatistical MT


Approaching MT

• There are many different ways of approaching the problem of MT.

• The choice of approach is complex and depends upon:– Task requirements– Human resources– Linguistic resources


Criterial Issues

• Do we want a translation system for one language pair or for many language pairs?

• Can we assume a constrained vocabulary or do we need to deal with arbitrary text?

• What resources exist for the languages that we are dealing with?

• How long will it take us to develop the resources and what human resources?


Parallel Data

• Lots of translated text available: 100s of million words of translated text for some language pairs– a book has a few 100,000s words– an educated person may read 10,000 words a day– 3.5 million words a year– 300 million a lifetime

• Computers can see more translated text than humans read in a lifetime

• Machine can learn how to translate foreign languages.

[Koehn 2006]


Statistical Translation

• Robust• Domain independent• Extensible• Does not require language specialists• Does requires parallel texts• Uses noisy channel model of translation


Noisy Channel ModelSentence Translation (Brown et. al. 1990)

sourcesentence

target sentence

sentence


Statistical Modelling

• Learn P(f|e) from a parallel corpus• Not sufficient data to estimate P(f|e) directly

[from Koehn 2006]


The Problem of Translation• Given a sentence T of the target language, seek

the source sentence S from which a translator produced T, i.e.find S that maximises P(S|T)

• By Bayes' theorem P(S|T) = P(S) x P(T|S)

P(T)whose denominator is independent of S.

• Hence it suffices to maximise P(S) x P(T|S)


The Three Components of a Statistical MT model

1. Method for computing language model probabilities (P(S))

2. Method for computing translation probabilities (P(S|T))

3. Method for searching amongst source sentences for one that maximisesP(S) * P(T|S)


A Statistical MT System

Source Language

Model

TranslationModel

P(S) * P(T|S) = P(S|T)

S T

DecoderT S


Three Kinds of Model

StatisticalModels

LanguageModels

TranslationModels

Word Based

PhraseBased

Syntax Based

Word Based

Phrase Based

Syntax Based


Language Models based on N-Grams of Words

• GeneralP(s1s2...sn) =P(s1)*P(s2|s1) ...*P(sn|s1...s(n-1))

• TrigramP(s1s2...sn) =P(s1)*P(s2|s1)*P(s3|s1,s2) ...*P(sn|s(n-1)s(n-2))

• BigramP(s1s2...sn) =P(s1)*P(s2|s1) ...*P(sn|s(n-1))


Syntax Based Language Models

• Good syntax tree – good English• Allows for long distance contstraints• Left sentence preferred by syntax based

model


Word-Based Translation Models

• Translation process is decomposed into smaller steps• Each is tied to words• Based on IBM Models [Brown et al., 1993]

[from Koehn 2006]


Word TM derived from Bitext

ENGLISH• the cat sleeps• the dog sleeps• the horse eats

FRENCH• le chat dort• le chien dort• le cheval mange


le chat dort/the cat sleeps

le I I I

chat I I I

chien

cheval

dort I I I

mange

the cat dog horse sleeps eats


le chien dort/the dog sleeps

le II I I II

chat I I I

chien I I I

cheval

dort II I I II

mange



le cheval mange/the horse eats

le III I I I II I

chat I I I

chien I I I

cheval I I I

dort II I I II

mange I I I


P(t|s)

1/9

3/9

1/9

1/9

1/9

2/9


Parameter Estimation

• Based on counting occurrences within monolingual and bilingual data.

• For language model, we need only source language text.

• For translation model, we need pairs of sentences that are translations of each other.

• Use EM (Expectation Maximisation) Algorithm (Baum 1972) to optimize model parameters.


EM Algorithm

• Word Alignments:for sentence pair ("a b c", "x y z")are formed from arbitrary pairings from the two sentences and include:(a.x,b.y,c.z), (a.z,b.y,c.x), etc.

• There is a large number of possible alignments, since we also allow, e.g.(ab.x,0.y,c.z),


EM Algorithm1. Make initial estimate of parameters. This can

be used to compute the probability of any possible word alignment.

2. Re-estimate parameters by ranking each possible alignment by its probability according to initial guess.

3. Repeated iterations assign ever greater probability to the set of sentences actually observed.Algorithm leads to a local maximum of the probability of observed sentence pairs as a function of the model parameters


Parameters forIBM Translation Model

• Word Translation Probability, P(t|s)probability that source word s is translated as target word t.

• Fertility P(n|s)probability that source word s is translated by n target words (25 ≥ n≥0).

• Distortion: P(i|j,l)probability that source word at position j is translated by target word at position i in target sentence of length l.


Experiment 1 (Brown et. al. 1990)

• Hansard. 40,000 pairs of sentences = approx. 800,000 words in each language.

• Considered 9,000 most common words in each language.

• Assumptions (initial parameter values)– each of the 9000 target words equally likely as

translations of each of the source words.– each of the fertilities from 0 to 25 equally likely for

each of the 9000 source words– each target position equally likely given each source

position and target length


English: theFrench Probabilityle .610la .178l’ .083les .023ce .013il .012de .009à .007que .007

Fertility Probability1 .8710 .1242 .004


English: notFrench Probabilitypas .469ne .460non .024pas du tout .003faux .003plus .002ce .002que .002jamais .002

Fertility Probability2 .7580 .1331 .106


English: hear

French Probabilitybravo .992entendre .005entendu .002entends .001

Fertility Probability0 .5841 .416


Sentence Translation Probability

• Given translation model for words, we can compute translation probability of sentence taking parameters into account.

• P(Jean aime Marie|John loves Mary) =

P(Jean|John) * P(1,John) * P(1|1,3) *P(aime|loves) * P(1,loves) * P(2|2,3) *P(Marie|Mary) * P(1,Mary) * P(3|3,3)


Flaws in Word-Based Translation

• Model handles many:one P(ttt|s) but not one:many P(t|sss) translations e.g. – Zeitmangel erschwert das Problem .– lack of time makes more difficult the problem .

• Correct translation: Lack of time makes the problem more difficult.

• MT output: Time makes the problem.[from Koehn 2006]


Flaws Word-Based Translation (2)

• Phrasal Translation: P(ttt|ssss)e.g. erübrigt sich /there is no point in– Eine Diskussion erübrigt sich demnach .– a discussion is made unnecessary itself therefore .

– Correct translation: Therefore, there is no point in a discussion.

• MT output: A debate turned therefore .[from Koehn 2006]


Flaws in Word BasedTranslation (3)

Syntactic transformations • Example Object/subject reordering• Den Vorschlag lehnt die Kommission ab

the proposal rejects the commission off• Correct translation: The commission

rejects the proposal .• MT output: The proposal rejects the commission.

[from Koehn 2006]


Phrase Based Translation Models

• Foreign input is segmented in phrases.• Phrases are any sequence of words, not

necessarily linguistically motivated.• Each phrase is translated into English• Phrases are reordered.

[from Koehn 2006]


Syntax Based Translation Models


Word Based Decoding: searching for the best translation

(Brown 1990)• Maintain list of hypotheses. • Initial hypothesis: (Jean aime Marie | *)• Search proceeds iteratively. • At each iteration we extend most promising

hypotheses with additional wordsJean aime Marie | John(1) *Jean aime Marie | * loves(2) *Jean aime Marie | * Mary(3) *Jean aime Marie | Jean(1) *

• Parenthesised numbers indicate corresponding position in target sentence


Phrase-Based Decoding

• Build translation left to right– select foreign word(s) to be translated– find English phrase translation– add English phrase to end of partial translation

[Koehn 2006]


Decoding Process

• one to many translation[Koehn 2006]


Decoding Process

• many to one translation[Koehn 2006]


Decoding Process

• translation finished[Koehn 2006]


Hypothesis Expansion

• Start with empty hypothesis– e: no English words– f: no foreign words covered– p: probability 1

[Koehn 2006]





• further hypothesis expansion[Koehn 2006]


Decoding Process

• adding more hypotheses leads to explosion of search space.

[Koehn 2006]


Hypothesis Recombination

• Sometimes different choices of hypothesis lead to the same translation result.

• Such paths can be combined.[Koehn 2006]


Hypothesis Recombination

• Drop weaker path• Keep pointer from weaker path

[Koehn 2006]


Pruning• Hypothesis recombination is not sufficient

– Heuristically discard weak hypotheses early• Organize Hypothesis in stacks, e.g. by

– same foreign words covered– same number of foreign words covered (Pharaoh does this)– same number of English words produced

• Compare hypotheses in stacks, discard bad ones– histogram pruning: keep top n hypotheses in each stack (e.g.,

n=100)– threshold pruning: keep hypotheses that are at most times the

cost of best hypothesis in stack (e.g., = 0.001)


Hypothesis Stacks

• Organization of hypothesis into stacks– here: based on number of foreign words translated– during translation all hypotheses from one stack are expanded– expanded Hypotheses are placed into stacks– one to many translation

[Koehn 2006]


Comparing Hypotheses covering Same Number of Foreign Words

• Hypothesis that covers easy part of sentence is preferred• Need to consider future cost of uncovered parts• Should take account of one to many translation

[Koehn 2006]


Future Cost Estimation

• Use future cost estimates when pruning hypotheses• For each uncovered contiguous span:

– look up future costs for each maximal contiguous uncovered span

– add to actually accumulated cost for translation option for pruning

[Koehn 2006]


Pharoah

• A beam search decoder for phrase-based models– works with various phrase-based models– beam search algorithm– time complexity roughly linear with input length– good quality takes about 1 second per sentence– Very good performance in DARPA/NIST Evaluation– Freely available for researchers

http://www.isi.edu/licensed-sw/pharaoh/• Coming soon: open source version of Pharaoh


Pharoah Demo% echo ’das ist ein kleines haus’ | pharaoh -f pharaoh.ini > outPharaoh v1.2.9, written by Philipp Koehn

a beam search decoder for phrase-based statistical machine translation models(c) 2002-2003 University of Southern California(c) 2004 Massachusetts Institute of Technology(c) 2005 University of Edinburgh, Scotlandloading language model from europarl.srilmloading phrase translation table from phrase-table, stored 21, pruned 0, kept 21loaded data structures in 2 secondsreading input sentencestranslating 1 sentences.translated 1 sentences in 0 seconds[3mm] % cat outthis is a small house


Brown Experiment 2

• Perform translation using 1000 most frequent words in the English corpus.

• 1,700 most frequently used French words in translations of sentences completely covered by 1000 word English vocabulary.

• 117,000 pairs of sentences completely covered by both vocabularies.

• Parameters of English language model from 570,000 sentences in English part.


Experiment 2 contd

• 73 French sentences tested from elsewhere in corpus. Results were classified as– Exact – same as actual translation– Alternate – same meaning– Different – legitimate translation but different

meaning– Wrong – could not be intepreted as a translation– Ungrammatical – grammatically deficient

• Corrections to the last three categories were made and keystrokes were counted


Results

Category # sentences percentExact 4 5Alternate 18 25Different 13 18Wrong 11 15Ungrammatical 27 37

Total 73


Results - Discussion

• According to Brown et. al., system performed successfully 48% of the time (first three categories).

• 776 keystrokes needed to repair 1916 keystrokes to generate all 73 translations from scratch.

• According to authors, system therefore reduces work by 60%.


IssuesAutomatic evaluation methods• can computers decide what are good

translations?Phrase-based models• what are atomic units of translation?• how are they discovered?• the best method in statistical machine translationDiscriminative training• what are the methods that directly optimize

translation performance?


The Speculative (Koehn 2006)

Syntax-based transfer models• how can we build models that take

advantage of syntax?• how can we ensure that the output is

grammatical?Factored translation models• how can we integrate different levels of

abstraction?


Bibliography

• Statistical MTBrown et. al., A Statistical Approach to MT, Computational Linguistics 16.2, 1990 pp79-85 (search “ACL Anthology”)

• Koehn tutorial (see http://www.iccs.inf.ed.ac.uk/~pkoehn/)

Date post:	17-Jan-2018
Category:	Documents
Upload:	samantha-shepherd
View:	231 times
Download:	0 times

Jan 2009Statistical MT1 Advanced Techniques in NLP Machine Translation III Statistical MT.

Documents