Download - Direct MT, Example-based MT, Statistical MT

Direct MT, Example-based MT, Statistical MT

Issues in Machine Translation• Orthography

– Writing from left-to-right vs right-to-left– Character sets (alphabetic, logograms, pictograms)– Segmentation into word/word-like units

• Morphology• Lexical: Word senses

– bank “river bank”, “financial institution”• Syntactic: Word order

– Subject-verb-object subject-object-verb• Semantic: meaning

– “ate pasta with a spoon”, “ate pasta with marinara”, “ate pasta with John” • Pragmatic: world knowledge

– “Can you pass me the salt?”• Social: conversational norms

– pronoun usage depends on the conversational partner• Cultural: idioms and phrases

– “out of the ballpark”, “came from leftfield”• Contextual•In addition for Speech Translation

– Prosody: JOHN eats bananas: John EATS bananas; John eats BANANAS– Pronunciation differences– Speech recognition errors

• In a multilingual environment– Code Switching: Use of linguistic apparatus of one language to express ideas in another language.

MT Approaches: Different levels of meaning transfer

Direct MT

Interlingua

Transfer-basedMT

Source Target

Depth of Analysis

Parsing

Semantic Interpretation

Semantic Generation

Syntactic Generation

Syntactic Structure

Syntactic Structure

Spanish : ajá quiero usar mi tarjeta de crédito

English : yeah I wanna use my credit cardAlignment : 1 3 4 5 7 0 6

Direct Machine Translation • Words are replaced using a dictionary

– Some amount of morphological processing• Word reordering is limited • Quality depends on the size of the dictionary, closeness of languages

English : I need to make a collect call

Japanese : 私はコレクトコールをかける必要があります

Alignment : 1 5 0 3 0 2 4

Translation Memory•Idea is to reuse translations that were done in the past

– Useful for technical terminology– Ideally used in a sub-language translation

• System helps in matching new instances against previously translated instances• Choices are presented to a human translator through a GUI• Human translator selects and “stitches” the available options to cover the source language sentence• If no match is found, the translator introduces a new translation pair into the translation memory.• Pros:

– Maintains consistency in translation across multiple translators– Improves efficiency of translation process

• Issues: How is the matching done?– Word level matching, morphological root matching– Determines robustness of the translation memory

Example-based MTTranslation-by-analogy: a. A collection of source/target text pairsb. A matching metricc. An word or phrase-level alignmentd. Method for recombinationATR EBMT System (E. Sumita, H. Iida, 1991); CMU Pangloss EBMT (R. Brown, 1996)

Exact match (direct translation)Target

ALIGNMENT (transfer)

MATCHING(analysis)

RECOMBINATION(generation)

Source

Example run of EBMT

English-Japanese Examples in the Corpus:1. He buys a notebook Kare wa noto o kau2. I read a book on international politics Watashi wa kokusai seiji

nitsuite kakareta hon o yomu

Translation Input: He buys a book on international politicsTranslation Output: Kare wa kokusai seiji nitsuite kakareta hon o kau

• Challenge: Finding a good matching metric• He bought a notebook• A book was bought• I read a book on world politics

Variations in EBMT• Database of Sentence Aligned corpus• Analysis of the SL

– Depends on how the database is stored– Full sentences, sentence fragments, tree fragments

• Matching metric: idea is to arrive at a semantic closeness– Exact match– N-gram match– Fuzzy match– Similarity-based match– Matching with variables

• Regeneration of the TL– Depends on how the database produces the output

Issues in EBMT• Parallel corpora• Granularity of examples• Size of example-base

– Does accuracy improve by growing example-base?• Suitability of examples

– Diversity and consistency of examples– Contradictory examples– Exceptional examples

(a) Watashi wa komputa o kyoyosuru I share the use of a computer

(b) Watashi wa kuruma o tsukau I use a car

(c) Watashi wa dentaku o shiyosuru I share the use of a calculator

I use a calculator

Issues in EBMT• How are examples stored?

– Context-based examples• “OK” depends on dialog context;

– “wakarimashita (I understand)”; – “iidesu yo (I agree)”– or “ijo desu (lets change the subject)”

– Annotated tree structures• Eg. Kanojo wa kami ga nagai (She has long hair)• Trees with linking nodes

– Multi-level lattices with typographic, orthographic, lexical, syntactic and other information.• Pos information, predicate-argument, chunks, dependency trees

– Generalized Examples• Tokenize Dates, Names, cities, gender, number, tense are replaced by

generalized tokens• Precision-Recall tradeoff• A continuum from plain strings to context sensitive rules

Issues in EBMT

String based• Sochira ni okeru We will send it to you• Sochira wa jimukyoku desu This is the office

Generalized String• X o onegai shimasu may I speak to the X• X o onegai shimasu please give me the X

Template Format• N1 N2 N3 N2’ N3’ for N1’(N1 = sanka “participation”, N2 = moshikomi “application” N3=yoshi “form”)

Distance in a thesaurus is used to select the method.

Issues in EBMT

• Matching:– Metric used to measure the similarity of the SL input to the SLs in

the example database.– Exact Character-based matching– Edit-distance based matching– Word-based matching

• Thesaurus similarity/Wordnet based similarity• A man eats vegetables Hito wa yasai o taberu• Acid eats metal san wa kinzoku o okasu• He eats potatoes kare wa jagaimo o taberu• Sulphuric acid eats iron Ryusan wa tetsu o okasu

– Thesaurus free similarity matching based on distributional clustering– Annotated word-based matching

• POS based matching• Relaxation techniques

– Exact match with dels and insertions word-order differences morphological variants POS differences

Matching in EBMT (contd)

• Structure-based Matching– Tree-based edit distance– Case-frame based matching

• Partial matching– Not entire input need match with the example database– Chunks, substrings, fragments can match– Assembling the TL output is more challenging.

Adaptability and Recombination in EBMTProblem:a. Identify which portion of the associated translation corresponds to the matched portion of the source text (Adaptability)b. Recombining the portions in an appropriate manner.

Alignment: can be done using statistical techniques or using bilingual dictionaries.

Boundary friction problem: For English-Japanese, translations of noun phrases can be reused independent of them being subjects or objects.

The handsome boy entered the roomThe handsome boy ate his breakfastI saw the handsome boy

Not in German:Der schone Junge aB seine FruhstuckIch sah den schonen Jungen

Adaptability

Example-retrieval can be scored on two counts: (a) the closeness of the match between the input text and the example,

and (b) the adaptability of the example, on the basis of the relationship between

the representations of the example and its translation.

Use the Offset Command to increase the spacing between the shapes.

a. Use the Offset Command to specify the spacing between the shapes.b. Mit der Option Abstand legen Sie den Abstand zwischen den Formen fest.

a. Use the Save Option to save your changes to disk.b. Mit der Option Speichern können Sie ihre Anderungen auf Diskette

speichern.

Recombination options are ranked using n-gram modela. Ich sah den schönen Jungen.b. * Ich sah der schöne Junge.

Flavors of EBMT•EBMT used as a component in an MT system which also has more traditional elements:•EBMT may be used

– in parallel with these other “engines”, – or just for certain classes of problems– when some other component cannot deliver a result.

•EBMT may be better suited to some kinds of applications than others. • Dividing line between EBMT and so-called “traditional” rule-based approaches may not be obvious.

When to apply EBMTWhen one of the following conditions holds true for a linguistic phenomenon, [rule-based] MT is less suitable than EBMT.(a) Translation rule formation is difficult.(b) The general rule cannot accurately describe [the] phenomen[on] because it represents a special case.(c) Translation cannot be made in a compositional way from target words.

Learning translation patternsKare wa kuruma o kuji de ateru.

HE topic CAR obj LOTTERY inst STRIKESLit. ‘He strikes a car with the lottery.’He wins a car as a prize in the lottery.

Learn pattern (c) from to correct (a) to be like (b)

Generation of Translation Templates• “Two phase” EBMT methodology: “learning” of templates (i.e. transfer rules) from a corpus.• Parse the translation pairs; align the syntactic units with the help of a bilingual dictionary.• Generalized by replacing the coupled units with variables marked for syntactic category.a. X[NP] no nagasa wa saidai 512 baito de aru. The maximum length of X[NP] is 512 bytes.b. X[NP] no nagasa wa saidai Y[N] baito de aru. The maximum length of X[NP] is Y[N] bytes.

• Any coupled unit pair can be replaced by variables. Refine templates which give rise to a conflicta. play baseball yakyu o suru b. play tennis tenisu o suruc. play X[NP]!X[NP] o suru

a. play the piano piano o hikub. play the violin baiorin o hikuc. play X[NP]!X[NP] o hiku

• “refined” by the addition of “semantic categories”a. play X[NP/sport] X[NP] o surub. play X[NP/instrument] X[NP] o hiku

Also, automatic generalization techniques from paired strings

Statistical Machine Translation

Can all the steps of EBMT technique be induced from a parallel corpus?What are the parameters of such a model?What are the components of SMT?

Slides adapted from Dorr and Monz, Knight, Schafer and Smith

Word-Level Alignments

Given a parallel sentence pair we can link (align) words or phrases that are translations of each other:

Where do we get the sentence pairs from?

Parallel Resources

Newswire: DE-News (German-English), Hong-Kong News, Xinhua News (Chinese-English),Government: Canadian-Hansards (French-English), Europarl (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish, Swedish), UN Treaties (Russian, English, Arabic, . . . )Manuals: PHP, KDE, OpenOffice (all from OPUS, many languages)Web pages: STRAND project (Philip Resnik)

Sentence Alignment

If document De is translation of document Df how do we find the translation for each sentence?The n-th sentence in De is not necessarily the translation of the n-th sentence in document Df

In addition to 1:1 alignments, there are also 1:0, 0:1, 1:n, and n:1 alignmentsApproximately 90% of the sentence alignments are 1:1

Sentence Alignment (c’ntd)

There are several sentence alignment algorithms:• Align (Gale & Church): Aligns sentences based on their character

length (shorter sentences tend to have shorter translations then longer sentences). Works astonishingly well

• Char-align: (Church): Aligns based on shared character sequences. Works fine for similar languages or technical domains

• K-Vec (Fung & Church): Induces a translation lexicon from the parallel texts based on the distribution of foreign-English word pairs.

Computing Translation Probabilities

Given a parallel corpus we can estimate P(e | f) The maximum likelihood estimation of P(e | f) is: freq(e,f)/freq(f)Way too specific to get any reasonable frequencies! Vast majority of unseen data will have zero counts!P(e | f ) could be re-defined as:

Problem: The English words maximizing P(e | f ) might not result in a readable sentence

P(e | f ) maxe if j

P(ei | f j )

Decoding

The decoder combines the evidence from P(e) and P(f | e) to find the sequence e that is the best translation:

The choice of word e’ as translation of f’ depends on the translation probability P(f’ | e’) and on the context, i.e. other English words preceding e’

argmaxe

P(e | f ) argmaxe

P( f |e)P(e)

Noisy Channel Model for Translation

Translation Modeling

Determines the probability that the foreign word f is a translation of the English word eHow to compute P(f | e) from a parallel corpus?Statistical approaches rely on the co-occurrence of e and f in the parallel data: If e and f tend to co-occur in parallel sentence pairs, they are likely to be translations of one another

Finding Translations in a Parallel Corpus

Into which foreign words f, . . . , f’ does e translate?Commonly, four factors are used:• How often do e and f co-occur? (translation)• How likely is a word occurring at position i to translate into a word

occurring at position j? (distortion) For example: English is a verb-second language, whereas German is a verb-final language

• How likely is e to translate into more than one word? (fertility) For example: defeated can translate into eine Niederlage erleiden

• How likely is a foreign word to be spuriously generated? (null translation)

Translation Model?

Mary did not slap the green witch

Maria no dió una botefada a la bruja verde

Source-language morphological analysis

Source parse tree

Semantic representation

Generate target structure

Generative approach:

Translation Model?


Maria no dió una botefada a la bruja verde

Source-language morphological analysis

Source parse tree

Semantic representation

Generate target structure

Generative story:

What are allthe possiblemoves andtheir associatedprobabilitytables?

The Classic Translation ModelWord Substitution/Permutation [IBM Model 3, Brown et al., 1993]


Mary not slap slap slap the green witch n(3|slap)

Maria no dió una botefada a la bruja verded(j|i)

Mary not slap slap slap NULL the green witchP-Null

Maria no dió una botefada a la verde brujat(la|the)

Generative approach:

Probabilities can be learned from raw bilingual text.


… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

All word alignments equally likely

All P(french-word | english-word) equally likely




“la” and “the” observed to co-occur frequently,so P(la | the) is increased.




“house” co-occurs with both “la” and “maison”, butP(maison | house) can be raised without limit, to 1.0,

while P(la | house) is limited because of “the”

(pigeonhole principle)




settling down after another iteration




Inherent hidden structure revealed by EM training!For details, see:

• “A Statistical MT Tutorial Workbook” (Knight, 1999).• “The Mathematics of Statistical Machine Translation” (Brown et al, 1993)• Software: GIZA++




P(juste | fair) = 0.411P(juste | correct) = 0.027P(juste | right) = 0.020 …

new Frenchsentence

Possible English translations,to be rescored by language model

IBM Models 1–5

Model 1: Bag of words• Unique local maxima• Efficient EM algorithm (Model 1–2)Model 2: General alignment: Model 3: fertility: n(k | e)• No full EM, count only neighbors (Model 3–5)• Deficient (Model 3–4)Model 4: Relative distortion, word classesModel 5: Extra variables to avoid deficiency

a(epos | f pos,elength, f length )

IBM Model 1Given an English sentence e1 . . . el and a foreign sentence f1 . . . fm

We want to find the ’best’ alignment a, where a is a set pairs of the form {(i , j), . . . , (i’, j’)}, 0<= i , i’ <= l and 1<= j , j’<= mNote that if (i , j), (i’, j) are in a, then i equals i’, i.e. no many-to-one alignments are allowedNote we add a spurious NULL word to the English sentence at position 0In total there are (l + 1)m different alignments AAllowing for many-to-many alignments results in (2l)m possible alignments A

IBM Model 1

Simplest of the IBM modelsDoes not consider word order (bag-of-words approach)Does not model one-to-many alignmentsComputationally inexpensiveUseful for parameter estimations that are passed on to more elaborate models

IBM Model 1Translation probability in terms of alignments:

where:

and:

P( f |e) P( f ,a | e)aA

P( f ,a | e) P(a | e)P( f | a,e)

1

(l 1)mP( f j

j1

m

|ea j )

P( f |e) 1

(l 1)mP( f j

j1

m

| ea j )aA

IBM Model 1We want to find the most likely alignment:

Since P(a | e) is the same for all a:

Problem: We still have to enumerate all alignments

argmaxaA

1(l 1)m

P( f jj1

m

|ea j )

argmaxaA

P( f jj1

m

| ea j )

IBM Model 1Since P(fj | ei) is independent from P(fj’ | ei’) we can find the maximum alignment by looking at the individual translation probabilities onlyLet , then for each aj:

The best alignment can computed in a quadratic number of steps: (l+1 x m)

argmaxaA

(a1, ... ,am )

a j argmax0il

P( f j |ei)

Computing Model 1 Parameters

How to compute translation probabilities for model 1 from a parallel corpus?Step 1: Determine candidates. For each English word e collect all foreign words f that co-occur at least once with eStep 2: Initialize P(f | e) uniformly, i.e. P(f | e) = 1/(no of co-occurring foreign words)

Computing Model 1 ParametersStep 3: Iteratively refine translation probablities:1 for n iterations2 set tc to zero3 for each sentence pair (e,f) of lengths (l,m)4 for j=1 to m 5 total=0; 6 for i=1 to l

7 total += P(fj | ei); 8 for i=1 to l

9 tc(fj | ei) += P(fj | ei)/total;10 for each word e11 total=0; 12 for each word f s.t. tc(f | e) is defined13 total += tc(f | e);14 for each word f s.t. tc(f | e) is defined15 P(f | e) = tc(f | e)/total;

IBM Model 1 Example

Parallel ‘corpus’:the dog :: le chienthe cat :: le chatStep 1+2 (collect candidates and initialize uniformly):P(le | the) = P(chien | the) = P(chat | the) = 1/3P(le | dog) = P(chien | dog) = P(chat | dog) = 1/3P(le | cat) = P(chien | cat) = P(chat | cat) = 1/3P(le | NULL) = P(chien | NULL) = P(chat | NULL) = 1/3

IBM Model 1 Example

Step 3: IterateNULL the dog :: le chien• j=1

total = P(le | NULL)+P(le | the)+P(le | dog)= 1tc(le | NULL) += P(le | NULL)/1 = 0 += .333/1 = 0.333tc(le | the) += P(le | the)/1 = 0 += .333/1 = 0.333tc(le | dog) += P(le | dog)/1= 0 += .333/1 = 0.333

• j=2total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1tc(chien | NULL) += P(chien | NULL)/1 = 0 += .333/1 = 0.333tc(chien | the) += P(chien | the)/1 = 0 += .333/1 = 0.333tc(chien | dog) += P(chien | dog)/1 = 0 += .333/1 = 0.333

IBM Model 1 Example

NULL the cat :: le chat• j=1

total = P(le | NULL)+P(le | the)+P(le | cat)=1tc(le | NULL) += P(le | NULL)/1 = 0.333 += .333/1 = 0.666tc(le | the) += P(le | the)/1 = 0.333 += .333/1 = 0.666tc(le | cat) += P(le | cat)/1 = 0 +=.333/1 = 0.333

• j=2total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1tc(chat | NULL) += P(chat | NULL)/1 = 0 += .333/1 = 0.333tc(chat | the) += P(chat | the)/1 = 0 += .333/1 = 0.333tc(chat | cat) += P(chat | dog)/1 = 0 += .333/1 = 0.333

IBM Model 1 Example

Re-compute translation probabilities • total(the) = tc(le | the) + tc(chien | the) + tc(chat | the)

= 0.666 + 0.333 + 0.333 = 1.333 P(le | the) = tc(le | the)/total(the)

= 0.666 / 1.333 = 0.5 P(chien | the) = tc(chien | the)/total(the)

= 0.333/1.333 0.25 P(chat | the) = tc(chat | the)/total(the)

= 0.333/1.333 0.25• total(dog) = tc(le | dog) + tc(chien | dog) = 0.666 P(le | dog) = tc(le | dog)/total(dog)

= 0.333 / 0.666 = 0.5 P(chien | dog) = tc(chien | dog)/total(dog)

= 0.333 / 0.666 = 0.5

IBM Model 1 Example

Iteration 2:NULL the dog :: le chien• j=1

total = P(le | NULL)+P(le | the)+P(le | dog)= 1.5 = 0.5 + 0.5 + 0.5 = 1.5

tc(le | NULL) += P(le | NULL)/1 = 0 += .5/1.5 = 0.333tc(le | the) += P(le | the)/1 = 0 += .5/1.5 = 0.333tc(le | dog) += P(le | dog)/1 = 0 += .5/1.5 = 0.333

• j=2total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1

= 0.25 + 0.25 + 0.5 = 1tc(chien | NULL) += P(chien | NULL)/1 = 0 += .25/1 = 0.25tc(chien | the) += P(chien | the)/1 = 0 += .25/1 = 0.25tc(chien | dog) += P(chien | dog)/1 = 0 += .5/1 = 0.5

IBM Model 1 Example

NULL the cat :: le chat• j=1

total = P(le | NULL)+P(le | the)+P(le | cat)= 1.5 = 0.5 + 0.5 + 0.5 = 1.5

tc(le | NULL) += P(le | NULL)/1 = 0.333 += .5/1 = 0.833tc(le | the) += P(le | the)/1 = 0.333 += .5/1 = 0.833tc(le | cat) += P(le | cat)/1 = 0 += .5/1 = 0.5

• j=2total = P(chat | NULL)+P(chat | the)+P(chat | cat)=1

= 0.25 + 0.25 + 0.5 = 1tc(chat | NULL) += P(chat | NULL)/1 = 0 += .25/1 = 0.25tc(chat | the) += P(chat | the)/1 = 0 += .25/1 = 0.25tc(chat | cat) += P(chat | cat)/1 = 0 += .5/1 = 0.5

IBM Model 1 Example

Re-compute translations (iteration 2):• total(the) = tc(le | the) + tc(chien | the) + tc(chat | the)

= .833 + 0.25 + 0.25 = 1.333 P(le | the) = tc(le | the)/total(the)

= .833 / 1.333 = 0.625 P(chien | the) = tc(chien | the)/total(the)

= 0.25/1.333 = 0.188 P(chat | the) = tc(chat | the)/total(the)

= 0.25/1.333 = 0.188• total(dog) = tc(le | dog) + tc(chien | dog) = 0.333 + 0.5 = 0.833 P(le | dog) = tc(le | dog)/total(dog)

= 0.333 / 0.833 = 0.4 P(chien | dog) = tc(chien | dog)/total(dog)

= 0.5 / 0.833 = 0.6

IBM Model 1Example

After 5 iterations: P(le | NULL) = 0.755608028335301 P(chien | NULL) = 0.122195985832349 P(chat | NULL) = 0.122195985832349 P(le | the) = 0.755608028335301 P(chien | the) = 0.122195985832349 P(chat | the) = 0.122195985832349 P(le | dog) = 0.161943319838057 P(chien | dog) = 0.838056680161943 P(le | cat) = 0.161943319838057 P(chat | cat) = 0.838056680161943

IBM Model 1 Recap

IBM Model 1 allows for an efficient computation of translation probabilitiesNo notion of fertility, i.e., it’s possible that the same English word is the best translation for all foreign wordsNo positional information, i.e., depending on the language pair, there might be a tendency that words occurring at the beginning of the English sentence are more likely to align to words at the beginning of the foreign sentence

IBM Model 3

IBM Model 3 offers two additional features compared to IBM Model 1:• How likely is an English word e to align to k foreign words

(fertility)? • Positional information (distortion), how likely is a word in

position i to align to a word in position j?

IBM Model 3: Fertility

The best Model 1 alignment could be that a single English word aligns to all foreign wordsThis is clearly not desirable and we want to constrain the number of words an English word can align to Fertility models a probability distribution that word e aligns to k words: n(k,e)Consequence: translation probabilities cannot be computed independently of each other anymoreIBM Model 3 has to work with full alignments, note there are up to (l+1)m different alignments

IBM Model 1 + Model 3

Iterating over all possible alignments is computationally infeasibleSolution: Compute the best alignment with Model 1 and change some of the alignments to generate a set of likely alignments (pegging)Model 3 takes this restricted set of alignments as input

Pegging

Given an alignment a we can derive additional alignments from it by making small changes:• Changing a link (j,i) to (j,i’)• Swapping a pair of links (j,i) and (j’,i’) to (j,i’) and (j’,i) The resulting set of alignments is called the neighborhood of a

IBM Model 3: Distortion

The distortion factor determines how likely it is that an English word in position i aligns to a foreign word in position j, given the lengths of both sentences: d(j | i, l, m)Note, positions are absolute positions

Deficiency

Problem with IBM Model 3: It assigns probability mass to impossible strings• Well formed string: “This is possible”• Ill-formed but possible string: “This possible is”• Impossible string:Impossible strings are due to distortion values that generate different words at the same positionImpossible strings can still be filtered out in later stages of the translation process

Limitations of IBM Models

Only 1-to-N word mappingHandling fertility-zero words (difficult for decoding)Almost no syntactic information• Word classes• Relative distortionLong-distance word movementFluency of the output depends entirely on the English language model

Decoding

How to translate new sentences?A decoder uses the parameters learned on a parallel corpus• Translation probabilities• Fertilities• DistortionsIn combination with a language model the decoder generates the most likely translationStandard algorithms can be used to explore the search space (A*, greedy searching, …)Similar to the traveling salesman problem

Decoding for “Classic” Models Of all conceivable English word strings, find the one maximizing P(e) x P(f | e)

Decoding is an NP-complete challenge • (Knight, 1999)

Several search strategies are available

Each potential English output is called a hypothesis.

Dynamic Programming Beam Search

1st targetword

2nd targetword

3rd targetword

4th targetword

start end

Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far)

all sourcewords

covered

[Jelinek, 1969; Brown et al, 1996 US Patent;(Och, Ueffing, and Ney, 2001]

Dynamic Programming Beam Search

1st targetword

2nd targetword

3rd targetword

4th targetword

start end

Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far)

all sourcewords

covered

[Jelinek, 1969; Brown et al, 1996 US Patent;(Och, Ueffing, and Ney, 2001]

best predecessorlink

The Classic Resultsla politique de la haine . (Foreign Original)

politics of hate . (Reference Translation)the policy of the hatred . (IBM4+N-grams+Stack)

nous avons signé le protocole . (Foreign Original)

we did sign the memorandum of agreement . (Reference Translation)we have signed the protocol . (IBM4+N-grams+Stack)

où était le plan solide ? (Foreign Original)

but where was the solid plan ? (Reference Translation)where was the economic base ? (IBM4+N-grams+Stack)

the Ministry of Foreign Trade and Economic Cooperation, including foreigndirect investment 40.007 billion US dollars today provide data includethat year to November china actually using foreign 46.959 billion US dollars and

Limitations of Word-Based MT

Multiple English words for one French word• IBM models can do one-to-many (fertility) but not many-to-onePhrasal Translation• “real estate”, “note that”, “interest in”Syntactic Transformations• Verb at the beginning in Arabic• Translation model penalizes any proposed re-ordering• Language model not strong enough to force the verb to move to the

right place

Phrase-Based Statistical MT

Phrase-Based Statistical MT

Foreign input segmented in to phrases• “phrase” is any sequence of wordsEach phrase is probabilistically translated into English• P(to the conference | zur Konferenz)• P(into the meeting | zur Konferenz)Phrases are probabilistically re-orderedSee [Koehn et al, 2003] for an intro.This is state-of-the-art!

Morgen fliege ich nach Kanada zur Konferenz

Tomorrow I will fly to the conference In Canada

Advantages of Phrase-Based

Many-to-many mappings can handle non-compositional phrasesLocal context is very useful for disambiguating• “Interest rate” …• “Interest in” …The more data, the longer the learned phrases• Sometimes whole sentences

How to Learn the Phrase Translation Table?One method: “alignment templates” (Och et al, 1999)

Start with word alignment, build phrases from that.

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

This word-to-wordalignment is a by-product of training a translation modellike IBM-Model-3.

This is the best(or “Viterbi”) alignment.

How to Learn the Phrase Translation Table?

One method: “alignment templates” (Och et al, 1999)

Start with word alignment, build phrases from that.

Mary

did

not

slap

the

green

witch


This word-to-wordalignment is a by-product of training a translation modellike IBM-Model-3.

This is the best(or “Viterbi”) alignment.

IBM Models are 1-to-Many

Run IBM-style aligner both directions, then merge:

EF bestalignment

Union or Intersection

MERGE

FE bestalignment

How to Learn the Phrase Translation Table?

Collect all phrase pairs that are consistent with the word alignment

Mary

did

not

slap

the

green

witch


oneexamplephrase

pair

Consistent with Word Alignment

Phrase alignment must contain all alignment points for allthe words in both phrases!

x x

Mary

did

not

slap

Maria no dió

Mary

did

not

slap

Maria no dió

Mary

did

not

slap

Maria no dió

consistent inconsistent inconsistent

Mary

did

not

slap

the

green

witch


Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

Mary

did

not

slap

the

green

witch



(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)(a la, the) (dió una bofetada a, slap the)

Mary

did

not

slap

the

green

witch



(Maria, Mary) (no, did not) (dió una bofetada, slap) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the)(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch)

Mary

did

not

slap

the

green

witch



(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the)(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) …

Mary

did

not

slap

the

green

witch



(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the)(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … (Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)

Phrase Pair ProbabilitiesA certain phrase pair (f-f-f, e-e-e) may appear many times across the bilingual corpus.

• We hope so!

So, now we have a vast list of phrase pairs and their frequencies – how to assign probabilities?

Phrase Pair Probabilities

Basic idea: • No EM training• Just relative frequency: P(f-f-f | e-e-e) = count(f-f-f, e-e-e) / count(e-e-e)

Important refinements: • Smooth using word probs P(f | e) for individual words connected in

the word alignment– Some low count phrase pairs now have high probability, others have low

probability• Discount for ambiguity

– If phrase e-e-e can map to 5 different French phrases, due to the ambiguity of unaligned words, each pair gets a 1/5 count

• Count BAD events too– If phrase e-e-e doesn’t map onto any contiguous French phrase, increment

event count(BAD, e-e-e)

Advanced Training Methods

Basic Model, Revisited

argmax P(e | f) = e

argmax P(e) x P(f | e) / P(f) = e

argmax P(e) x P(f | e) e


argmax P(e | f) = e

argmax P(e) x P(f | e) / P(f) = e

argmax P(e)2.4 x P(f | e) … works better! e


argmax P(e | f) = e

argmax P(e) x P(f | e) / P(f) e

argmax P(e)2.4 x P(f | e) x length(e)1.1

e

Rewards longer hypotheses, since these are unfairly punished by P(e)


argmax P(e)2.4 x P(f | e) x length(e)1.1 x KS 3.7 … e

Lots of knowledge sources vote on any given hypothesis.

“Knowledge source” = “feature function” = “score component”.

Feature function simply scores a hypothesis with a real value.

(May be binary, as in “e has a verb”).

Problem: How to set the exponent weights?

MT Evaluation

* Intrinsic

* Extrinsic

Human evaluation

Automatic (machine) evaluation

How useful is MT system output for…Deciding whether a foreign language blog is about politics?Cross-language information retrieval?Flagging news stories about terrorist attacks?…

Human Evaluation

Je suis fatigué.

Tired is I.

Cookies taste good!

I am exhausted.

Adequacy Fluency

5

1

5

2

5

5

Human Evaluation

CON

PRO

High quality

Expensive!

Person (preferably bilingual) must make atime-consuming judgment per system hypothesis.

Expense prohibits frequent evaluation of incremental system modifications.

Automatic Evaluation

PRO

Cheap. Given available reference translations,free thereafter.

CON

We can only measure some proxy fortranslation quality. (Such as N-Gram overlap or edit distance).

Automatic Evaluation: Bleu Score

Bleu=

B = { (1- |ref| / |hyp|)e if |ref| > |hyp|

1 otherwisebrevitypenalty

Bleu score: brevity penalty, geometricmean of N-Gram precisions

N-Gramprecision

Bounded aboveby highest countof n-gram in anyreference sentence

N

nnpN

B1

1exp

hypn

hypn clipn ncount

ncountp

gram-

gram-

)gram-(

)gram-(


I am exhaustedhypothesis 1

Tired is Ihypothesis 2

I am tiredreference 1

I am ready to sleep nowreference 2


I am exhaustedhypothesis 1

Tired is Ihypothesis 2

I am tiredreference 1

I am ready to sleep now and so exhaustedreference 2

1-gram 3-gram2-gram3/3

1/3

1/2

0/2

0/1

0/1

I I Ihypothesis 3 1/3 0/2 0/1

Maximum BLEU Training(Och, 2003)

Translation System

(Automatic,Trainable)

Translation Quality

Evaluator(Automatic)

Farsi EnglishMT Output

EnglishReference Translations(sample “right answers”)

BLEUscore

LanguageModel #1

TranslationModel

LanguageModel #2

Length Model

OtherFeatures

Learning Algorithm for Directly Reducing Translation Error

Yields big improvements in quality.

Minimizing Error/Maximizing Bleu

• Adjust parameters to minimize error (L) when translating a training set

• Error as a function of parameters is– nonconvex: not guaranteed

to find optimum– piecewise constant: slight

changes in parameters might not change the output.

• Usual method: optimize one parameter at a time with linear programming

Generative/Discriminative Reunion

Generative models can be cheap to train: “count and normalize” when nothing’s hidden.Discriminative models focus on problem: “get better translations”.Popular combination• Estimate several generative translation and language models using

relative frequencies.• Find their optimal (log-linear) combination using discriminative

techniques.

Generative/Discriminative Reunion

words#)()|()|()|( 87321 tptspstptsp LMlexicalphrasephrase

Score each hypothesis with several generative models:

If necessary, renormalize into a probability distribution:

)exp( kkZ fθ

where k ranges over all hypotheses. We then have

)exp(1)|( fθ Z

stp i

for any given hypothesis i.

Exponentiation makes it positive.

Unnecessary if thetas sum to 1 and p’s are all probabilities.

Minimizing Risk

kk

iii stp

][exp][exp)|(, fθ

fθ

1.0 1

10

Instead of the error of the 1-best translation, compute expected error (risk) using k-best translations; this makes the function differentiable.

Smooth probability estimates using gamma to even out local bumpiness. Gradually increase gamma to approach the 1-best error.

)],([E,

tsLp θ

Synchronous grammars