Stephan Vogel - Machine Translation1 Statistical Machine Translation Word Alignment Stephan Vogel MT...

Stephan Vogel - Machine Translation 1

Statistical Machine Translation

Word Alignment

Stephan Vogel

MT ClassSpring Semester 2011


Overview

Word alignment – some observations Models IBM2 and IBM1: 0th-order position model HMM alignment model: 1st-order position model IBM3: fertility IBM4: plus relative distortion


Alignment Example

Observations: Mostly 1-1 Some 1-to-many Some 1-to-nothing Often monotone Not always clear-cut

English ‘eight’ is a time German has ‘acht Uhr’ Could also leave ‘Uhr’

unaligned


Evaluating Alignment

Given some manually aligned data (ref) and automatically aligned data (hyp) links can be Correct, i.e. link in hyp matches link in ref: true positive (tp) Wrong, i.e. link in hyp but not in ref: false positive (fp) Missing, i.e. link in ref but not in hyp: false negative (fn)

Evaluation measures Precision: P = tp / (tp + fp) = correct / links_in_hyp

Recall: R = tp / (tp + fn) = correct / links_in_ref

Alignment Error Rate: AER = 1 – F = 1 – 2tp / (2tp +fp +fn)


Sure and Possible Links

Sometimes it is difficult for human annotators to decide Differentiate between sure and possible links

En: Det Noun - Ch: Noun, don’t align Det, or align to NULL? En: Det Noun - Ar: DetNoun, should Det be aligned to DetNoun?

Alignment Error Rate with sure and possible links (Och 2000) A = generated links S = sure links (no finding a sure link is an error) P = possible links (putting a link which is not possible is an error)

||||1 AER

|| Recall

||Precision

SA

SAPA

S

SA

A

PA


Word Alignment Models

IBM1 – lexical probabilities only IBM2 – lexicon plus absolut position IBM3 – plus fertilities IBM4 – inverted relative position alignment IBM5 – non-deficient version of model 4

HMM – lexicon plus relative position

BiBr – Bilingual Bracketing, lexical probabilites plus reordering via parallel segmentation

Syntactical alignment models

[Brown et.al. 1993, Vogel et.al. 1996, Och et al 2000, Wu 1997, Yamada et al. 2003, and many others]


GIZA++ Alignment Toolkit

All standard alignment models (IBM1 … IBM5, HMM) are implemented in GIZA++

This toolkit was started (as GIZA) at John Hopkins University workshop 1998

Extended and improved by Franz Josef Och Now used by many groups Known problems:

Memory when training on large corpora Writes many large files (depends on your parameter setting)

Extensions for large corpora (Qin Gao) Distributed GIZA: run on many machines, I/O bound Multithreaded GIZA: run on one machine, multiple cores


Notation

Source language f: source (French) word J: length of source sentence j: position in source sentence (target position) : source sentence

Target language e: target (English) word I: length of target sentence i: position in target sentence (source position) : target sentence

Alignment: relation mapping source to target positions i=aj: position i of ei which is aligned to j : whole alignment

IiI eeee ......11

JjJ ffff ......11

JjJ aaaa ......11


SMT - Principle

Translate a ‘French’ stringinto an ‘English’ string

Bayes’ decision rule for translation:

Why this inversion of the translation direction? Decomposition of dependencies: makes modeling easier Cooperation of two knowledge sources for final decision

Note: IBM paper and GIZA call e source and f target

JjJ ffff ......11

IiI eeee ......11

)}|Pr(){Pr(maxarg

)}|{Pr(maxarg

111

1

11

1

1

IJI

i

JI

i

I

efee

fee

e


Alignment as Hidden Variable

‘Hidden alignments’ to capture word-to-word correspondences Mapping A subset of [1, …, J]x[1, …, I] Number of connections: J * I (each source word with each target word Number of alignments: 2JI (each connection yes/no)

Summation over all alignments

To many alignments, summation not feasible

)|,Pr()|Pr( 01IJ efef


Restricted Alignment

Each source word has one connection Alignment mapping becomes function: j -> i = aj

Number of alignments is now: IJ

Sum over all alignments: Not possible to enumerate In some situations full summation

possible through Dynamic Programming In other situations: take only best alignment

and perhaps some alignments closeto the best one


Empty Position (Null Word)

Sometimes a word has no correspondence Alignment function aligns each source word to one target

word, i.e. cannot skip source word Solution:

Introduce empty position 0 with null word e0

‘Skip’ source word fj by aligning it to e0

Target sentence is extended to: Alignment is extended to:

IiI eeee ......00

JjJ aaaa ......00


Translation Model

Sum over all alignment

3 probability distributions: Length:

Alignment:

Lexicon:

),,|Pr(),|Pr()|Pr(

),|,Pr()|Pr(

)|,Pr(

011010

0110

011

IJJIJI

IJJI

IJJ

eJafeJaeJ

eJafeJ

eaf

Ja

IJJ eafef0

)|,Pr()|Pr( 011

)|Pr( 0IeJ

J

j

Ijj

IJ eJaaeJa1

01

101 ),,|Pr(),|Pr(

J

j

IJjj

IJJ eJaffeJaf1

011

1011 ),,,|Pr(),,|Pr(


Model Assumptions

Decompose interaction into pairwise dependencies Length: Source length only dependent on target length (very

weak)

Alignment: Zero order model: target position only dependent on source

position

First order model: target position only dependent on previous target position

Lexicon: source word only dependent on aligned word

)|()|Pr( 0 IJpeJ I

),,|(),,|Pr( 01

1 IJjapeJaa jIj

j

)|(),,,|Pr( 011

1 jajIJj

j efpeJaff

),,|(),,|Pr( 101

1 IJaapeJaa jjIj

j


Mixture Model

Interpretation as mixture model by direct decomposition

Again, simplifying model assumptions applied

J

j

I

iij

J

j

I

i

Ij

J

j

Ij

IJ

efpIJjipIJp

eJifpIJp

eJfpIJpef

1 1

1 11

1111

)|(),,|()|(

),|,()|(

),|()|()|Pr(


Training IBM2

Expectation-Maximization (EM) Algorithm Define posterior weight (i.e. sum over column =

1)

Lexicon probabilities

Alignment probabilities

'

);'(

);()|(

)|(),(),();(

f

s j i

sj

si

sj

efA

efAefp

fipeeffefA

'

),,;'(

),,;(),,|(

)|(),(),(),,;(

i

s

sj

ss

IJjiB

IJjiBIJjip

fipJJIIIJjiB

'' )|(),,|'(

)|(),,|()|(

i

si

sjss

si

sjss

efpIJjip

efpIJjipfip

count how oftenword pairs arealigned Turn counts into

probabilities


IBM1 Model

Assume uniform probability for position alignment

Alignment probability

In training: only collect counts for word pairs

IJIjip

1),,|(

J

j

I

iijJ

J

j

I

iij

IJ

efpI

IJp

efpIJjipIJpef

1 1

1 111

)|(1

)|(

)|(),,|()|()|Pr(


Training for IBM1 Model – Pseudo Code

# Accumulation (over corpus)For each sentence pair For each source position j Sum = 0.0 For each target position i Sum += p(fj|ei) For each target position i Count(fj,ei) += p(fj|ei)/Sum

# Re-estimate probabilities (over count table)For each target word e Sum = 0.0 For each source word f Sum += Count(f,e) For each source word f p(f|e) = Count(f,e)/Sum

# Repeat for several iterations


HMM Alignment Model

Idea: relative position model

Source

Target

Entire word groups (phrases)are moved with respect tosource position


HMM Alignment

First order model: target position dependent on previous target position(captures movement of entire phrases)

Alignment probability:

Maximum approximation:

),,|(),,|Pr( 101

1 IJaapeJaa jjIj

j

J

j

a

J

jajjj

IJ efpIaapIJpef1 1

111 )|(),|()|()|Pr(

J

jajjj

a

IJ

jJefpIaapIJpef

1111 )|(),|(max)|()|Pr(

1


Viterbi Training on HMM Model

# Accumulation (over corpus)# find Viterbi pathFor each sentence pair For each source position j For each target position i Pbest = 0; t = p(fj|ei) For each target position i’ Pprev = P(j-1,i’) a = p(i|i’,I,J) Pnew = Pprev*t*a if (Pnew > Pbest) Pbest = Pnew

BackPointer(j,i) = i’

# update countsi = argmax{ BackPointer( J, I ) }For each j from J downto 1 Count(f_j, e_i)++ Count(i,iprev,I,J)++ i = BackPoint(j,i)

# renormalize…

Pprev

a = p(i | i’,I,J)

t = p(fj | ei)

Pnew=Pprev*a*t


HMM Forward-Backward Training

Gamma : Probability to emit fj when in state i in sentence s

Sum over all paths through (j,i)

iaa

J

jajjj

sj

jJ

jefpIaapi

, 1''1''

1

')|(),|()(

j

i


HMM Forward-Backward Training

Epsilon: Probability to transit from state i’ into i Sum over all paths through (j-1,i’) and (j,i), emitting fj

iaiaa

J

jajjj

jjJ

jefpIaapii

,', 1''1''

11

')|(),|(),'(

j-1

i

j


Forward Probabilities

Defined as:

Recursion:

Initial condition:

)|(),'|()'()(1'

1 ij

I

ijj efpIiipii

iaa

j

jajjjj

jj

jefpIaapi

, 1''1''

1

')|(),|()(

)|(),0|()( 10 iefpIipi

j

i


Backward Probabilities

Defined as:

Recursion:

Initial condition:

)|(),|'()'()( 11'

1 ij

I

ijj efpIiipii

iaa

J

jjajjjj

jJj

jefpIaapi

, ''1'' )|(),|()(

'

1)(0 I

j

i


Forward-Backward

Calculate Gamma and Epsilon with Alpha and Beta:

Gammas:

Epsilons:

I

ij

j

ii

ii

1'

)()'(

)()(

iijijj

jijj

iefpIiipi

iefpIiipiii

~,'

~~1

1

)~

()|(),'~

|~

()'~

(

)()|(),'|()'(),'(


Parameter Re-Estimation

Lexicon probabilities

Alignment probabilities:

S

s

J

eej

sj

S

s

J

eeffj

sj

s

i

s

ij

i

i

efp

1 1

1,1

)(

)(

)|(

S

s

J

j

sj

S

s

J

j

sj

s

s

i

ii

iip

1 1

1 1

)(

),'(

)'|(


Forward-Backward Training – Pseudo Code

# Accumulation

For each sentence-pair {

Forward. (Calculate Alpha’s)

Backward. (Calculate Beta’s)

Calculate Xi’s and Gamma’s.

For each source word {

Increase LexiconCount(f_j|e_i) by Gamma(j,i).

Increase AlignCount(i|i’) by Epsilon(j,i,i’).

}

}

# Update

Normalize LexiconCount to get P(f_j|e_i).

Normalize AlignCount to get P(i|i’).


Example HMM Training

Date post:	24-Dec-2015
Category:	Documents
Upload:	eleanor-sherman
View:	238 times
Download:	3 times

Stephan Vogel - Machine Translation1 Statistical Machine Translation Word Alignment Stephan Vogel MT...

Documents