+ All Categories
Home > Documents > Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Date post: 28-Dec-2015
Category:
Upload: amber-hodges
View: 218 times
Download: 1 times
Share this document with a friend
24
Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011
Transcript
Page 1: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Machine Translation

Discriminative Word Alignment

Stephan VogelSpring Semester 2011

Page 2: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 2

Generative Alignment Models

Generative word alignment models: P(f, a|e) = … Alignment a as hidden variable

Actual word alignment is not observed Sum over all alignments

Well-known IBM 1 … 5 models, HMM, ITG Model lexical association, distortion, fertility

It is difficult to incorporate additional information POS of words (used in distortion model, not as direct link

features) Manual dictionary Syntax information …

Page 3: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 3

Discriminative Word Alignment

Model alignment directly: p(a | f, e) Find alignment that maximizes p(a | f, e)

Well-suited framework: maximum entropy Set of feature functions hm(a, f, e), m = 1, …, M

Set of model parameters (feature weights) cm, m = 1, …, M

Decision rule:

Page 4: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 5

Tasks

Modeling: design feature functions which capture cross-lingual divergences

Search: find alignment with highest probability Training: find optimal feature weights

Minimize alignment errors given some gold-standard alignments(Notice: Alignments no longer hidden!)

Supervised training, i.e. we evaluate against gold standard

Notice: features functions may result from some training procedure themselves E.g. use statistical dictionary resulting from IBMn alignment,

trained on large corpus Here now additional training step, on small (hand-aligned)

corpus(Similar to MERT for decoder)

Page 5: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 6

2005 – Year of DWA

Yang Liu, Qun Liu, and Shouxun Lin. 2005.Loglinear Models for Word Alignment.

Abraham Ittycheriah and Salim Roukos. 2005.A Maximum Entropy Word Aligner for Arabic-English Machine Translation.

Ben Taskar, Simon Lacoste-Julien, and Dan Klein. 2005.A Discriminative Matching Approach to Word Alignment.

Robert C. Moore. 2005.A Discriminative Framework for Bilingual Word Alignment.

Necip Fazil Ayan, Bonnie J. Dorr, and Christof Monz. 2005.NeurAlign: Combining Word Alignments Using Neural Networks.

Page 6: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 7

Yang Liu et al. 2005

Start out with features used in generative alignment

Lexicons E.g. IBM1

Use both directions: p(fj|ei) and p(ei|fj), =>Symmetrical alignment model

And/or symmetric model

Fertility model: p(i|ei)

Page 7: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 8

More Features

Cross count: number of crossings in alignment Neighbor count: count the number of links in the

immediate neighborhood Exact match: count number of src/tgt pairs, where

src=tgt Linked word count: total number of links (to influence

density) Link types: count how many 1-1, 1-m, m-1, n-m

alignments Sibling distance: if word is aligned to multiple words,

add the distance between these aligned words Link Co-occurrence count: given multiple alignments

(e.g. Viterbi alignments from IBM models) count how often links co-occur

Page 8: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 9

Search

Greedy search based on gain by adding a link For each of the features the gain can be calculated

E.g. IBM1

Algorithm:Start with empty alignmentLoop until no addition gain Loop over all (j,i) not in set if gain(j,i) > best_gain then store as (j’,i’) Set link(j’,i’)

Page 9: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 10

Moore 2005

Log-Likelihood-based model Measure word association strength Values can get large

Conditional-Link-Probability-based Estimated probability of two words being linked Used simpler alignment model to establish links Add simple smoothing

Additional features: one-to-one, one-to-many, non-monotonicity

Page 10: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 11

Training

Finding optimal alignment is non-trivial Adding link can affect nonmonotonicity, one-to-many features Dynamic programming does not work

Beam search could be used Requires pruning

Parameter optimization Modified version of average perceptron learning

)),,(),,(( feahfeah refirefiii

Page 11: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 12

Modeling Alignment with CRF

CRF is an undirected graphical model Each vertex (node) represents a random variable whose distribution is

to be inferred Each edge represents a dependency between two random variables The distribution of each discrete random variable Y in the graph is

conditioned on an input sequence X. Cliques: set of nodes in graph fully connected

In our case: Features derived from source and target words are the input sequence

X Alignment links are the random variables Y

Different ways to model alignment Blunsom & Cohn (2006): many-to-one word alignments, where each

source word is aligned with zero or one target words (-> asymmetric) Niehues & Vogel (2008): model not sequence, but entire alignment

matrix(->symmetric)

Page 12: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 13

Modeling Alignment Matrix

Random variables yji for all possible alignment links 2 values: 0/1 – word in position j is not linked/linked to word in

position i Represented as nodes in a graph

Page 13: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 14

Modeling Alignment Matrix

Factored nodes x representing features (observables) Linked to random variables Define potential for each yji

Page 14: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 15

Probability of Alignment

))(exp(1

))(exp(1

)(1

)|(p

function potential-))(exp()(

ctor weight vea-

vectorfeature a -)(

clique) (a nodes connected ofset a -

nodes factored ofset -

FN

FN

FN

Vccc

Vccc

Vccc

cccc

cc

c

FN

VFZ

VFZ

VZ

xy

VFV

VF

V

V

Page 15: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 16

Features

Local features, e.g. lexical, POS, … Fertility features First-order features: capturing relation between links Phrase-features: interaction between word and phrase

alignment

Page 16: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 17

Local Features

Local information about link probability Features derived from positions j and i only Factored node connected to only one random variable

Features Lexical probabilities, also normalized to (f,e) Word identity (e.g. for numbers, names) Word similarity (e.g. cognates) Relative position distance Link indicator feature: is (j,i) linked

in Viterbi alignment from generative alignment POS: Indicator feature for every src/tgt POS pair High frequency word indicator feature for every

src/tgt word pair for most frequent words

Page 17: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 18

Fertility Features

Model word fertility, src and tgt side Link to all nodes in row/column Constraint: model fertility only up

to maximum fertility Indicator features:

one for each fertility n <= None for all fertilities n > N

Alternative: use fertility probabilitiesfrom IBM4 training Now different for different words

Page 18: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 19

First Order Features

Links depend on links ofneighboring words

Link always 2 nodes Different features for different

directions (1,1), (1,2), (2,1), (1,0), …

Captures distortions, similar toHMM and IBM4 alignment

Indicator features, if both links are set Also POS 1-order feature: indicator feature link(j,i) and

(POSj, POSi) and link(j+k, i+l)

Page 19: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 20

Inference – Finding the Best Alignment

Word alignment corresponds to assignment of random variables

=> Find most probable variable assignment

Problem: Complex model structure: many loops No exact inference possible

Solution: Belief propagation algorithm Inference by message passing

Runtime exponential in number of connected nodes

Page 20: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 21

Belief Propagation

Messages are sent from random variable nodes to factored nodes, and also in the opposite direction

Start with some initial values, e.g. uniform In each iteration

Calculate messages from hidden node (j,i) and sent to factored node c

Calculate messages from factored node c and sent to hidden node (j,i)

Page 21: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 22

Getting the Probability

After several iterations, belief value calculated from messages send to hidden nodes

Belief value can be interpreted as posterior probability

Page 22: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 23

Training

Maximum log-likelihood of correct alignment Use gradient descent to find optimum

Train towards minimum alignment error Need smoothed version of AER Express AER in terms of link indicator functions Use sigmoid of link probability

Can use 2-step approach 1. Optimize towards ML 2. Optimize towards AER

Page 23: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 24

Some Results: Spanish-English

Features IBM1 and IBM4 lexicons Fertilties Link indicator feature POS features Phrase features

Impact on translationquality (Bleu scores)

Dev Eval

Baseline 40.04 47.73

DWA 41.62 48.13

Page 24: Machine Translation Discriminative Word Alignment Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation) 25

Summary

In last 5 years new efforts in word alignment Discriminative word alignment

Integrate many features Need small amount of hand aligned data to tune (train)

feature weights

Different variants Log-linear modeling Conditional random fields: sequence and alignment matrix

Significant improvements in word alignment error rate Not always improvements in translation quality

Different density of alignment -> different phrase table size Need to adjust phrase extraction algorithms?


Recommended