On the application of different evolutionary algorithms to the alignment problem in statistical...

ARTICLE IN PRESS

0925-2312/$ - se

doi:10.1016/j.ne

�CorrespondE-mail addr

Neurocomputing 71 (2008) 755–765

www.elsevier.com/locate/neucom

On the application of different evolutionary algorithms to the alignmentproblem in statistical machine translation

Luis Rodrıguez�, Ismael Garcıa-Varea, Jose A. Gamez

Departamento de Sistemas Informaticos–SIMD, Universidad de Castilla-La Mancha, Campus Universitario s/n, Albacete 02071, Spain

Available online 19 November 2007

Abstract

In statistical machine translation, an alignment defines a mapping between the words in the source and in the target sentence.

Alignments are used, on the one hand, to train the statistical models and, on the other, during the decoding process to link the words in

the source sentence to the words in the partial hypotheses generated. In both cases, the quality of the alignments is crucial for the success

of the translation process. In this paper, we propose several evolutionary algorithms for computing alignments between two sentences in

a parallel corpus. This algorithm has been tested on different tasks involving different pair of languages. Specifically, in the two shared

tasks proposed in the HLT-NAACL 2003 and in the ACL 2005, the EDA-based algorithm outperforms the best participant systems. In

addition, the experiments show that, because of the limitations of the well known statistical alignment models, new improvements in

alignments quality could not be achieved by using improved search algorithms only.

r 2007 Elsevier B.V. All rights reserved.

Keywords: Evolutionary algorithms; Estimation of distribution algorithms; Statistical machine translation; Statistical alignments

1. Introduction

The statistical approach to machine translation consti-tutes one of the most promising approaches in this fieldnowadays. The rationale behind this approximation is tolearn a statistical model which can be used for translating asentence from one language to another, from a parallelcorpus. A parallel corpus can be defined as a set of sentencepairs, each pair containing a sentence in a source languageand a translation of this sentence into a target language. Inthis context, word alignments are necessary to link thewords in the source and in the target sentence, in order toestablish a structural relation between the words in a pairof sentences. Owing to the fact that obtaining qualityalignments between these sentences is not trivial and, dueto the heavily dependence of the statistical models formachine translation on the concept of alignment, severaltasks on alignments in statistical machine translation havebeen proposed in the last few years (HLT-NAACL 2003[16] and ACL 2005 [13]).

e front matter r 2007 Elsevier B.V. All rights reserved.

ucom.2007.10.006

ing author.

ess: [email protected] (L. Rodrıguez).

As described in [21], word alignments have been usedto solve many natural language applications, rangingfrom machine translation to word sense disambiguation,and it has been proved that these applications cruciallydepend on the quality of the word alignment [19,29].For example, word alignments are the basis of single-word-based statistical machine translation systems[2,27,26,6,18,22,7], and also are the starting point ofphrase-based statistical [25,12,30,10,23] or example-basedtranslation systems [4]. In [24,14] word alignments areused to automatically extract bilingual lexica and terminol-ogy from corpora, and in [5], word alignments are appliedto word sense disambiguation. Also, in [28] word align-ments are used in morphological analysis and part-of-speech tagging.In this work, we address the problem of searching for

the best word alignment between a source and a targetsentence according to the optimal parameters of a statisticalalignment model. That is, we are provided with a sourceand a target sentence and we have to link each word inthe source sentence to a word (or several words) in thetarget sentence. As there is no efficient exact methodto compute the optimal alignment (known as Viterbi

www.elsevier.com/locate/neucom

dx.doi.org/10.1016/j.neucom.2007.10.006

mailto:[email protected]

ARTICLE IN PRESSL. Rodrıguez et al. / Neurocomputing 71 (2008) 755–765756

alignment) in most of the cases (specifically for the mostwidely used IBM models 3, 4 and 5 proposed in [3]) andsince, to the best of our knowledge, only a greedy methodto deal with this problem has been proposed [3] so far, wediscuss in this paper the use of different evolutionaryalgorithms. On the one hand, a recently appeared meta-heuristic family of algorithms, estimation of distribution

algorithms (EDAs). On the other, the well known genetic

algorithms (GAs). Clearly, by using a heuristic-basedmethod we cannot guarantee the achievement of theoptimal alignment. Nonetheless, we expect that the globalsearch carried out by our algorithms will produce highquality results in most of the cases, as previous experimentswith this technique [11] in different optimization task haveproved. In addition to this, the results presented in Section6 support the approximation presented here and show thatit may be not possible to significantly improve these resultsby considering this problem only from the search point ofview.

The rest of the paper is organized as follows: in Section 2a review of the theory of statistical alignment models ispresented. In Section 3 the specific problem addressed inthis paper is described. Next, Section 4 is devoted topresenting a description of the evolutionary algorithmsused in the experiments and, in Section 5 we discuss theadaptation of an EDA algorithm to deal with thealignment problem. Next, in Section 6 experimental resultsare presented, and finally, in Section 7, the main conclu-sions are stated.

2. Word alignments and statistical machine translation: a

review

In this section, we introduce the statistical framework formachine translation and the concept of statistical wordalignment. We also review the theory of statisticalalignment models, in order to describe, and better under-stand, the subject matter of this paper.

a : 2 3 0 4 4 5 5 6 7 8

e : favor desearia reservar habitacionnull Por , .una

f : book a, wouldI like .to roomPlease

Fig. 1. Word alignment example and its representation as a vector. Each

alignment can be represented as a vector as is shown in the figure. The

position ith in the vector corresponds to the ith word in the source

sentence. The value stored in this position represents the word in the target

sentence aligned to this source word.

2.1. Statistical machine translation

The translation process in statistical machine translation

can be formulated as follows: a source language string f ¼

f J1 ¼ f 1 . . . f J is to be translated into a target language

string e ¼ eI1 ¼ e1 . . . eI . Every target string is regarded as a

possible translation for the source language string with aposteriori probability PrðejfÞ. Therefore, the problem oftranslating a source language string can be formulated asthe search for the most probable target string as

e ¼ argmaxe

fPrðejfÞg. (1)

A better approximation is typically followed by applyingBayes’ Theorem to Eq. (1):

e ¼ argmaxe

fPrðfjeÞ � PrðeÞg, (2)

where PrðfÞ does not depend on e, and therefore theproblem is reframed as to choose the target string ðeÞthat maximizes the product of both the target languagemodel PrðeÞ and the inverse string translation modelPrðfjeÞ. Eq. (2) is known as the foundamental equation of

statistical machine translation [3].

2.2. Statistical alignment models

Given a source language string f ¼ f J1 ¼ f 1 . . . f J and a

target language string e ¼ eI1 ¼ e1 . . . eI , an alignment A

between f J1 and eI

1 is defined as a subset of the Cartesianproduct of the word positions, that is: A � fðj; iÞ :j 2 f1 . . . Jg; i 2 f1 . . . Igg, which allows many-to-many re-lationships between words. This is a general representationfor modelling a non-restricted alignment.However, statistical word-based alignment models typi-

cally impose additional constraints on the align-ment representation, as for example the widely usedIBM Models [3], or the HMM alignment model [19]. Inthis case, the alignment representation establishes acorrespondence between the words in the source and thetarget string that assigns one target word position toeach source word position. These alignment models aresimilar to the concept of Hidden Markov models (HMM)in speech recognition. Therefore, the alignment mappingis: j! i ¼ aj , from source position j to target positioni ¼ aj. The alignment a ¼ aJ

1 ¼ a1 . . . aJ may containalignments aj ¼ 0 with the ‘‘NULL’’ word e0 to accountfor source words that are not aligned to any targetword. In other words, every aj can take a value of theset f0; 1; . . . ; Ig. In Fig. 1 an example of an alignment isshown.As mentioned before, the IBM models [3] are the most

widely used models to estimate PrðfjeÞ. When consideringalignments, the actual probability in the translation modelbecomes Prðf; ajeÞ, where the alignment a is introduced inthe model as a hidden variable, leading to

PrðfjeÞ ¼Xa2A

Prðf; ajeÞ, (3)

where A is the set of all the possible alignments bet-ween f and e. Without loss of generality, Prðf; ajeÞ can be

ARTICLE IN PRESSL. Rodrıguez et al. / Neurocomputing 71 (2008) 755–765 757

written as

Prðf; ajeÞ ¼ PrðJjeI1ÞYJ

j¼1

Prðf j ; ajjfj�11 ; aj�1

1 ; eI1Þ

¼ PrðJjeI1ÞYJ

j¼1

Prðajjfj�11 ; aj�1

1 ; eI1Þ

� Prðf jjfj�11 ; aj

1; eI1Þ. ð4Þ

Eq. (4) can be intuitively described from a generative pointof view. That is, in order to generate a source sentence f

with an alignment a from a target sentence e, we canproceed as follows:

(1)
First, we can choose the length (J) of f given ourknowledge of e.
(2)
Then we can choose the position in e where the firstword of f is aligned to ða1Þ, given our knowledge of thestring e and the length of f (J).
(3)
Then we can choose the identity of the first word of fðf 1Þ given our knowledge of e, the length of f, and theposition a1 (in eÞ to which f 1 is connected.
We can iterate steps (2) and (3) for j ¼ 2 . . . J to generatethe rest of words of f and the corresponding alignment a.

In [3] a family of five different statistical alignmentmodels are presented. In all of these models the probability

Prðf jjfj�11 ; aj

1; eI1Þ in Eq. (4) is approximated by a lexicon

model pðf jjeajÞ by dropping the dependencies on f

j�11 , a

j�11 ,

eaj�1

1 , and eIajþ1

. The way of modelling the probability

Prðajjfj�11 ; aj�1

1 ; eI1Þ is what makes the differences among

them, ranging from a uniform distribution for Model 1 to avery complex distribution for Model 5, always in increasingcomplexity.

Models 3–5 are also called fertility models, because theyexplicitly model the fertility, defined as the number ofwords in f that every word e can generate in the generativeprocess. In other words, the fertility of a given word e is thenumber of words (in fÞ that are aligned to it.

In Models 1 and 2, the set of all possible alignments,between a given pair of sentences, can be computed with alow computational cost. Consequently, the approachshowed in Eq. (3) can be followed strictly. In Models3–5, there is no algorithm to perform this computationefficiently so that approximate solutions are used instead;in [3], a local-search-based algorithm is described for thecomputation of the best possible alignment. In spite of thisfact, it has been shown [21] that the quality of thealignments achieved by fertility models is higher.

Moreover, as is shown in [21], good tradeoff betweenefficiency and alignment quality is obtained when IBMModel 4 is used. Because of that, in this work, we use IBMModel 4 as a statistical alignment model. We postpone amore detailed description of this model to Section 5.2,where the fitness function for the different proposedalgorithms is described.

3. Definition of the problem

Once the concept of alignment has been introduced aswell as its relationship with the statistical translationmodels, we can describe the problem addressed in thepresent article in a more formal way.Given a source sentence f and a target sentence e, the

problem of searching for the best alignment between bothsentences can be defined as

a ¼ argmaxa2A

Prðajf; eÞ ¼ argmaxa2A

Prðf; ajeÞ, (5)

where A is the set of all the possible alignments between f

and e. This equality allows us to deal with the alignmentproblem by using the IBM word models commentedpreviously.Clearly, ‘‘real’’ alignments in most of the cases do not

follow the constraints imposed by the IBM models. Hence,the alignments obtained from these models have to besomehow extended to achieve more natural alignments.This is usually carried out by computing them in bothdirections (i.e. first from f to e to obtain one-to-manyalignments and then from e to f to obtain many-to-onealignments) and then combining them in a suitable way(this process is known as symmetrization) to finally allowfor more natural many-to-many alignments.As we will show in Section 6 we evaluate the quality of

an alignment by comparing it with a manual alignment,which was performed by human experts. At this point it isimportant to stress that the set of manual alignments usedin the evaluation were carried out by different experts. Tohave an idea of the difficulty of the work addressed here,the different experts proposed different possible alignmentfor the same pair of sentences. This ‘‘ambiguity’’ in theexperts criteria is summarized within a manual alignmentby means of Sure and Possible alignments between words.In that sense, an alignment was labeled as sure when all theexperts agreed. Otherwise, the alignment was labeled aspossible.

4. Evolutionary algorithms: estimation of distribution

algorithms

Evolutionary algorithms are non-deterministic heuristicstrategies that attack the search problem in a global way byusing a population of individuals (i.e. points of the searchspace) each one representing a potential solution to theproblem being solved. During the search process, theenvironmental pressure causes natural selection (survival ofthe fittest) which force the population to evolve towardmore promising areas of the search space. Without anykind of doubt the more representative family of evolu-tionary algorithms are GAs [9,8].The way to measure the degree of adaptation for a given

individual is by using a fitness or evaluation function. Thus,if we use X 1;X 2; . . . to represent random variables,domðX iÞ ¼ fx

1i ; . . . ;x

ki g as the set of values variable X i

ARTICLE IN PRESS

Fig. 2. A canonical GA.

L. Rodrıguez et al. / Neurocomputing 71 (2008) 755–765758

can take, X ¼ ðX 1; . . . ;X nÞ to represent a n-dimensionalrandom variable, and x ¼ ðx1; . . . ;xnÞ to represent one ofthe possible instantiations of X, then our goal will be toobtain

x� ¼ argmaxxi2OX

f ðxiÞ,

where f ð�Þ is the fitness function,1 domðXÞ ¼ domðX 1Þ �

� � � � domðX nÞ is the search space, and each xi ¼

ðx1i ; . . . ; x

ni Þ is a point of the search space. In this work we

assume (without lack of generality) that individualscoincide with solutions to the problem, that is to say, noadditional codification is needed.

Fig. 2 shows the operation mode of a canonical GA (see[15] for a description of GAs). First, random solutions aregenerated to form the initial population, these individualsare them evaluated by using f ð�Þ to know their degree ofadaptation. At each iteration, some kind of fitness-basedselection is carried out in order to seed the next generationwith good candidates. Then, selected individuals arearranged by pairs and recombined with probability pc byusing a crossover operator that (usually) creates twooffsprings by interchanging genetic material from theparents. Finally, with probability pm each component X i ¼

xi (gene) of the resulting individuals is mutated, e.g.,its value is replaced by randomly choosing a new onefrom domðX iÞnfxig. Finally, the new individuals areevaluated and the new population is selected. As wecan see, variation operators (crossover and mutation)introduce diversity and facilitate novelty, while selectionoperator pushes quality.

The large number of parameters to be selected in GAs(crossover and mutation operator, crossover and mutationprobability, type of selection, size of the population,y)together with the poor behavior of GAs in some problems(deceptive problems) in which crossover and mutationoperators do not guarantee to preserve the presence ofbuilding blocks during evolution, have led to the develop-ment of other types of evolutionary algorithms. One of themost outstanding alternatives are Estimation of Distribu-

tion Algorithms that have a theoretical foundation on theprobability theory.

1In this work we use f ðaÞ ¼ Prðf; ajeÞ as fitness function (see Eq. (11)).

4.1. Estimation of distribution algorithms

EDAs [11] are population-based evolutionary algorithmsin which the genetic operators (crossover and mutation)have been replaced by an alternative variation operator.This operator generates new individuals by sampling froma probability distribution which is learnt from a databasecontaining selected individuals from the previous genera-tion. On the contrary, as happens in other evolutionaryalgorithms, in EDAs the interrelations among the differentvariables representing the individuals can be explicitlyexpressed by means of the joint probability distributionassociated with the database of individuals selected at eachiteration. The operation mode of a canonical EDA isshown in Fig. 3.As we can see the differences with respect to GAs are in

steps 4(a), 4(b) and 4(c), that is, in the variation operator.Now, the basic idea is to summarize the properties of thefittest individuals in the population by using the jointprobability distribution (JPD) defined over the differentrandom variables that represents the components (genes)of the individuals: PrðX 1; . . . ;X nÞ. Later, we obtain newindividuals by sampling from the JPD.The main problem in the previous description is how to

estimate the JPD, because it is obvious that the computa-tion of all the parameters needed to specify the JPD is, ingeneral, intractable. This fact motivates to deal withapproximations of the JPD by assuming that it factorizesaccording to a probabilistic graphical model ðMÞ. Themost general case is to assume that the JPD is factorizedusing a n-variate probabilistic graphical model, e.g. aBayesian Network [11, Chapter 3]:

PrðX 1; . . . ;X nÞ ¼Yn

i¼1

PrðX ijpaðX iÞÞ, (6)

paðX iÞ being the set of parents of node (variable) X i in theBayesian network.However, more restricted probabilistic graphical models

have shown to perform quite well in practice, while beingsimpler to estimate/learn. In this paper, as this is the firstapproximation to the alignment problem with EDAs and,because of some issues that will be discussed later, we usethe simplest EDA approach, univariate models, andspecifically two of its more representative algorithms:

Fig. 3. A canonical EDA.


univariate marginal distribution algorithm (UMDA) andpopulation-based incremental learning (PBIL).

4.2. UMDA

In the UMDA [17] it is assumed that all the variables aremarginally independent, thus the n-dimensional probabilitydistribution is factorized as the product of n marginal/unidimensional distributions:

PrðX 1; . . . ;X nÞ ¼Yn

i¼1

PrðX iÞ. (7)

Among the advantages of UMDA we can cite thefollowing: no structural learning is needed; parameterlearning is fast; small data set can be used because onlymarginal probabilities have to be estimated; and, thesampling process is easy because each variable is indepen-dently sampled. Of course, no explicit modelling ofdependencies is possible with UMDA.

4.3. PBIL

PBIL [1] also marginal independence among all thevariables is assumed, but the algorithm is quite differentfrom UMDA. Thus, in PBIL, instead of learning a newprobabilistic model M at each iteration, a single model isrefined during the search process by using a Hebbian-likerule. Concretely, each component PrðX iÞ of the probabil-istic model M is updated as

8xi 2 domðX iÞ PrðxiÞ ¼ PrðxiÞ � ð1:0� lrÞ þ bi � lr, (8)

where lr is the learning rate and bi is 1.0 if X i ¼ xi in thebest individual of the current population and 0.0 otherwise.That is, at each iteration the probability vector thatrepresents the probabilistic model is moved towards thebest solution in the current population.

PBIL also includes an extra feature that differentiates itfrom canonical EDAs: at each step the probability vector ismutated (components can be slightly shifted) according toa mutation probability:

8xi 2 domðX iÞ

PrðxiÞ ¼ PrðxiÞ � ð1:0�mut_shiftÞ þ randomð0 or 1Þ

�mut_shift. ð9Þ

5. Design of an EDA to search for alignments

In this section, an EDA algorithm to align a source and atarget sentences is described.

5.1. Representation

One of the most important issues in the definition of asearch algorithm is to properly represent the space ofsolutions to the problem. In the problem considered here,we are searching for an ‘‘optimal’’ alignment between a

source sentence f and a target sentence e. Therefore, thespace of solutions can be stated as the set of possiblealignments between both sentences (that is, all the possibleset of links between the words in f and the words in eÞ.Owing to the constraints imposed by the IBM models, themost natural way to represent a solution to this problemconsists in storing each possible alignment in a vectora ¼ a1 . . . aJ , where J is the length of f. Each position ofthis vector can take the value of ‘‘0’’ to represent a NULLalignment (that is, a word in the source sentence that isaligned to no words in the target sentence) or an indexrepresenting any position in the target sentence (as can beseen in Fig. 1).Despite the fact that this representation is not the most

usual representation when solving problems with EDAs(instead, strings of binary digits or real numbers are usuallyconsidered), it has been adopted because it constitutes anexplicit representation, so that an individual exactlycorresponds to a point in the search space. On the otherhand, in Section 5.4 a new technique, suitable for thisrepresentation, will be proposed in order to introduce apseudo-local search during the search process.

5.2. Evaluation function

During the search, each individual (search hypothesis) isscored using the fitness function described as follows. Leta ¼ a1 . . . aJ be the alignment represented by an individual.This alignment a is evaluated by computing the logarithmof the probability Prðf; ajeÞ. This probability, for fertilityalignment models is defined as

Prðf; ajeÞ ¼X

ðs;pÞ2hf;ai

Prðs;pjeÞ. (10)

In that case, the generative process described in Section 2can be interpreted as follows: Given a string e we firstdecide the fertility ðfiÞ of each word ei and a list of words(which may be empty) to connect to it. The I different listsconform to the random variable s. After choosing s, itsword are permuted to produce f. This permutation ismodelled by the random variable p. Notice that s iscomposed of I lists ðs ¼ tI

1Þ, and each ti is composed ofk ¼ fi words of f ðti ¼ tk

i1Þ. The variable p is structuredequivalently but denoting positions instead of words.According to that, knowing s and p determines the stringf and the associated alignment a, but in general differentpairs s;p may lead to the same pair f; a. The set of suchpairs is denoted by hf; ai.Once the general equation for fertility models is

described, we present the specific form of Eq. (10) forIBM Model 4, which define the fitness function of ouralgorithms. The formulation of IBM Model 4 is [3]

Prðs;pÞ ¼YI

i¼1

nðfijeiÞ


�YI

i¼1

Yfi

k¼1

tðtikjeiÞ

�YI

i¼1;fi40

d¼1ðpi1 � crijEcðeri

Þ;Fcðti1ÞÞ

�YI

i¼1

Yfi

k¼2

d41ðpik � piðk�1ÞjFcðtikÞÞ

�J � f0

f0

!p

J�2f0

0 pf0

1 �Yf0

k¼1

tðt0kje0Þ, ð11Þ

where the factors separated by � symbols denote fertilityðnð:jeiÞÞ, translation ðtð:jeiÞÞ, head permutation ðd¼1ð:j:ÞÞ,non-head permutation ðd41ð:j:ÞÞ, null-fertility ðnð:je0ÞÞ, andnull-translation ðtð:je0ÞÞ probabilities. To be more precise,the symbols in Eq. (11) are: J (the length of eÞ, I (the lengthof fÞ, ei (the ith word in eI

1Þ, e0 (the NULL word), fi (thefertility of eiÞ, tik (the kth word produced by ei in fÞ, pik (theposition of tik in aÞ, ri (the position of the first fertile wordto the left of ei in aÞ, cri

(the ceiling of the average of all prik

for ri, or 0 if ri is undefined), and, Fcðf Þ and EcðeÞ are thecorresponding word-classes of f and e.

This model was trained using the GIZAþþ toolkit [21]on the material available for the different alignment tasksdescribed in Section 6.1.

For the sake of clarity, in Fig. 4a different alignment forthe same pair of sentences shown in Fig. 1 is presented.According to the new alignment, where the word ‘‘I’’ isnow aligned to ‘‘desearia’’ (in the previous example it wasaligned to ‘‘null’’), the value of Prðf; ajeÞ has changed in thefollowing terms:

�

e :

f :

a :

Fig

cha

Fertility model: The fertility of e3 ¼ ‘‘desearia’’ is now 3instead of 2 and there are no words aligned to null.
� Statistical dictionary: The probability of translating
‘‘desearıa’’ into ‘‘null’’ has been replaced by theprobability of translating ‘‘desearıa’’ into ‘‘I’’.
� Distortion model: we have to add to the model the
probability of aligning the third word in the sourcesentence to the fourth word in the target sentence.

5.3. Search

In this section, some specific details about the search aregiven. As it was mentioned in Section 4, the algorithm

favor desearia reservar habitacionnull Por , .una

book a, wouldI like .to roomPlease

2 3 4 4 5 5 6 7 84

. 4. In this word alignment the link for the english word ‘‘I’’ has been

nged with respect to the one showed in Fig. 1.

starts by generating an initial set of hypotheses (initialpopulation). In this case, a set of randomly generatedalignments between the source and the target sentences isgenerated (actually, some knowledge about the problem isintroduced in the initial population by means of introdu-cing in the initial population an individual representing theIBM model 2 optimal alignment. Therefore, it is notcompletely random-generated). Afterwards, all the indivi-duals in this population are scored using the functiondefined in Eq. (10). At this point, the actual search starts byapplying the scheme shown in Section 4, thereby leading toa gradual improvement in the hypotheses handled by thealgorithm in each step of the search.This process ends when some finalization criterium (or

criteria) is reached. In the problem addressed in this paper,two considerations have to be taken into account aboutthis criterium. The first one is related to time constraints.Statistical models are usually trained from thousand oreven million of sentences. Therefore, the algorithm foraligning sentences has to be computationally efficient. Onthe other hand, several evolutive algorithms are comparedas part of the experimentation. The finalization criteriumshould permit a fair comparison among the differenttechniques. Taking into account these premises, thedifferent algorithms implemented end when they reach acertain number of calls to the fitness function. This valuewas empirically estimated as a tradeoff between the timerestriction discussed before and the quality of the finalsolutions achieved. Specifically, the number of evaluationspermitted was set to 3000 times the length of the sourcesentence.Regarding the EDA model, as commented before, our

approach relies on the UMDA and PBIL models, mainlydue to the size of the search space defined by the task. Thealgorithm has to deal with individuals of length J, whereeach position can take ðI þ 1Þ possible values. Thus, in thecase of UMDA, the number of free parameters to be learntfor each position is I (e.g. in the English–French taskavgðJÞ ¼ 15 and avgðIÞ ¼ 17:3Þ. If more complex modelswere considered, the size of the probability tables wouldhave grown exponentially. As an example, in a bivariatemodel, each variable (position) is conditioned on anothervariable and thus the probability tables Pð:j:Þ to be learnthave IðI þ 1Þ free parameters. In order to properly estimatethe probability distributions, the size of the populationshas to be increased considerably. As a result, thecomputational resources required by the algorithm risedramatically. This was empirically confirmed by imple-menting and testing more complex EDA schemes (bivariateEDAs), the algorithm spending an excessive amount oftime to obtain solutions comparable to those achieved byUMDA.The parameters used in the three algorithms are

described in Table 1.Finally, it is necessary to remark that the fitness function

used in the algorithm just allows unidirectional alignments.Therefore, the search was conducted in both directions (i.e.

ARTICLE IN PRESS

Table 1

Different parameters used in the evolutive algorithms tested

Parameter GA PBIL UMDA

Population size 2J 2J 26J

Replacement mechanism Elitism Elitism Elitism

Learning rate – 0.3 –

Negative learning rate – 0.07 –

Population used to learn the prob. distr. 50% – –

Mutation type Random – –

Crossover type One point – –

Selection estrategy Proportional fitness selection – –

Crossover probability 0.9 – –

Mutation probability 0.01 0.01 –

J is the source sentence length. The finalization criterium is based on a maximum number of calls to the fitness function ð3000JÞ.

L. Rodrıguez et al. / Neurocomputing 71 (2008) 755–765 761

from f to e and from e to fÞ, combining the final results toachieve many-to-many alignments. This can be seen as apostprocess for the result of the search performed by theEDA and it is carried out using well known heuristicapproaches (symmetrization methods). The results shownin Section 6.2 were obtained by applying the refined method

proposed in [20].

5.4. Smoothing the probability distribution

The rationale behind the EDAs is to estimate aprobability distribution from a set of samples (individuals).Nonetheless, the number of samples needed to perform anaccurate estimation is usually prohibitive. Particularly, wehave to deal with unseen events in our population.Typically, this is performed by applying Laplace Correc-

tion, which consists in summing one to the absolutefrequency of each event (including the unseen events whoseabsolute frequency is zero).

In the present article, two different (and complementary)approaches have been adopted. The first one consists ininterpolating the probability distribution learnt from theprevious population with a uniform distribution over allthe possible states of the variables in the problem, so thatthe final distribution used for sampling the new populationis given by:

PrfinalðxÞ ¼ a � PrlearntðxÞ þ ð1� aÞ � PruniformðxÞ, (12)

where a is a weight factor, Prlearnt is the probabilitydistribution estimated from the previous population andPruniform is the probability given by a uniform distributionover all the possible states of the current variable. Theoptimal value for the weight a is correlated to thepopulation size and, despite having not been intensivelyoptimized, results seem to improve when using a value ofa ¼ 0:95 with a population size of 100 individuals.

On the other hand, smoothing has been introduced intothe algorithm in order to simulate a local search process. Insome tasks, the results obtained by the EDAs have beenimproved by applying a local search algorithm on the bestindividuals in the population. The problem here is that the

local search algorithm is usually high time-consuming,mainly due to the additional evaluations of the fitnessfunction required. Our proposal, which does not needadditional evaluation of this function, is based on theinterpolation (as shown in Eq. (12)) of the EDA distribu-tion probability with a gaussian mixture. This process isperformed by generating a mixture of discretized Gaussianfor each variable in the problem, so that the means of eachgaussian are placed on the states with highest probability inthe EDA distribution. The weight (prior) of each compo-nent in the mixture is proportional to the probability of thestate covered by this component.Therefore, the mixture tries to spread the probability

mass to the neighbors states of the best ones. As a result,we are forcing the EDA to explore the local environmentsof the different optima reached so far. An example of thissmoothing is depicted in Fig. 5.In Table 6, a comparison between an EDA with the first

kind of smoothing and an EDA with both kinds ofsmoothing is shown.

6. Experimental results

Several experiments have been carried out in order toassess the correctness of the search algorithm. Next, theexperimental methodology employed and the obtainedresults are described.

6.1. Corpora and evaluation

The experiments consisted in obtaining word alignmentsbetween pairs of sentences following the approach de-scribed in previous sections. Specifically, three different testsets containing several sentence pairs to be aligned wereused. All the test sets were taken from the two shared tasksin word alignments developed in HLT/NAACL 2003 [16]and ACL 2005 [13]. These two tasks involved four differentpair of languages: English–French, Romanian–English,English–Inuktitut and English–Hindi. Since addressingEnglish–Inuktitut and English–Hindi pairs requires certaindegree of knowledge about the languages themselves,


English–French and Romanian–English pairs have beenconsidered in these experiments. Next, a brief descriptionof the corpora used is given.

Regarding the Romanian–English task, the test dataused to evaluate the alignments consisted of 248 sentencesfor the 2003 evaluation task and 200 for the 2005evaluation task. In addition to this, a training corpus,consisting of about 1 million Romanian words and aboutthe same number of English words has been used. On theother hand, a subset of the Canadian Hansards corpus has

Fig. 5. Example of local search smoothing. In this example, a one

dimension problem involving a discrete variable is shown. All the states

that the variable can take are listed in the leftmost column. On the second

column, we can see the original distribution learnt by the EDA for this

variable. In this case, the probability mass is distributed into just two

states, 2 and 9. The next column represents the gaussian mixture used for

smoothing. Finally, the rightmost column shows the final probability

distribution.

Table 2

Features of the corpora used in the different alignment task

En-Fr Ro-En 03 Ro-En 05

Training size 1M 97K 97K

Vocabulary 68K/86K 48K/27K 48K/27K

Running words 20M/23M 1.9M/2M 1.9M/2M

Test size 447 248 200

Table 3

Alignment quality (%) for the English–French task with NULL alignments

System Ps Rs Fs

UMDA 73.78 86.79 79.76

PBIL 74.28 82.51 78.18

GA 74.32 81.39 77.70

GIZA++ 73.61 82.56 77.92

Ralign.EF1 72.54 80.61 76.36

XRCE.Nolem.EF.3 55.43 93.81 69.68

been used in the English–French task. The test corpusconsists of 447 English–French sentences. The trainingcorpus contains about 20 million English words, and aboutthe same number of French words. In Table 2, the featuresof the different corpora used are shown.To evaluate the quality of the final alignments obtained,

different measures have been taken into account: Precision,Recall, F-measure, and Alignment Error Rate. Given analignment A and a reference alignment G (both A and G

can be split into two subsets AS, AP and GS, GP,respectively representing Sure and Probable alignments)Precision ðPT Þ, Recall ðRT Þ, F-measure ðFT Þ and Alignment

Error Rate (ARE) are computed as follows:

PT ¼jAT \ GT j

jAT j,

RT ¼jAT \ GT j

jGT j,

FT ¼j2PT RT j

jPT þ RT j,

AER ¼ 1�jAS \ GSj þ jAP \ GPj

jAPj þ jGSj,

(where T is the alignment type, and can be set to either S

or P).From all these measures, the most important one for

determining the quality of the alignments is the AER sinceit summarizes all the information provided by theremaining ones.It is important to emphasize that EDAs and GAs are

non-deterministic algorithms. Because of this, the resultspresented in Section 6.2 are actually the mean and thestandard deviation of the results obtained in 30 differentexecutions of the algorithm.

6.2. Results

In Tables 3–6 the results obtained from the differenttasks are presented. The results achieved by the techniqueproposed in this paper are compared with the best resultspresented in the shared tasks described in [13,16]. Theresults obtained by the GIZAþþ hill-climbing algorithmare also presented. In these tables, the mean (and thestandard deviation in brackets) of the results obtained in 30runs of the search algorithm is shown.

Pp Rp Fp AER

80.97 32.95 46.84 14.26 (0.07)

78.93 32.38 45.93 16.17 (0.25)

78.73 32.20 45.70 16.51 (0.25)

79.94 32.96 46.67 15.89

77.56 36.79 49.91 18.50

72.01 36.00 48.00 21.27

ARTICLE IN PRESS

Table 4

Alignment quality (%) for the Romanian–English 2003 task

System Ps Rs Fs Pp Rp Fp AER

UMDA 95.05 49.77 65.33 80.65 59.35 68.38 31.62 (0.05)

PBIL 94.85 46.22 62.15 78.34 57.95 66.62 33.38 (0.28)

GA 94.70 45.43 61.40 77.40 52.28 65.87 34.13 (0.31)

GIZAþþ 95.20 48.54 64.30 79.89 57.82 67.09 32.91

XRCE.Trilex.RE.3 80.97 53.64 64.53 63.64 61.58 62.59 37.41

XRCE.Nolem-56k.RE.2 82.65 54.12 65.41 61.59 61.50 61.54 38.46

Table 5

Alignment quality (%) for the Romanian–English 2005 task

System Ps Rs F s Pp Rp Fp AER

UMDA 95.37 54.90 69.68 80.61 67.83 73.67 26.33 (0.06)

PBIL 95.88 50.58 66.22 83.68 62.51 71.56 28.44 (0.32)

GA 95.80 49.90 65.61 83.03 61.96 70.97 29.03 (0.32)

GIZAþþ 95.73 52.95 68.18 83.66 63.38 72.12 27.88

ISI.Run5.vocab.grow 87.90 63.08 73.45 87.90 63.08 73.45 26.55

ISI.Run4.simple.intersect 94.29 57.42 71.38 94.29 57.42 71.38 28.62

ISI.Run2.simple.union 70.46 71.31 70.88 70.46 71.31 70.88 29.12

Table 6

Improvements in Alignment Error Rate achieved by using local search

smoothing in UMDA

Task Without LS smoothing With LS smoothing

Ro-En 03 32.16 31.62

Ro-En 05 26.79 26.33

Fr-En 14.72 14.26

L. Rodrıguez et al. / Neurocomputing 71 (2008) 755–765 763

According to these results, the proposed EDA-basedsearch is very competitive with respect to the best resultpresented in the two shared task.

Owing to the fact that evolutionary algorithms are non-deterministic, it is necessary to perform some kind ofstatistical analysis on the results. As we commented before,30 executions of each algorithm were carried out on thedifferent corpora (the number in the tables is the mean ofthese executions). From the different algorithms tested, theonly one that seems to perform better than the baselineðGIZAþþÞ is the UMDA. On account of this, a Studentt-test was computed between the baseline result and the 30runs of UMDA (previously, we used a Kolmogorov–Smir-nov test to check the normality assumption in the samples).The alternative hypothesis corresponds to consider themean of these samples less than the mean of the baseline(both the samples and the baseline are error rates. Thus, alower value means a better result). The student t-test resultfor all the test sets (Ro-En 03, Ro-En 05 and Fr-En) was ap-value of less of 2:2�16, which clearly supports thealternative hypothesis.

In addition to these results, additional experiments werecarried out to evaluate the actual behavior of the search

algorithm. These experiments were focused on measuringthe quality of the algorithm, distinguishing between theerrors produced by the search process itself and the errorsproduced by the model that leads the search (i.e. the errorsintroduced by the fitness function). To this end, the nextapproach was adopted. Firstly, the (bidirectional) referencealignments used in the computation of the Alignment ErrorRate were split into two sets of unidirectional alignments.Owing to the fact that there is no exact method to performthis decomposition, we employed the following method.For each reference alignment, all the possible decomposi-tions into unidirectional alignments were performed,scoring each of them with the evaluation function f ðaÞ ¼

pðf; ajeÞ defined in Eq. (11), and being selected the best one,aref . Afterwards, this alignment was compared with thesolution provided by the EDA, aeda. This comparison wasmade for each sentence in the test set, according to thefitness function. At this point, we can say that a model-error is produced if f ðaedaÞ4f ðaref Þ. In addition, we cansay that a search-error is produced if f ðaedaÞof ðaref Þ. InTable 7, a summary for both kind of errors in the differenttest sets is shown.Regarding the temporal complexity of the three algo-

rithms tested, all of three performed in a similar way sincethe most time consuming operation is the fitness function.Hence, and because of the finalization criterium employed(maximum limit of calls to the fitness function) the timespent by each algorithm was quite similar (around 4 s persentence running in a Pentium Xeon 2.30Ghz). At thispoint, it is necessary to remark that GA and PBILalgorithms reached results competitive with UMDA whenincreasing this limit of calls (but consuming a prohibitive

ARTICLE IN PRESS

Table 7

Comparison between reference alignments (decomposed into two uni-

directional alignments) and the alignments provided by the EDA

Search Model Ne Search Model Ne

Romanian–English 03 task

Romanian–English English–Romanian

GIZAþþ 1.6 87.9 10.5 2 93.6 4.4

EDA-UMDA 0 89.9 10.1 0 95.6 4.4

EDA-PBIL 1.9 88.7 9.4 1.2 94.4 4.4

GA 2.6 87.7 9.7 1.6 93.9 4.5

Romanian–English 05 task

Romanian–English English–Romanian

GIZAþþ 1 92.5 6.5 1 92 7

EDA-UMDA 0 92 8 0 94 6

EDA-PBIL 1.5 90.5 8 1 93 6

GA 2 88 8 1.5 92.5 6

French–English task

French–English English–French

GIZAþþ 1.6 87.3 11.1 0.9 91.1 8

EDA-UMDA 0 87.9 12.1 0 91.7 8.3

EDA-PBIL 1.6 86.3 12.1 1.1 90.4 8.5

GA 2.7 86.1 11.1 2 89.3 8.7

The ‘‘Search’’ and ‘‘Model’’ columns show the percentage of search and

model errors achieved by the specific technic, respectively. The ‘‘Ne’’

column refers to the number of times (in percentage) that the algorithm

finds the optimal alignment (manual alignment).

L. Rodrıguez et al. / Neurocomputing 71 (2008) 755–765764

time for this task). From this, we can conclude that theactual advantage of the UMDA is given by the speed ofconvergence with respect to the other two algorithms.

These experiments show that most of the errors were notdue to the search process itself but to another differentfactors. In fact, as can be seen in the table the UMDAachieves a perfect search in all the task since all the errorsproduced in these tasks have been introduced by the modelthat leads the search. From this, we can conclude that themodel used to lead the search should be improved insteadof developing better search algorithms.

7. Conclusions

In this paper, a new approach, based on the use ofevolutionary algorithms, to deal with the alignmentproblem in statistical machine translation has beenpresented. This approximation has proved to be verycompetitive with respect to the state-of-the-art in this field.In addition, a new technique for smoothing the probabilitydistribution in discrete EDAs, in order to introduce apseudo-local search in this kind of algorithms, has beenproposed.

From the analysis of the results we can conclude that,instead of improving the search algorithm, we have to putmore efforts into improving the statistical alignment model

since in the case of the UMDA algorithm no search errorsare produced.Finally, we are now focusing on the influence of these

improved alignments in the statistical models for machinetranslation and on the degree of accuracy that could beachieved by means of these alignments. In addition to this,the integration of the alignment algorithm into the trainingprocess of the statistical translation models is currentlybeing performed.

Acknowledgements

This work has been supported by the Spanish ProjectsJCCM (PBI-05-022) and HERMES 05/06 (Vic. Inv.UCLM).The authors wish to thank anonymous reviewers and the

editors for their valuable comments and criticisms whichimproved the manuscript considerably.

References

[1] S. Baluja, Population based incremental learning, Technical Report

CMU-CS-94-163, School of Computer Science, Carnegie Mellon

University, 1994.

[2] A.L. Berger, P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, J.R.

Gillett, J.D. Lafferty, H. Printz, L. Ures, in: The Candide System for

Machine Translation, Plainsboro, NJ, 1994, pp. 157–162.

[3] P.F. Brown, S.A.D. Pietra, V.J.D. Pietra, R.L. Mercer, The

mathematics of statistical machine translation: parameter estimation,

Comp. Linguist. 19 (2) (1993) 263–311.

[4] R.D. Brown, in: Automated Dictionary Extraction for Knowledge-

free Example-based Translation, Santa Fe, NM, 1997, pp. 111–118.

[5] M. Diab, An unsupervised method for multilingual word sense

tagging using parallel corpora: a preliminary investigation, in: ACL-

2000 Workshop on Word Senses and Multilinguality, Hong Kong,

2000, pp. 1–9.

[6] I. Garcıa-Varea, F. Casacuberta, H. Ney, An iterative, DP-based

search algorithm for statistical machine translation, in: Proceedings

of the International Conference on Spoken Language Processing

(ICSLP’98), Sydney, Australia, 1998, pp. 1235–1238.

[7] U. Germann, M. Jahr, K. Knight, D. Marcu, K. Yamada, in: Fast

Decoding and Optimal Decoding for Machine Translation, Toulouse,

France, 2001, pp. 228–235.

[8] D. Goldberg, Genetic Algorithms in Search, Optimization and

Machine Learning, Addison-Wesley, Reading, MA, 1989.

[9] J. Holland, Adaptation in Natural and Artificial Systems, The

University of Michigan Press, Ann Arbor, MI, 1975.

[10] P. Koehn, F.J. Och, D. Marcu, Statistical phrase-based translation,

in: Proceedings of the Human Language Technology and North

American Association for Computational Linguistics Conference

(HLT/NAACL), Edmonton, Canada, 2003.

[11] P. Larranaga, J. Lozano, Estimation of Distribution Algorithms. A

New Tool for Evolutionary Computation, Kluwer Academic Publish-

ers, Dordrecht, 2001.

[12] D. Marcu, W. Wong, A phrase-based, joint probability model for

statistical machine translation, in: Proceedings of the Conference on

Empirical Methods in Natural Language Processing (EMNLP-2002),

Philadelphia, USA, 2002, pp. 1408–1414.

[13] J. Martin, R. Mihalcea, T. Pedersen, Word alignment for languages

with scarce resources, in: R. Mihalcea, T. Pedersen (Eds.), Proceed-

ings of the ACL Workshop on Building and Exploiting Parallel

Texts: Data Driven Machine Translation and Beyond, Association

for Computational Linguistics, Michigan, USA, 2005, pp. 1–10.


[14] I.D. Melamed, Models of translational equivalence among words,

Comput. Linguist. 26 (2) (2000) 221–249.

[15] Z. Michalewicz, Genetic AlgorithmsþData Structures ¼ Evolution

Programs, Springer, Berlin, 1996.

[16] R. Mihalcea, T. Pedersen, An evaluation exercise for word alignment,

in: R. Mihalcea, T. Pedersen (Eds.), HLT-NAACL 2003 Workshop:

Building and Using Parallel Texts: Data Driven Machine Translation

and Beyond, Association for Computational Linguistics, Edmonton,

Alberta, Canada, 2003, pp. 1–10.

[17] H. Muhlenbein, The equation for response to selection and its use for

prediction, Evol. Comput. 5 (3) (1997) 303–346.

[18] S. NieXen, S. Vogel, H. Ney, C. Tillmann, in: A DP-based Search

Algorithm for Statistical Machine Translation, Montreal, Canada,

1998, pp. 960–967.

[19] F.J. Och, H. Ney, in: A Comparison of Alignment Models

for Statistical Machine Translation, Saarbrucken, Germany, 2000,

pp. 1086–1090.

[20] F.J. Och, H. Ney, Improved statistical alignment models, in: ACL00,

Hongkong, China, 2000, pp. 440–447.

[21] F.J. Och, H. Ney, A systematic comparison of various statistical

alignment models, Comput. Linguist. 29 (1) (2003) 19–51.

[22] F.J. Och, N. Ueffing, H. Ney, in: An Efficient A* Search Algo-

rithm for Statistical Machine Translation, Toulouse, France, 2001,

pp. 55–62.

[23] D. Ortiz, I. Garcıa-Varea, F. Casacuberta, Thot: a toolkit to train

phrase-based statistical translation models, in: Tenth Machine

Translation Summit, Phuket, Thailand, 2005, pp. 141–148.

[24] F. Smadja, K.R. McKeown, V. Hatzivassiloglou, Translating

collocations for bilingual lexicons: a statistical approach, Comput.

Linguist. 22 (1) (1996) 1–38.

[25] J. Tomas, F. Casacuberta, Monotone statistical translation using

word groups, in: Proceedings of the Machine Translation Summit

VIII, Santiago de Compostela, Spain, 2001, pp. 357–361.

[26] Y.-Y. Wang, J.D. Lafferty, A. Waibel, Word clustering with parallel

spoken language corpora, in: Proceedings of the Fourth International

Conference on Spoken Language Processing (ICSLP’96), Philadel-

phia, PA, 1996, pp. 2364–2367.

[27] D. Wu, A polynomial-time algorithm for statistical machine

translation, in: Proceedings of the 34th Annual Conference of the

Association for Computational Linguistics (ACL ’96), Santa Cruz,

CA, 1996, pp. 152–158.

[28] D. Yarowsky, G. Ngai, R. Wicentowski, Inducing multilingual

text analysis tools via robust projection across aligned corpora,

in: Human Language Technology Conference, San Diego, CA, 2001,

pp. 109–116.

[29] D. Yarowsky, R. Wicentowski, in: Minimally Supervised Morpho-

logical Analysis by Multimodal Alignment, Hong Kong, 2000,

pp. 207–216.

[30] R. Zens, F. Och, H. Ney, Phrase-based statistical machine transla-

tion, in: Advances in Artificial Intelligence, vol. 25. Annual German

Conference on Artificial Intelligence, Lecture Notes in Computer

Science, vol. 2479, Springer, Berlin, 2002, pp. 18–32.

Luis Rodrıguez received the M.S. degree in

computer science from the University of Castilla

La Mancha, Spain, in 2002. He is currently

pursuing the Ph.D degree student at the Poli-

technic University of Valencia, Spain. He is

working at the Department of Computing System

in the University of Castilla La mancha as an

assistant professor. He is an active member of the

Intelligent Systems and Data minning group

(UCLM) and also of the Pattern Recognition

and Human Language Technology (UPV).

Ismael Garcıa-Varea received the Master degree

in Computer Science and the Ph.D. degree in

pattern recognition and artificial intelligence

from the Polytechnic University of Valencia

(UPV), Spain, in 1996 and 2003, respectively. In

1999 he joined the Computer Science department

of the University of Castilla-La Mancha

(UCLM), where he is until now serving as an

Assistant Professor. His current research interests

include the areas of syntactic and statistical

pattern recognition, machine learning, machine translation, data mining

and soft computing. Dr. Garcıa-Varea is an active member of the ‘‘Pattern

Recognition and Human Language Technology’’ (PRHLT) and ‘‘Data

Mining and Intelligent Systems’’ (SIMD) groups from the UPV and

UCLM respectively.

Jose A. Gamez received the M.S. degree in

computer science in 1991, and the Ph.D. degree

in computer science in 1998, both from the

University of Granada, Spain. He joined the

Department of Computer Science at the Uni-

versity of Castilla-La Mancha (UCLM) in 1991,

where he is currently an Associate Professor. He

has served as Vice-Dean of the Escuela Politecni-

ca Superior de Albacete (UCLM) from 1998 to

2004 and currently he serves as Chair of the

Department of Computing Systems (UCLM). His research interests

include probabilistic reasoning, Bayesian networks, evolutionary algo-

rithms, machine learning and data mining. Dr. Gamez has edited five

books and published more than 50 papers over these topics.

Date post:	10-Sep-2016
Category:	Documents
Upload:	luis-rodriguez
View:	214 times
Download:	1 times

On the application of different evolutionary algorithms to the alignment problem in statistical...

Documents