Date post: | 17-Jan-2017 |
Category: |
Technology |
Upload: | naist-machine-translation-study-group |
View: | 189 times |
Download: | 0 times |
MT Study Group
Transla1ng into Morphologically Rich Languages with Synthe1c Phrases
Victor Chahuneau, Eva Schlinger, Noah A. Smith, Chris Dyer
Proc. of EMNLP 2013, SeaJle, USA
Introduced by Akiva Miura, AHC-‐Lab
15/10/15 20141©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 1
Contents
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 2
1. Introduc1on 2. Translate-‐and-‐Inflect Model 3. Morphological Grammars and Features 4. Inflec1on Model Parameter Es1ma1on 5. Synthe1c Phrases 6. Transla1on Experiments 7. Conclusion 8. Impressions
1. Introduc1on
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 3
Problem: l MT into morphologically rich languages (e.g. inflec1on
languages) is challenging due to: • lexical sparsity • large variety of gramma1cal features expressed with morphology
l This paper: • introduces a method using target language morphological grammars to address this challenge
• improves transla1on quality from English into Russian/Hebrew/Swahili
1. Introduc1on (cont’d)
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 4
Proposed Approach: l The proposed approach decomposes MT process into steps:
1. transla1ng from a word (or phrase) into a meaning-‐bearing stem
2. selec1ng appropriate inflec3on using feature-‐rich discrimina1ve model that condi1ons on the source context of the word being translated
l Inventory of transla1on rules individual words and short phrases are obtained using standard transla1on rule extrac1on techniques of Chiang (2007) • Authors call these synthe3c phrases
1. Introduc1on (cont’d)
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 5
Advantages: l The major advantages of this approach are:
1. synthesized forms are targeted to a specific transla1on context
2. mul1ple, alterna1ve phrases may be generated with the final choice among rules lec to the global transla1on model
3. virtually no language-‐specific engineering is necessary 4. any phrase-‐ or syntax-‐based decoder can be used without
modifica1on 5. we can generate forms that were not aJested in the bilingual
training data
2. Translate-‐and-‐Inflect Model
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 6
Figure 1: The inflec1on model predicts a form for the target verb lemma σ=пытаться (pytat’sya) based on its source a'empted and the linear and syntac1c source context. inflec1on: μ = mis-‐sfm-‐e (equivalent to the more tradi1onal morphological string: +MAIN+IND+PAST+SING+FEM+MEDIAL+PERF).
она пыталась пересечь пути на ее велосипед
she had attempted to cross the road on her bike
PRP VBD VBN TO VB DT NN IN PRP$ NN
nsubj
aux
xcomp
σ:пытаться_V,+,μ:mis2sfm2e
C50 C473 C28 C8 C275 C37 C43 C82 C94 C331
root
-1 +1
она пыталась пересечь пути на ее велосипед
she had attempted to cross the road on her bike
PRP VBD VBN TO VB DT NN IN PRP$ NN
nsubj
aux
xcomp
σ:пытаться_V,+,μ:mis2sfm2e
C50 C473 C28 C8 C275 C37 C43 C82 C94 C331
root
-1 +1
Figure 1: The inflection model predicts a form for the target verb lemma � =пытаться (pytat’sya) based on itssource attempted and the linear and syntactic source context. The correct inflection string for the observed Russianform in this particular training instance is µ = mis-sfm-e (equivalent to the more traditional morphological string:+MAIN+IND+PAST+SING+FEM+MEDIAL+PERF).
����
���
source aligned word ei
parent word e�i with its dependency �i � iall children ej | �j = i with their dependency i� j
source words ei�1 and ei+1
����
���
��
�
tokenpart-of-speech tag
word cluster
��
�
– are ei, e�i at the root of the dependency tree?– number of children, siblings of ei
Figure 2: Source features �(e, i) extracted from e and its linguistic analysis. �i denotes the parent of the token inposition i in the dependency tree and �i � i the typed dependency link.
2.2 Source Context Features: �(e, i)
In order to select the best inflection of a target-language word, given the source word it translatesand the context of that source word, we seek to ex-ploit as many features of the context as are avail-able. Consider the example shown in Fig. 1, wheremost of the inflection features of the Russian word(past tense, singular number, and feminine gender)can be inferred from the context of the English wordit is aligned to. Indeed, many grammatical functionsexpressed morphologically in Russian are expressedsyntactically in English. Fortunately, high-qualityparsers and other linguistic analyzers are availablefor English.
On the source side, we apply the following pro-cessing steps:
• Part-of-speech tagging with a CRF taggertrained on sections 02–21 of the Penn Tree-bank.
• Dependency parsing with TurboParser (Mar-tins et al., 2010), a non-projective dependency
parser trained on the Penn Treebank to producebasic Stanford dependencies.
• Assignment of tokens to one of 600 Brownclusters, trained on 8G words of English text.3
We then extract binary features from e using thisinformation, by considering the aligned source wordei, its preceding and following words, and its syn-tactic neighbors. These are detailed in Figure 2.
3 Morphological Grammars and Features
We now describe how to obtain morphological anal-yses and convert them into feature vectors (�) forour target languages, Russian, Hebrew, and Swahili,using supervised and unsupervised methods.
3.1 Supervised Morphology
The state-of-the-art in morphological analysis usesunweighted morphological transduction rules (usu-
3The entire monolingual data available for the translationtask of the 8th ACL Workshop on Statistical Machine Transla-tion was used.
1679
2. Translate-‐and-‐Inflect Model (cont’d)
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 7
Modeling Inflec1on:
Swahili translation tasks, finding significant im-provements in all language pairs (§6). We finallyreview related work (§7) and conclude (§8).
2 Translate-and-Inflect Model
The task of the translate-and-inflect model is illus-trated in Fig. 1 for an English–Russian sentence pair.The input will be a sentence e in the source language(in this paper, always English) and any available lin-guistic analysis of e. The output f will be composedof (i) a sequence of stems, each denoted � and (ii)one morphological inflection pattern for each stem,denoted µ. When the information is available, astem � is composed of a lemma and an inflectionalclass. Throughout, we use �� to denote the setof possible morphological inflection patterns for agiven stem �. �� might be defined by a grammar;our models restrict �� to be the set of inflectionsobserved anywhere in our monolingual or bilingualtraining data as a realization of �.1
We assume the availability of a deterministicfunction that maps a stem � and morphological in-flection µ to a target language surface form f . Insome cases, such as our unsupervised approach in§3.2, this will be a concatenation operation, thoughfinite-state transducers are traditionally used to de-fine such relations (§3.1). We abstractly denote thisoperation by �: f = � � µ.
Our approach consists in defining a probabilisticmodel over target words f . The model assumes in-dependence between each target word f conditionedon the source sentence e and its aligned position i inthis sentence.2 This assumption is further relaxedin §5 when the model is integrated in the translationsystem.
We decompose the probability of generating eachtarget word f in the following way:
p(f | e, i) =�
��µ=f
p(� | ei)� �� �gen. stem
� p(µ | �, e, i)� �� �gen. inflection
Here, each stem is generated independently from asingle aligned source word ei, but in practice we
1This prevents the model from generating words that wouldbe difficult for the language model to reliably score.
2This is the same assumption that Brown et al. (1993) makein, for example, IBM Model 1.
use a standard phrase-based model to generate se-quences of stems and only the inflection model op-erates word-by-word. We turn next to the inflectionmodel.
2.1 Modeling Inflection
In morphologically rich languages, each stem maybe combined with one or more inflectional mor-phemes to express many different grammatical fea-tures (e.g., case, definiteness, mood, tense, etc.).
Since the inflectional morphology of a word gen-erally expresses multiple grammatical features, wewould like a model that naturally incorporates rich,possibly overlapping features in its representation ofboth the input (i.e., conditioning context) and out-put (i.e., the inflection pattern). We therefore usethe following parametric form to model inflectionalprobabilities:
u(µ, e, i) = exp��(e, i)�W�(µ)+
�(µ)�V�(µ)�,
p(µ | �, e, i) =u(µ, e, i)�
µ����u(µ�, e, i)
. (1)
Here, � is an m-dimensional source context fea-ture vector function, � is an n-dimensional mor-phology feature vector function, W � Rm�n andV � Rn�n are parameter matrices. As with themore familiar log-linear parametrization that is writ-ten with a single feature vector, single weight vec-tor and single bias vector, this model is linear in itsparameters (it can be understood as working witha feature space that is the outer product of the twofeature spaces). However, using two feature vectorsallows to define overlapping features of both the in-put and the output, which is important for modelingmorphology in which output variables are naturallyexpressed as bundles of features. The second termin the sum in u enables correlations among outputfeatures to be modeled independently of input, andas such can be understood as a generalization of thebias terms in multi-class logistic regression (on thediagonal Vii) and interaction terms between outputvariables in a conditional random field (off the diag-onal Vij).
1678
Swahili translation tasks, finding significant im-provements in all language pairs (§6). We finallyreview related work (§7) and conclude (§8).
2 Translate-and-Inflect Model
The task of the translate-and-inflect model is illus-trated in Fig. 1 for an English–Russian sentence pair.The input will be a sentence e in the source language(in this paper, always English) and any available lin-guistic analysis of e. The output f will be composedof (i) a sequence of stems, each denoted � and (ii)one morphological inflection pattern for each stem,denoted µ. When the information is available, astem � is composed of a lemma and an inflectionalclass. Throughout, we use �� to denote the setof possible morphological inflection patterns for agiven stem �. �� might be defined by a grammar;our models restrict �� to be the set of inflectionsobserved anywhere in our monolingual or bilingualtraining data as a realization of �.1
We assume the availability of a deterministicfunction that maps a stem � and morphological in-flection µ to a target language surface form f . Insome cases, such as our unsupervised approach in§3.2, this will be a concatenation operation, thoughfinite-state transducers are traditionally used to de-fine such relations (§3.1). We abstractly denote thisoperation by �: f = � � µ.
Our approach consists in defining a probabilisticmodel over target words f . The model assumes in-dependence between each target word f conditionedon the source sentence e and its aligned position i inthis sentence.2 This assumption is further relaxedin §5 when the model is integrated in the translationsystem.
We decompose the probability of generating eachtarget word f in the following way:
p(f | e, i) =�
��µ=f
p(� | ei)� �� �gen. stem
� p(µ | �, e, i)� �� �gen. inflection
Here, each stem is generated independently from asingle aligned source word ei, but in practice we
1This prevents the model from generating words that wouldbe difficult for the language model to reliably score.
2This is the same assumption that Brown et al. (1993) makein, for example, IBM Model 1.
use a standard phrase-based model to generate se-quences of stems and only the inflection model op-erates word-by-word. We turn next to the inflectionmodel.
2.1 Modeling Inflection
In morphologically rich languages, each stem maybe combined with one or more inflectional mor-phemes to express many different grammatical fea-tures (e.g., case, definiteness, mood, tense, etc.).
Since the inflectional morphology of a word gen-erally expresses multiple grammatical features, wewould like a model that naturally incorporates rich,possibly overlapping features in its representation ofboth the input (i.e., conditioning context) and out-put (i.e., the inflection pattern). We therefore usethe following parametric form to model inflectionalprobabilities:
u(µ, e, i) = exp��(e, i)�W�(µ)+
�(µ)�V�(µ)�,
p(µ | �, e, i) =u(µ, e, i)�
µ����u(µ�, e, i)
. (1)
Here, � is an m-dimensional source context fea-ture vector function, � is an n-dimensional mor-phology feature vector function, W � Rm�n andV � Rn�n are parameter matrices. As with themore familiar log-linear parametrization that is writ-ten with a single feature vector, single weight vec-tor and single bias vector, this model is linear in itsparameters (it can be understood as working witha feature space that is the outer product of the twofeature spaces). However, using two feature vectorsallows to define overlapping features of both the in-put and the output, which is important for modelingmorphology in which output variables are naturallyexpressed as bundles of features. The second termin the sum in u enables correlations among outputfeatures to be modeled independently of input, andas such can be understood as a generalization of thebias terms in multi-class logistic regression (on thediagonal Vii) and interaction terms between outputvariables in a conditional random field (off the diag-onal Vij).
1678
Swahili translation tasks, finding significant im-provements in all language pairs (§6). We finallyreview related work (§7) and conclude (§8).
2 Translate-and-Inflect Model
The task of the translate-and-inflect model is illus-trated in Fig. 1 for an English–Russian sentence pair.The input will be a sentence e in the source language(in this paper, always English) and any available lin-guistic analysis of e. The output f will be composedof (i) a sequence of stems, each denoted � and (ii)one morphological inflection pattern for each stem,denoted µ. When the information is available, astem � is composed of a lemma and an inflectionalclass. Throughout, we use �� to denote the setof possible morphological inflection patterns for agiven stem �. �� might be defined by a grammar;our models restrict �� to be the set of inflectionsobserved anywhere in our monolingual or bilingualtraining data as a realization of �.1
We assume the availability of a deterministicfunction that maps a stem � and morphological in-flection µ to a target language surface form f . Insome cases, such as our unsupervised approach in§3.2, this will be a concatenation operation, thoughfinite-state transducers are traditionally used to de-fine such relations (§3.1). We abstractly denote thisoperation by �: f = � � µ.
Our approach consists in defining a probabilisticmodel over target words f . The model assumes in-dependence between each target word f conditionedon the source sentence e and its aligned position i inthis sentence.2 This assumption is further relaxedin §5 when the model is integrated in the translationsystem.
We decompose the probability of generating eachtarget word f in the following way:
p(f | e, i) =�
��µ=f
p(� | ei)� �� �gen. stem
� p(µ | �, e, i)� �� �gen. inflection
Here, each stem is generated independently from asingle aligned source word ei, but in practice we
1This prevents the model from generating words that wouldbe difficult for the language model to reliably score.
2This is the same assumption that Brown et al. (1993) makein, for example, IBM Model 1.
use a standard phrase-based model to generate se-quences of stems and only the inflection model op-erates word-by-word. We turn next to the inflectionmodel.
2.1 Modeling Inflection
In morphologically rich languages, each stem maybe combined with one or more inflectional mor-phemes to express many different grammatical fea-tures (e.g., case, definiteness, mood, tense, etc.).
Since the inflectional morphology of a word gen-erally expresses multiple grammatical features, wewould like a model that naturally incorporates rich,possibly overlapping features in its representation ofboth the input (i.e., conditioning context) and out-put (i.e., the inflection pattern). We therefore usethe following parametric form to model inflectionalprobabilities:
u(µ, e, i) = exp��(e, i)�W�(µ)+
�(µ)�V�(µ)�,
p(µ | �, e, i) =u(µ, e, i)�
µ����u(µ�, e, i)
. (1)
Here, � is an m-dimensional source context fea-ture vector function, � is an n-dimensional mor-phology feature vector function, W � Rm�n andV � Rn�n are parameter matrices. As with themore familiar log-linear parametrization that is writ-ten with a single feature vector, single weight vec-tor and single bias vector, this model is linear in itsparameters (it can be understood as working witha feature space that is the outer product of the twofeature spaces). However, using two feature vectorsallows to define overlapping features of both the in-put and the output, which is important for modelingmorphology in which output variables are naturallyexpressed as bundles of features. The second termin the sum in u enables correlations among outputfeatures to be modeled independently of input, andas such can be understood as a generalization of thebias terms in multi-class logistic regression (on thediagonal Vii) and interaction terms between outputvariables in a conditional random field (off the diag-onal Vij).
1678
- n-dimensional morphology feature vector function
Swahili translation tasks, finding significant im-provements in all language pairs (§6). We finallyreview related work (§7) and conclude (§8).
2 Translate-and-Inflect Model
The task of the translate-and-inflect model is illus-trated in Fig. 1 for an English–Russian sentence pair.The input will be a sentence e in the source language(in this paper, always English) and any available lin-guistic analysis of e. The output f will be composedof (i) a sequence of stems, each denoted � and (ii)one morphological inflection pattern for each stem,denoted µ. When the information is available, astem � is composed of a lemma and an inflectionalclass. Throughout, we use �� to denote the setof possible morphological inflection patterns for agiven stem �. �� might be defined by a grammar;our models restrict �� to be the set of inflectionsobserved anywhere in our monolingual or bilingualtraining data as a realization of �.1
We assume the availability of a deterministicfunction that maps a stem � and morphological in-flection µ to a target language surface form f . Insome cases, such as our unsupervised approach in§3.2, this will be a concatenation operation, thoughfinite-state transducers are traditionally used to de-fine such relations (§3.1). We abstractly denote thisoperation by �: f = � � µ.
Our approach consists in defining a probabilisticmodel over target words f . The model assumes in-dependence between each target word f conditionedon the source sentence e and its aligned position i inthis sentence.2 This assumption is further relaxedin §5 when the model is integrated in the translationsystem.
We decompose the probability of generating eachtarget word f in the following way:
p(f | e, i) =�
��µ=f
p(� | ei)� �� �gen. stem
� p(µ | �, e, i)� �� �gen. inflection
Here, each stem is generated independently from asingle aligned source word ei, but in practice we
1This prevents the model from generating words that wouldbe difficult for the language model to reliably score.
2This is the same assumption that Brown et al. (1993) makein, for example, IBM Model 1.
use a standard phrase-based model to generate se-quences of stems and only the inflection model op-erates word-by-word. We turn next to the inflectionmodel.
2.1 Modeling Inflection
In morphologically rich languages, each stem maybe combined with one or more inflectional mor-phemes to express many different grammatical fea-tures (e.g., case, definiteness, mood, tense, etc.).
Since the inflectional morphology of a word gen-erally expresses multiple grammatical features, wewould like a model that naturally incorporates rich,possibly overlapping features in its representation ofboth the input (i.e., conditioning context) and out-put (i.e., the inflection pattern). We therefore usethe following parametric form to model inflectionalprobabilities:
u(µ, e, i) = exp��(e, i)�W�(µ)+
�(µ)�V�(µ)�,
p(µ | �, e, i) =u(µ, e, i)�
µ����u(µ�, e, i)
. (1)
Here, � is an m-dimensional source context fea-ture vector function, � is an n-dimensional mor-phology feature vector function, W � Rm�n andV � Rn�n are parameter matrices. As with themore familiar log-linear parametrization that is writ-ten with a single feature vector, single weight vec-tor and single bias vector, this model is linear in itsparameters (it can be understood as working witha feature space that is the outer product of the twofeature spaces). However, using two feature vectorsallows to define overlapping features of both the in-put and the output, which is important for modelingmorphology in which output variables are naturallyexpressed as bundles of features. The second termin the sum in u enables correlations among outputfeatures to be modeled independently of input, andas such can be understood as a generalization of thebias terms in multi-class logistic regression (on thediagonal Vii) and interaction terms between outputvariables in a conditional random field (off the diag-onal Vij).
1678
- m-dimensional source context feature vector function
2. Translate-‐and-‐Inflect Model (cont’d)
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 8
Source Context Feature Extrac1on:
она пыталась пересечь пути на ее велосипед
she had attempted to cross the road on her bike
PRP VBD VBN TO VB DT NN IN PRP$ NN
nsubj
aux
xcomp
σ:пытаться_V,+,μ:mis2sfm2e
C50 C473 C28 C8 C275 C37 C43 C82 C94 C331
root
-1 +1
она пыталась пересечь пути на ее велосипед
she had attempted to cross the road on her bike
PRP VBD VBN TO VB DT NN IN PRP$ NN
nsubj
aux
xcomp
σ:пытаться_V,+,μ:mis2sfm2e
C50 C473 C28 C8 C275 C37 C43 C82 C94 C331
root
-1 +1
Figure 1: The inflection model predicts a form for the target verb lemma � =пытаться (pytat’sya) based on itssource attempted and the linear and syntactic source context. The correct inflection string for the observed Russianform in this particular training instance is µ = mis-sfm-e (equivalent to the more traditional morphological string:+MAIN+IND+PAST+SING+FEM+MEDIAL+PERF).
����
���
source aligned word ei
parent word e�i with its dependency �i � iall children ej | �j = i with their dependency i� j
source words ei�1 and ei+1
����
���
��
�
tokenpart-of-speech tag
word cluster
��
�
– are ei, e�i at the root of the dependency tree?– number of children, siblings of ei
Figure 2: Source features �(e, i) extracted from e and its linguistic analysis. �i denotes the parent of the token inposition i in the dependency tree and �i � i the typed dependency link.
2.2 Source Context Features: �(e, i)
In order to select the best inflection of a target-language word, given the source word it translatesand the context of that source word, we seek to ex-ploit as many features of the context as are avail-able. Consider the example shown in Fig. 1, wheremost of the inflection features of the Russian word(past tense, singular number, and feminine gender)can be inferred from the context of the English wordit is aligned to. Indeed, many grammatical functionsexpressed morphologically in Russian are expressedsyntactically in English. Fortunately, high-qualityparsers and other linguistic analyzers are availablefor English.
On the source side, we apply the following pro-cessing steps:
• Part-of-speech tagging with a CRF taggertrained on sections 02–21 of the Penn Tree-bank.
• Dependency parsing with TurboParser (Mar-tins et al., 2010), a non-projective dependency
parser trained on the Penn Treebank to producebasic Stanford dependencies.
• Assignment of tokens to one of 600 Brownclusters, trained on 8G words of English text.3
We then extract binary features from e using thisinformation, by considering the aligned source wordei, its preceding and following words, and its syn-tactic neighbors. These are detailed in Figure 2.
3 Morphological Grammars and Features
We now describe how to obtain morphological anal-yses and convert them into feature vectors (�) forour target languages, Russian, Hebrew, and Swahili,using supervised and unsupervised methods.
3.1 Supervised Morphology
The state-of-the-art in morphological analysis usesunweighted morphological transduction rules (usu-
3The entire monolingual data available for the translationtask of the 8th ACL Workshop on Statistical Machine Transla-tion was used.
1679
Figure 2: Source features φ(e, i) extracted from e and its linguistic analysis. πi denotes the parent of the token in position i in the dependency tree and πi → i the typed dependency link.
3. Morphological Grammars and Features
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 9
Supervisd morphology features: ψ(μ) – extracted from results of morphological analysis (available only for Russian in this paper) Since a posi1on tag set is used, it is straighqorward to convert each fixed-‐length tag μ into a feature vector by defining a binary feature for each key-‐value pair (e.g., Tense=past) composing the tag.
3. Morphological Grammars and Features (cont’d)
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 10
Unsupervisd morphology features: ψ(μ) – generated by unsupervised analyzer (not using language-‐specific engineering) We produce binary features corresponding to the content of each poten1al affixa1on posi1on rela1ve to the stem:
Table 1: Corpus statistics.
Parallel Parallel+Monolingual
Sentences EN-tokens TRG-tokens EN-types TRG-types Sentences TRG-tokens TRG-typesRussian 150k 3.5M 3.3M 131k 254k 20M 360M 1,971kHebrew 134k 2.7M 2.0M 48k 120k 806k 15M 316kSwahili 15k 0.3M 0.3M 23k 35k 596k 13M 334k
will involve a more direct method for specifying orinferring these values.
Unsupervised morphology features: �(µ). Forthe unsupervised analyzer, we do not have a map-ping from morphemes to structured morphologicalattributes; however, we can create features from theaffix sequences obtained after morphological seg-mentation. We produce binary features correspond-ing to the content of each potential affixation posi-tion relative to the stem:
prefix suffix...-3 -2 -1 STEM +1 +2 +3...
For example, the unsupervised analysis µ =wa+ki+wa+STEM of the Swahili word wakiwapigawill produce the following features:
�prefix[�3][wa](µ) = 1,
�prefix[�2][ki](µ) = 1,
�prefix[�1][wa](µ) = 1.
4 Inflection Model Parameter Estimation
To set the parameters W and V of the inflection pre-diction model (Eq. 1), we use stochastic gradient de-scent to maximize the conditional log-likelihood ofa training set consisting of pairs of source (English)sentence contextual features (�) and target word in-flectional features (�). The training instances areextracted from the word-aligned parallel corpus withthe English side preprocessed as discussed in §2.2and the target side disambiguated as discussed in §3.When morphological category information is avail-able, we train an independent model for each open-class category (in Russian, nouns, verbs, adjectives,numerals, adverbs); otherwise a single model is usedfor all words (excluding words less than four char-acters long, which are ignored).
Statistics of the parallel corpora used to train theinflection model are summarized in Table 1. It isimportant to note here that our richly parameterizedmodel is trained on the full parallel training cor-pus, not just on a handful of development sentences(which are typically used to tune MT system param-eters). Despite this scale, training is simple: the in-flection model is trained to discriminate among dif-ferent inflectional paradigms, not over all possibletarget language sentences (Blunsom et al., 2008) orlearning from all observable rules (Subotin, 2011).This makes the training problem relatively tractable:all experiments in this paper were trained on a sin-gle processor using a Cython implementation of theSGD optimizer. For our largest model, trained on3.3M Russian words, n = 231K � m = 336 fea-tures were produced, and 10 SGD iterations wereperformed in less than 16 hours.
4.1 Intrinsic Evaluation
Before considering the broader problem of integrat-ing the inflection model in a machine translationsystem, we perform an artificial evaluation to ver-ify that the model learns sensible source sentence-target inflection patterns. To do so, we create aninflection test set as follows. We preprocess thesource (English) sentences exactly as during train-ing (§2.2), and using the target language morpholog-ical analyzer, we convert each aligned target word to�stem, inflection� pairs. We perform word alignmenton the held-out MT development data for each lan-guage pair (cf. Table 1), exactly as if it were going toproduce training instances, but instead we use themfor testing.
Although the resulting dataset is noisy (e.g., dueto alignment errors), this becomes our intrinsic eval-uation test set. Using this data, we measure inflec-tion quality using two measurements:5
5Note that we are not evaluating the stem translation model,
1681
For example, the unsupervised analysis µ = wa+ki+wa+STEM of the Swahili word wakiwapiga will produce the following features:
Table 1: Corpus statistics.
Parallel Parallel+Monolingual
Sentences EN-tokens TRG-tokens EN-types TRG-types Sentences TRG-tokens TRG-typesRussian 150k 3.5M 3.3M 131k 254k 20M 360M 1,971kHebrew 134k 2.7M 2.0M 48k 120k 806k 15M 316kSwahili 15k 0.3M 0.3M 23k 35k 596k 13M 334k
will involve a more direct method for specifying orinferring these values.
Unsupervised morphology features: �(µ). Forthe unsupervised analyzer, we do not have a map-ping from morphemes to structured morphologicalattributes; however, we can create features from theaffix sequences obtained after morphological seg-mentation. We produce binary features correspond-ing to the content of each potential affixation posi-tion relative to the stem:
prefix suffix...-3 -2 -1 STEM +1 +2 +3...
For example, the unsupervised analysis µ =wa+ki+wa+STEM of the Swahili word wakiwapigawill produce the following features:
�prefix[�3][wa](µ) = 1,
�prefix[�2][ki](µ) = 1,
�prefix[�1][wa](µ) = 1.
4 Inflection Model Parameter Estimation
To set the parameters W and V of the inflection pre-diction model (Eq. 1), we use stochastic gradient de-scent to maximize the conditional log-likelihood ofa training set consisting of pairs of source (English)sentence contextual features (�) and target word in-flectional features (�). The training instances areextracted from the word-aligned parallel corpus withthe English side preprocessed as discussed in §2.2and the target side disambiguated as discussed in §3.When morphological category information is avail-able, we train an independent model for each open-class category (in Russian, nouns, verbs, adjectives,numerals, adverbs); otherwise a single model is usedfor all words (excluding words less than four char-acters long, which are ignored).
Statistics of the parallel corpora used to train theinflection model are summarized in Table 1. It isimportant to note here that our richly parameterizedmodel is trained on the full parallel training cor-pus, not just on a handful of development sentences(which are typically used to tune MT system param-eters). Despite this scale, training is simple: the in-flection model is trained to discriminate among dif-ferent inflectional paradigms, not over all possibletarget language sentences (Blunsom et al., 2008) orlearning from all observable rules (Subotin, 2011).This makes the training problem relatively tractable:all experiments in this paper were trained on a sin-gle processor using a Cython implementation of theSGD optimizer. For our largest model, trained on3.3M Russian words, n = 231K � m = 336 fea-tures were produced, and 10 SGD iterations wereperformed in less than 16 hours.
4.1 Intrinsic Evaluation
Before considering the broader problem of integrat-ing the inflection model in a machine translationsystem, we perform an artificial evaluation to ver-ify that the model learns sensible source sentence-target inflection patterns. To do so, we create aninflection test set as follows. We preprocess thesource (English) sentences exactly as during train-ing (§2.2), and using the target language morpholog-ical analyzer, we convert each aligned target word to�stem, inflection� pairs. We perform word alignmenton the held-out MT development data for each lan-guage pair (cf. Table 1), exactly as if it were going toproduce training instances, but instead we use themfor testing.
Although the resulting dataset is noisy (e.g., dueto alignment errors), this becomes our intrinsic eval-uation test set. Using this data, we measure inflec-tion quality using two measurements:5
5Note that we are not evaluating the stem translation model,
1681
Table 1: Corpus statistics.
Parallel Parallel+Monolingual
Sentences EN-tokens TRG-tokens EN-types TRG-types Sentences TRG-tokens TRG-typesRussian 150k 3.5M 3.3M 131k 254k 20M 360M 1,971kHebrew 134k 2.7M 2.0M 48k 120k 806k 15M 316kSwahili 15k 0.3M 0.3M 23k 35k 596k 13M 334k
will involve a more direct method for specifying orinferring these values.
Unsupervised morphology features: �(µ). Forthe unsupervised analyzer, we do not have a map-ping from morphemes to structured morphologicalattributes; however, we can create features from theaffix sequences obtained after morphological seg-mentation. We produce binary features correspond-ing to the content of each potential affixation posi-tion relative to the stem:
prefix suffix...-3 -2 -1 STEM +1 +2 +3...
For example, the unsupervised analysis µ =wa+ki+wa+STEM of the Swahili word wakiwapigawill produce the following features:
�prefix[�3][wa](µ) = 1,
�prefix[�2][ki](µ) = 1,
�prefix[�1][wa](µ) = 1.
4 Inflection Model Parameter Estimation
To set the parameters W and V of the inflection pre-diction model (Eq. 1), we use stochastic gradient de-scent to maximize the conditional log-likelihood ofa training set consisting of pairs of source (English)sentence contextual features (�) and target word in-flectional features (�). The training instances areextracted from the word-aligned parallel corpus withthe English side preprocessed as discussed in §2.2and the target side disambiguated as discussed in §3.When morphological category information is avail-able, we train an independent model for each open-class category (in Russian, nouns, verbs, adjectives,numerals, adverbs); otherwise a single model is usedfor all words (excluding words less than four char-acters long, which are ignored).
Statistics of the parallel corpora used to train theinflection model are summarized in Table 1. It isimportant to note here that our richly parameterizedmodel is trained on the full parallel training cor-pus, not just on a handful of development sentences(which are typically used to tune MT system param-eters). Despite this scale, training is simple: the in-flection model is trained to discriminate among dif-ferent inflectional paradigms, not over all possibletarget language sentences (Blunsom et al., 2008) orlearning from all observable rules (Subotin, 2011).This makes the training problem relatively tractable:all experiments in this paper were trained on a sin-gle processor using a Cython implementation of theSGD optimizer. For our largest model, trained on3.3M Russian words, n = 231K � m = 336 fea-tures were produced, and 10 SGD iterations wereperformed in less than 16 hours.
4.1 Intrinsic Evaluation
Before considering the broader problem of integrat-ing the inflection model in a machine translationsystem, we perform an artificial evaluation to ver-ify that the model learns sensible source sentence-target inflection patterns. To do so, we create aninflection test set as follows. We preprocess thesource (English) sentences exactly as during train-ing (§2.2), and using the target language morpholog-ical analyzer, we convert each aligned target word to�stem, inflection� pairs. We perform word alignmenton the held-out MT development data for each lan-guage pair (cf. Table 1), exactly as if it were going toproduce training instances, but instead we use themfor testing.
Although the resulting dataset is noisy (e.g., dueto alignment errors), this becomes our intrinsic eval-uation test set. Using this data, we measure inflec-tion quality using two measurements:5
5Note that we are not evaluating the stem translation model,
1681
Table 1: Corpus statistics.
Parallel Parallel+Monolingual
Sentences EN-tokens TRG-tokens EN-types TRG-types Sentences TRG-tokens TRG-typesRussian 150k 3.5M 3.3M 131k 254k 20M 360M 1,971kHebrew 134k 2.7M 2.0M 48k 120k 806k 15M 316kSwahili 15k 0.3M 0.3M 23k 35k 596k 13M 334k
will involve a more direct method for specifying orinferring these values.
Unsupervised morphology features: �(µ). Forthe unsupervised analyzer, we do not have a map-ping from morphemes to structured morphologicalattributes; however, we can create features from theaffix sequences obtained after morphological seg-mentation. We produce binary features correspond-ing to the content of each potential affixation posi-tion relative to the stem:
prefix suffix...-3 -2 -1 STEM +1 +2 +3...
For example, the unsupervised analysis µ =wa+ki+wa+STEM of the Swahili word wakiwapigawill produce the following features:
�prefix[�3][wa](µ) = 1,
�prefix[�2][ki](µ) = 1,
�prefix[�1][wa](µ) = 1.
4 Inflection Model Parameter Estimation
To set the parameters W and V of the inflection pre-diction model (Eq. 1), we use stochastic gradient de-scent to maximize the conditional log-likelihood ofa training set consisting of pairs of source (English)sentence contextual features (�) and target word in-flectional features (�). The training instances areextracted from the word-aligned parallel corpus withthe English side preprocessed as discussed in §2.2and the target side disambiguated as discussed in §3.When morphological category information is avail-able, we train an independent model for each open-class category (in Russian, nouns, verbs, adjectives,numerals, adverbs); otherwise a single model is usedfor all words (excluding words less than four char-acters long, which are ignored).
Statistics of the parallel corpora used to train theinflection model are summarized in Table 1. It isimportant to note here that our richly parameterizedmodel is trained on the full parallel training cor-pus, not just on a handful of development sentences(which are typically used to tune MT system param-eters). Despite this scale, training is simple: the in-flection model is trained to discriminate among dif-ferent inflectional paradigms, not over all possibletarget language sentences (Blunsom et al., 2008) orlearning from all observable rules (Subotin, 2011).This makes the training problem relatively tractable:all experiments in this paper were trained on a sin-gle processor using a Cython implementation of theSGD optimizer. For our largest model, trained on3.3M Russian words, n = 231K � m = 336 fea-tures were produced, and 10 SGD iterations wereperformed in less than 16 hours.
4.1 Intrinsic Evaluation
Before considering the broader problem of integrat-ing the inflection model in a machine translationsystem, we perform an artificial evaluation to ver-ify that the model learns sensible source sentence-target inflection patterns. To do so, we create aninflection test set as follows. We preprocess thesource (English) sentences exactly as during train-ing (§2.2), and using the target language morpholog-ical analyzer, we convert each aligned target word to�stem, inflection� pairs. We perform word alignmenton the held-out MT development data for each lan-guage pair (cf. Table 1), exactly as if it were going toproduce training instances, but instead we use themfor testing.
Although the resulting dataset is noisy (e.g., dueto alignment errors), this becomes our intrinsic eval-uation test set. Using this data, we measure inflec-tion quality using two measurements:5
5Note that we are not evaluating the stem translation model,
1681
3. Morphological Grammars and Features (cont’d)
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 11
Unsupervised Morphological Segmenta1on:
Let M represent the set of all possible morphemes and define a regular grammar: ally in the form of an FST) to produce candidate
analyses for each word in a sentence and then sta-tistical models to disambiguate among the analy-ses in context (Hakkani-Tur et al., 2000; Hajic etal., 2001; Smith et al., 2005; Habash and Rambow,2005, inter alia). While this technique is capableof producing high quality linguistic analyses, it isexpensive to develop, requiring hand-crafted rule-based analyzers and annotated corpora to train thedisambiguation models. As a result, such analyzersare only available for a small number of languages,and, as a practical matter, each analyzer (which re-sulted from different development efforts) operatesdifferently from the others.
We therefore focus on using supervised analysisfor a single target language, Russian. We use theanalysis tool of Sharoff et al. (2008) which producesfor each word in context a lemma and a fixed-lengthmorphological tag encoding the grammatical fea-tures. We process the target side of the parallel datawith this tool to obtain the information necessaryto extract �lemma, inflection� pairs, from which wecompute � and morphological feature vectors �(µ).
Supervised morphology features: �(µ). Sincea positional tag set is used, it is straightforward toconvert each fixed-length tag µ into a feature vectorby defining a binary feature for each key-value pair(e.g., Tense=past) composing the tag.
3.2 Unsupervised Morphology
Since many languages into which we might want totranslate do not have supervised morphological an-alyzers, we now turn to the question of how to gen-erate morphological analyses and features using anunsupervised analyzer. We hypothesize that perfectdecomposition into rich linguistic structures may notbe required for accurate generation of new inflectedforms. We will test this hypothesis by experimentingwith a simple, unsupervised model of morphologythat segments words into sequences of morphemes,assuming a (naıve) concatenative generation processand a single analysis per type.
Unsupervised morphological segmentation. Weassume that each word can be decomposed into anynumber of prefixes, a stem, and any number of suf-fixes. Formally, we let M represent the set of allpossible morphemes and define a regular grammar
M�MM� (i.e., zero or more prefixes, a stem, andzero or more suffixes). To infer the decompositionstructure for the words in the target language, we as-sume that the vocabulary was generated by the fol-lowing process:
1. Sample morpheme distributions from symmet-ric Dirichlet distributions: �p � Dir|M |(�p)for prefixes, �� � Dir|M |(��) for stems, and�s � Dir|M |(�s) for suffixes.
2. Sample length distribution parameters�p � Beta(�p, �p) for prefix sequencesand �s � Beta(�s, �s) for suffix sequences.
3. Sample a vocabulary by creating each wordtype w using the following steps:
(a) Sample affix sequence lengths:lp � Geometric(�p);ls � Geometric(�s).
(b) Sample lp prefixes p1, . . . , plp indepen-dently from �p; ls suffixes s1, . . . , sls in-dependently from �s; and a stem � � ��.
(c) Concatenate prefixes, the stem, and suf-fixes: w = p1+· · ·+plp+�+s1+· · ·+sls .
We use blocked Gibbs sampling to sample seg-mentations for each word in the training vocabulary.Because of our particular choice of priors, it possibleto approximately decompose the posterior over thearcs of a compact finite-state machine. Sampling asegmentation or obtaining the most likely segmenta-tion a posteriori then reduces to familiar FST opera-tions. This model is reminiscent of work on learningmorphology using adaptor grammars (Johnson et al.,2006; Johnson, 2008).
The inferred morphological grammar is very sen-sitive to the Dirichlet hyperparameters (�p, �s, ��)and these are, in turn, sensitive to the number oftypes in the vocabulary. Using �p, �s � �� � 1tended to recover useful segmentations, but we havenot yet been able to find reliable generic priors forthese values. Therefore, we selected them empiri-cally to obtain a stem vocabulary size on the paralleldata that is one-to-one with English.4 Future work
4Our default starting point was to use �p = �s =10�6, �� = 10�4 and then to adjust all parameters by factorsof 10.
1680
The vocabulary was generated by the following process (Blocked Gibbs Sampling): 1. Sample morpheme distributions from symmetric Dirichlet distributions: θp 〜
Dir|M|(αp) for prefixes, θσ 〜 Dir|M|(ασ) for stems, and θs 〜 Dir|M|(αs) for suffixes. 2. Sample length distribution parameters λp 〜 Beta(βp,γp) for prefix sequences
and λs 〜 (βs,γs) for suffix sequences. 3. Sample a vocabulary by creating each word type w using the following steps:
a) Sample affix sequence lengths: lp 〜 Geom(λp); ls 〜 Geom(λs) b) Sample lp prefixes p1,…,plp independently from θp; ls suffixes s1,…,sls
independently from θs; and a stem σ 〜 θσ. c) Concatenate prefixes, the stem, and suffixes:
w = p1 + … + plp + σ + s1 + … + sls
4. Inflec1on Model Parameter Es1ma1on
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 12
Estimating Parameter: To set the parameters W and V of the inflection prediction model, we use stochastic gradient descent to maximize the conditional log-likelihood of a training set consisting of pairs of source (English) sentence contextual features (φ) and target word inflectional features (ψ). When morphological category information is available, we train an independent model for each open-class category (in Russian, nouns, verbs, adjectives, numerals, adverbs); otherwise a single model is used for all words.
4. Inflec1on Model Parameter Es1ma1on (cont’d)
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 13
Intrinsic Evaluation:
Table 1: Corpus statistics.
Parallel Parallel+Monolingual
Sentences EN-tokens TRG-tokens EN-types TRG-types Sentences TRG-tokens TRG-typesRussian 150k 3.5M 3.3M 131k 254k 20M 360M 1,971kHebrew 134k 2.7M 2.0M 48k 120k 806k 15M 316kSwahili 15k 0.3M 0.3M 23k 35k 596k 13M 334k
will involve a more direct method for specifying orinferring these values.
Unsupervised morphology features: �(µ). Forthe unsupervised analyzer, we do not have a map-ping from morphemes to structured morphologicalattributes; however, we can create features from theaffix sequences obtained after morphological seg-mentation. We produce binary features correspond-ing to the content of each potential affixation posi-tion relative to the stem:
prefix suffix...-3 -2 -1 STEM +1 +2 +3...
For example, the unsupervised analysis µ =wa+ki+wa+STEM of the Swahili word wakiwapigawill produce the following features:
�prefix[�3][wa](µ) = 1,
�prefix[�2][ki](µ) = 1,
�prefix[�1][wa](µ) = 1.
4 Inflection Model Parameter Estimation
To set the parameters W and V of the inflection pre-diction model (Eq. 1), we use stochastic gradient de-scent to maximize the conditional log-likelihood ofa training set consisting of pairs of source (English)sentence contextual features (�) and target word in-flectional features (�). The training instances areextracted from the word-aligned parallel corpus withthe English side preprocessed as discussed in §2.2and the target side disambiguated as discussed in §3.When morphological category information is avail-able, we train an independent model for each open-class category (in Russian, nouns, verbs, adjectives,numerals, adverbs); otherwise a single model is usedfor all words (excluding words less than four char-acters long, which are ignored).
Statistics of the parallel corpora used to train theinflection model are summarized in Table 1. It isimportant to note here that our richly parameterizedmodel is trained on the full parallel training cor-pus, not just on a handful of development sentences(which are typically used to tune MT system param-eters). Despite this scale, training is simple: the in-flection model is trained to discriminate among dif-ferent inflectional paradigms, not over all possibletarget language sentences (Blunsom et al., 2008) orlearning from all observable rules (Subotin, 2011).This makes the training problem relatively tractable:all experiments in this paper were trained on a sin-gle processor using a Cython implementation of theSGD optimizer. For our largest model, trained on3.3M Russian words, n = 231K � m = 336 fea-tures were produced, and 10 SGD iterations wereperformed in less than 16 hours.
4.1 Intrinsic Evaluation
Before considering the broader problem of integrat-ing the inflection model in a machine translationsystem, we perform an artificial evaluation to ver-ify that the model learns sensible source sentence-target inflection patterns. To do so, we create aninflection test set as follows. We preprocess thesource (English) sentences exactly as during train-ing (§2.2), and using the target language morpholog-ical analyzer, we convert each aligned target word to�stem, inflection� pairs. We perform word alignmenton the held-out MT development data for each lan-guage pair (cf. Table 1), exactly as if it were going toproduce training instances, but instead we use themfor testing.
Although the resulting dataset is noisy (e.g., dueto alignment errors), this becomes our intrinsic eval-uation test set. Using this data, we measure inflec-tion quality using two measurements:5
5Note that we are not evaluating the stem translation model,
1681
acc. ppl. |��|
Supe
rvis
ed
Russian
N 64.1% 3.46 9.16V 63.7% 3.41 20.12A 51.5% 6.24 19.56M 73.0% 2.81 9.14
average 63.1% 3.98 14.49
Uns
up. Russian all 71.2% 2.15 4.73
Hebrew all 85.5% 1.49 2.55Swahili all 78.2% 2.09 11.46
Table 2: Intrinsic evaluation of inflection model (N:nouns, V: verbs, A: adjectives, M: numerals).
• the accuracy of predicting the inflection giventhe source, source context and target stem, and
• the inflection model perplexity on the same setof test instances.
Additionally, we report the average number of pos-sible inflections for each stem, an upper bound to theperplexity that indicates the inherent difficulty of thetask. The results of this evaluation are presented inTable 2 for the three language pairs considered. Weremark on two patterns in these results. First, per-plexity is substantially lower than the perplexity of auniform model, indicating our model is overall quiteeffective at predicting inflections using source con-text only. Second, in the supervised Russian results,we see that predicting the inflections of adjectivesis relatively more difficult than for other parts-of-speech. Since adjectives agree with the nouns theymodify in gender and case, and gender is an idiosyn-cratic feature of Russian nouns (and therefore notdirectly predictable from the English source), thisdifficulty is unsurprising.
We can also inspect the weights learned by themodel to assess the effectiveness of the featuresin relating source-context structure with target-sidemorphology. Such an analysis is presented in Fig. 3.
4.2 Feature Ablation
Our inflection model makes use of numerous fea-ture types. Table 3 explores the effect of removingdifferent kinds of (source) features from the model,evaluated on predicting Russian inflections usingsupervised morphological grammars.6 Rows 2–3just the inflection prediction model.
6The models used in the feature ablation experiment weretrained on fewer examples, resulting in overall lower accuracies
show the effect of removing either linear or depen-dency context. We see that both are necessary forgood performance; however removing dependencycontext substantially degrades performance of themodel (we interpret this result as evidence that Rus-sian morphological inflection captures grammaticalrelationships that would be expressed structurally inEnglish). The bottom four rows explore the effectof source language word representation. The resultsindicate that lexical features are important for accu-rate prediction of inflection, and that POS tags andBrown clusters are likewise important, but they seemto capture similar information (removing one has lit-tle impact, but removing both substantially degradesperformance).
Table 3: Feature ablation experiments using supervisedRussian classification experiments.
Features (�(e, i)) acc.all 54.7%�linear context 52.7%�dependency context 44.4%�POS tags 54.5%�Brown clusters 54.5%�POS tags, �Brown cl. 50.9%�lexical items 51.2%
5 Synthetic Phrases
We turn now to translation; recall that our translate-and-inflect model is used to augment the set of rulesavailable to a conventional statistical machine trans-lation decoder. We refer to the phrases it producesas synthetic phrases.
Our baseline system is a standard hierarchicalphrase-based translation model (Chiang, 2007). Fol-lowing Lopez (2007), the training data is compiledinto an efficient binary representation which allowsextraction of sentence-specific grammars just beforedecoding. In our case, this also allows the creationof synthetic inflected phrases that are produced con-ditioning on the sentence to translate.
To generate these synthetic phrases with new in-flections possibly unseen in the parallel training
than seen in Table 2, but the pattern of results is the relevantdatapoint here.
1682
Intrinsic evaluation of inflection model (N: nouns, V: verbs, A: adjectives, M: numerals).
4. Inflec1on Model Parameter Es1ma1on (cont’d)
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 14
Feature Ablation:
acc. ppl. |��|Su
perv
ised
Russian
N 64.1% 3.46 9.16V 63.7% 3.41 20.12A 51.5% 6.24 19.56M 73.0% 2.81 9.14
average 63.1% 3.98 14.49
Uns
up. Russian all 71.2% 2.15 4.73
Hebrew all 85.5% 1.49 2.55Swahili all 78.2% 2.09 11.46
Table 2: Intrinsic evaluation of inflection model (N:nouns, V: verbs, A: adjectives, M: numerals).
• the accuracy of predicting the inflection giventhe source, source context and target stem, and
• the inflection model perplexity on the same setof test instances.
Additionally, we report the average number of pos-sible inflections for each stem, an upper bound to theperplexity that indicates the inherent difficulty of thetask. The results of this evaluation are presented inTable 2 for the three language pairs considered. Weremark on two patterns in these results. First, per-plexity is substantially lower than the perplexity of auniform model, indicating our model is overall quiteeffective at predicting inflections using source con-text only. Second, in the supervised Russian results,we see that predicting the inflections of adjectivesis relatively more difficult than for other parts-of-speech. Since adjectives agree with the nouns theymodify in gender and case, and gender is an idiosyn-cratic feature of Russian nouns (and therefore notdirectly predictable from the English source), thisdifficulty is unsurprising.
We can also inspect the weights learned by themodel to assess the effectiveness of the featuresin relating source-context structure with target-sidemorphology. Such an analysis is presented in Fig. 3.
4.2 Feature Ablation
Our inflection model makes use of numerous fea-ture types. Table 3 explores the effect of removingdifferent kinds of (source) features from the model,evaluated on predicting Russian inflections usingsupervised morphological grammars.6 Rows 2–3just the inflection prediction model.
6The models used in the feature ablation experiment weretrained on fewer examples, resulting in overall lower accuracies
show the effect of removing either linear or depen-dency context. We see that both are necessary forgood performance; however removing dependencycontext substantially degrades performance of themodel (we interpret this result as evidence that Rus-sian morphological inflection captures grammaticalrelationships that would be expressed structurally inEnglish). The bottom four rows explore the effectof source language word representation. The resultsindicate that lexical features are important for accu-rate prediction of inflection, and that POS tags andBrown clusters are likewise important, but they seemto capture similar information (removing one has lit-tle impact, but removing both substantially degradesperformance).
Table 3: Feature ablation experiments using supervisedRussian classification experiments.
Features (�(e, i)) acc.all 54.7%�linear context 52.7%�dependency context 44.4%�POS tags 54.5%�Brown clusters 54.5%�POS tags, �Brown cl. 50.9%�lexical items 51.2%
5 Synthetic Phrases
We turn now to translation; recall that our translate-and-inflect model is used to augment the set of rulesavailable to a conventional statistical machine trans-lation decoder. We refer to the phrases it producesas synthetic phrases.
Our baseline system is a standard hierarchicalphrase-based translation model (Chiang, 2007). Fol-lowing Lopez (2007), the training data is compiledinto an efficient binary representation which allowsextraction of sentence-specific grammars just beforedecoding. In our case, this also allows the creationof synthetic inflected phrases that are produced con-ditioning on the sentence to translate.
To generate these synthetic phrases with new in-flections possibly unseen in the parallel training
than seen in Table 2, but the pattern of results is the relevantdatapoint here.
1682
Feature ablation experiments using supervised Russian classification experiments.
5. Synthe1c Phrases (cont’d)
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 15
Russian supervisedVerb: 1st Person
child(nsubj)=I child(nsubj)=weVerb: Future tense
child(aux)=MD child(aux)=willNoun: Animate
source=animals/victims/...Noun: Feminine gender
source=obama/economy/...Noun: Dative case
parent(iobj)Adjective: Genitive case
grandparent(poss)
HebrewSuffix ים (masculine plural)
parent=NNS after=NNSPrefix א (first person sing. + future)
child(nsubj)=I child(aux)='llPrefix כ (preposition like/as)
child(prep)=IN parent=asSuffix י (possesive mark)
before=my child(poss)=mySuffix ה (feminine mark)
child(nsubj)=she before=shePrefix כש (when)
before=when before=WRB
SwahiliPrefix li (past)
source=VBD source=VBNPrefix nita (1st person sing. + future)
child(aux) child(nsubj)=IPrefix ana (3rd person sing. + present)
source=VBZPrefix wa (3rd person plural)
before=they child(nsubj)=NNSSuffix tu (1st person plural)
child(nsubj)=she before=shePrefix ha (negative tense)
source=no after=not
Figure 3: Examples of highly weighted features learned by the inflection model. We selected a few frequent morpho-logical features and show their top corresponding source context features.
data, we first construct an additional phrase-basedtranslation model on the parallel corpus prepro-cessed to replace inflected surface words with theirstems. We then extract a set of non-gappy phrasesfor each sentence (e.g., X � <attempted,
пытаться V>). The target side of each such phraseis re-inflected, conditioned on the source sentence,using the inflection model from §2. Each stem isgiven its most likely inflection.7
The original features extracted for the stemmedphrase are conserved, and the following featuresare added to help the decoder select good syntheticphrases:
• a binary feature indicating that the phrase issynthetic,
• the log-probability of the inflected forms ac-cording to our model,
• the count of words that have been inflected,with a separate feature for each morphologicalcategory in the supervised case.
Finally, these synthetic phrases are combined withthe original translation rules obtained for the base-line system to produce an extended sentence-specificgrammar which is used as input to the decoder. If a
7Several reviewers asked about what happens when k-bestinflections are added. The results for k � {2, 4, 8} range fromno effect to an improvement over k = 1 of about 0.2 BLEU(absolute). We hypothesize that larger values of k could have agreater impact, perhaps in a more “global” model of the targetstring; however, exploration of this question is beyond the scopeof this paper.
phrase already existing in the standard phrase tablehappens to be recreated, both phrases are kept andwill compete with each other with different featuresin the decoder.
For example, for the large EN�RU system, 6%of all the rules used for translation are syntheticphrases, with 65% of these phrases being entirelynew rules.
6 Translation Experiments
We evaluate our approach in the standard discrim-inative MT framework. We use cdec (Dyer et al.,2010) as our decoder and perform MIRA trainingto learn feature weights of the sentence translationmodel (Chiang, 2012). We compare the followingconfigurations:
• A baseline system, using a 4-gram languagemodel trained on the entire monolingual andbilingual data available.
• An enriched system with a class-based n-gramlanguage model8 trained on the monolingualdata mapped to 600 Brown clusters. Class-based language modeling is a strong baselinefor scenarios with high out-of-vocabulary ratesbut in which large amounts of monolingualtarget-language data are available.
• The enriched system further augmented withour inflected synthetic phrases. We expect theclass-based language model to be especially
8For Swahili and Hebrew, n = 6; for Russian, n = 7.
1683
6. Transla1on Experiments
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 16
Experimental Set-Up: l Comparing:
• A baseline system (Standard Hiero), using a 4-gram LM • An enriched system with class-based n-gram LM trained on the
monolingual data mapped to 600 Brown clusters • The enriched system further augmented with the inflected synthetic
phrases
l Dataset: • Russian: News Commentary parallel corpus and additional monolingual
data crawled from news websites • Hebrew: transcribed TED talks and additional monolingual news data • Swahili: parallel corpus obtained by crawling the Global Voices project
website and additional monolingual data taken form the Helsinki Corpus of Swahili
6. Transla1on Experiments
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 17
Experimental Set-Up: l Comparing:
• A baseline system (Standard Hiero), using a 4-gram LM • An enriched system with class-based n-gram LM trained on the
monolingual data mapped to 600 Brown clusters • The enriched system further augmented with the inflected synthetic
phrases
l Dataset: • Russian: News Commentary parallel corpus and additional monolingual
data crawled from news websites • Hebrew: transcribed TED talks and additional monolingual news data • Swahili: parallel corpus obtained by crawling the Global Voices project
website and additional monolingual data taken form the Helsinki Corpus of Swahili
6. Transla1on Experiments (cont’d)
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 18
Translation Results: helpful here and capture some basic agreementpatterns that can be learned more easily ondense clusters than from plain word sequences.
Detailed corpus statistics are given in Table 1:
• The Russian data consist of the News Com-mentary parallel corpus and additional mono-lingual data crawled from news websites.9
• The Hebrew parallel corpus is composed oftranscribed TED talks (Cettolo et al., 2012).Additional monolingual news data is also used.
• The Swahili parallel corpus was obtained bycrawling the Global Voices project website10
for parallel articles. Additional monolingualdata was taken from the Helsinki Corpus ofSwahili.11
We evaluate translation quality by translating andmeasuring the BLEU score of a 2000–3000 sentence-long evaluation corpus, averaging the results over 3MIRA runs to control for optimizer instability (Clarket al., 2011). Table 4 reports the results. For all lan-guages, using class language models improves overthe baseline. When synthetic phrases are added, sig-nificant additional improvements are obtained. Forthe English–Russian language pair, where both su-pervised and unsupervised analyses can be obtained,we notice that expert-crafted morphological analyz-ers are more efficient at improving translation qual-ity. Globally, the amount of improvement observedvaries depending on the language; this is most likelyindicative of the quality of unsupervised morpholog-ical segmentations produced and the kinds of gram-matical relations expressed morphologically.
Finally, to confirm the effectiveness of our ap-proach as corpus size increases, we use our tech-nique on top of a state-of-the art English–Russiansystem trained on data from the 8th ACL Work-shop on Machine Translation (30M words of bilin-gual text and 410M words of monolingual text). Thesetup is identical except for the addition of sparse
9http://www.statmt.org/wmt13/
translation-task.html
10http://sw.globalvoicesonline.org
11http://www.aakkl.helsinki.fi/cameel/
corpus/intro.htm
Table 4: Translation quality (measured by BLEU) aver-aged over 3 MIRA runs.
EN�RU EN�HE EN�SWBaseline 14.7±0.1 15.8±0.3 18.3±0.1
+Class LM 15.7±0.1 16.8±0.4 18.7±0.2
+Syntheticunsupervised 16.2±0.1 17.6±0.1 19.0±0.1
supervised 16.7±0.1 — —
rule shape indicator features and bigram cluster fea-tures. In these large scale conditions, the BLEU scoreimproves from 18.8 to 19.6 with the addition of wordclusters and reaches 20.0 with synthetic phrases.Details regarding this system are reported in Ammaret al. (2013).
7 Related Work
Translation into morphologically rich languages isa widely studied problem and there is a tremen-dous amount of related work. Our technique of syn-thesizing translation options to improve generationof inflected forms is closely related to the factoredtranslation approach proposed by Koehn and Hoang(2007); however, an important difference to thatwork is that we use a discriminative model that con-ditions on source context to make “local” decisionsabout what inflections may be used before combin-ing the phrases into a complete sentence translation.
Combination pre-/post-processing solutions arealso frequently proposed. In these, the tar-get language is generally transformed from multi-morphemic surface words into smaller units moreamenable to direct translation, and then a post-processing step is applied independent of the trans-lation model. For example, Oflazer and El-Kahlout(2007) experiment with partial morpheme groupingsto produce novel inflected forms when translatinginto Turkish; Al-Haj and Lavie (2010) compare dif-ferent processing schemes for Arabic. A related butdifferent approach is to enrich the source languageitems with grammatical features (e.g., a source sen-tence like John saw Mary is preprocessed into, e.g.,John+subj saw+msubj+fobj Mary+obj) so asto make the source and target lexicons have simi-lar morphological contrasts (Avramidis and Koehn,2008; Yeniterzi and Oflazer, 2010; Chang et al.,
1684
Translation quality (measured by BLEU) averaged over 3 MIRA runs.
7. Conclusion
15/10/15 2014©Akiba (Akiva) Miura AHC-‐Lab, IS, NAIST 19
The paper l presents an efficient technique that exploits morphologically
analyzed corpora to produce new inflec1ons possibly unseen in the bilingual training data • This method decomposes into two simple independent steps involving well-‐understood discrimina1ve models
l also achieves language independency by exploi1ng unsupervised morphological segmenta1on in the absence of linguis1cally informed morphological analyses