Dynamic Past and Future for Neural Machine Translation · 1 Introduction Neural machine translation...

Dynamic Past and Future for Neural Machine Translation

Zaixiang ZhengNanjing University

[email protected]

Shujian HuangNanjing University

[email protected]

Zhaopeng TuTencent AI Lab

[email protected]

Xin-Yu DaiNanjing [email protected]

Jiajun ChenNanjing University

[email protected]

Abstract

Previous studies have shown that neural ma-chine translation (NMT) models can bene-fit from explicitly modeling translated (PAST)and untranslated (FUTURE) source contents asrecurrent states (Zheng et al., 2018). How-ever, this less interpretable recurrent processhinders its power to model the dynamic up-dating of PAST and FUTURE contents duringdecoding. In this paper, we propose to modelthe dynamic principles by explicitly separatingsource words into groups of translated and un-translated contents through parts-to-wholes as-signment. The assignment is learned through anovel variant of routing-by-agreement mecha-nism (Sabour et al., 2017), namely Guided Dy-namic Routing, where the translating status ateach decoding step guides the routing processto assign each source word to its associatedgroup (i.e., translated or untranslated content)represented by a capsule, enabling translationto be made from holistic context. Experimentsshow that our approach achieves substantialimprovements over both RNMT and Trans-former by producing more adequate transla-tions. Extensive analysis demonstrates that ourmethod is highly interpretable, which is able torecognize the translated and untranslated con-tents as expected.1

1 Introduction

Neural machine translation (NMT) generallyadopts an attentive encoder-decoder frame-work (Sutskever et al., 2014; Vaswani et al., 2017),where the encoder maps a source sentence intoa sequence of contextual representations (sourcecontents), and the decoder generates a target sen-tence word-by-word based on part of the sourcecontent assigned by an attention model (Bahdanau

1Codes are released at https://github.com/zhengzx-nlp/dynamic-nmt.

et al., 2015). Like human translators, NMT sys-tems should have the ability to know the rel-evant source-side context for the current word(PRESENT), as well as recognize what parts in thesource contents have been translated (PAST) andwhat parts have not (FUTURE), at each decodingstep. Accordingly, the PAST, PRESENT and FU-TURE are three dynamically changing states dur-ing the whole translation process.

Previous studies have shown that NMT modelsare likely to face the illness of inadequate transla-tion (Kong et al., 2019), which is usually embod-ied in over- and under-translation problems (Tuet al., 2016, 2017). This issue may be attributedto the poor ability of NMT of recognizing the dy-namic translated and untranslated contents. Toremedy this, Zheng et al. (2018) first demonstratethat explicitly tracking PAST and FUTURE con-tents helps NMT models alleviate this issue andgenerate better translation. In their work, the run-ning PAST and FUTURE contents are modeled asrecurrent states. However, the recurrent processis still non-trivial to determine which parts of thesource words are the PAST and which are the FU-TURE, and to what extent the recurrent states rep-resent them respectively, this less interpretable na-ture is probably not the best way to model and ex-ploit the dynamic PAST and FUTURE.

We argue that an explicit separation of thesource words into two groups, representing PAST

and FUTURE respectively (Figure 1), could bemore beneficial not only for easy and direct recog-nition of the translated and untranslated sourcecontents, but also for better interpretation ofmodel’s behavior of the recognition. We formu-late the explicit separation as a procedure of parts-to-wholes assignment: the representation of eachsource words (parts) should be assigned to its asso-ciated group of either PAST or FUTURE (wholes).

In this paper, we implement this idea using Cap-

https://github.com/zhengzx-nlp/dynamic-nmt

https://github.com/zhengzx-nlp/dynamic-nmt

布什(Bush)

为

其(his)

振兴(revive)

经济(economy)

计划(plan)

辩护(defend)

<BOS>

<EOS>

Bush

defended

his

plan

to

revive

its

economy

<EOS>

布什

(Bush) 为振兴

(revive) 经济

(economy) 计

划(plan

) 辩护(defe

nd)<BOS>

<EOS>

Bushdefended

plan to revive its economy<EOS

>his

其(his)

Translated (Past)

Untranslated (Future)

Current translation (Present)

布什

(Bush) 为振兴

(revive) 经

济

(economy)计划

(plan) 辩护

(defend)<BOS> <EOS>

Bush defended plan to revive its economy <EOS>his

其(his)

Translated (Past) Untranslated (Future) Current Translating (Present)

Target:

Source:

Figure 1: An example of separation of PAST and FU-TURE in machine translation. When generating thecurrent translation “his”, the source tokens “〈BOS〉”,“布什(Bush)” and phrase “为...辩护(defend)” are thetranslated contents (PAST), while the remaining tokensare untranslated contents (FUTURE).

sule Network (Hinton et al., 2011) with routing-by-agreement mechanism (Sabour et al., 2017),which has demonstrated its appealing strength ofsolving the problem of parts-to-wholes assign-ment (Hinton et al., 2018; Gong et al., 2018; Douet al., 2019; Li et al., 2019), to model the separa-tion of the PAST and FUTURE:

1. We first cast the PAST and FUTURE sourcecontents as two groups of capsules.

2. We then design a novel variant of the routing-by-agreement mechanism, called Guided Dy-namic Routing (GDR), which is guided bythe current translating status at each decodingstep to assign each source word to its associ-ated capsules by assignment probabilities forseveral routing iterations.

3. Finally, the PAST and FUTURE capsules ac-cumulate their expected contents from rep-resentations, and are fed into the decoderto provide a time-dependent holistic view ofcontext to decide the prediction.

In addition, two auxiliary learning signals facili-tate GDR’s acquiring of our expected functional-ity, other than implicit learning within the trainingprocess of the NMT model.

We conducted extensive experiments and anal-ysis to verify the effectiveness of our pro-posed model. Experiments on Chinese-to-English, English-to-German, and English-to-Romanian show consistent and substantial im-provements over the Transformer (Vaswani et al.,2017) or RNMT (Bahdanau et al., 2015). Visual-ized evidence proves that our approach does ac-quire the expected ability to separate the sourcewords into PAST and FUTURE, which is highly in-terpretable. We also observe that our model doesalleviate the inadequate translation problem: Hu-man subjective evaluation reveals that our model

produces more adequate and high-quality transla-tions than Transformer. Length analysis regardingsource sentences shows that our model generatesnot only longer but also better translations.

2 Neural Machine Translation

Neural models for sequence-to-sequence taskssuch as machine translation often adopt anencoder-decoder framework. Given a sourcesentence x = 〈x1, . . . , xI〉, a NMT modellearns to predict a target sentence y =〈y1, . . . , yT 〉 by maximizing the conditional prob-abilities p(y|x) =

∏Tt=1 p(yt|y<t,x). Specifi-

cally, an encoder first maps the source sentenceinto a sequence of encoded representations:

h = 〈h1, . . . ,hI〉 = fe(x), (1)

where fe is the encoder’s transformation function.Given the encoded representations of the sourcewords, a decoder generates the sequence of targetwords y autoregressively:

zt = fd(y<t,at), (2)

p(yt|y<t,x) = softmax(E(yt)>zt), (3)

where E(yt) is the embedding of yt. The currentword is predicted based on the decoder state zt. fdis the transformation function of decoder, whichdetermines zt based on the target translation tra-jectory y<t, and the lexical-level source contentat that is most relevant to PRESENT translation byan attention model (Bahdanau et al., 2015). Ide-ally, with all the source encoded representationsin the encoder, NMT models should be able toupdate translated and untranslated source contentsand keep them in mind. However, most of exist-ing NMT models lack an explicit functionality tomaintain the translated and untranslated contents,failing to distinguish the source words being of ei-ther PAST or FUTURE (Zheng et al., 2018), whichis likely to suffer from severe inadequate transla-tion problem (Tu et al., 2016; Kong et al., 2019).

3 Approach

Motivation Our intuition arises straightfor-wardly: if we could tell the translated and untrans-lated source contents apart by directly separatingthe source words into PAST and FUTURE cate-gories at each decoding step, the PRESENT trans-lation could benefit from the dynamically holisticcontext (i.e., PAST+ PRESENT+ FUTURE). To this

purpose, we should design a mechanism by whicheach word could be recognized and assigned to adistinct category, i.e., PAST or FUTURE contents,subject to the translation status at present. Thisprocedure can be seen as a parts-to-wholes assign-ment, in which the encoder hidden states of thesource words (parts) are supposed to be assignedto either PAST or FUTURE (wholes).

Capsule network (Hinton et al., 2011) hasshown its capability of solving the problem of as-signing parts to wholes (Sabour et al., 2017). Acapsule is a vector of neurons which representsdifferent properties of the same entity from theinput (Sabour et al., 2017). The functionality re-lies on a fast iterative process called routing-by-agreement, whose basic idea is to iteratively re-fine the proportion of how much a part should beassigned to a whole, based on the agreement be-tween the part and the whole (Dou et al., 2019).Therefore, it is appealing to investigate if thismechanism could be employed for our intuition.

3.1 Guided Dynamic Routing (GDR)Dynamic routing (Sabour et al., 2017) is an imple-mentation of routing-by-agreement, where it runsintrinsically without any external guidance. How-ever, what we expect is a mechanism driven by thedecoding status at present. Here we propose a vari-ant of dynamic routing mechanism called GuidedDynamic Routing (GDR), where the routing pro-cess is guided by the translating information ateach decoding step (Figure 2).

Formally, we cast the source encoded represen-tations h of I source words to be input capsules,while we denote Ω as output capsules, which con-sist of J entries. Initially, we assume that J/2 ofthem (ΩP ) represent the PAST contents, and therest J/2 capsules (ΩF ) represent the FUTURE:

ΩP = 〈ΩP1 , · · · ,ΩP

J/2〉, ΩF = 〈ΩF1 , · · · ,ΩF

J/2〉.

where each capsule is represented by a dc-dimension vector. We assemble these PAST andFUTURE capsules together, which are expected tocompeting for source information, i.e., we nowhave Ω = ΩP ∪ ΩF . We will describe how toteach these capsules to retrieve their relevant partsfrom source contents in the Section 3.3. Note thatwe employ GDR at every decoding step t to obtainthe time-dependent PAST and FUTURE and omitthe subscript t for simplicity.

In the dynamic routing process, each vector out-put of capsule j is calculated with a non-linear

Guided Dynamic Routing

Future Capsules

zth1 h2 h3 h4

ΩFPast Capsules ΩP

Encoder’s representations of source wordsCurrent Decoder hidden state

Guides

Figure 2: Illustration of the Guided Dynamic Routing.

squashing function (Sabour et al., 2017):

Ωj =‖sj‖2

1 + ‖sj‖2sj‖sj‖

, sj =I∑i

cijvij , (4)

where sj is the accumulated input of capsule Ωj ,which is a weighted sum over all vote vectors vij .vij is transformed from the input capsule hi:

vij = Wjhi, (5)

where Wj ∈ Rd×dc is a trainable matrix for j-thoutput capsule2. cij is the assignment probabil-ity (i.e. the agreement) that is determined by theiterative dynamic routing. The assignment prob-abilities ci· associated with each input capsule hisum to 1:

∑j cij = 1, and are computed by:

cij = softmax(bij), (6)

where routing logit bij is initialized as all 0s,which measures the degree that hi should be sentto Ωj . The initial assignment probabilities are theniteratively updated by measuring the agreementbetween the vote vector vij and capsule Ωj by anMLP, considering the current decoding state zt:

bij ← bij +w>tanh(Wb[zt;vij ; Ωj ]), (7)

where Wb ∈ Rd+dc∗2 and w ∈ Rdc are learnableparameters. Instead of using simple scalar prod-uct, i.e., bij = v>ijΩj (Sabour et al., 2017), whichcould not consider the current decoding state asa condition signal, we resort to the MLP to takezi into account inspired by MLP-based attentionmechanism (Bahdanau et al., 2015; Luong et al.,2015). That is why we call it “guided” dynamicrouting.

2Note that unlike Sabour et al. (2017), where each pair ofinput capsule i and output capsule j has a distinct transfor-mation matrix Wij as their numbers are predefined (I × Jtransformation matrices in total), here we share the transfor-mation matrix Wj of output capsule j among all the inputcapsules due to the varied amount of the source words. Sothere are J transformation matrices in our model.

Algorithm 1 Guided Dynamic Routing (GDR)Input: Encoder hidden state h, current decoding hidden

state zt, and number of routing iterations r.Output: PAST, FUTURE, and redundant capsules.procedure: GDR(h, zt, r)1: ∀i ∈ h, j ∈ Ω : bij ← 0,vij ←Wjhi . Initializing

routing logits, and vote vectors.2: for r iterations do3: ∀i ∈ h, j ∈ Ω: Compute assign. probs. cij by Eq. 64: ∀j ∈ Ω : Compute capsules Ωj by Eq. 45: ∀i ∈ h, j ∈ Ω : Update routing logits bij by Eq. 76: end for7: [ΩP ; ΩF ; ΩR] = Ω . Return past, future, and

redundant capsules8: return ΩP ,ΩF ,ΩR

Now with the awareness of the current decodingstatus, the hidden state (input capsule) of a sourceword prefers to send its representation to the out-put capsules, which have large routing agreementsassociated with the input capsule. After a fewrounds of iterations, the output capsules are able toignore all but the most relevant information fromthe source hidden states, representing a distinct as-pect of either PAST or FUTURE.

Redundant Capsules In some cases, some partsof the source sentence may belong to neither pastcontents nor future contents. For example, func-tion words in English (e.g., “the”) could not findits counterpart translation in Chinese. There-fore, we add additional Redundant Capsules ΩR

(also known as “orphan capsules” in Sabour et al.(2017)), which are expected to receive higher rout-ing assignment probabilities when a source wordshould not belong to either PAST or FUTURE.

We show the algorithm of our guided dynamicrouting in Algorithm 1.

3.2 Integrating into NMT

The proposed GDR can be applied on the topof any sequence-to-sequence architecture, whichdoes not require any specific modification. Let ustake a Transformer-fashion architecture as exam-ple (Figure 3). Given a sentence x = 〈x1, . . . , xI〉,the encoder leveragesN stacked identical layers tomap the sentence into contextual representations:

h(l) = EncoderLayer(h(l−1)),

where the superscript l indicates layer depth.Based on the encoded source representations hN ,a decoder generates translation word by word. The

Embedding Layer

Inputs

RNN/Self-Attention

Embedding Layer

Outputs

RNN/Self-Attention

Linear

Softmax

Outputprobabilities

⨉ N

Past Capsules

Future Capsules

Present States


Inputs

N ⨉

Outputs (shift-right)

Linear

Softmax

OutputprobabilitiesPast Capsules

Future Capsules

Present States


Embedding Layer

⨉ NRNN/Self-Attention

Embedding Layer

RNN/Self-Attention

Figure 3: Illustration of our architecture.

decoder also has N stacked identical layers:

z(l) = DecoderLayer(z(l−1),a(l)),

a(l) = Attention(z(l−1),h(N)),

where a(l) is the lexical-level source context as-signed by an attention mechanism between currentdecoder layer and the last encoder layer. Giventhe hidden states of the last decoder layer z(N),we perform our proposed guided dynamic routing(GDR) mechanism to compute the PAST and FU-TURE contents from the source side and obtain theholistic context of each decoding step:

ΩP ,ΩF ,ΩR = GDR(z(N),h(N)),

o = FeedForward(z(N),ΩP ,ΩF ) + z(N),

where o = 〈o1, · · · ,oT 〉 is the sequence of theholistic context of each decoding step. Based onthe holistic context, the output probabilities arecomputed as:

p(yt|y≤t,x) = softmax(g(ot)).

The NMT model is now able to employ the dy-namic holistic context for better generation.

3.3 Learning PAST and FUTURE as ExpectedAuxiliary Guided LossesTo ensure that the dynamic routing process runsas expected, we introduce the following auxiliaryguided signals to assist the learning process.

Bag-of-Word Constraint Weng et al. (2017)propose a multitasking scheme to boost NMT bypredicting the bag-of-words of target sentence us-ing the Word Predictions approach. Inspired by

this work, we introduce a BOW constraint to en-courage the PAST and FUTURE capsules to be pre-dictive of the preceding and subsequent bag-of-words regarding each decoding step respectively:

LBOW =1

T

T∑t=0

(− log pPRE(y≤t|ΩP

t )

− log pSUB(y≥t|ΩFt )),

where ppre(y≤t|ΩPt ) and psub(y≥t|ΩF

t ) are thepredicted probabilities of the preceding bag-of-words and subsequent words at decoding step t,respectively. For instance, the probabilities of thepreceding bag-of-words are computed by:

pPRE(y<t|ΩPt ) =

∏τ∈[1,t]

pPRE(yτ |ΩPt )

∝∏τ∈[1,t]

exp(E(yτ )>W PBOWΩP

t ).

The computation of pSUB(y≥t|ΩFt ) is similar. By

applying the BOW constraint, the PAST and FU-TURE capsules can learn to reflect the target-sidepast and future bag-of-words information.

Bilingual Content Agreement Intuitively, thetranslated source contents should be semanticallyequivalent to the translated target contents, and sodo untranslated contents. Thus, a natural idea isto encourage the source PAST contents, modeledby the PAST capsule to be close to the target PAST

representation at each decoding step, and the samefor the FUTURE. Hence, we introduce a BilingualContent Agreement (BCA) to require the bilingualsemantic-equivalent contents to be predictive toeach other by Minimum Square Estimation (MSE)loss:

LBCA =1

T

T∑t=1

‖ΩPt −W P

BCA(1

t

t∑τ=1

zτ )‖2

+ ‖ΩFt −W F

BCA(1

T−t+1

T∑τ=t

zτ )‖2,

where the target-side past information is repre-sented by the averaged results of the decoder hid-den states of all preceding words, while the aver-age of subsequent decoder hidden states representsthe target-side future information.

TrainingGiven the dataset of parallel training examples〈x(m),y(m)〉Mm=1, the model parameters are

trained by minimizing the loss L(θ), where θ isthe set of all the parameter of the proposed model:

L(θ)=1

M

M∑m=1

(−log p(y(m)|x(m))

+ λ1 · LBoW + λ2 · LBCA

),

where λ1 and λ2 are hyper-parameters.

4 Experiment

We mainly evaluated our approaches on the widelyused NIST Chinese-to-English (Zh-En) transla-tion task. We also conducted translation exper-iments on WMT14 English-to-German (En-De)and WMT16 English-to-Romanian (En-Ro):

1. NIST Zh-En. The training data consists of1.09 million sentence pairs extracted from LDC3.We used NIST MT03 as the development set(Dev); MT04, MT05, MT06 as the test sets.

2. WMT14 En-De. The training data consistsof 4.5 million sentence pairs from WMT14 newstranslation task. We used newstest2013 as the de-velopment set and newstest2014 as the test set.

3. WMT16 En-Ro. The training data consistsof 0.6 million sentence pairs from WMT16 newstranslation task. We used newstest2015 as the de-velopment set and newstest2016 as the test set.

We used transformer base configura-tion (Vaswani et al., 2017) for all the models. Werun the dynamic routing for r=3 iterations. The di-mension dc of a single capsule is 256. Either PAST

or FUTURE content was represented by J2 = 2

capsules. Our proposed models were trained onthe top of pre-trained baseline models4. λ1 and λ2in training objective were set to 1. In Appendix,we provide details for the training settings.

4.1 NIST Zh-En TranslationWe list the results of our experiments on NISTZh-En task in Table 1 concerning two different ar-chitectures, i.e., Transformer and RNMT. As wecan see, all of our models substantially outperformthe baselines in terms of averaged BLEU score ofall the test sets. Among them, our best modelachieves 45.65 BLEU based on Transformer ar-chitecture. We also find that redundant capsulesare helpful while discarding them leads to -0.35BLEU degradation (45.65 vs 45.30).

3The corpora includes LDC2002E18, LDC2003E07,LDC2003E14, Hansards portion of LDC2004T07,LDC2004T08 and LDC2005T06

4Pre-training is only for efficiency purpose. Our approachcould also learn from scratch.

Model |θ| vtrain vtest Dev MT04 MT05 MT06 Tests Avg.Transformer 66.1m 1.00× 1.00× 45.83 46.66 43.36 42.17 44.06GDR 68.9m 0.77× 0.94× 46.50 47.03 45.50 42.21 44.91 (+0.75)

+ LBOW 69.2m 0.70× 0.94× 47.12 48.09 45.98 42.68 45.58 (+1.42)+ LBCA 69.4m 0.75× 0.94× 46.86 48.00 45.67 42.62 45.43 (+1.37)+ LBOW + LBCA [OURS] 69.7m 0.67× 0.94× 47.52 48.13 45.98 42.85 45.65 (+1.59)

OURS - redundant capsules 68.7m 0.69× 0.94× 47.20 47.82 45.59 42.51 45.30 (+1.24)RNMT 50.2m 1.00× 1.00× 35.98 37.85 36.12 35.86 36.61

+PFRNN (Zheng et al., 2018) N/A 0.54× 0.74× 37.90 40.37 36.75 36.44 37.85 (+1.24)+AOL (Kong et al., 2019) N/A 0.57× 1.00× 37.61 40.05 37.58 36.87 38.16 (+1.55)

OURS 53.9m 0.62× 0.90× 38.10 40.87 37.50 37.00 38.45 (+1.84)

Table 1: Experiment ressuts on NIST Zh-En task, including number of parameters (|θ|, excluding word embed-dings), training/testing speeds (vtrain/vtest), and translation results in case-insensitive BLEU.

Architectures Our approach shows consistenteffects on both Transformer and RNMT architec-tures. In comparison to the Transformer base-line, our model achieves at most +1.59 BLEU im-provement (45.65 v.s 44.06), while +1.84 BLEUimprovement over RNMT baselines (38.45 v.s36.61). These results indicate the compatibility ofour approach to different architectures.

Auxiliary Guided Losses Both the auxiliaryguided losses help our model for better learning.The BOW constraint leads to a +0.67 improve-ment compared to the vanilla GDR, while the ben-efit is +0.62 for BCA. Combination of both gainsthe most margins (+0.84), which means that theycan supplement each other.

Efficiency To examine the efficiency of the pro-posed approach, we also list the relative speed ofboth training and testing. Our approach is 0.67×slower than the Transformer baseline in trainingphase, however, it does not hurt the speed of test-ing too much (0.94×). It is because the most ex-tra computation in training phrase is related to thesoftmax operations of BOW losses, the degrada-tion of the testing efficiency is moderate.

Comparison to Other Work On the experi-ments on RNMT architecture, we list two relatedworks. Zheng et al. (2018) use extra PAST andFUTURE RNNs to capture translated and untrans-lated contents recurrently (PFRNN), while Konget al. (2019) directly leverage translation adequacyas learning reward by their proposed Adequacy-oriented Learning (AOL). Compared to them, ourmodel also enjoys competitive improvements dueto the explicit separation of source contents. Inaddition, PFRNN is non-trivial to adapt to Trans-former, because it requires a recurrent process

Model En-De En-RoGNMT+RL (Wu et al., 2016) 24.6 N/AConvS2S (Gehring et al., 2017) 25.2 29.88Transformer (Vaswani et al., 2017) 27.3 N/A

+AOL (Kong et al., 2019) 28.01 N/ATransformer (Gu et al., 2017) N/A 31.91Transformer 27.14 32.10OURS 28.10 32.96

Table 2: Case-sensitive BLEU on WMT14 En-De andWMT16 En-Ro tasks.

which fails to be compatible with parallel train-ing of Transformer, scarifying Transformer’s effi-ciency advantage.

4.2 WMT En-De and En-Ro TranslationWe evaluated our approach on WMT14 En-Deand WMT16 En-Ro tasks. As shown in Table2, our reproduced Transformer baseline systemsare close to the state-of-the-art results in previouswork, which guarantee the comparability of ourexperiments. The results show a consistent trendof improvements as NIST Zh-En task on WMT14En-De (+0.96 BLEU) and WMT16 En-Ro (+0.86BLEU) benchmarks. We also list the results ofother published research for comparison, whereour model outperforms the previous results in bothlanguage pairs. Note that our approach also sur-passes Kong et al. (2019) on WMT14 En-De task.These experiments demonstrate the effectivenessof our approach across different language pairs.

4.3 Analysis and DiscussionOur model learns PAST and FUTURE. We vi-sualize the assignment probabilities in the lastrouting iteration (Figure 4). Interestingly, thereis a clear trend that the assignment probabilitiesto the PAST capsules gradually raise up, while

Figure 4: Visualization of the assignment probabilities of iterative routing. Each sub-heatmap is associated witha target word, where the left column is the probabilities of each source words routing to the PAST capsules, andthe right one is to the FUTUREExamples in the red frame indicate the changes before and after the generationof the central word. We omit the assignment probabilities associated with the redundant capsules for simplicity.For instance, after the target word “defended” was generated, the assignment probabilities of its source translation“辩护” changed from FUTURE to PAST. Results of “Bush“, “his”, “revive” and “economy” are similar, except aadverse case (“plan”).

those to the FUTURE capsules reduce to aroundzeros. This phenomenon is consistent with the in-tuition that the translated contents should aggre-gate and the untranslated should decline (Zhenget al., 2018). The assignment weights of a specificword change from FUTURE to PAST after beinggenerated. These pieces of evidence give a strongverification that our GDR mechanism indeed haslearned to distinguish the PAST contents and FU-TURE contents in the source-side.

Moreover, we measure how well our capsulesaccumulate the expected contents by comparisonbetween the BOW predictions and ground-truthtarget words. Accordingly, we define a top-5overlap rate (rOL) for predicting preceding andsubsequent words are defined as follow, respec-tively: rPOL=

1T

∑Tt=1

|Top5t(ppre(ΩPt ))∩y<=t|

|y<=t| , rFOL=

1T

∑Tt=1

|Top5(T−t)(psub(ΩFt ))∩y>=t|

|y>=t| . The PAST

capsules achieves rPOL of 0.72, while rFOL of 0.70for the FUTURE capsules. The results indicatethat the capsules could predict the correspondingwords to a certain extent, which implies thecapsules contain the expected information ofPAST or FUTURE contents.

Translations become better and more adequate.To validate the translation adequacy of our model,we use Coverage Difference Ratio (CDR) pro-posed by Kong et al. (2019), i.e., CDR = 1 −|Cref\Cgen||Cref | , where Cref and Cgen are the set of

source words covered by the reference and transla-tion, respectively. The CDR reflects the translationadequacy by comparing the source coverages be-

Model Transformer OURSCDR 0.73 0.79

HUMAN EVALUATIONQUALITY 4.39±.11 4.66±.10OVER(%) 0.03±.01 0.01±.01UNDER(%) 3.83±.97 2.41±.80

Table 3: Evaluation on translation quality and ade-quacy. For HUMAN evaluation, we asked three evalu-ators to score translations from 100 source sentences,which are randomly sampled from the testsets fromanonymous systems, the QUALITY from 1 to 5 (higheris better), and the proportions of source words concern-ing OVER- and UNDER-translation, respectively.

tween reference and translation. As shown in Ta-ble 3, our approach achieves a better CDR than theTransformer baseline, which means superiority intranslation adequacy.

Following Zheng et al. (2018), we also con-duct subjective evaluations to validate the bene-fit of modeling PAST and FUTURE (the last threerows of Table 3). Surprisingly, we find that themodern NMT model, i.e., Transformer, rarely pro-duces over-translation but still suffers from under-translation. Our model obtains the highest humanrating on translation quality while substantially al-leviates the under-translation problem than Trans-former.

Longer sentences benefit much more. We re-port the comparison with sentence lengths (Fig-ure 5). In all the intervals of length, our modeldoes generate better (Figure 5b) and longer (Fig-ure 5a) translations. Interestingly, our approach

(a) Translation length v.s source length

(b) BLEU v.s source length

Figure 5: Comparison regarding source length.

gets a larger improvement when the input sen-tences become longer, which are commonlythought hard to translate. We attribute this tothe less number of under-translation cases in ourmodel, meaning that our model learns better trans-lation quality and adequacy, especially for longsentences.

Does guided dynamic routing really matter?Despite the promising numbers of the GDR and theauxiliary guided losses, a straightforward questionrises: will other more simple models also work ifthey are just equipped with the guided losses torecognize PAST and FUTURE contents? In otherword, does the proposed guided dynamic routingreally matter?

To answer this question, we integrate the pro-posed auxiliary losses into two simple baselines toguide the recognition of past and future: A MLPclassifier model (CLF) that determines if a sourceword is a past word, otherwise future5; and an

5CLF is a 3-way classifier that computes the probabili-ties pP (xi), pF (xi) and pR(xi) (they sum to 1) as past,future and redundant weights, which is similar to Equation6. The PAST and FUTURE representations are computed byweighted summation, which is similar to Equation 4.

Figure 6: Comparison with simple baselines with thesame auxiliary guided loss on NIST Zh-En.

attention-based model (ATTN) that uses two indi-vidual attention modules to retrieve past or futureparts from the source words. As shown in Table 6,surprisingly, the simple baselines can obtain im-provements, emphasizing the function of the pro-posed guided losses, while there remain a consid-erable gaps between our model and them. In fact,the CLF is essentially a one-iteration variant ofGDR, and iterative refinement by multiple itera-tions is necessary and effective6. And the attentionmechanism is used for feature pooling, not suit-able for parts-to-wholes assignment7. These ex-periments reveal that our guided dynamic routingis a better choice to model and exploit the dynamicPAST and FUTURE.

5 Related Work

Inadequate translation problem is a widely knownweakness of NMT models, especially when trans-lating long sentences (Kong et al., 2019; Tu et al.,2016; Lei et al., 2019). To alleviate this problem,one direction is to recognize the translated and un-translated contents, and pay more attention to un-translated parts. Tu et al. (2016), Mi et al. (2016)and Li et al. (2018) employ coverage vector orcoverage ratio to indicate the lexical-level cover-age of source words. Meng et al. (2018) influencethe attentive vectors by translated/untranslated in-formation. Our work mainly follows the path ofZheng et al. (2018), which introduce two extrarecurrent layers in the decoder to maintain therepresentations of the past and future translationcontents. However, it may be not easy to showthe direct correspondence between the source con-tents and learned representations in the past/future

6See Appendix for analysis of iteration numbers.7Consider an extreme case that in the end of translation,

there is no FUTURE content left, but the attention model stillproduces a weighted average over all the source representa-tions, which is nonsense. In contrast, the GDR is able to as-sign zero probabilities to the FUTURE capsules, solving thesource of the problem.

RNN layers, nor compatible with the state-of-the-art Transformer for the additional recurrences pre-vent Transformer decoder from being parallelized.

Another direction is to introduce global repre-sentations. Lin et al. (2018) model a global sourcerepresentation by deconvolution networks. Xiaet al. (2017); Zhang et al. (2018); Geng et al.(2018) propose to provide a holistic view of tar-get sentence by multi-pass decoding. Zhou et al.(2019) improve Zhang et al. (2018) to a syn-chronous bidirectional decoding fashion. Simi-larly, Weng et al. (2019) deploy bidirectional de-coding in interactive translation setting. Differ-ent from these work aiming at providing staticglobal information in the whole translation pro-cess, our approach models a dynamically global(holistic) context by using capsules network toseparate source contents at every decoding steps.

Other efforts explore exploiting future hints.Serdyuk et al. (2018) design a Twin Regularizationto encourage the hidden states in forward decoderRNN to estimate the representations of a backwardRNN. Weng et al. (2017) require the decoder statesto not only generate the current word but also pre-dict the remain untranslated words. Actor-criticalgorithms are employed to predict future prop-erties (Li et al., 2017; Bahdanau et al., 2017; Heet al., 2017) by estimating the future rewards fordecision making. Kong et al. (2019) propose a pol-icy gradient based adequacy-oriented approach toimprove translation adequacy. These methods usefuture information only at the training stage, whileour model could also exploit past and future in-formation at inference, which provides accessibleclues of translated and untranslated contents.

Capsule networks (Hinton et al., 2011) and itsassociated assignment policy of dynamic rout-ing (Hinton et al., 2011) and EM-routing (Hin-ton et al., 2018) aims at addressing the limitedexpressive ability of the parts-to-wholes assign-ment in computer vision. In natural language pro-cessing community, however, the capsule networkhas not yet been widely investigated. Zhao et al.(2018) testify capsule network on text classifica-tion and Gong et al. (2018) propose to aggregatea sequence of vectors via dynamic routing for se-quence encoding. Dou et al. (2019) first proposeto employ capsule network in NMT using routing-by-agreement mechanism for layer representationaggregation. Wang (2019) develops a constanttime NMT model using capsule networks. These

studies mainly use capsule network for informa-tion aggregation, where the capsules could have aless interpretable meaning. In contrast, our modellearns what we expect by the aid of auxiliary learn-ing signals, which endows our model with betterinterpretability.

6 Conclusion

In this paper, we propose to recognize the trans-lated PAST and untranslated FUTURE contentsvia parts-to-wholes assignment in neural machinetranslation. We propose the guided dynamic rout-ing, a novel mechanism that explicitly separatessource words into PAST and FUTURE guided byPRESENT target decoding status at each decodingstep. We empirically demonstrate that such ex-plicit separation of source contents benefit neuralmachine translation with considerable and consis-tent improvements on three language pairs. Ex-tensive analysis shows that our approach learnsto model the PAST and FUTURE as expected,and alleviates the inadequate translation problem.It is interesting to apply our approach to othersequence-to-sequence tasks, e.g., text summariza-tion (as listed in Appendix).

Acknowledgement

We would like to thank the anonymous review-ers for their insightful comments. Shujian Huangis the corresponding author. This work is sup-ported by the National Science Foundation ofChina (No. U1836221 and No. 61772261), theJiangsu Provincial Research Foundation for BasicResearch (No. BK20170074).

ReferencesDzmitry Bahdanau, Philemon Brakel, Kelvin Xu,

Anirudh Goyal, Ryan Lowe, Joelle Pineau, AaronCourville, and Yoshua Bengio. 2017. An actor-criticalgorithm for sequence prediction. In ICLR 2017.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In ICLR 2015.

Ziyi Dou, Zhaopeng Tu, Xing Wang, Longyue Wang,Shuming Shi, and Tong Zhang. 2019. Dynamiclayer aggregation for neural machine translation. InAAAI.

Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann Dauphin. 2017. Convolutional se-quence to sequence learning. In ICML 2017.

Xinwei Geng, Xiaocheng Feng, Bing Qin, and TingLiu. 2018. Adaptive multi-pass decoder for neuralmachine translation. In EMNLP, pages 523–532.

Jingjing Gong, Xipeng Qiu, Shaojing Wang, and Xuan-jing Huang. 2018. Information aggregation via dy-namic routing for sequence encoding. In COLING,pages 2742–2752.

Jiatao Gu, James Bradbury, Caiming Xiong, Vic-tor O. K. Li, and Richard Socher. 2017. Non-Autoregressive Neural Machine Translation.

Di He, Hanqing Lu, Yingce Xia, Tao Qin, Liwei Wang,and Tie-Yan Liu. 2017. Decoding with value net-works for neural machine translation. In NIPS,pages 178–187.

Geoffrey E Hinton, Alex Krizhevsky, and Sida DWang. 2011. Transforming auto-encoders. InICANN, pages 44–51. Springer.

Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst.2018. Matrix capsules with EM routing. In ICLR.

Xiang Kong, Zhaopeng Tu, Shuming Shi, EduardHovy, and Tong Zhang. 2019. Neural machinetranslation with adequacy-oriented learning. InAAAI.

Wenqiang Lei, Weiwen Xu, Ai Ti Aw, Yuanxin Xiang,and Tat-Seng Chua. 2019. Revisit automatic errordetection for wrong and missing translation – a su-pervised approach. In EMNLP.

Jian Li, Baosong Yang, Zi-Yi Dou, Xing Wang,Michael R Lyu, and Zhaopeng Tu. 2019. Infor-mation aggregation for multi-head attention withrouting-by-agreement. In NAACL-HLT, pages3566–3575.

Jiwei Li, Will Monroe, and Dan Jurafsky. 2017. Learn-ing to decode for future success. arXiv preprintarXiv:1701.06549.

Yanyang Li, Tong Xiao, Yinqiao Li, Qiang Wang,Changming Xu, and Jingbo Zhu. 2018. A simpleand effective approach to coverage-aware neural ma-chine translation. In ACL, volume 2, pages 292–297.

Junyang Lin, Xu Sun, Xuancheng Ren, Shuming Ma,Jinsong Su, and Qi Su. 2018. Deconvolution-basedglobal decoding for neural machine translation. InCOLING, pages 3260–3271.

Thang Luong, Hieu Pham, and D. Christopher Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In EMNLP 2015.

Fandong Meng, Zhaopeng Tu, Yong Cheng, HaiyangWu, Junjie Zhai, Yuekui Yang, and Di Wang. 2018.Neural machine translation with key-value memory-augmented attention. In IJCAI, pages 2574–2580.AAAI Press.

Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and AbeIttycheriah. 2016. Coverage Embedding Models forNeural Machine Translation. In EMNLP.

Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton.2017. Dynamic routing between capsules. In NIPS,pages 3856–3866.

Dmitriy Serdyuk, Nan Rosemary Ke, Alessandro Sor-doni, Adam Trischler, Chris Pal, and Yoshua Ben-gio. 2018. Twin networks: Matching the future forsequence generation.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In NIPS 2014.

Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu,and Hang Li. 2017. Neural machine translation withreconstruction. In AAAI 2017.

Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,and Hang Li. 2016. Modeling coverage for neuralmachine translation. In ACL 2016.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is Allyou Need. In NIPS.

Mingxuan Wang. 2019. Towards linear time neu-ral machine translation with capsule networks. InEMNLP.

Rongxiang Weng, Shujian Huang, Zaixiang Zheng,Xin-Yu Dai, and Jiajun Chen. 2017. Neural machinetranslation with word predictions. In EMNLP 2017.

Rongxiang Weng, Hao Zhou, Shujian Huang, Lei Li,Yifan Xia, and Jiajun Chen. 2019. Correct-and-memorize: Learning to translate from interactive re-visions. In IJCAI.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation. arXiv preprintarXiv:1609.08144.

Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin,Nenghai Yu, and Tie-Yan Liu. 2017. Deliberationnetworks: Sequence generation beyond one-pass de-coding. In NIPS, pages 1784–1794.

Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Ron-grong Ji, and Hongji Wang. 2018. Asynchronousbidirectional decoding for neural machine transla-tion. In AAAI.

Wei Zhao, Jianbo Ye, Min Yang, Zeyang Lei, SuofeiZhang, and Zhou Zhao. 2018. Investigating capsulenetworks with dynamic routing for text classifica-tion. In AAAI.

Zaixiang Zheng, Hao Zhou, Shujian Huang, Lili Mou,Xinyu Dai, Jiajun Chen, and Zhaopeng Tu. 2018.Modeling past and future for neural machine trans-lation. TACL, 6:145–157.

Long Zhou, Jiajun Zhang, and Chengqing Zong. 2019.Synchronous bidirectional neural machine transla-tion. TACL, 7:91–105.

Date post:	17-Oct-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Dynamic Past and Future for Neural Machine Translation · 1 Introduction Neural machine translation...

Documents