Modeling Past and Future for Neural Machine Translation

Existing neural machine translation systems do not explicitly model what has been translated and what has not during the decoding phase. To address this problem, we propose a novel mechanism that separates the source information into two parts: translated Past contents and untranslated Future contents, which are modeled by two additional recurrent layers. The Past and Future contents are fed to both the attention model and the decoder states, which provides Neural Machine Translation (NMT) systems with the knowledge of translated and untranslated contents. Experimental results show that the proposed approach significantly improves the performance in Chinese-English, German-English, and English-German translation tasks. Specifically, the proposed model outperforms the conventional coverage model in terms of both the translation quality and the alignment error rate.


Introduction
Neural machine translation (NMT) generally adopts an encoder-decoder framework (Kalchbrenner and Blunsom, 2013;Cho et al., 2014;Sutskever et al., 2014), where the encoder summarizes the source sentence into a source context vector, and the decoder generates the target sentence word-by-word based on the given source. During translation, the decoder implicitly serves several functionalities at the same time: 1. Building a language model over the target sentence for translation fluency (LM). 2. Acquiring the most relevant source-side information to generate the current target word (PRESENT). 3. Maintaining what parts in the source have been translated (PAST) and what parts have not (FUTURE). However, it may be difficult for a single recurrent neural network (RNN) decoder to accomplish these functionalities simultaneously. A recent successful extension of NMT models is the attention mechanism (Bahdanau et al., 2015;Luong et al., 2015), which makes a soft selection over source words and yields an attentive vector to represent the most relevant source parts for the current decoding state. In this sense, the attention mechanism separates the PRESENT functionality from the decoder RNN, achieving significant performance improvement.
In addition to PRESENT, we address the importance of modeling PAST and FUTURE contents in machine translation. The PAST contents indicate translated information, whereas the FUTURE contents indicate untranslated information, both being crucial to NMT models, especially to avoid undertranslation and over-translation (Tu et al., 2016). Ideally, PAST grows and FUTURE declines during the translation process. However, it may be difficult for a single RNN to explicitly model the above processes.
In this paper, we propose a novel neural machine translation system that explicitly models PAST and FUTURE contents with two additional RNN layers. The RNN modeling the PAST contents (called PAST layer) starts from scratch and accumulates the in-formation that is being translated at each decoding step (i.e., the PRESENT information yielded by attention). The RNN modeling the FUTURE contents (called FUTURE layer) begins with holistic source summarization, and subtracts the PRESENT information at each step. The two processes are guided by proposed auxiliary objectives. Intuitively, the RNN state of the PAST layer corresponds to source contents that have been translated at a particular step, and the RNN state of the FUTURE layer corresponds to source contents of untranslated words. At each decoding step, PAST and FUTURE together provide a full summarization of the source information. We then feed the PAST and FUTURE information to both the attention model and decoder states. In this way, our proposed mechanism not only provides coverage information for the attention model, but also gives a holistic view of the source information at each time.
We conducted experiments on Chinese-English, German-English, and English-German benchmarks. Experiments show that the proposed mechanism yields 2.7, 1.7, and 1.1 improvements of BLEU scores in three tasks, respectively. In addition, it obtains an alignment error rate of 35.90%, significantly lower than the baseline (39.73%) and the coverage model (38.73%) by Tu et al. (2016). We observe that in traditional attention-based NMT, most errors occur due to over-and under-translation, which is probably because the decoder RNN fails to keep track of what has been translated and what has not. Our model can alleviate such problems by explicitly modeling PAST and FUTURE contents.

Motivation
In this section, we first introduce the standard attention-based NMT, and then motivate our model by several empirical findings.
The attention mechanism, proposed in Bahdanau et al. (2015), yields a dynamic source context vector for the translation at a particular decoding step, modeling PRESENT information as described in Section 1. This process is illustrated in Figure 1.
Formally, let x = {x 1 , . . . , x I } be a given input sentence. The encoder RNN-generally implemented as a bi-directional RNN (Schuster and Paliwal, 1997)-transforms the sentence to a sequence Based on the source annotations, another decoder RNN generates the translation by predicting a target word y t at each time step t: where g(·) is a non-linear activation, and s t is the decoding state for time step t, computed by Here f (·) is an RNN activation function, e.g., the Gated Recurrent Unit (Cho et al., 2014, GRU) and Long Short-Term Memory (Hochreiter and Schmidhuber, 1997, LSTM). c t is a vector summarizing relevant source information. It is computed as a weighted sum of the source annotations where the weights (α t,i for i = 1 · · · , I) are given by the attention mechanism: Here, a(·) is a scoring function measuring the degree to which the decoding state and source information match to each other.  Intuitively, the attention-based decoder selects source annotations that are most relevant to the decoder state, based on which the current target word is predicted. In other words, c t is some source information for the PRESENT translation.
The decoder RNN is initialized with the summarization of the entire source sentence − → h I ; ← − h 1 , given by After we analyze existing attention-based NMT in detail, our intuition arises as follows. Ideally, with the source summarization in mind, after generating each target word y t from the source contents c t , the decoder should keep track of (1) translated source contents by accumulating c t , and (2) untranslated source contents by subtracting c t from the source summarization. However, such information is not well learned in practice, as there lacks explicit mechanisms to maintain translated and untranslated contents. Evidence show that attention-based NMT still suffers from serious over-and under-translation problems (Tu et al., 2016;Tu et al., 2017b). Examples of under-translation are shown in Table 1a.
Another piece of evidence also shows the decoder may lack a holistic view of the source information, explained as below. We conduct a pilot experiment by removing the initialization of the RNN decoder. If the "holistic" context is well exploited by the decoder, translation performance would significantly decrease without the initialization. As shown in Table 1b, however, translation performance only decreases slightly after we remove the initialization. This indicates NMT decoders do not make full use of source summarization ; that the initialization only helps the prediction at the beginning of the sentence. We attribute the vanishing of such signals to the overloaded use of decoder states (e.g., LM, PAST, and FUTURE functionalities), and hence we propose to explicitly model the holistic source summarization by PAST and FUTURE contents at each decoding step.

Related Work
Our research is built upon an attention-based sequence-to-sequence model (Bahdanau et al., 2015), but is also related to coverage modeling, future modeling, and functionality separation. We discuss these topics in the following.
Coverage Modeling. Tu et al. (2016) and Mi et al. (2016) maintain a coverage vector to indicate which source words have been translated and which source words have not. These vectors are updated by accumulating attention probabilities at each decoding step, which provides an opportunity for the attention model to distinguish translated source words from untranslated ones. Viewing coverage vectors as a (soft) indicator of translated source contents, we take one step further following this idea. We model translated and untranslated source contents by directly manipulating the attention vector (i.e., the source contents that are being translated) instead of attention probability (i.e., the probability of a source word being translated).
In addition, we explicitly model both translated (with PAST-RNN) and untranslated (with FUTURE-RNN) instead of using a single coverage vector to indicate translated source words. Another difference with Tu et al. (2016) is that the PAST and FUTURE contents in our model are fed to not only the attention mechanism but also the decoder's states.
In the context of semantic-level coverage,  propose a memory-enhanced decoder and Meng et al. (2016) propose a memory-enhanced attention model. Both implement the memory with a Neural Turing Machine (Graves et al., 2014), in which the reading and writing operations are expected to erase translated contents and highlight untranslated contents. However, their models lack an explicit objective to guide such intuition, which is one of the key ingredients for the success in this work. In addition, we use two separate layers to explicitly model translated and untranslated contents, which is another distinguishing feature of the proposed approach.
Future Modeling. Standard neural sequence decoders generate target sentences from left to right, thus failing to estimate some desired properties in the future (e.g., the length of target sentence). To address this problem, actor-critic algorithms are employed to predict future properties Bahdanau et al., 2017); in their models, an interpolation of the actor (the standard generation policy) and the critic (a value function that estimates the future values) is used for decision making. Concerning the future generation at each decoding step, Weng et al. (2017) guide the decoder's hidden states to not only generate the current target word, but also predict the target words that remain untranslated. Along the direction of future modeling, we introduce a FUTURE layer to maintain the untranslated source contents, which is updated at each decoding step by subtracting the source content being translated (i.e., attention vector) from the last state (i.e., the untranslated source content so far).
Functionality Separation. Recent work has revealed that the overloaded use of representations makes model training difficult, and such problem can be alleviated by explicitly separating these functions (Reed and Freitas, 2015;Ba et al., 2016;Miller et al., 2016;Gulcehre et al., 2016;Rocktäschel et al., 2017). For example, Miller et al. (2016) separate the functionality of look-up keys and memory contents in memory networks (Sukhbaatar et al., 2015). Rocktäschel et al. (2017) propose a keyvalue-predict attention model, which outputs three vectors at each step: the first is used to predict the next-word distribution; the second serves as the key for decoding; and the third is used for the attention mechanism. In this work, we further separate PAST and FUTURE functionalities from the decoder's hidden representations.

Modeling PAST and FUTURE for Neural Machine Translation
In this section, we describe how to separate PAST and FUTURE functions from decoding states. We introduce two additional RNN layers ( Figure 2): • FUTURE Layer (Section 4.1) encodes source contents to be translated. • PAST Layer (Section 4.2) encodes translated source contents. Let us take y = {y 1 , y 2 , y 3 , y 4 } as an example of the target sentence. The initial state of FUTURE layer is a summarization of the whole source sentence, indicating that all source contents need to be translated. The initial state of PAST layer is a allzero vector, indicating no source content is yet trans- lated.
After c 1 is obtained by the attention mechanism, we (1) update the FUTURE layer by "subtracting" c 1 from the previous state, and (2) update the PAST layer state by "adding" c 1 to the previous state. The two RNN states are updated as described above at every step of generating y 1 , y 2 , y 3 , and y 4 . In this way, at each time step, the FUTURE layer encodes source contents to be translated in the future steps, while the PAST layer encodes translated source contents up to the current step.
The advantages of PAST and FUTURE layers are two-fold. First, they provide coverage information, which is fed to the attention model and guides NMT systems to pay more attention to untranslated source contents. Second, they provide a holistic view of the source information, since we would anticipate "PAST + FUTURE = HOLISTIC." We describe them in detail in the rest of this section.

Modeling FUTURE
Formally, the FUTURE layer is a recurrent neural network (the first gray layer in Figure 2) , and its state at time step t is computed by where F is the activation function for FUTURE layer. We have several variants F, aiming to better model the expected subtraction, as described in Section 4.1.1. The FUTURE RNN is initialized with the summarization of the whole source sentence, as computed by Equation 5. When calculating attention context at time step t, we feed the attention model with the FUTURE state from the last time step, which encodes source contents to be translated. We rewrite Equation 4 as After obtaining attention context c t , we update FUTURE states via Equation 6, and feed both of them to decoder states: where c t encodes the source context of the present translation, and s F t encodes source context on future translation.

Activation Functions for Subtraction
We design several variants of RNN activation functions to better model the subtractive operation ( Figure 3):

GRU.
A natural choice is standard GRU, 1 which learns subtraction directly from the data: where r t is a reset gate determining the combination of the input with the previous state, and u t is an update gate defining how much of the previous state to keep around. The standard GRU uses a feed-forward neural network (Equation 10) to model the subtraction without any explicit operation, which may lead to the difficulty of the training. In the following two variants, we provide GRU with explicit subtraction operations, which are inspired by the well known phenomenon that minus operation can be applied to the semantics of word embeddings (Mikolov et al., 2013). 2 Therefore we subtract the semantics being translated from the untranslated FUTURE contents at each decoding step.
GRU with Outside Minus (GRU-o). Instead of directly feeding c t to GRU, we compute the current untranslated contents M(s F t−1 , c t ) with an explicit minus operation, and then feed it to GRU: GRU with Inside Minus (GRU-i). We can alternatively integrate a minus operation into the calculation ofs F t : Compared with Equation 10, the differences between GRU-i and standard GRU are 1. Minus operation is applied to produce the energy of intermediate candidate states F t ; 2. The reset gate r t is used to control the amount of information flowing from inputs instead of from the previous state s F t−1 . Note that for both GRU-o and GRU-i, we leave enough freedom for GRU to decide the extent of integrating with subtraction operations. In other words, the information subtraction is "soft."

Modeling PAST
Formally, the PAST layer is another recurrent neural network (the second gray layer in Figure 2), and its state at time step t is calculated by Initially, s P t is an all-zero vector, which denotes no source content is yet translated. We choose GRU as the activation function for the PAST layer, since the internal structure of GRU is in accord with "addition" operation.
We feed the PAST state from last time step to both attention model and decoder state:

Modeling PAST and FUTURE
We integrate PAST and FUTURE layers together in our final model ( Figure 2): In this way, both of the attention model and decoder state are aware of what has been translated, and what has not yet.

Learning
We introduce additional loss functions to estimate the semantic subtraction and addition, which guide the training of the FUTURE layer and PAST layer, respectively.
Loss Function for Subtraction. As described above, the FUTURE layer models the future semantics in a declining way: ∆ F t = s F t−1 − s F t ≈ c t . Since source and target sides contain equivalent semantic information in machine translation (Tu et al., 2017a): c t ≈ E(y t ), we directly measure the consistence between ∆ F t and E(y t ), which guides the subtraction to learn the right thing: In other words, we explicitly guide the FUTURE layer by this subtractive loss, expecting ∆ F t to be discriminative of the current word y t .
Loss Function for Addition. Likewise, we introduce another loss function to measure the information incrementation of the PAST layer. Notice that ∆ P t = s P t − s P t−1 ≈ c t , which is defined similar to ∆ F t except a minus sign. In this way, we can reasonably assume the FUTURE and PAST layers are indeed doing subtraction and addition, respectively. We also evaluate the alignment performance on the standard benchmark of Liu and Sun (2015), which contains 900 manually aligned sentence pairs. We measure the alignment quality with the alignment error rate (Och and Ney, 2003).
For De-En and En-De, we conduct experiments on the WMT17 (Bojar et al., 2017) corpus. The dataset consists of 5.6M sentence pairs. We use newstest2016 as our development set, and newstest2017 as our testset. We follow Sennrich et al. (2017a) to segment both German and English words into subwords using byte-pair encoding (Sennrich et al., 2016, BPE).
We measure the translation quality with BLEU scores (Papineni et al., 2002).
We use the multi-bleu script for Zh-En 4 , and the multi-bleu-detok script for De-En and En-De 5 . mosesdecoder/blob/master/scripts/generic/ multi-bleu.perl 5 https://github.com/EdinburghNLP/ nematus/blob/master/data/ multi-bleu-detok.perl Training Details. We use the Nematus 6 (Sennrich et al., 2017b), implementing a baseline translation system, RNNSEARCH. For Zh-En, we limit the vocabulary size to 30K. For De-En and En-De, the number of joint BPE operations is 90,000. We use the total BPE vocabulary for each side. We tie the weights of the target-side embeddings and the output weight matrix (Press and Wolf, 2017) for De-En. All out-of-vocabulary words are mapped to a special token UNK.
We train each model with sentences of length up to 50 words in the training data. The dimension of word embeddings is 512, and all hidden sizes are 1024. In training, we set the batch size as 80 for Zh-En, and 64 for De-En and En-De. We set the beam size as 12 in testing. We shuffle the training corpus after each epoch.
We use Adam (Kingma and Ba, 2014) with annealing (Denkowski and Neubig, 2017) as our optimization algorithm. We set the initial learning rate as 0.0005, which halves when the validation crossentropy does not decrease.
For the proposed model, we use the same setting with the baseline model. The FUTURE and PAST layer sizes are 1024. We employ a two-pass strategy for training the proposed model, which has proven useful to ease training difficulty when the model is relatively complicated (Shen et al., 2016;Wang et al., 2017;Wang et al., 2018). Model parameters shared with the baseline are initialized by the baseline model.

Results on Chinese-English
We first evaluate the proposed model on the Chinese-English translation and alignment tasks.  a regular GRU for its minus operation, and GRUi is the best, which shows that our elaborately designed architecture is more proper for modeling the decreasing phenomenon of the future semantics.

Translation Quality
Adding subtractive loss gives an extra 0.68 BLEU score improvement, which indicates that adding g is beneficial guided objective for FRNN to learn the minus operation.
PAST Layer. (Rows 5-6). We observe the same trend on introducing PAST layer: using it alone achieves a significant improvement ( +1.19), and with the additional objective it further improves the translation performance ( +0.57).
Stacking FUTURE and PAST Together. (Rows 7-8). The model's final architecture outperforms our intermediate models (1-6) by combining FRNN and PRNN. By further separating the functionaries of past contents modeling and language modeling into different neural components, the final model is more flexible, obtaining a 0.91 BLEU improvement over the best intermediate model (Row 4) and an improvement of 2.71 BLEU points over the RNNSEARCH baseline.
Comparison with Other Work. (Rows 9-11). We also conduct experiments with multi-layer decoders  to see whether NMT system can automatically model the translated and untranslated contents with additional decoder layers (Rows 9-10). However, we find that the performance is not improved using a two-layer decoder (Row 9), until a deeper version (three-layer decoder, Row 10) is used. This indicates that enhancing performance is non-trivial by simply adding more RNN layers into the decoder without any explicit instruction, which is consistent with the observation of Britz et al. (2017) Our model also outperforms the word-level COV-ERAGE (Tu et al., 2016), which considers the coverage information of the source words independently. Our proposed model can be regarded as a high-level coverage model, which captures higher level coverage information, and gives more specific signals for the decision of attention and target prediction. Our model is more deeply involved in generating target words, by being fed not only to the attention model as in Tu et al. (2016), but also to the decoder state.

Subjective Evaluation
Following Tu et al. (2016), we conduct subjective evaluations to validate the benefit of modeling PAST and FUTURE (Table 3). Four human evaluators are asked to evaluate the translations of 100 source sentences, which are randomly sampled from the testsets without knowing from which system the translation is selected. For the BASE system, 1.7% of the source words are over-translated and 8.8% are under-translated. Our proposed model alleviates these problems by explicitly modeling the dy-   Table 3: Subjective evaluation on over-and undertranslation for Chinese-English. "Ratio" denotes the percentage of source words which are overor under-translated, "∆" indicates relative improvement. "BASE" denotes RNNSEARCH and "OURS" denotes "+ FRNN (GRU-i) + PRNN + LOSS". namic source contents by PAST and FUTURE layers, reducing 11.8% and 35.2% of over-translation and under-translation errors, respectively. The proposed model is especially effective for alleviating the under-translation problem, which is a more serious translation problem for NMT systems, and is mainly caused by lacking necessary coverage information (Tu et al., 2016). Table 4 lists the alignment performances of our proposed model. We find that the COVERAGE model do improve attention model. But our model can produce much better alignments compared to the word level coverage (Tu et al., 2016). Our model distinguishes the PAST and FUTURE directly, which is a higher level coverage mechanism than the word coverage model.  Table 4: Evaluation of the alignment quality. The lower the score, the better the alignment quality.

Results on German-English
We also evaluate our model on the WMT17 benchmarks for both De-En and En-De. As shown in Table  5, our baseline gives comparable BLEU scores to the state-of-the-art NMT systems of WMT17. Our proposed model improves the strong baseline on both De-En and En-De. This shows that our proposed model work well across different language pairs. Rikters et al. (2017) and Sennrich et al. (2017a) obtain higher BLEU scores than our model, because they use additional large scaled synthetic data (about 10M) for training. It maybe unfair to compare our model to theirs directly.

Analysis
We conduct analyses on Zh-En, to better understand our model from different perspectives .  leads to a relatively slower training speed. However, our proposed model does not significant slow down the decoding speed. The most time consuming part is the calculation of the subtraction and addition losses. As we show in the next paragraph, our system works well by only using the losses in training, which further improve the decoding speed of our model.

Effectiveness of Subtraction and Addition Loss.
Adding subtraction and addition loss functions helps in twofold: (1) guiding the training of the proposed subtraction and addition operation, and (2) enabling better reranking of generated candidates in testing. Table 7 lists the improvements from the two perspectives. When applied only in training, the two loss functions lead to an improvement of 0.48 BLEU points by better modeling subtraction and addition operations. On top of that, reranking with FUTURE and PAST loss scores in testing further improves the performance by +0.99 BLEU points.
Initialization of FUTURE Layer. The baseline model does not obtain abundant accuracy improvement by feeding the source summarization into the decoder (Table 1). We also experiment to not feed the source summarization into the decoder of the proposed model, which leads to a significant BLEU score drop on Zh-En. This shows that our proposed model better use the source summarization with ex-  plicitly modeling the FUTURE compared to the conventional encoder-decoder baseline.
Case Study. We also compare the translation cases for the baseline, word level coverage and our proposed models. As shown in Table 9, our baseline system suffers from the over-translation problems (case 1), which is consistent with the results of human evaluation (Section 3). The BASE system also incorrectly translates "the royal family" into "the people of hong kong", which is totally irrelevant here. We attribute the former case to the lack of untranslated future modeling, and the latter one to the overloaded use of the decoder state where the language modeling of the decoder leads to the fluent but wrong predictions. In contrast, the proposed approach almost address the errors in these cases.

Conclusion
Modeling source contents well is crucial for encoder-decoder based NMT systems. However, current NMT models suffer from distinguishing translated and untranslated translation contents, due to the lack of explicitly modeling past and future translations. In this paper, we separate PAST and FUTURE functionalities from decoder states, which can maintain a dynamical yet holistic view of the source content at each decoding step. Experimental results show that the proposed approach signifi-Source 布什 还 表示 , 应 巴基斯坦 和 印度 政府 的 邀请 , 他 将 于 3月份 对 巴基斯坦 和 印度 进行 访问 。 Reference bush also said that at the invitation of the pakistani and indian governments , he would visit pakistan and india in march .
BASE bush also said that he would visit pakistan and india in march .
COVERAGE bush also said that at the invitation of pakistan and india , he will visit pakistan and india in march .
OURS bush also said that at the invitation of the pakistani and indian governments , he will visit pakistan and india in march . Source Reference therefore , many people say that it will have a great impact on the royal family and japanese society .
BASE therefore , many people are of the view that if this is the case , it will also have a great impact on the people of hong kong and the japanese society .
COVERAGE therefore , many people think that if this is the case , there will be great impact on the royal and japanese society .
OURS therefore , many people think that if this is the case , it will have a great impact on the royal and japanese society . cantly improves translation performances across different language pairs. With better modeling of past and future translations, our approach performs much better than the standard attention-based NMT, reducing the errors of under and over translations.