Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation

Neural machine translation (NMT) aims at solving machine translation (MT) problems using neural networks and has exhibited promising results in recent years. However, most of the existing NMT models are shallow and there is still a performance gap between a single NMT model and the best conventional MT system. In this work, we introduce a new type of linear connections, named fast-forward connections, based on deep Long Short-Term Memory (LSTM) networks, and an interleaved bi-directional architecture for stacking the LSTM layers. Fast-forward connections play an essential role in propagating the gradients and building a deep topology of depth 16. On the WMT’14 English-to-French task, we achieve BLEU=37.7 with a single attention model, which outperforms the corresponding single shallow model by 6.2 BLEU points. This is the first time that a single NMT model achieves state-of-the-art performance and outperforms the best conventional model by 0.7 BLEU points. We can still achieve BLEU=36.3 even without using an attention mechanism. After special handling of unknown words and model ensembling, we obtain the best score reported to date on this task with BLEU=40.4. Our models are also validated on the more difficult WMT’14 English-to-German task.


Introduction
Neural machine translation (NMT) has attracted a lot of interests in solving the machine translation (MT) problems in recent years (Kalchbrenner and Blunsom, 2013;Bahdanau et al., 2014). Unlike conventional statistical machine translation (SMT) systems (Koehn et al., 2003;Durrani et al., 2014) which consist of multiple separately tuned components, NMT models encode the source sequence into continuous representation space and generate the target sequence in an end-to-end form. Moreover, NMT models can also be easily adapted to other tasks such as dialog systems , question answering systems (Yu et al., 2015) and image caption generation (Mao et al., 2014).
Generally, there are two types of NMT topologies named encoder-decoder network  and attention network (Bahdanau et al., 2014). Encoder-decoder network represents the source sequence with a fixed dimensional vector and the target sequence is generated from this vector word by word. Attention network considers the representations from all time steps of the input sequence, building a detailed relationship between the target words and input words. Recent results show that the systems based on these models can achieve similar performance with conventional SMT systems (Luong et al., 2015;Jean et al., 2015).
However, single neural models of either of the above types have not been competitive with the best conventional system (Durrani et al., 2014) when evaluated on WMT'14 English-to-French task. The best BLEU score from single model which has six layers is only 31.5 (Luong et al., 2015) while the conventional method of (Durrani et al., 2014) gives 37.0.
We focus on improving the single model per-formance by increasing the model depth. Deep topology has been proven to outperform the shallow architecture in computer vision. In the past two years the top positions of the ImageNet contest are always occupied by the systems with more than tens or even hundreds of layers (Szegedy et al., 2015;He et al., 2015). But in NMT, the largest depth used successfully is only six (Luong et al., 2015). We attribute this problem to the properties of Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) which is widely used in NMT.
In LSTM, there are more non-linear activations than those in convolution layer. These activations largely decrease the gradient in deep topology, especially when the information propagates in recurrent form.
There are also many efforts to increase the depth of LSTM such as (Kalchbrenner et al., 2015), where the shortcuts do not avoid the nonlinear and recurrent computation.
In this work, we introduce the new linear connections for multi-layer recurrent networks. These connections which are called fast-forward connections play an essential role in building the deep topology with depth of 16. Besides, we also introduce the interleaved bi-directional way to stack LSTM layers in encoder part. The topology works for both encoder-decoder network and attention network. On WMT'14 English-to-French task, this is the deepest NMT topology that has ever been investigated. With our deep attention model, the BLEU score can be improved to 37.7 and outperforms the shallow model which has six layers (Luong et al., 2015) by 6.2 BLEU points. This is also the first time on this task that a single NMT model achieves state-ofthe-art performance and outperforms the best conventional SMT system (Durrani et al., 2014) with the improvement of 0.7. Even without considering the attention mechanism, we can still achieve 36.3 with a single model. After model ensembling and unknown word processing, the BLEU score can be further improved to 40.4. When evaluated on the subset of test corpus without unknown words, our model gives 41.4. As a reference, previous work showed that oracle re-scoring the SMT generated 1000-best sequences exhibits the BLEU score of about 45 . Our models are also verified on the more difficult WMT'14 Englishto-German task.

Neural Machine Translation
Neural machine translation aims at generating the target word sequence y = {y 1 , . . . , y n } given the source word sequence x = {x 1 , . . . , x m } with neural models. In this task, the likelihood p(y | x, θ) of the target sequence will be maximized (Forcada andÑeco, 1997) with parameter θ to learn: (1) y 0:j−1 is the sub sequence from y 0 to y j−1 . y 0 and y m+1 denote the start mark and end mark of target sequence respectively. The process can be explicitly split into encoding part, decoding part and interface between these two parts. In the encoding part, the source sequence is processed and transformed into a group of vectors e = {e 1 , · · · , e m } for each time step. Further operations will be used at the interface part to extract the final representation c of the source sequence from e. At decoding step, the target sequence will be generated from the representation c.
Recently, there are two types of NMT models which are distinct in the interface part. In the encoder-decoder model , a single vector extracted from e is used as the representation. In the attention model (Bahdanau et al., 2014), c is dynamically obtained according to the relationship between target sequence and source sequence.
Recurrent neural network (RNN) or its specific form LSTM is generally employed as the basic unit of the encoding and decoding function. However, most of the existing works are of shallow topology. In attention network, the encoding part and decoding part have only one LSTM layer respectively. In encoding-decoding network, people used at most six LSTM layers (Luong et al., 2015). As machine translation is considered to be a difficult problem, we believe more complex encoding and decoding functions are needed for modeling the relationship between the source sequence and target sequence. In this work, we focus on enhancing the complexity of the encoding/decoding functions by increasing the model depth.
Deep neural models have been studied in a wild range of problems. In computer vision, models with more than tens of convolution layers outperform the shallow ones on a series of image tasks in recent years (Srivastava et al., 2015;He et al., 2015;Szegedy et al., 2015). Different kinds of shortcut connections are proposed to decrease the length of the gradient propagation path. Training networks based on LSTM layers which is widely used in language problems is a much more challenging task. Because of the existence of large amount of nonlinear activations and the recurrent computation, gradient values are not stable and generally are smaller. Following the same spirit for convolutional network, a lot of efforts have also been spent into training deep LSTM topologies. Yao et al. (2015) introduced depth-gated shortcuts connecting LSTM cells at adjacent layers to provide a fast way to propagate the gradients. They verified the modification of these shortcuts on MT task and language modeling task, but the best score are from models with three layers. Similarly, Kalchbrenner et al. (2015) extended this topology to be multi-dimensional. They decreased the number of nonlinear activations and path length, but the gradient propagation still relies on the recurrent computation. The investigations were also made on question-answering to encode the questions, where at most two LSTM layers are stacked (Hermann et al., 2015).
Based on the above considerations, we propose new connections to facilitate gradient propagation in the following section.

Deep Topology
We build the deep LSTM network with the new proposed linear connections. The shortest paths through the proposed connections do not include any nonlinear transformations and do not rely on any recurrent computation. We call these connections fastforward connections. Within the deep topology, we also introduce the interleaved bi-directional way to stack the LSTM layers. And the way to extract the source sequence representation is consequently modified.

Network
Our whole deep neural network is shown in Fig. 2. This topology can be divided into three parts: encoder part (P-E) on the left, decoder part (P-D) on the right and the interface between these two parts (P-I) which extracts the representation of the source sequence. We have two instantiations of this topology named Deep-ED and Deep-Att, corresponding to the extension of encoder-decoder network and attention network respectively. Our essential innovation is made for the adjacent stacked recurrent layers and we will start with the basic RNN model for the sake of clarity. Recurrent layer: When an input sequence {x 1 , . . . , x m } is given to a recurrent layer, the output h t at each time step t can be computed as (see Fig. 1 where bias parameter is not included for simplicity. We use red circle, blue empty square to denote the input and hidden state. Blue square with "-" denotes previous hidden state. Dotted line means that the hidden state is recurrently used. This computation can be equivalently split into two consecutive steps: • Feed-Forward computation: f t = W f x t . Left part in Fig. 1 (b). "f" block.
For a deep topology with stacked recurrent layers, the input of each block f at recurrent layer k (denoted by f k ) is usually the output of block r at its previous recurrent layer k − 1 (denoted by h k−1 ). In our work, we add fast-forward connections (F-F connections) which connect two feed-forward computation blocks f of adjacent recurrent layers. It means that each block f at recurrent layer k takes both outputs of block f and block r at its previous layer as input ( Fig. 1 (c)). F-F connections are denoted by dashed red lines in Fig. 1 (c) and Fig. 2. The path of F-F connections contains neither nonlinear activations nor recurrent computation. It provides a fast way for information to travel, so we call this path fast-forward connections.
Additionally, in order to learn more temporal dependencies, the sequences could be processed in different directions at each pair of adjacent recurrent layers. This is quantitatively expressed in Eq. 3: The opposite directions are marked by direction term (−1) k . At the beginning layer, the block f takes x t as the input.
[ , ] denotes the concatenation of the inside vectors. This is shown in Fig. 1 (c). Two variations are summarized here: • We add a connection between f k t and f k−1 t . Without f k−1 t , our model will be reduced to the traditional stacked model.
• We alternate the RNN direction at different layer k with the direction term (−1) k . If we fix the direction term to −1, all layers work in the forward direction.
LSTM layer: Actually, a specific type recurrent layer called LSTM (Hochreiter and Schmidhuber, 1997;Graves et al., 2009) is used in our work. LSTM is structurally more complex than basic RNN used in Eq. 2. We define the computation of LSTM as a function which maps the input f and its state-output pair (h, s) at previous time step to a new current state-output pair. The exact computations for (h t , s t ) = LSTM(f t , h t−1 , s t−1 ) are the following: is the concatenation of four vectors of equal size, • means element-wise multiplication, σ i is the input activation function, σ o is the output activation function, σ g is the activation function for gates, and W r , θ ρ , θ φ , and θ π are the parameters of the LSTM. It is slightly different from the standard notation in that we do not have a matrix to multiply with the input f in our notation.
With this notation, we can write down the computations for our deep bi-directional LSTM model with F-F connections: where x t is the input to the deep bi-directional LSTM model. For encoder, x t is the embedding of t th word in the source sentence. For decoder x t is the concatenation of the embedding of the t th word in the target sentence and the encoder representation for step t. In our final model two additional operations are used with Eq. 5, which is shown in Eq. 6. Half(f ) means the first half elements of h, and Dr(h) is the dropout operation  which randomly set an element of h to zero with a certain probability. The use of Half(·) is to reduce the parameter size and does not affect on the performance. While we tried to only use the first thirds parameters, the performance degrades considerably.
With the F-F connections, we build a channel to propagate the gradients in deep topology. F-F connections will accelerate the model convergence and meanwhile improve the performance. Similar idea Figure 2: Network. It includes three parts from left to right: encoder part (P-E), interface (P-I) and decoder part (P-D). We only show the topology of Deep-Att as an example. "f" and "r" blocks correspond to feed-forward part and the following LSTM computation. The F-F connections are denoted by red lines.
Encoder: The LSTM layers are stacked following Eq. 5. We call this type of encoder interleaved bidirectional encoder. Besides, there are two similar columns (a 1 and a 2 ) in encoder part. Each column consists of n e stacked LSTM layers. There is no connection between the columns. The first layers of both columns process the word representations of the source sequence in different direction. At the last LSTM layers, there are two groups of vectors representing the source sequence. The group size is the same as the length of the input sequence.
Interface: Previous encoder-decoder model and attention model are different in how to extract the representations of the source sequences. In our work, as a consequence of the introduced F-F connections, we have 4 output vectors (h k t and f k t of both columns). The representations are modified for both Deep-ED and Deep-Att. For Deep-Att, we do not need the above two operations. We only concatenate the 4 output vectors at each time steps to obtain e t , and a soft attention mechanism (Bahdanau et al., 2014) is used to calculate the final representation c t from e t . e t is summarized as: Note that the vector dimension of f is four times larger than that of h (see Eq. 4). c t is summarized as: α t,t is the normalized attention weight computed by: h 1,dec t−1 is the first hidden layer output in decoding part. a(·) is an alignment model described in (Bahdanau et al., 2014). For Deep-Att to save the memory cost, we linearly project (with W p ) the concatenated vector e t to a vector with 1/4 dimension size, denoted by fc (fully connected) block in Fig. 2.
Decoder: The decoder follows the Eq. 5 and Eq. 6 with fixed direction term −1. At the first layer, we use the following x t : y t−1 is the target word embedding at the previous time step and initially y 0 is zero. There is a single column of n d stacked LSTM units. We also use the F-F connections as those in encoder and all layers work in forward direction. Note that at the last LSTM layer, we only use h t to make the prediction with a softmax layer.

Training technique
We take the parallel data as the only input without using any monolingual data for neither word representation pre-training nor language modeling. Because of the deep bi-directional structure, we do not need to reverse the sequence order as . The deep topology brings difficulties into the model training, especially when the first order methods such as stochastic gradient descent (SGD) (Le-Cun et al., 1998) is used. The parameters should be properly initialized and the converging process would be slow. We tried several high order optimization techniques such as AdaDelta (Zeiler, 2012), RMSProp (Tieleman and Hinton, 2012) and Adam (Kingma and Ba, 2014). We found all of them are able to speed up the process a lot compared to simple SGD while no significant difference is observed among them in performance. In this work, we choose Adam for model training and will not compare their detailed behavior.
Dropout  is also used to avoid over-fitting. It is utilized on LSTM nodes h k t (See Eq. 5) with ratio of p d for both encoder and decoder. During the whole model training process, we keep all hyper parameters fixed without any intermediate interruption. The hyper parameters are selected according to the performance on development set. For such a deep and large network, it is not easy to determine the tuning strategy and this would be considered in future work.

Generation
We use the common left-to-right beam-search method for sequence generation. At each time step t, the word y t can be predicted by: whereŷ t is the predicted target word.ŷ 0:t−1 is the generated sequence from time step 0 to t − 1. We keep n b best candidates according to Eq. 11 at each time step, until the end mark is generated. The hypothesis are ranked by the total likelihood of the generated sequence, despite in some works normalized likelihood is considered (Jean et al., 2015).

Experiment
We evaluate our method mainly on the widely used WMT'14 English-to-French translation task. In order to verify our model on more difficult language pairs, we also give the results of WMT'14 Englishto-German translation task.

Data sets
For both tasks, we use the full WMT'14 parallel corpus as our training data. The detailed data sets are listed below: • English-to-French: Europarl v7, Common Crawl, UN, News Commentary, Gigaword • English-to-German: Europarl v7, Common Crawl, News Commentary In total, the English-to-French corpus includes 36 million sentence pairs, and English-to-German corpus includes 4.5 million sentence pairs. The newstest-2012 and news-test-2013 are concatenated as our development set, and the news-test-2014 is the test set. Our data set is consistent with the previous works on NMT (Luong et al., 2015;Jean et al., 2015) to ensure fair comparison. For the source language, we select the most frequent 200K words as the input vocabulary. For the target language we select the most frequent 80K French words and 160K German words as the output vocabulary. The full vocabulary of German corpus is larger (Jean et al., 2015), so we select more German words to build the target vocabulary. Outof-vocabulary words are replaced with the unknown symbol unk . For complete comparison with previous work on English-to-French task, we also show the results with small vocabulary of 30K input words and 30K output words on the sub train set with selected 12M parallel sequences (Schwenk, 2014;.

Model settings
We have two models as described above, named Deep-ED and Deep-Att. Both models have exactly the same configuration and layer size except the interface part P-I.
We use 256 dimensional word embeddings for both source and target languages. All LSTM layers, including 2 × n e layers in encoder and n d layers in decoder, have 512 memory cells. The output layer size is the same as the size of the target vocabulary. The dimension of c t is 5120 and 1280 for Deep-ED and Deep-Att respectively. For each LSTM layer, activation functions for gates, inputs and outputs are sigmoid, tanh, and tanh respectively.
Our network is narrow on word embeddings and LSTM layers. Note that in previous work Bahdanau et al., 2014), 1000 dimensional word embeddings and 1000 dimensional LSTM layers are used. We also tried larger scale models but did not obtain further improvements. Consider comparing the computation complexity with Luong et. al. (2015), ours might be much lower than theirs.

Optimization
Note that each LSTM layer includes two parts as described in Eq. 3, feed-forward computation and recurrent computation. Since there are non-linear activations in the recurrent computation, a larger learning rate l r = 5×10 −4 is used while for feed-forward computation a smaller learning rate l f = 4 × 10 −5 is used. Word embeddings and softmax layer also use this learning rate l f . We refer to all parameters not used for recurrent computation as non-recurrent part of the model.
Because of the large model size, we use strong L 2 regularization to constrain the parameter matrix v in the following way: Here r is the regularization strength, l is the corresponding learning rate, g stands for the gradients of v. Two embedding layers are not regularized. All the other layers have the same r = 2.
Parameters of the recurrent computation part are set to be zero at the beginning. All non-recurrent parts are randomly initialized with zero mean and standard deviation of 0.07. A detailed guide for setting hyper-parameters can be found in (Bengio, 2012).
Dropout ratio p d is set to be 0.1. In each batch, there are 500 ∼ 800 sequences in our work. The exact number depends on the sequence lengths and model size. We also find that larger batch size results in better convergence although the improvement is not large. But this is constrained by the GPU memory. We need 4 ∼ 8 GPU machines (each has 4 K40 GPU cards) running for 10 days to train the full model with parallelization at data batch level. It takes nearly 1.5 days for each pass.
One thing we want to stress here is that our deep model is not sensitive to these settings. Small variations around this region will not affect the final performance.

Results
We keep the same evaluation rule with most previous NMT works Luong et al., 2015;Jean et al., 2015). All reported BLEU scores are computed with the multi-bleu.perl 1 script which is also used in the above works. The results are tokenized and case sensitive.

Single models
English-to-French: First we list our single model results on English-to-French task in Tab. 1. In the first block we show the results with the full corpus. The previous best single NMT encoder-decoder model (Enc-Dec) with six layers gives BLEU=31.5 . From Deep-ED, we obtain the BLEU score of 36.3, which outperforms Enc-Dec model by 4.8 BLEU points. This result is even better than the ensemble result of eight Enc-Dec models which is 35.6 (Luong et al., 2015). This shows that deep topology can also works well with LSTM layer, besides convolutional layers in computer vision. For Deep-Att, the performance is further improved to be 37.7. We also list the previous state-of-the-art performance from conventional SMT system (Durrani et al., 2014) with BLEU of 37.0. This is the first time that a single NMT model trained in end-to-end form beats the best conventional system.
We also show the results on the smaller data set with 12M sentence pairs and 30K vocabulary in the second block. Two attention models RNNsearch (Bahdanau et al., 2014) and RNNsearch-LV (Jean et al., 2015) give the BLEU score of 28.5 and 32.7 respectively. Note that RNNsearch-LV uses a large output vocabulary of 500K words based on the standard attention model RNNsearch. We obtain BLEU=35.9 which outperforms its corresponding shallow model RNNsearch by 7.4 BLEU points. The SMT result from (Schwenk, 2014) is also listed and falls behind our model by 2.6 BLEU points.  Table 1: English-to-French task: BLEU scores of single neural models. We also list the conventional SMT system for comparison.
Besides, during the generation process, we obtained the best BLEU score with beam size = 3 (when beam size is 2, there is only 0.1 difference in BLEU score.). This is different from other works listed in Tab. 1, where the beam size is 12 (Jean et al., 2015;. We attribute this difference to the improved model performance, where the ground truth generally exists in the top hypothesis. Consequently, with largely decreased beam size, the generation efficiency is significantly improved. Next we are going to list the effect of the novel F-F connections in our Deep-Att model of shallow topology in Tab. 2. When n e = 1 and n d = 1, the BLEU scores are 31.2 without F-F and 32.3 with F-F. Note that the model without F-F is exactly the standard attention model (Bahdanau et al., 2014). Since here is only a single layer, F-F connections means at the interface part we include f t into the representation (see Eq. 7). We find F-F connections bring 1.1 improvement of BLEU score. After we increase our model depth to n e = 2 and n d = 2, the improvement is enlarged to 1.4. When the model is trained with larger depth without F-F connections, we find that the parameter exploding problem (Bengio et al., 1994) happens too frequently that we could not finish training. This suggests that F-F connections provide a fast way for gradient propagation.  Removing F-F connections also reduces the corresponding model size. In order to figure out the effect of F-F between models with the same parameter size, we increase the LSTM layer width of Deep-Att without F-F. In Tab. 3 we show that, after using two times larger LSTM layer width of 1024, we can only obtain BLEU score of 33.8. It is still behind the corresponding Deep-Att with F-F.  We also notice that the interleaved bi-directional encoder start to work when the encoder depth is larger than 1. This property is investigated in Tab. 4. For our largest model with n e = 9 and n d = 7, we compared the BLEU scores of interleaved bidirectional encoder and uni-directional encoder (all LSTM layers work in forward direction). We find there is a gap about 1.5 points between these two encoders in both Deep-Att and Deep-ED. We list the BLEU scores of our largest Deep-Att and Deep-ED. The encoder term Bi deontes the interleaved bi-directional encoder is used. Uni denotes all LSTM layers work in forward direction.
Next we are going to look into the model depth. In Tab. 5, starting from n e = 1 and n d = 1 and gradually increasing the model depth, we significantly increase BLEU scores. With n e = 9 and n d = 7, the best scores for Deep-Att is 37.7. We tried to increase the LSTM width based on this, but obtained little improvements. As we stated in Sec.2, the complexity of encoding/decoding functions, which is related to the model depth, is more important than model size. We also tried larger depth further, but the results start to get worse. With our topology and training technique, n e = 9 and n d = 7 is the best depth we can achieve.  Table 5: BLEU score of Deep-Att with different model depth. With n e = 1 and n d = 1, F-F connections only contribute to the representation at interface part where f t is included (see Eq. 7).
The last line in Tab. 5 shows the BLEU score of 36.6 of our deepest model where only one encoding column (Col = 1) is used. We find 1.1 BLEU points degradation with single encoding column. Note that results in Tab. 4 with uni-direction still have two encoding columns. In order to find out whether this is caused by the decreased parameter size, we use a wider LSTM layer with width of 1024 memory blocks. It is shown in Tab. 6 that there is a minor improvement of only 0.1. We attribute this to the complementary information provided by the double encoding column.
English-to-German: We also verify our deep  topology on English-to-German task. English-to-German task is considered as a relatively more difficult task, because of the less similarity between this language pair. Since the German vocabulary is much larger than French vocabulary, we select 160K most frequent words as the target vocabulary. All the other hyper parameters are exactly the same as those in English-to-French task.
We list our single model Deep-Att performance in Tab. 7. Our single model result with BLEU=20.6 is similar with conventional SMT result of 20.7 (Buck et al., 2014). We also outperform the shallow attention models as shown in the first two lines in Tab. 7. All the results are consistent with those in Englishto-French task.

Post processing
Two post processing techniques are used to improve the performance further on English-to-French task.
First, three Deep-Att models are built for ensemble results. They are initialized with different parameters. And the training corpus for these models are shuffled with different random seed. We sum over the distribution of the predicted words and normalize the final distribution to generate the next word. It is shown in Tab. 8 that the model ensemble can improve the performance further to 38.9. In (Luong et al., 2015) and (Jean et al., 2015) there are eight models for the best scores, but we only use three models and we do not obtain further gain from more models.  Table 8: BLEU scores of different models. The first two blocks are our results of two single models and models with post processing. In the last block we list two baselines of the best conventional SMT system and NMT system.
Second, we recover the unknown words in the generated sequences with Positional Unknown (PosUnk) model introduced in (Luong et al., 2015). The full parallel corpus is used to obtain the word mappings (Liang et al., 2006). We find this method provides additional 1.5 BLEU points, consistent with the conclusion in (Luong et al., 2015). We obtain the new BLEU score of 39.2 with single Deep-Att. For the ensemble models of Deep-Att, the BLEU score rises to 40.4. At the last two lines, we list the conventional SMT model (Durrani et al., 2014) and previous best neural models based system Encoding-Decoding (Luong et al., 2015) for comparison. We find our best score outperforms the previous best score by nearly 3 points.

Length
On English-to-French task, we analyze the effect of the source sequence length on our models as shown in Fig. 3. Here we show five curves of our Deep-Att single model, our Deep-Att ensemble model, our Deep-ED model, previous Enc-Dec model with four layers  and SMT model (Durrani et al., 2014). We find our Deep-Att model works better than the previous two models (Enc-Dec and SMT) on nearly all length scales. It is also shown that for very long sequences with length over 70 words, the performance of our Deep-Att does not degrade, when compared with another NMT model Enc-Dec. Our Deep-ED also have much better performance than the shallow Enc-Dec model on nearly all length scales, although for long

Unknown words
Next we look into the detail of the effect of unknown words on English-to-French task. We select the subset without unknown words on target sentence from the original test set. There are 1705 sequences (56.8%). We compute the BLEU scores on this subset and the results are shown in Tab. 9. We also list the results from SMT model (Durrani et al., 2014) as comparison .
We find the BLEU score of Deep-Att on subset rises to 40.3. Considering the score on full test set is 37.7, we have a gap of 2.4. On this subset, the SMT model gives 37.5, and it should be noted that its score on full set is 37.0. This suggests that the difficulty on this subset is not much different from the full set. So we attribute the larger gap for Deepatt to the existence of unknown words. We also compute the BLEU score on the subset with ensemble model and obtain 41.4. As a reference related to human level, in , it has been tested that the BLEU score of oracle re-scoring the LIUM 1000-best results (Schwenk, 2014) is 45.

Over-fitting
Deep models have more parameters, and thus have stronger ability to fit the large data set. Besides, we will also show that deep models are less probable to over-fit the data set.    In Fig. 4, on English-to-French task, we show three results from models with different depth. These three models are evaluated by token error rate, which is defined as the ratio of incorrectly predicted words in the whole target sequence with correct historical input. The curve with square marks corresponds to Deep-Att with n e = 9 and n d = 7. The curve with circle marks corresponds to n e = 5 and n d = 3. The curve with triangle marks corresponds to n e = 1 and n d = 1. We find deep model has better performance on test set when the token error rate is the same as that of shallow models on train set. It shows that, with decreased token error rate, deep model is more advantageous in avoiding the overfitting phenomenon. We only plot the early training stage curves because on late training stage the curves become not smooth.

Examples
On English-to-French task, we select two example sequences showing in Tab. 10. For each example, there are four lines corresponding to source language, reference(ref), our ensemble model result with PosUnk and SMT model (Durrani) (Durrani et al., 2014) respectively.
With the introduction of fast-forward connections to the deep LSTM network, we build a fast path with neither non-linear transformations nor recurrent computation to propagate the gradients from the top to the deep bottom. On this path, gradients decays much slower compared to the standard LSTM network. This enable us to build the deep topology of NMT models.
We trained NMT models with depth of 16 including 25 LSTM layers and evaluated them mainly on WMT'14 English-to-French translation task. This is the deepest topology that have been investigated in NMT area on this task. We showed that our Deep-Att exhibits 6.2 BLEU points improvement over the previous best single model, achieving 37.7 BLEU score. This single end-to-end NMT model outperforms the conventional best SMT system (Durrani et al., 2014) and achieves the state-ofthe-art performance. After utilizing the unknown word processing and model ensemble of three models, we obtained the BLEU score of 40.4, improved by 2.9 BLEU score over previous best result. When evaluated on the subset of test corpus without unknown words, our model gives 41.4. As a reference number, the work  show that oracle re-scoring the SMT generated 1000-best lists exhibits the BLEU score of about 45. Our model is also verified on the more difficult English-to-German task. Besides, our model is efficient in sequence generation. The best results from both single model and model ensemble are obtained with beam size of 3, much smaller than previous NMT systems where beam size is about 12 (Jean et al., 2015;. From our analysis, we find deep models are more advantageous at learning long sequences and the deep topology is resistant to the over-fitting phenomenon. We tried deeper models and did not obtain further improvements with current topology and training techniques. However, 16 is not a large number of model depth when compared to the models in computer vision (He et al., 2015). We believe we can benefit from deeper models with new design of topologies and training techniques which are left as our future work.