Dynamically Shaping the Reordering Search Space of Phrase-Based Statistical Machine Translation

Defining the reordering search space is a crucial issue in phrase-based SMT between distant languages. In fact, the optimal trade-off between accuracy and complexity of decoding is nowadays reached by harshly limiting the input permutation space. We propose a method to dynamically shape such space and, thus, capture long-range word movements without hurting translation quality nor decoding time. The space defined by loose reordering constraints is dynamically pruned through a binary classifier that predicts whether a given input word should be translated right after another. The integration of this model into a phrase-based decoder improves a strong Arabic-English baseline already including state-of-the-art early distortion cost (Moore and Quirk, 2007) and hierarchical phrase orientation models (Galley and Manning, 2008). Significant improvements in the reordering of verbs are achieved by a system that is notably faster than the baseline, while bleu and meteor remain stable, or even increase, at a very high distortion limit.


Introduction
Word order differences are among the most important factors determining the performance of statistical machine translation (SMT) on a given language pair (Birch et al., 2009). This is particularly true in the framework of phrase-based SMT (PSMT) (Zens et al., 2002;Koehn et al., 2003;, an approach that remains highly competitive despite the recent advances of the tree-based approaches.
During the PSMT decoding process, the output sentence is built from left to right, while the input sentence positions can be covered in different orders. Thus, reordering in PSMT can be viewed as the problem of choosing the input permutation that leads to the highest-scoring output sentence. Due to efficiency reasons, however, the input permutation space cannot be fully explored, and is therefore limited with hard reordering constraints.
Although many solutions have been proposed to explicitly model word reordering during decoding, PSMT still largely fails to handle long-range word movements in language pairs with different syntactic structures 1 . We believe this is mostly not due to deficiencies of the existing reordering models, but rather to a very coarse definition of the reordering search space. Indeed, the existing reordering constraints are rather simple and typically based on word-to-word distances. Moreover, they are uniform throughout the input sentence and insensitive to the actual words being translated. Relaxing this kind of constraints means dramatically increasing the size of the search space and making the reordering model's task extremely complex. As a result, even in language pairs where long reordering is regularly observed, PSMT quality degrades when long word movements are allowed to the decoder.
We address this problem by training a binary classifier to predict whether a given input position should be translated right after another, given the words at those positions and their contexts. When this model is integrated into the decoder, its predic-tions can be used not only as an additional feature function, but also as an early indication of whether or not a given reordering path should be further explored. More specifically, at each hypothesis expansion, we consider the set of input positions that are reachable according to the usual reordering constraints, and prune it based only on the reordering model score. Then, the hypothesis can be expanded normally by covering the non-pruned positions. This technique makes it possible to dynamically shape the search space while decoding with a very high distortion limit, which can improve translation quality and efficiency at the same time.
The remainder of the paper is organized as follows. After an overview of the relevant literature, we describe in detail our word reordering model. In the following section, we introduce early pruning of reordering steps as a way to dynamically shape the input permutation space. Finally, we present an empirical analysis of our approach, including intrinsic evaluation of the model and SMT experiments on a well-known Arabic-English news translation task.

Previous Work
In this paper, we focus on methods that guide the reordering search during the phrase-based decoding process. See for instance (Costa-jussà and Fonollosa, 2009) for a review of pre-and post-reordering approaches that are not treated here.
Assuming a one-to-one correspondence between source and target phrases, reordering in PSMT can be viewed as the problem of searching through a set of permutations of the input sentence. Thus, two sub-problems arise: defining the set of allowed permutations (reordering constraints) and scoring the allowed permutations according to some likelihood criterion (reordering model). We begin with the latter, returning to the constraints later in this section.

Reordering modeling
In its original formulation, the PSMT approach includes a basic reordering model, called distortion cost, that exponentially penalizes longer jumps among consecutively translated phrases (f i−1 ,f i ): A number of more sophisticated solutions have been proposed to explicitly model word reordering during decoding. These can mostly be grouped into three families: phrase orientation models, jump models and source decoding sequence models.
Phrase orientation models (Tillmann, 2004;Koehn et al., 2005;Zens and Ney, 2006;Galley and Manning, 2008), also known as lexicalized reordering models, predict the orientation of a phrase with respect to the last translated one, by classifying it as monotone, swap or discontinuous. These models have proven very useful for short and mediumrange reordering and are among the most widely used in PSMT. However, their coarse classification of reordering steps makes them unsuitable to predict long-range reorderings.
Jump models (Al-Onaizan and Papineni, 2006;Green et al., 2010;Yahyaei and Monz, 2010) predict the direction and length of a jump to perform after a given input word 2 . Both these works achieve their best Arabic-English results within a rather small DL: namely, 8 in (Al-Onaizan and Papineni, 2006) and 5 in (Green et al., 2010), thus failing to capture the rare but crucial long reorderings that were their main motivation. A drawback of this approach is that long jumps are typically penalized because of their low frequency compared to short jumps. This strong bias is undesirable, given that we are especially interested in detecting probable long reorderings.
Source decoding sequence models predict which input word is likely to be translated at a given state of decoding. For instance, reordered source language models (Feng et al., 2010) are smoothed ngram models trained on a corpus of source sentences reordered to match the target word order. When integrated into the SMT system, they assign a probability to each newly translated word given the n-1 previously translated words. Finally, source word pair reordering models (Visweswariah et al., 2011) estimate, for each pair of input words i and j, the cost of translating j right after i given various features of i, j and their respective contexts. Differently from reordered source LMs, these models are discriminative and can profit from richer feature sets. At the same time, they do not employ decoding historybased features, which allows for more effective hy-pothesis recombination. The model we are going to present belongs to this last sub-group, which we find especially suitable to predict long reorderings.

Reordering constraints
The reordering constraint originally included in the PSMT framework and implemented in our reference toolkit, Moses (Koehn et al., 2007), is called distortion limit (DL). This consists in allowing the decoder to skip, or jump, at most k words from the last translated phrase to the next one. More precisely, the limit is imposed on the distortion D between consecutively translated phrases (f i−1 ,f i ): Limiting the input permutation space is necessary for beam-search PSMT decoders to function in linear time. Reordering constraints are also important for translation quality because the existing models are typically not discriminative enough to guide the search over very large sets of reordering hypotheses.
Despite their crucial effects on the complexity of reordering modeling, though, reordering constraints have drawn less attention in the literature. The existing reordering constraints are typically based on word-to-word distances -IBM (Berger et al., 1996) and DL (Koehn et al., 2007) -or on permutation patterns -ITG (Wu, 1997). Both kinds of constraints are uniform throughout the input sentence, and insensitive to the word being translated and to its context. This results in a very coarse definition of the reordering search space, which is problematic in language pairs with different syntactic structures.
To address this problem, Yahyaei and Monz (2010) present a technique to dynamically set the DL: they train a classifier to predict the most probable jump length after each input word, and use the predicted value as the DL after that position. Unfortunately, this method can generate inconsistent constraints leading to decoding dead-ends. As a solution, the dynamic DL is relaxed when needed to reach the first uncovered position. Translation improvements are reported only on a small-scale task with short sentences (BTEC), over a baseline that includes a very simple reordering model. In our work we develop this idea further and use a reordering model to predict which specific input words, rather than input intervals, are likely be translated next. Moreover, our solution is not affected by the constraint inconsistency problem (see Sect. 4).
In another related work,  generate likely reorderings of the input sentence by means of language-specific fuzzy rules based on shallow syntax. Long jumps are then suggested to the PSMT decoder by reducing the distortion cost for specific pairs of input words. In comparison to the dynamic DL, that is a much finer way to define the reordering space, leading to consistent improvements of both translation quality and efficiency over a strong baseline. However, the need of specific reordering rules makes the method harder to apply to new language pairs.

The WaW reordering model
We model reordering as the problem of deciding whether a given input word should be translated after another (Word-after-Word). This formulation is particularly suitable to help the decoder decide whether a reordering path is promising enough to be further explored. Moreover, when translating a sentence, choosing the next source word to translate appears as a more natural problem than guessing how much to the left or to the right we should move from the current source position. The WaW reordering model addresses a binary decision task through the following maximum-entropy classifier: where f J 1 is a source sentence of J words, h m are feature functions and λ m the corresponding feature weights. The outcome Y can be either 1 or 0, with R i,j =1 meaning that the word at position j is translated right after the word at position i. Our WaW reordering model is strongly related to that of Visweswariah et al. (2011) -hereby called Travelling Salesman Problem (TSP) model -with few important differences: (i) we do not include in the features any explicit indication of the jump length, in order to avoid the bias on short jumps; (ii) they train a linear model with MIRA (Crammer and Singer, 2003) by minimizing the number of input words that get placed after the wrong position, while we use a maximum-entropy classifier trained by maximum-likelihood; (iii) they use an off-the shelf TSP solver to find the best source sentence permutation and apply it as pre-processing to training and test data. By contrast, we integrate the maximum-entropy classifier directly into the SMT decoder and let all its other models (phrase orientation, translation, target LM etc.) contribute to the final reordering decision.

Features
Like the TSP model (Visweswariah et al., 2011), the WaW model builds on binary features similar to those typically employed for dependency parsing (McDonald et al., 2005): namely, combinations of surface forms or POS tags of the words i and j and their context. Our feature templates are presented in Table 1. The main novelties with respect to the TSP model are the mixed word-POS templates (rows 16-17) and the shallow syntax features. In particular, we use the chunk types of i, j and their context (18-19), as well as the chunk head words of i and j (20). Finally we add a feature to indicate whether the words i and j belong to the same chunk (21). The jump orientation -forward/backward -is included in the features that represent the words comprised between i and j (rows 6, 7, 14, 15). No explicit indication of the jump length is included in any feature.

Training data
To generate training data for the classifier, we first extract reference reorderings from a word-aligned parallel corpus. Given a parallel sentence, different heuristics may be used to convert arbitrary word alignments to a source permutation (Birch et al., 2010;Feng et al., 2010;Visweswariah et al., 2011). Similarly to this last work, we compute for each source word f i the mean a i of the target positions aligned to f i , then sort the source words according to this value. 3 As a difference, though, we do not discard unaligned words but assign them the mean belong to same chunk(i, j)?
w: word identity, p: POS tag, c: chunk type, h: chunk head Table 1: Feature templates used to learn whether a source position j is to be translated right after i. Positions comprised between i and j are denoted by b and generate two feature templates: one for each position (6 and 14) and one for the concatentation of them all (7 and 15).
of their neighbouring words' alignment means, so that a complete permutation of the source sentence (σ) is obtained. Table 2(a) illustrates this procedure. Given the reference permutation, we then generate positive and negative training samples by simulating the decoding process. We traverse the source positions in the order defined by σ, keeping track of the positions that have already been covered and, for each t : 1 ≤ t ≤ J, generate: • one positive sample (R σt,σ t+1 =1) for the source position that comes right after it, has not yet been translated.
Here, the sampling window δ serves to control the size of the training data and the proportion between positive and negative samples. Its value naturally correlates with the DL used in decoding. The generation of training samples is illustrated by Table 2(b).
(a) Converting word alignments to a permutation: source words are sorted by their target alignments mean a. The unaligned word "D" is assigned the mean of its neighbouring words' a values (2 + 5)/2 = 3.5 : (b) Generating binary samples by simulating the decoding process: shaded rounds represent covered positions, while dashed arrows represent negative samples:

Integration into phrase-based decoding
Rather than using the new reordering model for data pre-processing as done by (Visweswariah et al., 2011), we directly integrate it into the PSMT decoder Moses (Koehn et al., 2007). Two main computation phases are required by the WaW model: (i) at system initialization time, all feature weights are loaded into memory, and (ii) before translating each new sentence, features are extracted from it and model probabilities are pre-computed for each pair of source positions (i, j) such that |j − i − 1| ≤ DL. Note that this efficient solution is possible because our model does not employ decoding history-based features, like the word that was translated before the last one, or like the previous jump legth. This is an important difference with respect to the reordered source LM proposed by Feng et al. (2010), which requires inclusion of the last n translated words in the decoder state. Fig. 1 illustrates the scoring process: when a partial translation hypothesis H is expanded by covering a new source phrasef , the model returns the log-probability of translating the words off in that particular order, just after the last translated word of H. In details, this is done by converting the phraseinternal word alignment 4 to a source permutation, in just the same way it was done to produce the model's training examples. Thus, the global score is independent from phrase segmentation, and normalized across outputs of different lengths: that is, the probability of any complete hypothesis decomposes into J factors, where J is the length of the input sentence.
The WaW reordering model is fully compatible with, and complementary to the lexicalized reordering (phrase orientation) models included in Moses.

Early pruning of reordering steps
We now explain how the WaW reordering model can be used to dynamically refine the input permutation space. This method is not dependent on the particular classifier described in this paper, but can in principle work with any device estimating the probability of translating a given input word after another.
The method consists of querying the reordering model at the time of hypothesis expansion, and filtering out hypotheses solely based on their reordering score. The rationale is to avoid costly hypothesis expansions for those source positions that the reordering model considers very unlikely to be covered at a given point of decoding. In practice, this works as follows: • at each hypothesis expansion, we first enumerate the set of uncovered input positions that are reachable within a fixed DL, and query the WaW reordering model for each of them 5 ; • only based on the WaW score, we apply histogram and threshold pruning to this set and proceed to expand the non-pruned positions.
Furthermore, it is possible to ensure that local reorderings are always allowed, by setting a so-called non-prunable-zone of width ϑ around the last covered input position. 6 According to how the DL, pruning parameters, and ϑ are set, we can actually aim at different targets: with a low DL, loose pruning parameters, and ϑ=0 we can try to speed up search without sacrificing much translation quality. With a high DL, strict pruning parameters, and a medium ϑ, we ensure that the standard medium-range reordering space is explored, as well as those few long jumps that are promising according to the reordering model. In our experiments, we explore this second option with the setting DL=18 and ϑ=5.
The underlying idea is similar to that of early pruning proposed by Moore and Quirk (2007), which consisted in discarding possible extensions of a partial hypothesis based on their estimated score before computing the exact language model score. Our technique too has the effect of introducing additional points at which the search space is pruned. However, while theirs was mainly an optimization technique meant to avoid useless LM queries, we instead aim at refining the search space by exploiting the fact that some SMT models are more important than others at different stages of the translation process. Our approach actually involves a continuous alternation of two processes: during hypothesis expansion the reordering score is combined with all other scores, while during early pruning some reordering decisions are taken only based on the reordering score. In this way, we try to combine the benefits of fully integrated reordering models with those of monolingual pre-ordering methods.

Evaluation
We test our approach on an Arabic-English news translation task where sentences are typically long and complex. In this language pair, long reordering errors mostly concern verbs, as all of Subject-Verb-Object (SVO), VSO and, more rarerly, VOS constructions are attested in modern written Arabic. This issue is well known in the SMT field and was addressed by several recent works, with deep or shallow parsing-based techniques (Green et al., 2009;Carpuat et al., 2012;Andreas et al., 2011;. We question whether our approach -which is not conceived to solve this specific problem, nor requires manual rules to predict verb reordering -will succeed in improving long reordering in a fully data-driven way. As SMT training data, we use all the in-domain parallel data provided for the NIST-MT09 evaluation for a total of 986K sentence pairs (31M English words). 7 The target LM used to run the main series of experiments is trained on the English side of all available NIST-MT09 parallel data, UN included (147M words). In the large-scale experiments, the LM training data also include the sections of the English Gigaword that best fit to the development data in terms of perplexity: namely, the Agence France-Presse, Xinhua News Agency and Associated Press Worldstream sections (2130M words in total).
For development and test, we use the newswire sections of the NIST benchmarks: dev06-nw, eval08nw, eval09-nw consisting of 1033, 813, 586 sentences respectively. Each set includes 4 reference translations and the average sentence length is 33 words. To focus the evaluation on problematic reordering, we also consider a subset of eval09-nw containing only sentences where the Arabic main verb is placed before the subject (vs-09: 299 sent.). 8 As pre-processing, we apply standard tokenization to the English data, while the Arabic data is segmented with AMIRA (Diab et al., 2004) according to the ATB scheme 9 . The same tool also produces POS tagging and shallow syntax annotation.

Reordering model intrinsic evaluation
Before proceeding to the SMT experiments, we evaluate the performance of the WaW reordering model in isolation. All the tested configurations are trained with the freely available MegaM Toolkit 10 , implementing the conjugate gradient method (Hestenes and Stiefel, 1952), in maximum 100 iterations. Training samples are generated within a sampling window of width δ=10, from a subset (30K sentences) of the parallel data described above, resulting in 8M training word pairs 11 . Test samples are generated from TIDES-MT04 (1324 sentences, 370K samples with δ=10), one of the corpora included in our SMT training data. Features with less than 20 occurrences are ignored.
Classification accuracy. Table 3 presents precision, recall, and F-score achieved by different feature subsets, where W stands for word-based, P for POS-based and C for chunk-based feature templates. We can see that all feature types contribute to improve the classifier's performance. The word-based model achieves the highest precision but a very low recall, while the POS-based has much more balanced scores. A better performance overall is obtained by combining word-, POS-and mixed word-POS-based features (62.6% F-score). Finally, the addition of chunk-based features yields a further improvement of about 1 point, reaching 63.8% F-score. Given these results, we decide to use the W,P,C model for the rest of the evaluation.  Ranking accuracy. A more important aspect to evaluate for our application is how well our model's scores can rank a typical set of reordering options. In fact, the WaW model is not meant to be used as 10 http://www.cs.utah.edu/˜hal/megam/ (Daumé III, 2004). 11 This is the maximum number of samples manageable by MegaM. However, even scaling from 4M to 8M was only slightly helpful in our experiments. In the future we plan to test other learning approaches that scale better to large data sets. a stand-alone classifier, but as one of several SMT feature functions. Moreover, for early reordering pruning to be effective, it is especially important that the correct reordering option be ranked in the top n among those available at the time of a given hypothesis expansion. In order to measure this, we simulate the decoding process by traversing the source words in target order and, for each of them, we examine the ranking of all words that may be translated next (i. e. the uncovered positions within a given DL). We check how often the correct jump was ranked first (Top-1) or at most third (Top-3). We also compute the latter score on long reorderings only (Top-3-long): i. e. backward jumps with distortion D>7 and forward jumps with D>6. In Table 4, results are compared with the ranking produced by standard distortion, which always favors shorter jumps. Two conditions are considered: DL=10 corresponding to the sampling window δ used to produce the training data, and DL=18 that is the maximum distortion of jumps that will be considered in our early-pruning SMT experiment.  We can see that, in terms of overall accuracies, the WaW reordering model outperforms standard distortion by a large margin (about 10% absolute). This is an important result, considering that the jump length, strongly correlating with the jump likelihood, is not directly known to our model. As regards the DL, the higher limit naturally results in a lower DL-error rate (percentage of correct jumps beyond DL): namely 0.8% instead of 2.4%. However, jump prediction becomes much harder: Top-3 accuracy of long jumps by distortion drops from 50.7% to 18.9% (backward) and from 66.0% to 52.3% (forward). Our model is remarkably robust to this effect on backward jumps, where it achieves 68.0% accu-racy. Due to the syntactic characteristics of Arabic and English, the typical long reordering pattern consists in (i) skipping a clause-initial Arabic verb, (ii) covering a long subject, then finally (iii) jumping back to translate the verb and (iv) jumping forward to continue translating the rest of the sentence (see Fig. 3 for an example). 12 Deciding when to jump back to cover the verb (iii) is the hardest part of this process, and that is precisely where our model seems more helpful, while distortion always prefers to proceed monotonically achieving a very low accuracy of 18.9%. In the case of long forward jumps (iv), instead, distortion is advantaged as the correct choice typically corresponds to translating the first uncovered position, that is the shortest jump available from the last translated word. Even here, our model achieves an accuracy of 51.8%, only slightly lower than that of distortion (52.3%).
In summary, the WaW reordering model significantly outperforms distortion in the ranking of long jumps. In the large majority of cases, it is able to rank a correct long jump in the top 3 reordering options, which suggests that it can be effectively used for early reordering pruning.

SMT experimental setup
Our SMT systems are built with the Moses toolkit, while word alignment is produced by the Berkeley Aligner (Liang et al., 2006). The baseline decoder includes a phrase translation model, a lexicalized reordering model, a 6-gram target language model, distortion cost, word and phrase penalties. More specifically, the baseline reordering model is a hierarchical phrase orientation model (Tillmann, 2004;Koehn et al., 2005;Galley and Manning, 2008) trained on all the available parallel data. This variant was shown to outperform the default wordbased on an Arabic-English task. To make our baseline even more competitive, we apply early distortion cost, as proposed by Moore and Quirk (2007). This function has the same value as the standard one over a complete translation hypothesis, but it anticipates the gradual accumulation of the cost, making hypotheses of the same length more comparable to one another. Note that this option has no ef-fect on the distortion limit, but only on the distortion cost feature function. As proposed by Johnson et al. (2007), statistically improbable phrase pairs are removed from the translation model. The language models are estimated by the IRSTLM toolkit (Federico et al., 2008) with modified Kneser-Ney smoothing (Chen and Goodman, 1999).
Feature weights are optimized by minimum BLEU-error training (Och, 2003) on dev06-nw. To reduce the effects of the optimizer instability, we tune each configuration four times and use the average of the resulting weight vectors to translate the test sets, as suggested by Cettolo et al. (2011).
Finally, eval08-nw is used to select the early pruning parameters for the last experiment, while eval09nw is always reserved as blind test.

Evaluation metrics
We evaluate global translation quality with BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005). These metrics, though, are only indirectly sensitive to word order, and especially unlikely to capture improvements at the level of longrange reordering. For this reason, we also compute the Kendall Reordering Score or KRS (Birch et al., 2010) which is a positive score based on the Kendall's Tau distance between the source-output permutation π and the source-reference permutations σ: where BP is a sentence-level brevity penalty, similar to that of BLEU. The KRS is robust to lexical choice because it performs no comparison between output and reference words, but only between the positions of their translations. Besides, it was shown to correlate strongly with human judgements of fluency.
Our work specifically addresses long-range reordering phenomena in language pairs where these are quite rare, although crucial for preserving the source text meaning. Hence, an improvement at this level may not be detected by the general-purpose metrics. We then develop a KRS variant that is only sensitive to the positioning of specific input words. Assuming that each input word f i is assigned a weight λ i , the formula above is modified as follows: A similar element-weighted version of Kendall Tau was proposed by Kumar and Vassilvitskii (2010) to evaluate document rankings in information retrieval. Because long reordering errors in Arabic-English mostly affect verbs, we set the weights to 1 for verbs and 0 for all other words to only capture verb reordering errors, and call the resulting metric KRS-V. The source-reference word alignments needed to compute the reordering scores are generated by the Berkeley Aligner previously trained on the training data. Source-output word alignments are instead obtained from the decoder's trace.

Results and discussion
To motivate the choice of our baseline setup (early distortion cost and DL=8), we first compare the performance of standard and early distortion costs under various DL conditions. As shown in Fig. 2, most results are close to each other in terms of BLEU and KRS, but early distortion consistently outperforms the standard one (statistically significant). The most striking difference appears at a very high distortion limit (18), where standard distortion scores drop by more than 1 BLEU point and almost 7 KRS points! Early distortion is much more robust (only -1 KRS when going from DL=8 to DL=18), which makes our baseline system especially strong at the level of reordering.  (Riezler and Maxwell, 2005). Run times are obtained by an Intel Xeon X5650 processor on the first 500 sentences of eval08-nw, and exclude loading time of all models.
Medium-scale evaluation. Integrating the WaW model as an additional feature function results in small but consistent improvements in all DL conditions, which shows that this type of model conveys information that is missing from the state-of-the-art reordering models. As regards efficiency, the new model makes decoding time increase by 8%.
Among the DL settings considered, DL=8 is confirmed as the optimal one -with or without WaW model. Raising the DL to 18 with no special pruning has a negative impact on both translation quality and efficiency. The effect is especially visible on the reordering scores: that is, from 84.7 to 83.9 KRS and from 86.2 to 85.8 KRS-V on eval09-nw. Run times are almost doubled: from 87 to 164 and from 94 to 178 ms/word, that is a 89% increase.
We then proceed to the last experiment where the reordering space is dynamically pruned based on the WaW model score. As explained in Sect. 4, a non-prunable-zone of width ϑ=5 is set around the last covered position. To set the early pruning parameters, we perform a grid search over the values (1,2,3,4,5) for histogram and (0.5, 0.25, 0.1) for relative threshold, and select the values that achieve the best BLEU and KRS on eval08-nw, namely 3 (histogram) and 0.1 (threshold). The resulting configuration is then re-optimized by MERT on dev06-nw. This setting implies that, at a given point of decoding where i is the last covered position, a new word can be translated only if: • it lies within a DL of 5 from i, or • it lies within a DL of 18 from i and its WaW reordering score is among the top 3 and at least equal to 1/10 of the best score (in linear space).
As shown in Table 5, early pruning achieves the best results overall: despite the high DL, we report  no loss in BLEU, METEOR and KRS, but we actually see several improvements. In particular, the gains on the blind test eval09-nw are +0.3 BLEU, +0.2 ME-TEOR and +0.2 KRS (only METEOR is significant). While these gains are admittedly small, we recall that our techniques affect rare and isolated events which can hardly emerge from the general-purpose evaluation metrics. Moreover, to our knowledge, this is the first time that a PSMT system is shown to maintain a good performance on this language pair while admitting very long-range reorderings.
Finally and more importantly, the reordering of verbs improves significantly on both generic tests and on the VS-sentence subset (vs-09): namely, in the latter, we achieve a notable gain of 1.4 KRS-V. Efficiency is also largely improved by our early reordering pruning technique: decoding time is reduced to 68 ms/word, corresponding to a 22% speed-up over the baseline.
Large-scale evaluation. We also investigate whether our methods can be useful in a scenario where efficiency is less important and more data is available for training. To this end, we build a very large LM by interpolating the main LM with three other LMs trained on different Gigaword sections (see Sect. 5). Moreover, we relax the decoder's beam size from the default value of 200 to 400 hy-potheses, to reduce the risk of search errors and obtain the best possible baseline performance.
By comparing the large-scale with the mediumscale baseline in Table 5, we note that the addition of LM data is especially beneficial for BLEU (+1.5 on eval08-nw and +1.0 on eval09-nw), but not as much for the other metrics, which challenges the commonly held idea that more data always improves translation quality.
Here too, relaxing the DL without special pruning hurts not only efficiency but also translation quality: all the scores decrease considerably, showing that even the stronger LM is not sufficient to guide search through a very large reordering search space.
As for our enhanced system, it achieves similar gains as in the medium-scale scenario: that is, BLEU and METEOR are preserved or slightly improved despite the very high DL, while all the reordering scores increase. In particular, we report statistically significant improvements in the reordering of verbs, which is where the impact of our method is expected to concentrate (+0.7, +0.8 and +1.0 KRS-V on eval08-nw, eval09-nw and vs-09, respectively).
These results confirm the usefulness of our method not only as an optimization technique, but also as a way to improve translation quality on top of a very strong baseline.

Long jumps statistics and examples.
To better understand the behavior of the early-pruning system, we extract phrase-to-phrase jump statistics from the decoder log file. We find that 132 jumps beyond the non-prunable zone (D>5) were performed to translate the 586 sentences of eval09-nw; 38 out of these were longer than 8 and mostly concentrated on the VS-sentence subset (27 jumps D>8 performed in vs-09). 13 This and the higher reordering scores suggest that long jumps are mainly carried out to correctly reorder clause-inital verbs over long subjects. Fig. 3 shows two Arabic sentences taken from eval09-nw, that were erroneuously reordered by the baseline system. The system including the WaW model and early reordering pruning, instead, produced the correct translation. The first sentence is a typical example of VSO order with a long subject: while the baseline system left the verb in its Arabic position, producing an incomprehensible translation, the new system placed it rightly between the English subject and object. This reordering involved two long jumps: one with D=9 backward and one with D=8 forward.
The second sentence displays another, less common, Arabic construction: namely VOS, with a personal pronoun object. In this case, a backward jump with D=10 and a forward jump with D=8 were necessary to achieve the correct reordering. 13 Statistics computed on the medium-LM system.

Conclusions
We have trained a discriminative model to predict likely reordering steps in a way that is complementary to state-of-the-art PSMT reordering models. We have effectively integrated it into a PSMT decoder as additional feature, ensuring that its total score over a complete translation hypothesis is consistent across different phrase segmentations. Lastly, we have proposed early reordering pruning as a novel method to dynamically shape the input reordering space and capture long-range reordering phenomena that are often critical when translating between languages with different syntactic structures.
Evaluated on a popular Arabic-English news translation task against a strong baseline, our approach leads to similar or even higher BLEU, ME-TEOR and KRS scores at a very high distortion limit (18), which is by itself an important achievement. At the same time, the reordering of verbs, measured with a novel version of the KRS, is consistently improved, while decoding gets significantly faster. The improvements are also confirmed when a very large LM is used and the decoder's beam size is doubled, which shows that our method reduces not only search errors but also model errors even when baseline models are very strong.
Word reordering is probably the most difficult aspect of SMT and an important factor of both its quality and efficiency. Given its strong interaction with the other aspects of SMT, it appears natural to solve word reordering during decoding, rather than before or after it. To date, however, this objective was only partially achieved. We believe there is a promising way to go between fully-integrated reordering models and monolingual pre-ordering methods. This work has started to explore it.