Modelling and Optimizing on Syntactic N-Grams for Statistical Machine Translation

The role of language models in SMT is to promote fluent translation output, but traditional n-gram language models are unable to capture fluency phenomena between distant words, such as some morphological agreement phenomena, subcategorisation, and syntactic collocations with string-level gaps. Syntactic language models have the potential to fill this modelling gap. We propose a language model for dependency structures that is relational rather than configurational and thus particularly suited for languages with a (relatively) free word order. It is trainable with Neural Networks, and not only improves over standard n-gram language models, but also outperforms related syntactic language models. We empirically demonstrate its effectiveness in terms of perplexity and as a feature function in string-to-tree SMT from English to German and Russian. We also show that using a syntactic evaluation metric to tune the log-linear parameters of an SMT system further increases translation quality when coupled with a syntactic language model.


Introduction
Many languages exhibit fluency phenomena that are discontinuous in the surface string, and are thus not modelled well by traditional n-gram language models. Examples include morphological agreement, e.g. subject-verb agreement in languages that do not (exclusively) follow SVO word order, subcategorisation, and collocations involving distant, but syntactically linked words.
Syntactic language models try to overcome the limitation to a local n-gram context by using syntactically related words (and non-terminals) as context information. Despite their theoretical attractiveness, it has proven difficult to improve SMT with parsers as language models (Och et al., 2004;Post and Gildea, 2008).
This paper describes an effective method to model, train, decode with, and weight a syntactic language model for SMT. While all these aspects are important for successfully applying a syntactic language model, our primary contributions are a novel dependency language model which improves over prior work by making relational modelling assumptions, which we argue are better suited for languages with a (relatively) free word order, and the use of a syntactic evaluation metric for optimizing the loglinear parameters of the SMT model.
While language models that operate on words linked through a dependency chain -called syntactic n-grams (Sidorov et al., 2013) -can improve translation, some of the improvement is invisible to an n-gram metric such as BLEU. As a result, tuning to BLEU does not show the full value of a syntactic language model. What does show its value is an optimization metric that operates on the same syntactic n-grams that are modelled by the dependency LM.
The paper is structured as follows. Section 2 describes our relational dependency language model; section 3 describes our neural network training procedure, and the integration of the model into an SMT decoder. We describe the syntactic evaluation metric we use for tuning in Section 4. The language models are evaluated on the basis of perplexity and SMT performance in section 5. We discuss related work in section 6, and finish with concluding remarks in section 7.

A Relational Dependency Language Model
As motivation, and working example for the model description, consider the dependency tree in Figure  1, which is taken from the output of our baseline string-to-tree SMT system. 1 The output contains two errors: • a morphological agreement error between the subject Ergebnisse (plural) and the finite verb wird (singular).
• a subcategorisation error: überraschen is transitive, but the translation has a prepositional phrase instead of an object.
While these errors might not have occurred if the words involved were adjacent to one another here and throughout the training set, non-adjacency is common, especially where the distance between subject and finite verb, or between a full verb and its arguments can be arbitrarily long. Prior work on syntactic language modelling has typically focused on English, and we argue that some modelling decisions do not transfer well to other languages. The dependency models proposed by Shen et al. (2010) and Zhang (2009) rely heavily on structural information such as the direction and distance of the dependent from the parent. In a language where the order of syntactic dependents is more flexible than in English, such as German 2 , grammatical function (and thus the inflection) is hard to predict from the dependent order. Instead, we make dependency labels, which encode grammatical relations, a core element of our model. 3 1 The tree is converted into constituency format for compatibility with SCFG decoding algorithms, with dependency edges represented as non-terminal nodes.
2 German has a strict word order within noun phrases and for the placement of verbs, but has different word order for main clauses and subordinated clauses, and some flexibility in the order of dependents of a verb. 3 Tsarfaty (2010) classifies parsing approaches into configurational approaches that rely on structural information, and relational ones that take grammatical relations as primitives. While she uses dependency syntax as a prototypical example of Shen et al. (2010) propose a model that estimates probability of each token given its parent and/or preceding siblings. We start with a variant of their model that does not hard-code configurational modelling assumptions, and then extend it by including dependency labels.

Unlabelled Model
Let S be a sequence of terminal symbols w 1 , w 2 , ..., w n with a dependency topology T , and let h s (i) and h a (i) be lists of heads of preceding siblings and ancestors of w i according to T , from closest to furthest. In our example in Figure 1: • h a (4) = (Umfrage, Ergebnisse, wird, ) Note that h a and its subsequences are instances of syntactic n-grams. For this model, we follow related work and assume that T is available (Popel and Marecek, 2010), approximating P (S) as P (S|T ). We make the Markov assumption that the probability of each word only depends on its preceding siblings 4 and ancestors, and decompose the probability of a sentence like this: We further make the Markov assumption that only a fixed window of the closest q siblings, and the closest r ancestors, affect the probability of a word.
Equation 2 represents our basic, unlabelled model. It differs from that of Shen et al. (2010) in two ways.
relational approaches, the dependency LM by Shen et al. (2010) would fall into the configurational category, while ours is relational. 4 Shen et al. (2010) use the siblings that are between the word and its parent, i.e. the following siblings if the word comes before its parent. We believe both preceding and following siblings are potentially useful, but leave expansion of the context to future work. 170 die Ergebnisse der jüngsten Umfrage wird für viele überraschen .  Figure 1: Translation output of baseline English→German string-to-tree SMT system with original dependency representation and conversion into constituency representation.
First, it uses separate context windows for siblings and ancestors. In contrast, Shen et al. (2010) treat the ancestor as the first symbol in a context window that is shared between the ancestor and siblings. Our formulation encodes our belief that the model should always assume dependence on the r nearest ancestor nodes, regardless of the number of siblings. Secondly, Shen et al. (2010) separate dependents to the left and to the right of the parent. While the fixed SVO verb order in English is compatible with such a separation, allowing P L to model subjects, P R to model objects, most arguments can occur before or after the head verb in German main clauses. We thus argue that left and right dependents should be modelled by a single model to allow for sharing of statistical strength. 5

Labelled Model
The motivation for the inclusion of dependency labels is twofold. Firstly, having dependency labels in the context serves as a strong signal for the prediction of the correct inflectional form. Secondly, dependency labels are the appropriate level of ab-5 Similar arguments have been made for parsing of (relatively) free word-order languages, e.g. by Tsarfaty et al. (2009). straction to model subcategorisation frames.
Let D be a sequence of dependency labels l 1 , l 2 , ..., l n , with each label l i being the label of the incoming arc at position i in T , and l s (i) and l a (i) the list of dependency labels of the siblings and ancestors of w i , respectively. Continuing the example for w 4 , these are: • l a (4) = (gmod, subj, vroot, sent) We predict both the terminal symbols S and dependency labels D. The latter lets us model subcategorisation by penalizing unlikely relations, e.g. objects whose parent is an intransitive verb. We decompose P (S, D) into P (D) × P (S|D) to obtain:

Head and Label Extraction
We here discuss some details for the extraction of the context h s and h a . Dependency structures require no language-specific head extraction rules, even in a converted constituency representation. In the constituency representation shown in Figure 1, each non-terminal node in the tree that is not a preterminal has exactly one pre-terminal child. The head of a non-terminal node can thus be extracted by identifying the pre-terminal child, and taking its terminal symbol as head. An exception is the virtual node sent, which is added to the root of the tree to combine subtrees that are not connected in the original grammar, e.g. the main tree and the punctuation symbol. If a node has no pre-terminal child, we use a special token as its head.
If the sibling of a node is a pre-terminal node, we represent this through a special token in h s and l s . We also use special out-of-bound tokens (separate for h s , h a , l s and l a ) to fill up the context window if the window is larger than the number of siblings and/or ancestors.
The context extraction rules are languageindependent and can be applied to any dependency structure. Language-specific or grammar-specific rules are possible in principle. For instance, for verbal heads in German, one could consider separable verb prefixes part of the head, and thus model differences in subcategorisation between schlagen (Engl. beat) and schlagen ... vor (Engl. suggest).

Predicting the Tree Topology
The model in equation 3 still assumes the topology of the dependency tree to be given, and we remedy this by also predicting pre-terminal nodes, and a virtual ST OP node as the last child of each node. This models the position of the head in a subtree (through the prediction of pre-terminal nodes), and the probability that a word has no more dependents (by assigning probability mass to the ST OP node).
Instead of generating all n terminal symbols as in equation 3, we generate all m nodes in the dependency tree in top-down, depth-first order, with l i being P T for pre-terminals, and the node label otherwise, and w i being either the head of the node, or if the node has no pre-terminal child. Our final model is given in equation 4. S Umfrage der jüngsten T 3 8 9 9 8 12 12 8 8 Figure 2: Snippet of prediction steps when generating terminals (top) or all nodes in tree (bottom) for dependency tree in Figure 1. Figure 2 illustrates the prediction of a subtree of the dependency tree in Figure 1. Note that T is encoded implicitly, and can be retrieved from D through a stack to which all nodes (except for preterminal and ST OP nodes) are pushed after prediction, and from which the last node is popped when predicting a ST OP node.

Neural Network Training and SMT Decoding
We extract all training instances from automatically parsed training text, and perform training with a standard feed-forward neural network (Bengio et al., 2003), using the NPLM toolkit (Vaswani et al., 2013). Back-off smoothing schemes are unsatisfactory because it is unclear which part of the context should be forgotten first, and neural networks elegantly solve this problem. We use two separate networks, one for P w and one for P l . Both networks share the same input vocabulary, but are trained and applied independently. The model input is a (2q + 2r)-word context vector (+1 for P w to encode l i ), each word being mapped to a shared embedding layer. We use a single hidden layer with rectifiedlinear activation function, and noise-contrastive estimation (NCE). We integrate our dependency language models into a string-to-tree SMT system as additional feature functions that score each translation hypothesis. The model in equation 4 predicts P (S, D, T ). Obtaining the probability of the translation hypothesis P (S) would require the (costly) marginalization over all sequences of dependency labels D and topologies T , but like the SMT decoder itself, we approximate the search for the best translation by searching for the highest-scoring derivation, meaning that we directly integrate P w and P l as two features into the log-linear SMT model. We use selfnormalized neural networks with precomputation of the hidden layer, which makes the integration into decoding reasonably fast.
The decoder builds the translation bottom-up, and the full context is not available for all symbols in the hypothesis. Vaswani et al. (2013) propose to use a special null word for unavailable context, their embedding being the weighted average of the input embeddings of all other words. We adopt this strategy, with the difference that we use separate null words for each position in the context window in order to reflect distributional differences between the different positions, e.g. between ancestor labels and sibling labels. Symbols are re-scored as more context becomes available in decoding, but poor approximations could affect pruning and thus lead to search errors. In Table 1, we illustrate the use of null words with a 5-gram and a bigram NNLM model. We observe a small increase in entropy when querying the 5-gram model with bigrams, compared to querying a bigram model directly.
Some hierarchical SMT systems allow glue rules which concatenate two subtrees. Since the resulting glue structures do not occur in the training data, we do not estimate their probability in our model. When encountering the root of a glue rule in our language model, we recursively evaluate its children, but ignore the glue node itself. This could introduce a bias towards using more glue rules during translation. To counter this, and encourage the production of linguistically plausible trees, we assign a fixed, high cost to glue rules. Glue rules thus play a small role in our systems, with about 100 glue rule applications per 3000 sentences, and could be abandoned entirely. 6 4 Optimizing Syntactic N-grams N-gram based metrics such as BLEU (Papineni et al., 2002) are still predominantly used to optimize the log-linear parameters of SMT systems, and (to a lesser extent) to evaluate the final translation systems. However, n-gram metrics are not well suited to measure fluency phenomena with string-level gaps, and there is a danger that BLEU underestimates the modelling power of dependency language models, resulting in a suboptimal assignment of loglinear weights. As an alternative metric that operates on the level of syntactic n-grams, we use a variant of the head-word chain metric (HWCM) (Liu and Gildea, 2005).
HWCM is a precision metric similar to BLEU, but instead of counting n-gram matches between the translation output and the reference, it compares head-word chains, or syntactic n-grams. HWCM is not only suitable for our task because it operates on the same structures as the dependency language models, but also because our string-to-tree SMT architecture produces trees that can be evaluated directly, without requiring a separate parse of the translation output, a task for which few parsers are optimized. For extracting syntactic n-grams from the reference translations of the respective development and test sets, we automatically parse them, using the same preprocessing as for training.
We count syntactic n-grams of sizes 1 to 4, mirroring the typical usage of BLEU. Banerjee and Lavie (2005) have demonstrated the importance of recall in MT evaluation, and we compute the harmonic mean of precision and recall, which we denote HWCM f , instead of the original, precision-based metric.

Evaluation
We perform three evaluations of our dependency language models. Our perplexity evaluation measures model perplexity on the 1-best output of a baseline SMT system and a human reference translation. Our SMT evaluation integrates the model as a feature function in a string-to-tree SMT system and evaluates its impact on translation quality. Finally, we quantify the effect of different language models on grammaticality by measuring the number of agreement errors of our SMT systems.
We refer to the unlabelled variant of our model (equation 2) as DLM, and to the labelled variant (equation 4) as RDLM, emphasizing that the latter is a relational dependency LM.

Data and Methods
We perform our experiments on English→German data from the WMT 2014 shared translation task (Bojar et al., 2014), consisting of about 4.5 million sentence pairs of parallel data and 120 million sentences of monolingual German data. We train all language models on the German side of the parallel text and the monolingual data. We also perform some experiments on the English→Russian data from the same translation task, with 2 million sentence pairs of parallel data and 34 million sentences of monolingual Russian data.
For a 5-gram Neural Network LM baseline (NNLM), and the dependency language models, we train feed-forward Neural Network language models with the NPLM toolkit. We use 150 dimensions for the input embeddings, and a single hidden layer with 750 dimensions. We use a vocabulary of 500 000 words (70 for the output vocabulary of P l ), from which we draw 100 noise samples for NCE (50 for P l ). We train for two epochs, each epoch being a full traversal of the training text. For unknown words, we back-off to a special unk token for the sequence models and P l , and to the pre-terminal symbol for the other dependency models. We report perplexity values with softmax normalization, but disable normalization during decoding, relying on the selfnormalization of NCE for efficiency. For the translation experiments with DLM and RDLM, we set the sibling window size q to 1, and the ancestor window size r to 2. 7 We train baseline language models with interpolated modified Kneser-Ney smoothing with SRILM 7 On our test set, a node has an average of 4.6 ancestors (σ = 2.5), and 1.2 left siblings (σ = 1.3). (Stolcke, 2002). The model in the SMT baseline uses the full vocabulary and a linear interpolation of component models for domain adaptation. For the perplexity evaluation, we use the same vocabulary and training data as for the Neural Network models.
For the English→German SMT evaluation, our baseline system is a string-to-tree SMT system with Moses (Koehn et al., 2007), with dependency parsing of the German texts (Sennrich et al., 2013). It is described in more detail in (Williams et al., 2014). This setup was ranked 1-2 (out of 18) in the WMT 2014 shared translation task and is stateof-the art. Our biggest deviation from this setup is that we do not enforce the morphological agreement constraints that are provided by a unification grammar (Williams and Koehn, 2011), but use them for analysis instead. For English→Russian, we copy the language-independent settings from the the English→German set-up, and perform dependency parsing with a Russian model for the Maltparser (Nivre et al., 2006;Sharoff and Nivre, 2011), applying projectivization after parsing. We tune our system on a development set of 2000 sentences with k-best batch MIRA (Cherry and Foster, 2012) on BLEU and a linear interpolation of BLEU and HWCM f , and report both scores for evaluation. We also report METEOR (Denkowski and Lavie, 2011) for German and TER (Snover et al., 2006). We control for optimizer instability by running the optimization three times per system and performing significance testing with Multeval (Clark et al., 2011), which we enhanced to also perform significance testing for HWCM f . Shen et al. (2010) We reimplement the model by Shen et al. (2010) for our evaluation. The authors did not specify training and smoothing of their model, so we only adopt their definition of the context window, and use the same neural network architecture as for our other models. Specifically, we use two neural networks: one for left dependents, and one for right dependents. We use maximum-likelihood estimation for the head of root nodes, ignoring unseen events. To distinguish between parents and siblings in the context window, we double the input vocabulary and mark parents with a suffix. Like Shen et al. (2010), we ignore the  We consider scalability to a larger ancestor context a real concern, since another duplication of the vocabulary may be necessary for each ancestor level.

Perplexity
There are a number of factors that make a direct comparison of the reference set perplexity unfair. Mainly, the unlabelled dependency model DLM and the one by Shen et al. (2010) assume that the dependency topology is given; P w even assumes this for the dependency labels D. Conversely, the full RDLM predicts the terminal sequence, the dependency labels, and the dependency topology, and we thus expect it to have a higher perplexity. 8 Also note that we compare 5-gram n-gram models to 3-and 4gram dependency models. A more minor difference is that n-gram models also predict end-of-sentence tokens, which the dependency models do not. Rather than directly comparing perplexity between different models, our focus lies on a perplexity comparison between a human reference translation and the 1-best SMT output of a baseline transla-tion system. Our basic assumption is that the difference in perplexity (or cross-entropy) tells us whether a model contains information that is not already part of the baseline model, and if incorporating it into our SMT system can nudge the system towards producing a translation that is more similar to the reference.
Results for English→German are shown in table 2. The baseline 5-gram language model with Kneser-Ney smoothing prefers the SMT output over the reference translation, which is natural given that this language model is part of the system producing the SMT output. The 5-gram NNLM improves over the Kneser-Ney models, and happens to assign almost the same perplexity score to both texts. This still means that it is less biased towards the SMT output than the baseline model, and can be a valuable addition to the model.
The dependency language models all show a preference for the reference translation, with DLM having a stronger preference than the model by Shen et al. (2010), and RDLM having the strongest preference. The direct comparison of DLM and P w , which is the component of RDLM that predicts the terminal symbols, shows that dependency labels serve as a strong signal for predicting the terminals, confirming our initial hypothesis. The prediction of the dependency topology and labels through P l means that the full RDLM has the highest perplexity of all models. However, it also strongly prefers the human reference text over the baseline SMT output.

Translation Quality
Translation results for English→German with different language models added to our baseline are shown in Table 3. Considering the systems tuned on BLEU, we observe that the 5-gram NNLM and RDLM are best in terms of BLEU and TER, but that RDLM is the only winner 9 according to HWCM f and METEOR. In particular, we observe a sizable gap of 0.6 HWCM f points between the NNLM and the RDLM systems, despite similar BLEU scores. The unlabelled DLM and the dependency LM by Shen et al. (2010), which are generally weaker than RDLM, also tend to improve HWCM f more than BLEU. This reflects the fact that the dependency  LMs improve fluency along the syntactic n-grams that HWCM measures, whereas NNLM only improves local fluency, to which BLEU is most sensitive. The fact that the models cover different phenomena is also reflected in the fact that we see further gains from combining the 5-gram NNLM with the strongest dependency LM, RDLM, for a total improvement of 0.9-1.1 BLEU over the baseline.
If we use BLEU+HWCM f as our tuning objective, the difference between the models increases. Compared to the 5-gram NNLM, the RDLM system gains 0.8-0.9 points in HWCM f and 0.3-0.5 points in BLEU. Compared to the original baseline, tuned only on BLEU, the system with RDLM that is tuned on BLEU+HWCM f yields an improvement of 1.1-1.3 BLEU and 1.3-1.4 HWCM f .
If we compare the same system being trained on both tuning objectives, we observe that tuning on BLEU+HWCM f , unsurprisingly, yields higher HWCM f scores than tuning on BLEU only. What is more surprising is that adding HWCM f as a tuning objective also yields significantly higher BLEU on the test sets for 9 out of 10 data points. The gap is larger for the two systems with RDLM (0.3-0.6 BLEU) than for the baseline or the NNLM system (0.1-0.2 BLEU). We hypothesize that the inclusion of HWCM f as a tuning metric reduces overfitting and encourages the production of more grammatically well-formed constructions, which we expect to be a robust objective across different texts, espe-cially when coupled with a strong dependency language model such as RDLM.
Some example translations are shown in table 4. They illustrate three error types in the baseline system: 1. an error in subject-verb agreement.
2. a subcategorisation error: gelten is a valid translation of the intransitive meaning of apply, but cannot be used for transitive constructions, where anwenden is correct.
3. a collocation error: two separate collocations are conflated in the baseline translation: • All errors are due to inter-dependencies in the sentence that have string-level gaps, but which can be modelled through syntactic n-grams, and are corrected by the system with RDLM and tuning on BLEU+HWCM f . We evaluate a subset of the systems on an English→Russian task to test whether the improvements from adding RDLM and tuning on BLEU+HWCM f apply to other language pairs. Results are shown in Table 5. The system with RDLM 1 source also the user manages his identity and can therefore be anonymous. baseline auch der Benutzer verwaltet seine Identität und können daher anonym sein. best auch der Benutzer verwaltet seine Identität und kann daher anonym sein. reference darüber hinaus verwaltet der Inhaber seine Identität und kann somit anonym bleiben.

source
how do you apply this definition to their daily life and social networks? baseline wie kann man diese Definition für ihr tägliches Leben und soziale Netzwerke gelten? best wie kann man diese Definition auf ihren Alltag und sozialen Netzwerken anwenden? reference wie wird diese Definition auf seinen Alltag und die sozialen Netzwerke angewendet? 3 source the City Council must reach a decision on this in December. baseline Der Stadtrat muss im Dezember eine Entscheidung darüber erzielen. best Im Dezember muss der Stadtrat eine Entscheidung darüber treffen. reference Im Dezember muss dann noch die Stadtverordnetenversammlung entscheiden. Table 4: SMT output of baseline system and best system (RDLM tuned on BLEU+HWCM f ).  Table 5: Translation quality of English→Russian string-to-tree SMT system with DLM and RDLM, with k-best batch MIRA optimization on BLEU and BLEU+HWCM f . Average of 3 optimization runs. bold: no other system in same block is significantly better (p < 0.05); *: significantly better than same model with other MIRA objective (p < 0.05). Higher scores are better for BLEU and HWCM f ; lower scores are better for TER.
is the consistent winner, and significantly outperforms the baseline for all metrics and test sets. Tuning on BLEU+HWCM f results in further improvements in HWCM f and TER. Looking at the combined effect of adding RDLM and changing the tuning objective, we observe gains in BLEU by 0.5-0.9 points, and gains in HWCM f by 2.1-3.4 points.

Morphological Agreement
We argue that the dependency language models and HWCM f as a tuning metric improve grammaticality, and we are able to quantify one aspect thereof, morphological agreement, for English→German. Williams and Koehn (2011) introduce a unification grammar with hand-crafted agreement constraints to identify and suppress selected morphological agreement violations in German, namely in regards to noun phrase agreement, prepositional phrase agreement, and subject-verb agreement. We can use their grammar to analyse the effect of different models on morphological agreement by counting the number of translations that violate at least one agreement constraint. We assume that the number of false posi-  tives (i.e. correct analyses that trigger an agreement violation) remains roughly constant throughout all systems, so that a reduction in the number of agreement violations is an indicator of better grammatical agreement. Table 6 shows the results. While the 5-gram NNLM reduces the number of agreement errors somewhat compared to the baseline (-18%), the reduction is greater for DLM (-34%) and RDLM (-46%). Neither the baseline nor the 5-gram NNLM profits strongly from tuning on HWCM f , while the number of agreement errors is further reduced for the system with DLM (-41%) and RDLM (-54%). Adding the 5-gram NNLM to the RDLM system yields no further reduction on the number of agreement errors.
Enforcing the agreement constraints on the baseline system (tuned on BLEU+HWCM f ) provides us with a gain of 0.3 in both BLEU and HWCM f ; on the RDLM system, only 0.03. The fact that the benefit of enforcing the agreement constraints drops off more sharply than the number of constraint violations indicates that the remaining violations tend to be harder for the model to correct, e.g. because the translation model has not learned to produce the required inflection of a word, or because some of the remaining violations are false positives. While the dependency language models' effect of improving morphological agreement is not (fully) cumulative with the benefit from enforcing the unification constraints formulated by Williams and Koehn (2011), our model has the advantage of being languageindependent, learning from the data itself rather than relying on manually developed, grammar-specific constraints, and covering a wider range of phenomena such as subcategorisation and syntactic collocations.
The results confirm that the RDLM is more effective at reducing morphological agreement errors than a similarly trained n-gram NNLM and the unlabelled DLM, and that adding HWCM f to the training objective is beneficial. On a a meta-evaluation level, we compare the rank correlation between the automatic metrics and the numer of agreement errors with Kendall's τ correlation, and observe that he number of agreement errors is more strongly (negatively) correlated with HWCM f (τ = −0.92) than with BLEU (τ = −0.77), METEOR (τ = −0.54) or TER (τ = 0.69). This supports our theoretical expectation that HWCM f is more sensitive to morphological agreement, which is enforced along syntactic n-grams, than n-gram metrics such as BLEU, or the unigram metric METEOR.

Related Work
While there has been a wide range of dependency language models proposed (e.g. (Chelba et al., 1997;Quirk et al., 2004;Shen et al., 2010;Zhang, 2009;Popel and Marecek, 2010)), there are vast differences in modelling assumptions. Our work is most similar to the dependency language model described in Shen et al. (2010), or the h-gram model proposed by Zhang (2009), both of which have been used for SMT. We make different modelling assumptions, relying less on configurational information, but including the prediction of dependency labels in the model. We argue that our relational modelling assumptions are more suitable for languages with a relatively free word order such as German.
To a lesser extent, our work is similar to other parsing models that have been used for language modelling, such as lexicalized PCFGs (Charniak, 2001;Collins, 2003;Charniak et al., 2003), or structured language models (Chelba and Jelinek, 2000); previous efforts to include them in the translation process failed to improve translation performance (Och et al., 2004;Post and Gildea, 2008). Differences in our work that could explain why we see improvements include the use of Neural Networks for training the model on the automatically parsed training text, instead of re-using existing parser models, which could be seen as a form of self-training (McClosky et al., 2006), and the integration of the language model into the decoder instead of n-best reranking. Also, there are major differences in the parsing models themselves. For instance, note that the structured LM by Chelba and Jelinek (2000) uses a binary branching structure, and that complex label sets would be required to encode subcategorisation frames in binary trees (Hockenmaier and Steedman, 2002).
Our neural network is a standard feed-forward neural network as introduced by Bengio et al. (2003). Recently, recursive neural networks have been proposed for syntactic parsing (Socher et al., 2010;Le and Zuidema, 2014). The recursive nature of such models allows for the encoding of more context; for an efficient integration into the dynamic programming search of SMT decoding, we deem our model, which makes stronger Markov assumptions, more suitable.
While BLEU has been the standard objective function for tuning the log-linear parameters in SMT systems, recent work has investigated alternative objective functions. Some authors concluded that none of the tested alternatives could consistently outperform BLEU (Cer et al., 2010;Callison-Burch et al., 2011). Liu et al. (2011) report that tuning on the TESLA metric gives better results than tuning on BLEU; Lo et al. (2013) do the same for MEANT.
There is related work on improving morphological agreement and subcategorisation through postediting (Rosa et al., 2012) or independent models for inflection generation (Toutanova et al., 2008;Weller et al., 2013). The latter models initially produce a stemmed translation, then predict the inflection through feature-rich sequence models. Such a pipeline of prediction steps is less powerful than our joint prediction of stems and inflection. For instance, in example 2 in Table 4, our model chooses a different stem to match the subcategorisation frame of the translation; it is not possible to fix the baseline translation with inflection changes alone.

Conclusion
The main contribution of this paper is the description of a relational dependency language model. 10 We show that it is a valuable asset to a state-of-the-art SMT system by comparing perplexity values with other types of languages models, and by its integration into decoding, which results in improvements according to automatic MT metrics and reduces the number of agreement errors. We show that the disfluencies that our model captures are qualitatively different from an n-gram Neural Network language model, with our model being more effective at modelling fluency phenomena along syntactic n-grams.
A second important contribution is the optimization of the log-linear parameters of an SMT system based on syntactic n-grams. We are to our knowledge the first to tune an SMT system on a non-shallow syntactic similarity metric. Apart from showing improvements by tuning on HWCM f , our results also shed light on the interaction between models and tuning metrics. With n-gram language models, the choice of tuning metric only had a small effect on the English→German translation results. Only with dependency language models, which are able to model the syntactic n-grams that HWCM scores, did we see large improvements from adding HWCM f to the objective function. On the one hand, this has implications when evaluating new model components: using an objective function that cannot capture the impact of a model component can result in false negatives because the model component will not receive an appropriate weight, and the model may thus seem to be of little use, even in a human evaluation. On the other hand, it is an important finding for the evaluation of objective functions: the performance of an objective function is tied to the power of the underlying model. Without a model that is able to model syntactic n-grams, we might have concluded that HWCM is of little help as an objective function. Now, we hypothesize that HWCM is well-suited to optimize dependency language models because both operate on syntactic ngrams, just like BLEU and n-gram models are natural counterparts.
The approach we present is languageindependent, and we evaluated it on SMT into German and Russian. While we have no empirical data on the model's effectiveness for other target languages, we suspect that syntactic n-grams are especially suited for modelling and evaluating translations into languages with inter-dependencies between distant words and relatively free word order, such as German, Czech, or Russian.
In this work, we relied on parse hypotheses being provided by a string-to-tree SMT decoder, but other settings are conceivable for future work, such as performing n-best string reranking by coupling the relational dependency LM with a monolingual parse algorithm. Another obvious extension of the relational dependency LM is the inclusion of more context, for instance through larger windows for siblings and ancestors, or source-context as in (Devlin et al., 2014). Also, we believe that the model can benefit from further advances in neural network modelling, for instance recent findings that ensembles of networks outperform a single network (Mikolov et al., 2011;Devlin et al., 2014)