Locally Non-Linear Learning for Statistical Machine Translation via Discretization and Structured Regularization

Linear models, which support efficient learning and inference, are the workhorses of statistical machine translation; however, linear decision rules are less attractive from a modeling perspective. In this work, we introduce a technique for learning arbitrary, rule-local, non-linear feature transforms that improve model expressivity, but do not sacrifice the efficient inference and learning associated with linear models. To demonstrate the value of our technique, we discard the customary log transform of lexical probabilities and drop the phrasal translation probability in favor of raw counts. We observe that our algorithm learns a variation of a log transform that leads to better translation quality compared to the explicit log transform. We conclude that non-linear responses play an important role in SMT, an observation that we hope will inform the efforts of feature engineers.


Introduction
Linear models using log-transformed probabilities as features have emerged as the dominant model in MT systems. This practice can be traced back to the IBM noisy channel models (Brown et al., 1993), which decompose decoding into the product of a translation model (TM) and a language model (LM), motivated by Bayes' Rule. When Och and Ney (2002) introduced a log-linear model for translation (a linear sum of log-space features), they noted that the noisy channel model was a special case of their model using log probabilities. This same formulation persisted even after the introduction of MERT (Och, 2003), which optimizes a linear model; again, using two log probability features (TM and LM) with equal weight recovered the noisy channel model. Yet systems now use many more features, some of which are not even probabilities. We no longer believe that equal weights between the TM and LM provides optimal translation quality; the probabilities in the TM do not obey the chain rule nor Bayes' rule, nullifying several theoretical mathematical justifications for multiplying probabilities. The story of multiplying probabilities may just amount to heavily penalizing small values.
The community has abandoned the original motivations for a linear interpolation of two logtransformed features. Is there empirical evidence that we should continue using this particular transformation? Do we have any reason to believe it is better than other non-linear transformations? To answer these, we explore the issue of non-linearity in models for MT. In the process, we will discuss the impact of linearity on feature engineering and develop a general mechanism for learning a class of non-linear transformations of real-valued features.
Applying a non-linear transformation such as log to features is one way of achieving a non-linear response function, even if those features are aggregated in a linear model. Alternatively, we could achieve a non-linear response using a natively nonlinear model such as a SVM (Wang et al., 2007) or RankBoost (Sokolov et al., 2012). However, MT is a structured prediction problem, in which a full hypothesis is composed of partial hypotheses. MT decoders take advantage of the fact that the model score decomposes as a linear sum over both local features and partial hypotheses to efficiently perform inference in these structured spaces ( §2) -currently, there are no scalable solutions to integrating the hypothesis-level non-linear feature transforms typically associated with kernel methods while still maintaining polynomial time search. Another alternative is incorporating a recurrent neural network (Schwenk, 2012;Auli et al., 2013;Kalchbrenner and Blunsom, 2013) or an additive neural network (Liu et al., 2013a). While these models have shown promise as methods of augmenting existing models, they have not yet offered a path for replacing or transforming existing real-valued features.
In this article, we discuss background ( §2), describe local discretization, our approach to learning non-linear transformations of individual features, compare it with globally non-linear models ( §3), present our experimental setup ( §5), empirically verify the importance of non-linear feature transformations in MT and demonstrate that discretization can be used to recover non-linear transformations ( §6), discuss related work ( §7), and conclude ( §8).

Feature Locality & Structured Hypotheses
Decoding a given source sentence f can be expressed as search over target hypotheses e, each with an associated complete derivation D. To find the best-scoring hypothesisê(f ), a linear model applies a set of weights w to a complete hypothesis' feature vector H: However, this hides many of the realities of performing inference in modern decoders. Traditional inference would be intractable if every feature were allowed access to the entire derivation D and its associated target hypothesis e. Decoders take advantage of the fact that features decompose over partial derivations d. For a complete derivation D, the global features H(D) are an efficient summation over local features h(d): This contrasts with non-local features such as the language model (LM), which cannot be exactly calculated given an arbitrary partial hypothesis, which may lack both left and right context. 1 Such features require special handling including future cost estimation. In this study, we limit ourselves to local features, leaving the traditional non-local LM feature unchanged. In general, feature locality is relative to a particular structured hypothesis space, and is unrelated to the structured features described in Section 4.2.

Feature Non-Linearity and Separability
Unlike models that rely primarily on a large number of sparse indicator features, state-of-the-art machine translation systems rely heavily on a small number of dense real-valued features. However, unlike indicator features, real-valued features may benefit from non-linear transformations to allow a linear model to better fit the data.
Decoders use a linear model to rank hypotheses, selecting the highest-ranked derivation. Since the absolute score of the model is irrelevant, non-linear responses are useful only in cases where they elicit novel rankings. In this section, we will discuss these cases in terms of separability. Here, we are separating the correctly ranked pairs of hypotheses from the incorrect in the implicit pairwise rankings defined by the total ordering on hypotheses provided by our model.
When the local feature vectors h of each oraclebest 2 hypothesis (or hypotheses) are distinct from those of all other competing hypotheses, we say that the inputs are oracle separable given the feature set. If there exists a weight vector that distinguishes the oracle-best ranking from all other rankings under a linear model, we say that the inputs are linearly separable given the feature set. If the inputs are oracle separable but not linearly separable, we say that there are non-linearities that are unexplained by the feature set. For example, this can happen if a feature is positively related to quality in some regions but negatively related in other regions.
As we add more sentences to our corpus, separability becomes increasingly difficult. For a given corpus, if all hypotheses are oracle separable, we can always produce the oracle translation -assuming an optimal (and potentially very complex) model and weight vector. If our hypothesis space also contains all reference translations, we can always recover the reference. In practice, both of these conditions are typically violated to a certain degree. However, if we modify our feature set such that some lower-ranked higher-quality hypothesis can be separated from all higher-ranked lower-quality hypotheses, then we can improve translation quality. For this reason, we believe that separability remains an informative tool for thinking about modeling in MT.
Currently, non-linearities in novel real-valued features are typically addressed via manual feature engineering involving a good deal of trial and error (Gimpel and Smith, 2009) 3 or by manually discretizing features (e.g. indicator features for count=N ). We will explore one technique for automatically avoiding non-linearities in Section 3.

Learning with Large Feature Sets
While MERT has proven to be a strong baseline, it does not scale to larger feature sets in terms of both inefficiency and overfitting. While MIRA (Chiang et al., 2008), Rampion (Gimpel and Smith, 2012), and HOLS (Flanigan et al., 2013) have been shown to be effective over larger feature sets, they are difficult to explicitly regularize -this will become important in Section 4.2. Therefore, we use the PRO optimizer (Hopkins and May, 2011) as our baseline learner since it has been shown to perform comparably to MERT for a small number of features, and to significantly outperform MERT for a large number of features (Hopkins and May, 2011;Ganitkevitch et al., 2012). Other very recent MT optimizers such as the linear structured SVM (Cherry and Foster, 2012), AROW (Chiang, 2012) and regularized MERT  are also compatible with the discretization and structured regularization techniques described in this article. 4

Discretization and Feature Induction
In this section, we propose a feature induction technique based on discretization that produces a feature set that is less prone to non-linearities (see §2.2).
We define feature induction as a function Φ(y) that takes the result of the feature function y = h(x) ∈ R and returns a tuple y , j where y ∈ R is a transformed feature value and j is the transformed feature index. 5 Building on equation 2, we can apply feature induction as follows: At first glance, one might be tempted to simply choose some non-linear function for Φ (e.g. log(x), exp(x), sin(x), x n ). However, even if we were to restrict ourselves to some "standard" set of non-linear functions, many of these functions have hyperparameters that are not directly tunable by conventional optimizers (e.g.period and amplitude for sin, n in x n ). Discretization allows us to avoid many nonlinearities ( §2.2) while preserving the fast inference provided by feature locality ( §2.1). We first discretize real-valued features into a set of indicator 5 One could also imagine a feature transformation function Φ that returns a vector of bins for a single value returned by a feature function h or a transformation that has access to values from multiple feature functions at once. features and then use a conventional optimizer to learn a weight for each indicator feature ( Figure 1). This technique is sometimes referred to as binning and is closely related to quantization. Effectively, discretization allows us to re-shape a feature function ( Figure 2). In fact, given an infinite number of bins, we can perform any non-linear transformation of the original function. However, since we may only adjust w 0 , these "bins" will be rigidly fixed along the feature function's value. Right: After discretizing the feature into 4 bins, we may now adjust 4 weights independently, to achieve a non-linear re-shaping of the function.
For indicator discretization, we define Φ i in terms of a binning function BIN i (x) ∈ R → N: where the operator indicates concatenation of a feature identifier with a bin identifier to form a new, unique feature identifier.

Local Discretization
Unlike other approaches to non-linear learning in MT, we perform non-linear transformation on partial hypotheses as in equation 3 where discretization is applied as Φ i (h i (d)), which allows locally non-linear transformations, instead of applying Φ to complete hypotheses as in Φ i (H i (D)), which would allow globally non-linear transformations. This enables our transformed model to produce non-linear responses with regard to the initial feature set H while inference remains linear with regard to the optimized parameters w . Importantly, our transformed feature set requires no additional non-local information for inference. By performing transformations within a local context, we effectively reinterpret the feature set. For example, the familiar target word count feature el gato come furtivamente In terms of predictive power, this transformation can provide the learned model with increased ability to discriminate between hypotheses. This is primarily a result of moving to a higher-dimensional feature space. As we introduce new parameters, we expect that some hypotheses that were previously indistinguishable under H become separable under H ( §2.2). We show specific examples comparing linear, locally non-linear, and globally non-linear models in Figures 4 -6. As seen in these examples, locally non-linear models (Eq. 3, 4) are not an approximation nor a subset of globally non-linear models, but rather a different class of models.

Binning Algorithm
To initialize the learning procedure, we construct the binning function BIN used by the indicator di-  Figure 4: An example showing a collinearity over multiple input sentences S 3 , S 4 in which the oracle-best hypothesis is "trapped" along a line with other lower quality hypotheses in the linear model's output space. Ranking shows how the hypotheses would appear in a k-best list with each partial derivation having its partial feature vector h under it; the complete feature vector H is shown to the right of each hypothesis and the oracle-best hypothesis is notated with a * . Pairs explicates the implicit pairwise rankings. Pairwise Ranking graphs those pairs in order to visualize whether or not the hypotheses are separable. (⊕ indicates that the pair of hypotheses is ranked correctly according to the extrinsic metric and indicates the pair is ranked incorrectly. In the pairwise ranking row, some ⊕ and points are annotated with their positions along the third axis H 3 (omitted for clarity). Collinearity can also occur with a single input having at least 3 hypotheses.

Linear
Globally Non-Linear Locally Non-Linear    cretizer Φ. We have two desiderata: (1) any monotonic transformation of a feature should not affect the induced binning since we should not require feature engineers to determine the optimal feature transformation and (2) no bin's data should be so sparse that the optimizer cannot reliably estimate a weight for each bin. Therefore, we construct bins that are (i) populated uniformly subject to (ii) each bin containing no more than one feature value. We call this approach uniform population feature binning. While one could consider the predictive power of the features when determining bin boundaries, this would suggest that we should jointly optimize and determine bin boundaries, which is beyond the scope of this work. This problem has recently been considered for NLP by Suzuki and Nagata (2013) and for MT by Liu et al. (2013b), though the latter involves decoding the entire training data.
Let X be the list of feature values to bin where i indexes feature values x i ∈ X and their associ-ated frequencies f i . We want each bin to have a uniform size u. For the sake of simplifying our final algorithm, we first create adjusted frequencies f i so that very frequent feature values will not occupy more than 100% of a bin via the following algorithm, which iterates over k: which returns u = u k when f k i < u k ∀i. Next, we solve for a binning B of N bins where b j is the population of each bin: We use Algorithm 1 to produce this binning. In our experiments, we construct a translation model for each sentence in our tuning corpus; we then add a feature value instances to X for each rule instance.
Algorithm 1 POPULATEBINSUNIFORMLY(X , N ) Remaining frequency mass within ideal bound def C Handle value that straddles ideal boundaries by minimizing its violation of the ideal if i ≤ R(j) and f i −C(j)

Structured Regularization
Unfortunately, choosing the right number of bins can have important effects on the model, including: Fidelity. If we choose too few bins, we risk degrading the model's performance by discarding important distinctions encoded in fine differences between the feature values. In the extreme, we could reduce a real-valued feature to a single indicator feature. Sparsity. If we choose too many bins, we risk making each indicator feature too sparse, which is likely to result in the optimizer overfitting such that we generalize poorly to unseen data.
While one may be tempted to simply throw more data or millions of sparse features at the problem, we elect to more strategically use existing data, since (1) large in-domain tuning data is not always available, and (2) when it is available, it can add considerable computational expense. In this section, we explore methods for mitigating data sparsity by embedding more knowledge into the learning procedure.

Overlapping Bins
One very simplistic way we could combat sparsity is to extend the edges of each bin such that they cover their neighbors' values (see Equation 4): This way, each bin will have more data points to estimate its weight, reducing data sparsity, and the bins will mutually constrain each other, reducing the ability to overfit. We include this technique as a contrastive baseline for structured regularization.

Linear Neighbor Regularization
Regularization has long been used to discourage optimization solutions that give too much weight to any one feature. This encodes our prior knowledge that such solutions are unlikely to generalize. Regularization terms such as the p norm are frequently used in gradient-based optimizers including our baseline implementation of PRO. Unregularized discretization is potentially brittle with regard to the number of bins chosen. Primarily, it suffers from sparsity. At the same time, we note that we know much more about discretized features than initial features since we control how they are formed. These features make up a structured feature space. With these things in mind, we propose linear neighbor regularization, a structured regularizer that embeds a small amount of knowledge into the objective function: that the indicator features resulting from the discretization of a single real-valued feature are spatially related. We expect similar weights to be given to the indicator features that represent neighboring values of the original real-valued feature such that the resulting transformation appears somewhat smooth.
To incorporate this knowledge of nearby bins, the linear neighbor regularizer R LNR penalizes each feature's weight by the squared amount it differs from its neighbors' midpoint: This is a special case of the feature network regularizer of Sandler (2010). Unlike traditional regularizers, we do not hope to reduce the active feature count. With the PRO loss l and a 2 regularizater R 2 , our final loss function internal to each iteration of PRO is:

Monotone Neighbor Regularization
However, as β → ∞, the linear neighbor regularizer R LNR forces a linear arrangement of weightsthis violates our premise that we should be agnostic to non-linear transformations. We now describe a structured regularizer R MNR whose limiting solution is any monotone arrangement of weights. We augment R LNR with a smooth damping term D(w, j), which has the shape of a bathtub curve with steepness γ: D is nearly zero while w j ∈ [w j−1 , w j+1 ] and nearly one otherwise. Briefly, the numerator measures how far w j is from the midpoint of w j−1 and w j+1 while the denominator scales that distance by the radius from the midpoint to the neighboring weight.

Experimental Setup 6
Formalism: In our experiments, we use a hierarchical phrase-based translation model (Chiang, 2007). A corpus of parallel sentences is first word-aligned, and then phrase translations are extracted heuristically. In addition, hierarchical grammar rules are extracted where phrases are nested. In general, our choice of formalism is rather unimportant -our techniques should apply to most common phrasebased and chart-based paradigms including Hiero and syntactic systems. Decoder: For decoding, we will use cdec (Dyer et al., 2010), a multi-pass decoder that supports syntactic translation models and sparse features. Optimizer: Optimization is performed using PRO (Hopkins and May, 2011) as implemented by the cdec decoder. We run PRO for 30 iterations as suggested by Hopkins and May (2011). The PRO optimizer internally uses a L-BFGS optimizer with the default 2 regularization implemented in cdec. Any additional regularization is explicitly noted. Baseline Features: We use the baseline features produced by Lopez' suffix array grammar extractor (Lopez, 2008), which is distributed with cdec. 6 All code at http://github.com/jhclark/cdec Bidirectional lexical log-probabilities, the coherent phrasal translation log-probability, target word count, glue rule count, source OOV count, target OOV count, and target language model logprobability. Note that these features may be simplified or removed as specified in each experimental condition.
Zh→En We also construct a Czech→English system based on the CzEng 1.0 data (Bojar et al., 2012). First, we lowercased and performed sentence-level deduplication of the data. 7 Then, we uniformly sampled a training set of 1M sentences (sections 1 -97) along with a weighttuning set (section 98), hyperparameter-tuning (section 99), and test set (section 99) from the paraweb domain contained of CzEng. 8 Sentences less than 5 words were discarded due to noise. Evaluation: We quantify increases in translation quality using case-insensitive BLEU (Papineni et al., 2002). We control for test set variation and optimizer instability by averaging over multiple optimizer replicas (Clark et al., 2011). 9 Bits 4 Table 3: Top: Translation quality for systems with and without the typical log transform. Bottom: Translation quality for systems using discretization and structured regularization with probabilities P or counts C as the input of discretization. MNR P consistently recovers or outperforms a state-of-the-art system, but without any assumptions about how to transform the initial features. All scores are averaged over 3 end-to-end optimizer replications. denotes significantly different than log probs (row2) with p(CHANCE) < 0.01 under Clark et al. (2011) and † is likewise used with regard to P (row 1).

Does Non-Linearity Matter?
In our first set of experiments, we seek to answer "Does non-linearity matter?" by starting with our baseline system of 7 typical features (the log Probability system) and we then remove the log transform from all of the log probability features in our grammar (the Probs. system). The results are shown in Table 3 (rows 1, 2). If a naïve feature engineer were to remove the non-linear log transform, the systems would degrade between 1.1 BLEU and 3.6 BLEU. From this, we conclude that non-linearity does affect translation quality. This is a potential pitfall for any real-valued feature including probability features, count features, similarity measures, etc.

Learning Non-Linear Transformations
Next, we evaluate the effects of discretization (Disc), overlapping bins (Over.), linear neighbor regularization (LNR), and monotone neighbor regularization (MNR) on three language pairs: a small Zh→En system, a large Ar→En system and a large Cz→En system. In the first row of Table 3, we use raw probabilities rather than log probabilities for p coherent (t|s), p lex (t|s), and p lex (s|t). In rows 3 -7, all translation model features (without the logtransformed features) are then discretized into indicator features. 10 The number of bins and the structured regularization strength were tuned on the hyperparameter tuning set. Discretization alone does not consistently recover the performance of the log transformed features (row 3). The naïve overlap strategy in fact degrades performance (row 4). Linear neighbor regularization (row 5) behaves more consistently than discretization alone, but is consistently outperformed by the monotone neighbor regularizer (row 6), which is able to meet or significantly exceed the performance of the log transformed system. Importantly, this is done without any knowledge of the correct nonlinear transformation. In the final row, we go a step further by removing p coherent (t|s) altogether and replacing it with simple count features: c(s) and c(s, t), with slight to no degradation in quality. 11 We take this as evidence that a feature engineer developing a new real-valued feature may find discretization and monotone neighbor regularization useful.
We also observe that different data sets benefit from non-linear feature transformation in to different degrees (Table 3, rows 1, 2). We noticed that discretization with monotone neighbor regularization is able to improve over a log transform (rows 2, 6) in proportion to the improvement of a log transform over probability-based features (rows 1, 2).
To provide insight into how translation quality can be affected by the number of bits for discretization, we offer Table 2.
In Figure 7, we present the weights learned by the Ar→En system for probability-based features. We see that even without a bias toward a log transform, a log-like shape still emerges for many SMT features based only on the criteria of optimizing BLEU and a preference for monotonicity. However, the optimizer chooses some important variations on the log curve, especially for low probabilities, that lead to improvements in translation quality. Original raw count feature value 0.08 0.07 Weight Figure 7: Plots of weights learned for the discretized p coherent (e|f ) (top) and c(f ) (bottom) for the Ar→En system with 4 bits and monotone neighbor regularization. p(e|f ) > 0.11 is omitted for exposition as values were constant after this point. The gray line fits a log curve to the weights. The system learns a shape that deviates from the log in several regions. Each non-monotonic segment represents the learner choosing to better fit the data while paying a strong regularization penalty.

Related Work
Previous work on feature discretization in machine learning has focused on the conversion of realvalued features into discrete values for learners that are either incapable of handling real-valued inputs or perform suboptimally given real-valued inputs (Dougherty et al., 1995;Kotsiantis and Kanellopoulos, 2006). Decision trees and random forests have been successfully used in language modeling (Jelinek et al., 1994;Xu and Jelinek, 2004) and parsing (Charniak, 2010;Magerman, 1995).
Kernel methods such as support vector machines (SVMs) are often considered when non-linear interactions between features are desired since they allow for easy usage of non-linear kernels. Wu et al. (2004) showed improvements using non-linear kernel PCA for word sense disambiguation. Taskar et al. (2003) describes a method for incorporating kernels into structured Markov networks. Tsochantaridis et al. (2004) then proposed a structured SVM for grammar learning, named-entity recognition, text classification, and sequence alignment. This was followed by a structured SVM with inexact inference (Finley and Joachims, 2008) and the latent structured SVM (Yu and Joachims, 2009). Even within kernel methods, learning non-linear mappings with kernels remains an open area of research; For example, Cortes et al. (2009) investigated learning non-linear combinations of kernels. In MT, Giménez and Màrquez (2007) used a SVM to annotate a phrase table with binary features indicating whether or not a phrase translation was appropriate in context. Nguyen et al. (2007) also applied nonlinear features for SMT n-best reranking.
Toutanova and Ahn (2013) use a form of regression decision trees to induce locally non-linear features in a n-best reranking framework. He and Deng (2012) directly optimize the lexical and phrasal features using expected BLEU. Nelakanti et al. (2013) use tree-structured p regularizers to train language models and improve perplexity over Kneser-Ney.
Learning parameters under weak order restrictions has also been studied for regression. Isotonic regression (Barlow et al., 1972;Robertson et al., 1988;Silvapulle and Sen, 2005) fits a curve to a set of data points such that each point in the fitted curve is greater than or equal to the previous point in the curve. Nearly isotonic regression allows violations in monotonicity (Tibshirani et al., 2011).

Conclusion
In the absence of highly refined knowledge about a feature, discretization with structured regularization enables higher quality impact of new feature sets that contain non-linearities. In our experiments, we observed that discretization out-performed naïve features lacking a good non-linear transformation by up to 4.4 BLEU and that it can outperform a baseline by up to 0.8 BLEU while dropping the log transform of the lexical probabilities and removing the phrasal probabilities in favor of counts. Looking beyond this basic feature set, non-linear transformations could be the difference between showing quality improvements or not for novel features. As researchers include more real-valued features including counts, similarity measures, and separately-trained models with millions of features, we suspect this will become an increasingly relevant issue. We conclude that non-linear responses play an important role in SMT, even for a commonly-used feature set, an observation that we hope will inform feature engineers.