Large-scale Word Alignment Using Soft Dependency Cohesion Constraints

Dependency cohesion refers to the observation that phrases dominated by disjoint dependency subtrees in the source language generally do not overlap in the target language. It has been verified to be a useful constraint for word alignment. However, previous work either treats this as a hard constraint or uses it as a feature in discriminative models, which is ineffective for large-scale tasks. In this paper, we take dependency cohesion as a soft constraint, and integrate it into a generative model for large-scale word alignment experiments. We also propose an approximate EM algorithm and a Gibbs sampling algorithm to estimate model parameters in an unsupervised manner. Experiments on large-scale Chinese-English translation tasks demonstrate that our model achieves improvements in both alignment quality and translation quality.


Introduction
Word alignment is the task of identifying word correspondences between parallel sentence pairs. Word alignment has become a vital component of statistical machine translation (SMT) systems, since it is required by almost all state-of-the-art SMT systems for the purpose of extracting phrase tables or even syntactic transformation rules (Koehn et al., 2007;Galley et al., 2004).
During the past two decades, generative word alignment models such as the IBM Models (Brown et al., 1993) and the HMM model (Vogel et al., 1996) have been widely used, primarily because they are trained on bilingual sentences in an unsupervised manner and the implementation is freely available in the GIZA++ toolkit (Och and Ney, 2003). However, the word alignment quality of generative models is still far from satisfactory for SMT systems. In recent years, discriminative alignment models incorporating linguistically motivated features have become increasingly popular (Moore, 2005;Taskar et al., 2005;Riesa and Marcu, 2010;Saers et al., 2010;Riesa et al., 2011). These models are usually trained with manually annotated parallel data. However, when moving to a new language pair, large amount of hand-aligned data are usually unavailable and expensive to create.
A more practical way to improve large-scale word alignment quality is to introduce syntactic knowledge into a generative model and train the model in an unsupervised manner (Wu, 1997;Yamada and Knight, 2001;Lopez and Resnik, 2005;DeNero and Klein, 2007;Pauls et al., 2010). In this paper, we take dependency cohesion (Fox, 2002) into account, which assumes phrases dominated by disjoint dependency subtrees tend not to overlap after translation. Instead of treating dependency cohesion as a hard constraint  or using it as a feature in discriminative models (Cherry and Lin, 2006b), we treat dependency cohesion as a distortion constraint, and integrate it into a modified HMM word alignment model to softly influence the probabilities of alignment candidates. We also propose an approximate EM algorithm and an explicit Gibbs sampling algorithm to train the model in an unsupervised manner. Experiments on a large-scale Chinese-English translation task demonstrate that our model achieves improvements in both word alignment quality and machine translation quality.
The remainder of this paper is organized as follows: Section 2 introduces dependency cohesion constraint for word alignment. Section 3 presents our generative model for word alignment using dependency cohesion constraint. Section 4 describes algorithms for parameter estimation. We discuss and analyze the experiments in Section 5. Section 6 gives the related work. Finally, we conclude this paper and mention future work in Section 7.

Dependency Cohesion Constraint for Word Alignment
Given a source (foreign) sentence 1 = 1 , 2 , … , and a target (English) sentence 1 = 1 , 2 , … , , the alignment between 1 and 1 is defined as a subset of the Cartesian product of word positions: ∈ {( , ): = 1, … , ; = 1, … , } When given the source side dependency tree , we can project dependency subtrees in onto the target sentence through the alignment . Dependency cohesion assumes projection spans of disjoint subtrees tend not to overlap. Let ( ) be the subtree of rooted at , we define two kinds of projection span for the node : subtree span and head span. The subtree span is the projection span of the total subtree ( ), while the head span is the projection span of the node itself. Following Fox (2002) and , we consider two types of dependency cohesion: headmodifier cohesion and modifier-modifier cohesion. Head-modifier cohesion is defined as the subtree span of a node does not overlap with the head span of its head (parent) node, while modifier-modifier cohesion is defined as subtree spans of two nodes under the same head node do not overlap each other. We call a situation where cohesion is not maintained crossing.
Using the dependency tree in Figure 1 as an example, given the correct alignment "R", the subtree span of "有/have" is [8,14] , and the head span of its head node "之一/one of" is [3,4]. They do not overlap each other, so the head-modifier cohesion is maintained. Similarly, the subtree span of "少数/few" is [6,6], and it does not overlap the subtree span of "有/have", so a modifier-modifier cohesion is maintained. However, when "R" is replaced with the incorrect alignment "W", the subtree span of "有/have" becomes [3,14], and it overlaps the head span of its head "之一/one of", so a head-modifier crossing occurs. Meanwhile, the subtree spans of the two nodes "有/have" and " 少数/few" overlap each other, so a modifiermodifier crossing occurs. Fox (2002) showed that dependency cohesion is generally maintained between English and French. To test how well this assumption holds between Chinese and English, we measure the dependency cohesion between the two languages with a manually annotated bilingual Chinese-English data set of 502 sentence pairs 1 . We use the headmodifier cohesion percentage (HCP) and the modifier-modifier cohesion percentage (MCP) to measure the degree of cohesion in the corpus. HCP (or MCP) is used for measuring how many headmodifier (or modifier-modifier) pairs are actually cohesive. Table 1 lists the relative percentages in both Chinese-to-English (ch-en, using Chinese side dependency trees) and English-to-Chinese (en-ch, using English side dependency trees) directions. As we see from Table 1, dependency cohesion is 1 The data set is the development set used in Section 5.  Figure 1: A Chinese-English sentence pair including the word alignments and the Chinese side dependency tree. The Chinese and English words are listed horizontally and vertically, respectively. The black grids are gold-standard alignments. For the Chinese word "有/have", we give two alignment positions, where "R" is the correct alignment and "W" is the incorrect alignment.
generally maintained between Chinese and English. So dependency cohesion would be helpful for word alignment between Chinese and English. However, there are still a number of crossings. If we restrict alignment space with a hard cohesion constraint, the correct alignments that result in crossings will be ruled out directly. In the next section, we describe an approach to integrating dependency cohesion constraint into a generative model to softly influence the probabilities of alignment candidates. We show that our new approach addresses the shortcomings of using dependency cohesion as a hard constraint.

A Generative Word Alignment Model with Dependency Cohesion Constraint
The most influential generative word alignment models are the IBM Models 1-5 and the HMM model (Brown et al., 1993;Vogel et al., 1996;Och and Ney, 2003). These models can be classified into sequence-based models (IBM Models 1, 2 and HMM) and fertility-based models (IBM Models 3, 4 and 5). The sequence-based model is easier to implement, and recent experiments have shown that appropriately modified sequence-based model can produce comparable performance with fertility-based models (Lopez and Resnik, 2005;Liang et al., 2006;DeNero and Klein, 2007;Zhao and Gildea, 2010;Bansal et al., 2011). So we built a generative word alignment model with dependency cohesion constraint based on the sequence-based model.

The Sequence-based Alignment Model
According to Brown et al. (1993) and Och and Ney (2003), the sequence-based model is built as a noisy channel model, where the source sentence 1 and the alignment 1 are generated conditioning on the target sentence 1 . The model assumes each source word is assigned to exactly one target word, and defines an asymmetric alignment for the sentence pair as 1 = 1 , 2 , … , , … , , where each ∈ [0, ] is an alignment from the source position j to the target position , = 0 means is not aligned with any target words. The sequence-based model divides alignment procedure into two stages (distortion and translation) and factors as: (1) where is the distortion model and is the translation model. IBM Models 1, 2 and the HMM model all assume the same translation model ( | ) . However, they use three different distortion models. IBM Model 1 assumes a uniform distortion probability 1/(I+1), IBM Model 2 assumes ( | ) that depends on word position j and HMM model assumes ( | −1 , ) that depends on the previous alignment −1 . Recently, tree distance models (Lopez and Resnik, 2005;DeNero and Klein, 2007) formulate the distortion model as ( | −1 , ) , where the distance between and −1 are calculated by walking through the phrase (or dependency) tree T.

Proposed Model
To integrate dependency cohesion constraint into a generative model, we refine the sequence-based model in two ways with the help of the source side dependency tree .
First, we design a new word alignment order. In the sequence-based model, source words are aligned from left to right by taking source sentence as a linear sequence. However, to apply dependency cohesion constraint, the subtree span of a head node is computed based on the alignments of its children, so children must be aligned before the head node. Riesa and Marcu (2010) propose a hierarchical search procedure to traverse all nodes in a phrase structure tree. Similarly, we define a bottom-up topological order (BUT-order) to traverse all words in the source side dependency tree . In the BUT-order, tree nodes are aligned bottom-up with as a backbone. For all children under the same head node, left children are aligned from right to left, and then right children are aligned from left to right. For example, the BUT-order for the following dependency tree is "C B E F D A H G". For the sake of clarity, we define a function to map all nodes in into their BUT-order, and notate it as BUT( ) = 1 , 2 , … , , … , , where means the j-th node in BUT-order is the -th word in the original source sentence. We arrange alignment sequence 1 according the BUT-order and notate it as [1, ] = 1 , … , , … , , where is the aligned position for a node . We also notate the sub-sequence , … , as [ , ] .
Second, we keep the same translation model as the sequence-based model and integrate the dependency cohesion constraints into the distortion model. The main idea is to influence the distortion procedure with the dependency cohesion constraints. Assume node ℎ and node are a head-modifier pair in , where ℎ is the head and is the modifier. The head-modifier cohesion relationship between them is notated as ℎ, ∈ { ℎ , } . When the head-modifier cohesion is maintained ℎ, = ℎ , otherwise ℎ, = .
We represent the set of headmodifier cohesion relationships for all the headmodifier pairs in as: Similarly, we assume node and node are a modifier-modifier pair in . To avoid repetition, we assume is the node sitting at the position after in BUT-order and call as the higherorder node of the pair. The modifier-modifier cohesion relationship between them is notated as } . When the modifiermodifier cohesion is maintained We represent the set of modifier-modifier cohesion relationships for all the modifier-modifier pairs in as: and are a modifier-modifier pair in } The set of modifier-modifier cohesion relationships for all the modifier-modifier pairs taking as the higher-order node can be represented as: and are a modifier-modifier pair in } Obviously, = ⋃ =0 . With the above notations, we formulate the distortion probability for a node as According to Eq. (1) and the two improvements, we formulated our model as: (2) Here, we use the approximation symbol, because the right hand side is not guaranteed to be normalized. In practice, we only compute ratios of these terms, so it is not actually a problem. Such model is called deficient (Brown et al., 1993), and many successful unsupervised models are deficient, e.g., IBM model 3 and IBM model 4.

Dependency Cohesive Distortion Model
We assume the distortion procedure is influenced by three factors: words distance, head-modifier cohesion and modifier-modifier cohesion. Therefore, we further decompose the distortion model into three terms as follows: where is the words distance term, ℎ is the head-modifier cohesion term and is the modifier-modifier cohesion term.
The word distance term has been verified to be very useful in the HMM alignment model. However, in our model, the word distance is calculated based on the previous node in BUTorder rather than the previous word in the original sentence. We follow the HMM word alignment model (Vogel et al., 1996) and parameterize in terms of the jump width: where () is the count of jump width.
The head-modifier cohesion term ℎ is used to penalize the distortion probability according to relationships between the head node and its children (modifiers). Therefore, we define ℎ as the product of probabilities for all head-modifier pairs taking as head node: , ∈ where , ∈ { ℎ , } is the headmodifier cohesion relationship between and one of its child , ℎ is the corresponding probability, and are the aligned words for and . Similarly, the modifier-modifier cohesion term is used to penalize the distortion probability according to relationships between and its siblings. Therefore, we define as the product of probabilities for all the modifier-modifier pairs taking as the higher-order node: , ∈ where , ∈ { ℎ , } is the modifiermodifier cohesion relationship between and one of its sibling , is the corresponding probability, and are the aligned words for and . Both ℎ and in Eq. (5) and Eq. (6) are conditioned on three words, which would make them very sparse. To cope with this problem, we use the word clustering toolkit, mkcls (Och et al., 1999), to cluster all words into 50 classes, and replace the three words with their classes.

Parameter Estimation
To align sentence pairs with the model in Eq. (2), we have to estimate some parameters: , , ℎ and . The traditional approach for sequencebased models uses Expectation Maximization (EM) algorithm to estimate parameters. However, in our model, it is hard to find an efficient way to sum over all the possible alignments, which is required in the E-step of EM algorithm. Therefore, we propose an approximate EM algorithm and a Gibbs sampling algorithm for parameter estimation.

Approximate EM Algorithm
The approximate EM algorithm is similar to the training algorithm for fertility-based alignment models (Och and Ney, 2003). The main idea is to enumerate only a small subset of good alignments in the E-step, then collect expectation counts and estimate parameters among the small subset in Mstep. Following with Och and Ney (2003), we employ neighbor alignments of the Viterbi alignment as the small subset. Neighbor alignments are obtained by performing one swap or move operation over the Viterbi alignment.
Obtaining the Viterbi alignment itself is not so easy for our model. Therefore, we take the Viterbi alignment of the sequence-based model (HMM model) as the starting point, and iterate the hillclimbing algorithm (Brown et al., 1993) many times to get the best alignment greedily. In each iteration, we find the best alignment with Eq. (2) among neighbor alignments of the initial point, and then make the best alignment as the initial point for the next iteration. The algorithm iterates until no update could be made.

Gibbs Sampling Algorithm
Gibbs sampling is another effective algorithm for unsupervised learning problems. As is described in the literatures (Johnson et al., 2007;Gao and Johnson, 2008), there are two types of Gibbs samplers: explicit and collapsed. An explicit sampler represents and samples the model parameters in addition to the word alignments, while in a collapsed sampler the parameters are integrated out and only alignments are sampled. Mermer and Saraç lar (2011) proposed a collapsed sampler for IBM Model 1. However, their sampler updates parameters constantly and thus cannot run efficiently on large-scale tasks. Instead, we take advantage of explicit Gibbs sampling to make a highly parallelizable sampler. Our Gibbs sampler is similar to the MCMC algorithm in Zhao and Gildea (2010), but we assume Dirichlet priors when sampling model parameters and take a different sampling approach based on the source side dependency tree.
Our sampler performs a sequence of consecutive iterations. Each iteration consists of two sampling steps. The first step samples the aligned position for each dependency node according to the BUTorder. Concretely, when sampling the aligned position ( +1) for node on iteration +1, the aligned positions for [1, −1]  ) calculated with Eq.
(2), and the denominator is the summation of the probabilities of aligning with each target word. The second step of our sampler calculates parameters , , ℎ and using their counts, where all these counts can be easily collected during the first sampling step. Because all these parameters follow multinomial distributions, we consider Dirichlet priors for them, which would greatly simplify the inference procedure.
In the first sampling step, all the sentence pairs are processed independently. So we can make this step parallel and process all the sentence pairs efficiently with multi-threads. When using the Gibbs sampler for decoding, we just ignore the second sampling step and iterate the first sampling step many times.

Experiments
We performed a series of experiments to evaluate our model. All the experiments are conducted on the Chinese-English language pair. We employ two training sets: FBIS and LARGE. The size and source corpus of these training sets are listed in Table 2. We will use the smaller training set FBIS to evaluate the characters of our model and use the LARGE training set to evaluate whether our model is adaptable for large-scale task. For word alignment quality evaluation, we take the handaligned data sets from SSMT2007 2 , which contains 505 sentence pairs in the testing set and 502 sentence pairs in the development set. Following Och and Ney (2003), we evaluate word alignment quality with the alignment error rate (AER), where lower AER is better.
Because our model takes dependency trees as input, we parse both sides of the two training sets, the development set and the testing set with Berkeley parser (Petrov et al., 2006), and then convert the generated phrase trees into dependency trees according to Wang and Zong (2010;2011). Our model is an asymmetric model, so we perform word alignment in both forward (ChineseEnglish) and reverse (EnglishChinese) directions.

Effectiveness of Cohesion Constraints
In Eq. (3), the distortion probability is decomposed into three terms: , ℎ and . To study whether cohesion constraints are effective for word alignment, we construct four sub-models as follows: (1) wd: = ; (2) wd-hc: = • ℎ ; (3) wd-mc: = • ; (4) wd-hc-mc: = • ℎ • . We train these four models with the approximate EM and the Gibbs sampling algorithms on the FBIS training set. For approximate EM algorithm, we first train a HMM model (with 5 iterations of IBM model 1 and 5 iterations of HMM model), then train these four sub-models with 10 iterations of the approximate EM algorithm. For Gibbs sampling, we choose symmetric Dirichlet priors identically with all hyper-parameters equals 0.0001 to obtain a sparse Dirichlet prior. Then, we make the alignments produced by the HMM model as the initial points, and train these sub-models with 20 iterations of the Gibbs sampling.
AERs on the development set are listed in Table  3. We can easily find: 1) when employing the head-modifier cohesion constraint, the wd-hc model yields better AERs than the wd model; 2)  when employing the modifier-modifier cohesion constraint, the wd-mc model also yields better AERs than the wd model; and 3) when employing both head-modifier cohesion constraint and modifier-modifier cohesion constraint together, the wd-hc-mc model yields the best AERs among the four sub-models. So both head-modifier cohesion constraint and modifier-modifier cohesion constraint are helpful for word alignment. Table 3 also shows that the approximate EM algorithm yields better AERs in the forward direction than reverse direction, while the Gibbs sampling algorithm yields close AERs in both directions.

Comparison with State-of-the-Art Models
To show the effectiveness of our model, we compare our model with some of the state-of-theart models. All the systems are listed as follows: 1) IBM4: The fertility-based model (IBM model 4) which is implemented in GIZA++ toolkit. The training scheme is 5 iterations of IBM model 1, 5 iterations of the HMM model and 10 iterations of IBM model 4. 2) IBM4-L0: A modification to the GIZA++ toolkit which extends IBM models with ℓ 0norm (Vaswani et al., 2012). The training scheme is the same as IBM4. 3) IBM4-Prior: A modification to the GIZA++ toolkit which extends the translation model of IBM models with Dirichlet priors (Riley and Gildea, 2012). The training scheme is the same as IBM4. 4) Agree-HMM: The HMM alignment model by jointly training the forward and reverse models (Liang et al., 2006), which is implemented in the BerkeleyAligner. The training scheme is 5 iterations of jointly training IBM model 1 and 5 iterations of jointly training HMM model. 5) Tree-Distance: The tree distance alignment model proposed in DeNero and Klein (2007), which is implemented in the BerkeleyAligner.
The training scheme is 5 iterations of jointly training IBM model 1 and 5 iterations of jointly training the tree distance model. 6) Hard-Cohesion: The implemented "Cohesion Checking Algorithm"  which takes dependency cohesion as a hard constraint during beam search word alignment decoding. We use the model trained by the Agree-HMM system to estimate alignment candidates.
We also build two systems for our soft dependency cohesion model: 7) Soft-Cohesion-EM: the wd-hc-mc sub-model trained with the approximate EM algorithm as described in sub-section 5.1. 8) Soft-Cohesion-Gibbs: the wd-hc-mc sub-model trained with the Gibbs sampling algorithm as described in sub-section 5.1. We train all these systems on the FBIS training set, and test them on the testing set. We also combine the forward and reverse alignments with the grow-diag-final-and (GDFA) heuristic (Koehn et al., 2007). All AERs are listed in Table 4. We find our soft cohesion systems produce better AERs than the Hard-Cohesion system as well as the other systems. Table 5 gives the head-modifier cohesion percentage (HCP) and the modifiermodifier cohesion percentage (MCP) of each system. We find HCPs and MCPs of our soft cohesion systems are much closer to the goldstandard alignments.
To evaluate whether our model is adaptable for large-scale task, we retrained these systems using the LARGE training set. AERs on the testing set are listed in Table 3 6. Compared with Table 4, we 3 Tree-Distance system requires too much memory to run on our server when using the LARGE data set, so we can't get the result.   find all the systems yield better performance when using more training data. Our soft cohesion systems still produce better AERs than other systems, suggesting that our soft cohesion model is very effective for large-scale word alignment tasks.

Machine Translation Quality Comparison
We then evaluate the effect of word alignment on machine translation quality using the phrase-based translation system Moses (Koehn et al., 2007). We take NIST MT03 test data as the development set, NIST MT05 test data as the testing set. We train a 5-gram language model with the Xinhua portion of English Gigaword corpus and the English side of the training set using the SRILM Toolkit (Stolcke, 2002). We train machine translation models using GDFA alignments of each system. BLEU scores on NIST MT05 are listed in Table 7, where BLEU scores are calculated using lowercased and tokenized data (Papineni et al., 2002). Although the IBM4-L0, Agree-HMM, Tree-Distance and Hard-Cohesion systems improve word alignment than IBM4, they fail to outperform the IBM4 system on machine translation. The BLEU score of our Soft-Cohesion-EM system is better than the IBM4 system when using the FBIS training set, but worse when using the LARGE training set. Our Soft-Cohesion-Gibbs system produces the best BLEU score when using both training sets. We also performed a statistical significance test using bootstrap resampling with 1000 samples (Koehn, 2004;Zhang et al., 2004). Experimental results show the Soft-Cohesion-Gibbs system is significantly better (p<0.05) than the IBM4 system. The IBM4-Prior system slightly outperforms IBM4, but it's not significant.

Related Work
There have been many proposals of integrating syntactic knowledge into generative alignment models. Wu (1997) proposed the inversion transduction grammar (ITG) to model word alignment as synchronous parsing for a sentence pair. Yamada and Knight (2001) represented translation as a sequence of re-ordering operations over child nodes of a syntactic tree. Gildea (2003) introduced a "loosely" tree-based alignment technique, which allows alignments to violate syntactic constraints by incurring a cost in probability. Pauls et al. (2010) gave a new instance of the ITG formalism, in which one side of the synchronous derivation is constrained by the syntactic tree. Fox (2002) measured syntactic cohesion in gold standard alignments and showed syntactic cohesion is generally maintained between English and French. She also compared three variant syntactic representations (phrase tree, verb phrase flattening tree and dependency tree), and found the dependency tree produced the highest degree of cohesion. So 2006a)    out directly. Although the alignment quality is improved, they ignored situations where a small set of correct alignments can violate cohesion. To address this limitation, Cherry and Lin (2006b) proposed a soft constraint approach, which took dependency cohesion as a feature of a discriminative model, and verified that the soft constraint works better than the hard constraint. However, the training procedure is very timeconsuming, and they trained the model with only 100 hand-annotated sentence pairs. Therefore, their method is not suitable for large-scale tasks. In this paper, we also use dependency cohesion as a soft constraint. But, unlike Cherry and Lin (2006b), we integrate the soft dependency cohesion constraint into a generative model that is more suitable for large-scale word alignment tasks.

Conclusion and Future Work
We described a generative model for word alignment that uses dependency cohesion as a soft constraint. We proposed an approximate EM algorithm and an explicit Gibbs sampling algorithm for parameter estimation in an unsupervised manner. Experimental results performed on a large-scale data set show that our model improves word alignment quality as well as machine translation quality. Our experimental results also indicate that the soft constraint approach is much better than the hard constraint approach. It is possible that our word alignment model can be improved further. First, we generated word alignments in both forward and reverse directions separately, but it might be helpful to use dependency trees of the two sides simultaneously. Second, we only used the one-best automatically generated dependency trees in the model. However, errors are inevitable in those trees, so we will investigate how to use N-best dependency trees or dependency forests (Hayashi et al., 2011) to see if they can improve our model.