Joint Prediction of Word Alignment with Alignment Types

Current word alignment models do not distinguish between different types of alignment links. In this paper, we provide a new probabilistic model for word alignment where word alignments are associated with linguistically motivated alignment types. We propose a novel task of joint prediction of word alignment and alignment types and propose novel semi-supervised learning algorithms for this task. We also solve a sub-task of predicting the alignment type given an aligned word pair. In our experimental results, the generative models we introduce to model alignment types significantly outperform the models without alignment types.


Introduction
Word alignment is an essential component in a statistical machine translation (SMT) system. Soft alignments, or attention, are also an important component in neural machine translation (NMT) systems. The classic generative model approach to word alignment is based on IBM models 1-5 (Brown et al., 1993) and the HMM model (Vogel et al., 1996;Och and Ney, 2000a). These traditional models use unsupervised algorithms to learn alignments, relying on a large amount of parallel training data without hand annotated alignments. Supervised algorithms for word alignment have become more widespread with the availability of manually annotated word-aligned data and have shown promising results (Taskar et al., 2005;Blunsom and Cohn, 2006;Moore et al., 2006;Liang et al., 2006). Manually word-aligned data are valuable resources for SMT research, but they are costly to create and are only available for a handful of language pairs. Semisupervised methods for word alignment combine hand-annotated word alignment data with parallel data without explicit word alignments. Even small amounts of hand-annotated word alignment data has been shown to improve the alignment and translation quality (Callison-Burch et al., 2004). In this paper, we provide a novel semi-supervised word alignment model that adds alignment type information to word alignments.
Unsupervised or semi-supervised probabilistic word alignment models do not play a central role in neural machine translation (NMT) (Bahdanau et al., 2015;Sutskever et al., 2014;Luong et al., 2015;Chung et al., 2016). However, attention models, which are crucial for high-quality NMT, have been augmented with ideas from statistical word alignment (Luong et al., 2015;Cohn et al., 2016). Other than machine translation, word alignments are also important in the best performing models for NLP tasks. They play a central role in learning paraphrases in a source language by doing round-trips from source to target and back using word alignments (Ganitkevitch et al., 2013). Aligments also form the basis for learning multi-lingual word embeddings (Faruqui and Dyer, 2014;Lu et al., 2015) and in the projection of syntactic and semantic annotations from one language to another (Hwa et al., 2005;McDonald et al., 2011). Therefore, there is still a prominent role for word alignment in NLP; research into improvements in word alignment is a worthy goal.
Adding additional information such as part-ofspeech tags and syntactic parse information has yielded some improvements in word alignment quality. Toutanova et al. (2002) incorporated the partof-speech (POS) tags of the words in the sentence pair as a constraint on HMM-based word alignment. Additional constraints have also been injected into generative and discriminative models 501 所以 一定 要 好好 照顾 自己 。 so you must be sure to take really good care of yourself . FUN FUN SEM SEM GIS GIF FUN Figure 1: An alignment between a Chinese sentence and its translation in English which is enriched with alignment types . SEM (semantic), FUN (function), GIF (grammatically inferred function) and GIS (grammatically inferred semantic) are tags of the links.
by designing linguistically-motivated features (Ittycheriah and Roukos, 2005;Blunsom and Cohn, 2006;Deng and Gao, 2007;Berg-Kirkpatrick et al., 2010;. These models provide evidence that additional constraints can help in modelling word alignments in a log-linear model where word based features can be augmented with morphological, syntactic or semantic features. For example, such a model might learn that function words in one language tend to be aligned to function words in the other language. In this paper, we propose a novel task which is the joint prediction of word alignment and alignment types for a given sentence pair in a parallel corpus. We present how to enhance the alignment model with alignment types. The primary contribution of this paper is to demonstrate the success of the proposed joint model (alignment-type-enhanced model) to improve word alignment and translation quality. We apply our method on Chinese-English, because the annotated alignment type training data is provided in this language pair. However, the proposed method is potentially language-independent and can be applied to any language-pair as long as alignment type annotated data is created. The alignment types themselves may be language dependent and may vary in different language pairs.

The Data Set
The Linguistic Data Consortium (LDC) developed a linguistically-enriched word alignment data set: the GALE Chinese-English Word Alignment and Tagging Corpus. This human annotated data set adds alignment type information to word alignments. The goal was to sub-categorize different types of alignment and draw a distinction between different types of alignment. For instance, it makes a distinction between aligned function words in both languages versus aligned content words. The goal was to improve word alignment and translation quality. Figure  1 shows an example of an enriched word alignment with alignment types extracted from the LDC data. Each link tag in the figure demonstrates the alignment type between its constituents.
The GALE Chinese-English Word Alignment and Tagging corpus contains 22,313 manually word aligned sentence pairs from which we extracted 20,357 sentences for training and we kept the rest as a test set. Table 1 shows the type and number of each alignment type in our training data. We briefly explain the existing alignment types in the GALE Chinese-English Word Alignment and Tagging Corpus. The SEM tag represents a semantic link between content words/phrases of source and translation, indicating a direct equivalence. Content words are typically nouns, verbs, adjectives and ad-502 verbs. FUN refers to a Function link which indicates that a word on either side of the link is a function word. Grammatically Inferred Function (GIF) link is a type of link in which by stripping off extra words, we get a pure function link. In Grammatically Inferred Semantic (GIS) links, stripping off extra words results in pure semantic links. Alignment types PDE (DE-possesive), CDE (DE-Clause) and MDE (DE-modifier) are designed to handle the different features of the Chinese word 的(DE). In Contextually Inferred (COI) links, the extra words attached to one side of the link are required. Without these words, the grammatical structure might still be acceptable, but it is not semantically sensible. TIN (Translated Incorrectly) and NTR (Not translated) types are designed to handle the various errors that occur in the translation process, such as incorrect translation and no translation. MTA (Meta word) was designed to handle special characters that usually appear in the context of web pages.

ID Alignment
Sub-categorizing different types of word alignments is likely to result in better word alignments. The alignment types provided by the LDC as annotations on each word alignment link have never been used (as far as we are aware) in order to improve word alignment. A subset of this data was used in (Wang et al., 2014) to refine word segmentation for machine translation but they ignore the alignment link types in their experiments.

Word Alignment
Given a source sentence f = {f 1 , f 2 , . . . , f J } and a target sentence e = {e 1 , e 2 , . . . , e I }, the goal in SMT is to model the translation probability P r(f|e). In alignment models, a hidden variable a = {a 1 , a 2 , . . . , a J } is introduced which describes a mapping between source and target words. Using this terminology, a j = i denotes that f j is aligned to e i . The translation probability can therefore be written as a marginal probability over all alignments: P r(f|e) = a P r(f, a|e) (1) In IBM Model 1, the alignment model is decomposed into the product of translation probabilities as follows: P r(f, a|e) = 1 In the Hidden Markov alignment model, we assume a first order dependence for the alignments a j . The HMM-based model has the following form: where p(a j |a j−1 , I) are the alignment probabilities (transition parameters) and p(f j |e a j ) are the translation probabilities (emission parameters). Vogel et al. (1996) make the alignment parameters p(i|i , I) independent of the absolute word positions and assume that p(i|i , I) depend only on the jump width (i − i ). Hence, the alignment probabilities are estimated using a set of distortion parameters c(i − i ) as follows: where at each EM iteration c(i − i ) is the fractional count of transitions with jump width i − i . The HMM network is extended by I NULL words (Och and Ney, 2000a) with the following constraints on the transition probabilities (i ≤ I, i ≤ I): p(i|i + I, I) = p(i|i , I) where δ is the Kronecker delta function. The parameter p 0 controls NULL insertion and is optimized on a held-out dataset.

Joint Model for IBM Model 1 and HMM
We consider two classic generative models, IBM Model 1 (Brown et al., 1993) and the HMM alignment model (Vogel et al., 1996) as our baselines and present how we can enhance these models with alignment types. In this section, we introduce two models (a generative and a discriminative model) for each baseline to jointly find the word alignments and the corresponding alignment types for a sentence pair. 503 4.1 Generative Models

IBM Model 1 with Alignment Types
We augment IBM Model 1 (Equation 2) with alignment type information. In addition to alignment function a : j → i, our model has a tagging function: h : j → k which specifies the mapping for each alignment link (f j , e i ) to an alignment type k. Alignment type k can be any tag in the set of all possible linguistic tags. The new generative model with alignment type, has the following form: where N is the number of possible linguistic alignment types. Using the chain rule, we have the following enhanced IBM Model 1 which includes alignment-types: In order to normalize the probability, we modify the fraction in Equation 2 by adding term N J as there are N different alignment types for each alignment link from each source word.

EM algorithm
Similar to IBM Model 1, we use EM algorithm to estimate the parameters of our model. In the expectation step, we need to compute the posterior probability P r(a, h|f, e) which is the probability of an alignment with its types given the sentence pair. Applying the chain rule gives: P r(a, h|f, e) = P r(a|f, e) × P r(h|a, f, e) (10) where P r(a|f, e) is the posterior probability of IBM Model 1: P r(h|a, f, e) can be written as a product of alignment type parameters over the individual source and target words and their corresponding alignment type: Substituting Equations 11 and 12 in Equation 10 and simplifying it results in: We collect the expected counts over all possible alignments and their alignment types, weighted by their probability. Suppose c(f, h|e; f, e) is the expected count for a word e generating a word f with an alignment type h in a sentence pair (f, e): Plugging P r(a, h|f, e) (Equation 13) in Equation 14, yields The alignment type parameters are then estimated by Equation 16.
Translation probabilities are estimated similar to IBM Model 1.
This model is called IBM1+Type+Gen in the experiments section.
After training, we can jointly predict the best alignment and the best alignment types for each sentence pair: In this decoding method, for a given sentence pair, for each source word f j , we have to go through all the target words e a j in the target sentence and all the possible alignment types and find the pair of target position and alignment type that maximizes p(f j |e a j )p(h j |f j , e a j ).

HMM with Alignment Types
Our HMM with alignment types model has the factor p(f j , h j |e a j ) in its formulation, which can be further decomposed to give: This model is called HMM+Type+Gen in this paper.
We now explain how we can estimate the parameters of this model. A compact representation of this Equation 19 confirms that the posterior probability of the HMM+Type+Gen model is the HMM posterior multiplied by the alignment type probability factor p(h|f j , e i ) which is similar to the relationship between posterior probabilities in case of IBM Model 1 as shown in Equation 13. Using this equation, we can compute the expected counts: The expected counts are then normalized in the Mstep to re-estimate the parameters. The transition and emission parameters are estimated as the standard HMM-based alignment model.
The EM algorithm for this model is similar to the Baum-Welch algorithm for the standard HMMbased word alignment model. The only change is that in the E-step, we need to collect the alignment type expected counts and in the M-step, alignment type parameters are re-estimated.
After training, Viterbi decoding is used to find the best word alignment and alignment types for new sentences. We define V i (j, h) to be the probability of the most probable alignment for f 1 . . . f j that f j is aligned to e i and the alignment type for this link is h. It can be computed recursively as follows:

Discriminative Models
Although we can use the generative models explained in Section 4.1 to estimate the alignment type probabilities p(h|f, e), we can build a classifier to predict the alignment type given a pair of aligned words.
We have a set of 11 possible alignment types in the LDC data which are the possible classes in the classification problem. We use logistic regression to model the alignment type prediction problem. The rationale for using this model is that it can provide us with both the alignment type and the probability of being classified as this type.

Features
We used 22 different types of features in our logistic regression model as shown in Table 2. Lexical features are the heart of all lexical translation models; here, they are defined on pairs of Chinese and English words, shown by feature template (c 0 , e 0 ) in Table 2.
Moreover, we include features taking the context into consideration. For example, (c −1 , c 0 , e 0 ) uses the previous Chinese word apart from the pair of English and Chinese words. Part-of-speech (POS) tags are used to address the sparsity of the lexical features. For example, POS tags of the pair of Chinese and English words (c t 0 , e t 0 ) are included. We also use the first five letters of the English word in a feature, to approximate the stem of an English word. An example used as a feature is (c 0 , [e 0 ] 5 ), where the pair of Chinese word and the prefix of English word is used as a feature. 505 word-based (c 0 , e 0 ), (c −1 , c 0 , e 0 ), (c −2 , c −1 , c 0 , e 0 ), (c 0 , c 1 , c 2 , e 0 ), (c 0 , e −1 , e 0 ), (c 0 , e −1 , e 0 , e 1 ), (c 0 , e 0 , e 1 )

EM for the Discriminative Models
We introduce discriminative variants of the generative models explained in Section 4.1. These discriminative models are referred to as IBM1+Type+Disc and HMM+Type+Disc. The main difference between these models and their generative counterparts is in the way they compute alignment type probabilities p(h|f, e). Whereas the generative models estimate these probabilities using the EM algorithm, the discriminative models estimate these probabilities using the logistic regression classifier. For the discriminative models, we first train a logistic regression model on the LDC data (see Section 5.1.2 ). The model provides us with the alignment type probabilities which are used in the decoding stage. For IBM1+Type+Gen model, expected counts for alignment types are collected and alignment type parameters are updated in each iteration. In the EM algorithm for IBM1+Type+Disc, however, we do not need to collect the expected counts for alignment types since these parameter values are obtained from the pre-trained logistic regression classifier. However there is an important difference in the decoding step: Equation 17 is used to jointly find the best alignment and alignment types for each sentence pair. This joint decoding step makes this approach different from simply using a pipeline trained EM model followed by a discriminative classifier on the Viterbi output of the EM trained model. A comparison with the pipeline model is given in Section 5.2.Similarly, the EM training of HMM+Type+Disc is similar to the EM training of baseline HMM. For decoding a new sentence pair, Equation 21 is used.

Experiments
For the experiments, we have used two datasets. The first is the GALE Chinese-English Word Alignment and Tagging corpus which is released by LDC 1 . This dataset is annotated with gold alignment and alignment types (see Section 2 for more details). The second dataset is the Hong Kong parliament proceedings (HK Hansards) for which we do not have the gold alignment and alignment types.
We used 1 million sentences of the HK Hansards in the experiments to augment the training data. In the following sections, we describe three experiments. First, we examine how effective the logistic regression classifier is for alignment type prediction. Second, we present our experiments for two tasks: word alignment and the joint prediction of word alignment and alignment types. Finally, we explain the machine translation experiment. 2

Alignment Type Prediction Given Alignments
For the alignment type prediction task given an aligned word pair, we have examined three simple maximum likelihood classifiers, as well as the logistic regression classifier with the features shown in Table 2. We have trained all these classifiers on the parallel Chinese-English 20K LDC data which is annotated with gold alignment and alignment types. To obtain the word pairs, we have extracted the word pairs from the parallel sentences with the gold alignment. To get the part-of-speech tags, we annotated the 20K LDC data with the Stanford POS tagger (Toutanova et al., 2003). We ignored the gold alignment if the Chinese side of the gold alignment is not contiguous; i.e., it cannot form one Chinese word. This usually happens in the many-to-one 1 Catalog numbers: LDC2012T16, LDC2012T20, LDC2012T24, LDC2013T05, LDC2013T23 and LDC2014T25.
2 All our codes for the baselines and the proposed models are available at https://github.com/sfu-natlang/ align-type-tacl2017-code. and many-to-many alignments. There were only a small number of these discontiguous alignments as mentioned in the LDC catalog entry for this data.

Maximum Likelihood Classifiers
We have examined three maximum likelihood (ML) classifiers. The first model is a word-based ML classifier that uses the maximum likelihood estimate of the alignment type parameters p(h|f, e), computed from the training data, to predict the alignment type for a new given pair of aligned words in a sentence pair in the test data. If the aligned words were not seen in the training data, this model backsoff to SEM as it is the most probable alignment type.
The second model which is a tag-based ML classifier, uses the maximum likelihood estimate of p(h|t f , t e ) parameters of the model trained on the POS tagged data. t f and t e are the POS tags of the Chinese word f and the English word e, respectively. It backs-off to SEM for unseen pairs of POS tags. Finally, for a pair of word (f, e), the last classifier first uses the ML estimate of p(h|f, e) parameter. For unseen pair of words, it backs-off to use the ML estimate of p(h|t f , t e ) and in case the pair of POS tags was not seen, it backs-off to SEM.

Logistic Regression Classifier
We evaluated the logistic regression classifier which makes use of the features shown in Table 2 and the combination of different sets of these features. We assessed the performance of our features using 10-fold cross-validation. The best average cross-validation accuracy of 81.5% was achieved by a classifier that combines all the 22 features, shown in Table 2. We have used this trained classifier for the discriminative models (IBM1+Type+Disc and HMM+Type+Disc) in the experiments reported in Section 5.2. Table 3 shows the accuracy of the classifiers on the 2K sentences used as held-out data. The logistic regression classifier achieved the best accuracy on the test data. Since the logistic regression classifier obtains 87.5% on training, and the cross-validation accuracy variance was small, we do not believe the classifier overfits on our training data.  Table 3: Accuracy of the alignment type classifiers given the alignment.

Joint Word Alignment and Alignment Type Experiments
We measure the performance of our models using precision, recall, and F1-score. We also evaluated the performance of our models and the baseline models on two different tasks: (1) The traditional word alignment task and (2) The joint prediction of word alignment and alignment types task. The second task is harder as the model has to predict both word alignment and alignment types correctly. Moreover, as the baseline IBM Model 1 and the baseline HMM cannot predict the alignment types, we can only make a comparison between our generative and discriminative models for the second task.
We initialized the translation probabilities of Model 1 uniformly over the word pairs that occur together in the same sentence pair.
We built an HMM similar to the one proposed by Och and Ney (2003). This model is referred to as HMM in this paper. HMM was initialized with uniform transition probabilities and Model 1 translation probabilities. Model 1 was trained for 5 iterations; it is followed by 5 iterations of HMM.
To handle unseen data when the model is applied to the test data, smoothing has been used. We smooth translation probability p(f |e) by backingoff to a uniform probability 1/|V | where |V | is the source vocabulary size.
t f and t e are the POS tags of the Chinese word f and the English word e, respectively. To obtain p(h|t f , t e ), we labelled the parallel data with the 507 Word Alignment (WA) Task: Train (20K)  To learn the hyper-parameters, we split the 20K LDC training data into two sets: a train set of 18K sentences and a 2K validation set. To learn p 0 and NULL emission probability, we performed a two-dimensional grid search varying p 0 in the set {0.05, 0.1, 0.2, 0.3, 0.4} and NULL emission probability in the set {1e-7, 5e-7, . . .,1e-2, 5e-2, 1e-1}. The tuned parameters that lead to the best result were achieved when p 0 = 0.3 and NULL emission probability was 5e-6. To tune the hyperparameters λ 1 , λ 2 and λ 3 , we performed a twodimensional grid search. The tuned parameters that lead to the best result was achieved when λ 1 = 0.99, and λ 3 =1e-15. Hence, λ 2 = 1 − λ 1 − λ 3 = 9.99e-11. We then used these learned parameters in the experiments.
Finally, for HMM-based models, we smooth transition parameters p(i|i , I) by backing off to a uniform prior 1/I.  Table 5: Results of the models trained on 20K LDC data (22K LDC data for GIZA++) and tested on 2K LDC test data for (1) joint prediction of word alignment and alignment types task, and (2) word alignment models followed by the discriminative classifier to predict alignment types. and tested on the 2K sentences used as held-out data.

Results on the LDC Alignment Type Data
We can see that our generative models consistently outperform their corresponding baselines. The best performing model, HMM+Type+Gen, achieves up to 13.9% improvement in F1-score over the baseline HMM. To compare our models against GIZA++, we add the test data to the training, and use Moses (Koehn et al., 2007) with its default parameters to obtain word alignments. We report its performance on the test data. Unlike the other models in Table 4 which are trained on 20K data, GIZA++ model is trained on 22K data. 3 Table 5 shows the results obtained for the joint prediction of word alignment and alignment types task. As mentioned previously, the basic IBM Model 1 and HMM are incapable of predicting the alignment types and hence are not included in this table. However, it is interesting to compute word alignments using our baselines and then apply the logistic regression classifier on the alignments to get the corresponding alignment types. In Table 5  decoding step. The former jointly predicts word alignment and alignment types while the latter performs word alignment and then applies the classifier on the output of word alignment to obtain the alignment types. We also computed word alignments using GIZA++ as explained for the previous experiment and then ran our logistic regression classifier on the alignments to get the corresponding alignment types. This model is denoted as GIZA++→Disc in Table 5. The results in this table show that the generative model outperforms its discriminative counterpart. Similar to the previous experiment, HMM+Type+Gen model achieved the best result.

Results with Augmented Model
We conducted another experiment to see whether we can improve the current results by augmenting the training data. We trained on the 20K LDC data with gold alignment and alignment types, and 1 million HK Hansards which has no alignment or alignment type annotations and tested on the 2K sentences used as held-out data. Although HK Hansards data is not annotated, it can augment our vocabulary. We built a model with the 20K LDC data; we call it LDC model. We then trained a model using the 20K LDC data and the 1 million HK Hansards data; we refer to this as the augmented model. The alignment type parameters of the augmented model are initialized, based on the maximum likelihood estimate of the 20K LDC data. Tables 6 shows the results of the augmented model for the word alignment task. Table 7 shows the results of this model for the joint prediction of word alignment and alignment types  Table 7: Results using the augmented model for (1) joint prediction of word alignment and alignment types task, and (2) word alignment models followed by the discriminative classifier to predict alignment types.

Results with Augmented Model and Back-off Smoothing
Purely using the augmented model was not effective in estimating the translation probabilities p(f |e), and hence did not contribute to any improvement compared to the previous experiment. This is due to the fact that HK Hansards data is from a different domain compared to our test LDC data. Since the 2K test data is from the LDC data, we applied a back-off smoothing technique: we estimated p(f |e) from the LDC model if the word pair (f, e) was seen by the LDC model, and we used the augmented model to compute p(f |e) otherwise. Table 8 shows the results of the augmented model after the smoothing step is done for the word alignment task. Compared to the results in  Table 9: Results with back-off using the augmented model for (1) joint prediction of word alignment and alignment types task, and (2) word alignment models followed by the discriminative classifier to predict alignment types.
The results for the joint prediction task are shown in Table 9. This confirms our success in improving the performance of all the methods, compared to the results in Table 5.
Statistical significance tests were performed using the approximate randomization test (Yeh, 2000) with 10,000 iterations. The generative models significantly outperform their baseline and discriminative counterparts (p-value < 0.0001).

Machine Translation Experiment
To see whether the improvement in F1-score by our generative model also improves the BLEU score, we aligned the 20K LDC data and 1 million sentences of the HK Hansards data using the augmented model and tested on 919 sentences of MTC part 4 (LDC2006T04). We trained models in each translation direction and then symmetrized the produced alignments using the grow-diag-final heuristic (Och and Ney, 2003). We used Moses (Koehn et al., 2007) with standard features, and tuned the weights with MERT (Och, 2003). An English 5-gram language model is trained using KenLM (Heafield, 2011) on the Gigaword corpus (Parker et al., 2011). We give a comparison between HMM+Type+Gen model, our baseline HMM, GIZA++ HMM and standard GIZA++ (as used by Moses) in Table 10. We report the BLEU scores and TER computed using MultEval .
The generative model improves over GIZA++ HMM by 1.0 BLEU points. It also im-  In both examples, the HMM+Type+Gen model identifies difficult alignments over long distances better than the baseline HMM. For example, Figure  2(a) illustrates how our baseline HMM makes a mistake by aligning the Chinese word "。" to "with" possibly because the transition probabilities were dominant in the baseline HMM. HMM+Type+Gen model however avoids this mistake by making use of the alignment type information. The model takes into account the fact that "。" and "." are function words and should be aligned to each other with a FUN tag. Figure 2(b) shows that the HMM+Type+Gen model favors aligning 见面 (meet) to "meet", whereas the baseline HMM incorrectly aligns 见面 (meet) to "jintao". We hypothesize that this occurs because p(SEM| 见面, meet) has a high value.
To give a detailed analysis of the precision of the generative model in alignment type prediction, we present a confusion matrix on the test data in Table 11 where the vertical axis represents the actual alignment type and the horizontal axis represents the predicted alignment type. From the confusion matrix, we found that our model works well in predicting SEM, FUN, GIS, GIF, MDE and CDE alignment types since the numbers on the diagonal are the largest in the row. PDE is hard to be distinguished from MDE. COI and TIN can be easily mispredicted by the model. NTR and MTA are omitted from this table as all the predictions for these alignment types are zero. For MTA, it is probably because this type rarely occurs in our training data. An alignment type is NTR if either Chinese or English list of tokens for that alignment is empty. In other words, the NTR alignment type is used when some words are dropped during the translation process. We could predict NTR for the Chinese words that are aligned to NULLs. However, predicting NTR for such cases worsened the F-score of the generative model (2.0 points drop for HMM+Type+Gen model). Hence, we do not predict the NTR alignment type for any Chinese words. In total, just for the confusion matrix, 10,216 alignments (or 25.54% of all alignments) are not included in Table 11 which shows the alignment type predictions for the word pairs that were correctly aligned by HMM+Type+Gen. 4 CDE COI TIN  SEM 11374 136 2002 10  0  0  0  14  3  FUN  196  8172  21  16  2  0  0  1  1  GIS 2790  31  3312 18  2  1  2  8  0  GIF  16  118  26  772  0  0  0  0  2  MDE  0  0  1  0  293  26  2  0  0  PDE  1  0  1  2  40  55  0  0  0  CDE  0  5  0  0  0  0  79  0  0  COI  91  2  48  0  0  0  0  38  0  TIN  22  12  11  3  0  0  0  0  5   Table 11: Confusion matrix of the HMM+Type+Gen model on the LDC test data. The vertical axis represents the actual alignment type and the horizontal axis represents the predicted alignment type.

Related Work
There has been several studies on semi-supervised word alignment models. Callison-Burch et al. (2004) improve alignment and translation quality by interpolating hand-annotated, word-aligned data and automatic sentence-aligned data. They showed 4 We should note that these incorrectly predicted alignments are only kept out of the confusion matrix. All alignments, correct or incorrect, are included in all the results we show in the other tables. that a much higher weight should be assigned to the model trained on word-aligned data. Fraser and Marcu (2006) propose a semi-supervised training approach to word alignment, based on IBM Model 4, that alternates the EM step which is applied on a large training corpus with a discriminative error training step on a small hand-annotated sub-511 corpus. The alignment problem is viewed as a search problem over a log-linear space with features (submodels) coming from the IBM Model 4. In the proposed algorithm, discriminative training controls the contribution of sub-models while an EM-like procedure is used to estimate the sub-model parameters.

意大利人
Unlike previous approaches (Och and Ney, 2003;Fraser and Marcu, 2006;Fraser and Marcu, 2007) that use discriminative methods to tune the weights of generative models, Gao et al. (2010b) proposes a semi-supervised word alignment technique that integrates discriminative and generative methods. They propose to use a discriminative word aligner to produce high precision partial alignments that can serve as constraints for the EM algorithm. The discriminative word aligner uses the generative aligner's output as features. This feedback loop iteratively improves the quality of both aligners. Niehues and Vogel (2008) propose a discriminative model that directly models the alignment matrix. Although the discriminative model provides the flexibility to use manually word-aligned data to tune its weights, it still relies on the model parameters of IBM models and alignment links from GIZA++ as features. Gao et al. (2010a) present a semi-supervised algorithm that extends IBM Model 4 by using partial manual alignments. Partial alignments are fixed and treated as constraints into the EM training. DeNero and Klein (2010) present a supervised model for extracting phrase pairs under a discriminative model by using word alignments. They consider two types of alignment links, sure and possible, that are extracted from the manually word-aligned data. Possible alignment links dictate which phrase pairs can be extracted from a sentence pair.
Among the unsupervised methods, (Toutanova et al., 2002) utilizes additional source of information apart from the parallel sentences. Part-of-speech tags of the words in the sentence pair are incorporated as a linguistic constraint on the HMM-based word alignment. The part-of-speech tag translation probabilities in this model are then learned along with other probabilities using the EM algorithm. POS tags as used in Toutanova et al. (2002) were also utilized to act similarly to word classes in (Och and Ney, 2000a;Och and Ney, 2000b); however, the improvements provided by the HMM with POS tag model over HMM alignment model of Och and Ney (2000b) was for small training data sizes (<50K parallel corpus).
All previous studies on word alignment have assumed that word alignments are untyped. To our knowledge, the alignment types for word alignment provided by the LDC as annotations on word alignment links, have never been used to improve word alignment. Our work differs from the previous works as it proposes a new task of jointly predicting word alignment and alignment types. A semisupervised learning algorithm is presented to solve this task. Our method is semi-supervised as it combines LDC data, which is annotated with alignment and alignment types, with sentence aligned (but not word aligned) data from the HK Hansards corpus. Our generative algorithm makes use of the gold alignment and alignment types data to initialize the alignment type parameters. The EM training is then used to re-estimate the parameters of the model in an unsupervised manner. We also use POS tags to smooth the alignment type parameters, unlike the approach in (Toutanova et al., 2002).

Conclusion
In this paper, we introduced new probabilistic models for augmenting word alignments with linguistically motivated alignment types. Our proposed HMM-based aligners with alignment types achieved up to 13.9% improvement in the alignment F1-score over the baseline. The BLEU score improved by 1.2 points over the standard GIZA++ aligner. In the future, we plan to use alignment type information as a feature function for feature rich word alignment models and explore how alignments types can improve attention models for neural MT models. The alignment types we predict can also be used for other tasks such as projecting part-of-speech tags and dependency trees from a resource-rich language to a resource-poor language.