Domain Adaptation for Syntactic and Semantic Dependency Parsing Using Deep Belief Networks

In current systems for syntactic and semantic dependency parsing, people usually define a very high-dimensional feature space to achieve good performance. But these systems often suffer severe performance drops on out-of-domain test data due to the diversity of features of different domains. This paper focuses on how to relieve this domain adaptation problem with the help of unlabeled target domain data. We propose a deep learning method to adapt both syntactic and semantic parsers. With additional unlabeled target domain data, our method can learn a latent feature representation (LFR) that is beneficial to both domains. Experiments on English data in the CoNLL 2009 shared task show that our method largely reduced the performance drop on out-of-domain test data. Moreover, we get a Macro F1 score that is 2.32 points higher than the best system in the CoNLL 2009 shared task in out-of-domain tests.

In current systems for syntactic and semantic dependency parsing, people usually define a very high-dimensional feature space to achieve good performance. But these systems often suffer severe performance drops on outof-domain test data due to the diversity of features of different domains. This paper focuses on how to relieve this domain adaptation problem with the help of unlabeled target domain data. We propose a deep learning method to adapt both syntactic and semantic parsers. With additional unlabeled target domain data, our method can learn a latent feature representation (LFR) that is beneficial to both domains. Experiments on English data in the CoNLL 2009 shared task show that our method largely reduced the performance drop on out-of-domain test data. Moreover, we get a Macro F1 score that is 2.32 points higher than the best system in the CoNLL 2009 shared task in out-of-domain tests.

Introduction
Both syntactic and semantic dependency parsing are the standard tasks in the NLP community. The stateof-the-art model performs well if the test data comes from the domain of the training data. But if the test data comes from a different domain, the performance drops severely. The results of the shared tasks of CoNLL 2008 and 2009 (Surdeanu et al., 2008;Hajič et al., 2009) also substantiates the argument. To relieve the domain adaptation, in this paper, we propose a deep learning method for both syntactic and semantic parsers. We focus on the situation that, besides source domain training data and target domain test data, we also have some unlabeled target domain data.
Many syntactic and semantic parsers are developed using a supervised learning paradigm, where each data sample is represented as a vector of features, usually a high-dimensional feature. The performance degradation on target domain test data is mainly caused by the diversity of features of different domains, i.e., many features in target domain test data are never seen in source domain training data.
Previous work have shown that using word clusters to replace the sparse lexicalized features (Koo et al., 2008;Turian et al., 2010), helps relieve the performance degradation on the target domain. But for syntactic and semantic parsing, people also use a lot of syntactic features, i.e., features extracted from syntactic trees. For example, the relation path between a predicate and an argument is a syntactic feature used in semantic dependency parsing (Johansson and Nugues, 2008). Figure 1 shows an example of this relation path feature. Obviously, syntactic features like this are also very sparse and usually specific to each domain. The method of clustering fails in generalizing these kinds of features. Our method, however, is very different from clustering specific features and substituting these features using their clusters. Instead, we attack the domain adaption problem by learning a latent feature representation (LFR) for different domains, which is similar to Titov (2011). Formally, we propose a Deep Belief Network (DBN) model to represent a data sample using a vector of latent features. This latent feature vector is inferred by our DBN model  (Blitzer, 2006). Discriminative models using our latent features adapt better to the target domain than models using original features. Discriminative models in syntactic and semantic parsers usually use millions of features. Applying a typical DBN to learn a sensible LFR on that many original features is computationally too expensive and impractical (Raina et al., 2009). Therefore, we constrain the DBN by splitting the original features into groups. In this way, we largely reduce the computational cost and make LFR learning practical. We carried out experiments on the English data of the CoNLL 2009 shared task. We use a basic pipelined system and compare the effectiveness of the two feature representations: original feature representation and our LFR. Using the original features, the performance drop on out-of-domain test data is 10.58 points in Macro F1 score. In contrast, using the LFR, the performance drop is only 4.97 points. And we have achieved a Macro F1 score of 80.83% on the out-of-domain test data. As far as we know, this is the best result on this data set to date.

Related Work
Dependency parsing and semantic role labeling are two standard tasks in the NLP community. There have been many works on the two tasks (McDonald et al., 2005;Gildea and Jurafsky, 2002;Yang and Zong, 2014;Zhuang and Zong, 2010a;Zhuang and Zong, 2010b, etc). Among them, researches on domain adaptation for dependency parsing and SRL are directly related to our work. Dredze et al., (2007) show that domain adaptation is hard for dependency parsing based on results in the CoNLL 2007 shared task . Chen et al., (2008) adapted a syntactic dependency parser by learning reliable information on shorter dependencies in unlabeled target domain data. But they do not consider the task of semantic dependency parsing. Huang et al., (2010) used an HMM-based latent variable language model to adapt a SRL system. Their method is tailored for a chunking-based SRL system and can hardly be applied to our dependency based task. Weston et al., (2008) used deep neural networks to improve an SRL system. But their tests are on in-domain data.
On methodology, the work in Glorot et al., (2011) andTitov (2011) is closely related to ours. They also focus on learning LFRs for domain adaptation. However, their work deals with domain adaptation for sentiment classification, which uses much fewer features and training samples. So they do not need to worry about computational cost as much as we do. Titov (2011) used a graphical model that has only one layer of hidden variables. On contrast, we need to use a model with two layers of hidden variables and split the first hidden layer to reduce computational cost. The model of Titov (2011) also embodies a specific classifier. But our model is independent of the classifier to be used. Glorot et al., (2011) used a model called Stacked Denoising Auto-Encoders, which also contains multiple hidden layers. However, they do not exploit the hierarchical structure of their model to reduce computational cost. By splitting, our model contains much less parameters than theirs. In fact, the models in Glorot et al., (2011) andTitov (2011) cannot be applied to our task simply because of the high computational cost.

Our DBN Model for LFR
In discriminative models, each data sample is represented as a vector of features. Our DBN model maps this original feature vector to a vector of latent 272 features. And we use this latent feature vector to represent the sample, i.e., we replace the whole original feature vector by the latent feature vector. In this section, we introduce how our DBN model represent a data sample as a vector of latent features. Before introducing our DBN model, we first review a simpler model called Restricted Boltzman Machines (RBM) . When training a DBN model, RBM is used as a basic unit in a DBN.

Restricted Boltzmann Machines
An RBM is an undirected graphical model with a layer of visible variables v = (v 1 , ..., v m ), and a layer of hidden variables h = (h 1 , ..., h n ). These variables are binary. Figure 2 shows a graphical representation of an RBM.  The parameters of an RBM are θ = (W, a, b) where W = (W ij ) m×n is a matrix with W ij being the weight for the edge between v i and h j , and a = (a 1 , ..., a m ), b = (b 1 , ..., b n ) are bias vectors for v and h respectively. The probabilistic model of an RBM is: where Because the connections in an RBM are only between visible and hidden variables, the conditional distribution over a hidden or a visible variable is quite simple: where σ(x) = 1/(1 + exp(−x)) is the logistic sigmoid function. An RBM can be efficiently trained on a sequence of visible vectors using the Contrastive Divergence method (Hinton, 2002).

The Problem of Large Scale
In our syntactic and semantic parsing task, all features are binary. So each data sample (an shift action in syntactic parsing or an argument candidate in semantic parsing) is represented as a binary feature vector. By treating a sample's feature vector as visible variable vector in an RBM, and taking hidden variables as latent features, we could get the LFR of this sample using the RBM. However, for our syntactic and semantic parsing tasks, training such an RBM is computationally impractical due to the following considerations. Let m, n denote respectively the number of visible and hidden variables in the RBM. Then there are O(mn) parameters in this RBM. If we train the RBM on d samples, then the time complexity for Contrastive Divergence training is O(mnd). For syntactic or semantic parsing, there are over 1 million unique binary features, and millions of training samples. That means both m and d are in an order of 10 6 . With m and n of that order, n should not be chosen too small to get a sensible LFR (Hinton, 2010). Our experience indicates that n should be at least in an order of 10 3 . Now we see why the O(mnd) complexity is formidable for our task.

Our DBN Model
A DBN is a probabilistic generative model that is composed of multiple layers of stochastic, latent variables . The motivation of using a DBN is two-fold. First, previous research has shown that a deep network can capture high-level correlations between visible variables better than an RBM (Bengio, 2009). Second, as shown in the preceding subsection, the large scale of our task poses 273 ...  a great challenge for learning an LFR. By manipulating the hierarchical structure of a DBN, we can significantly reduce the number of parameters in the DBN model. This largely reduces the computational cost for training the DBN. Without this technique, it is impractical to learn a DBN model with that many parameters on large training sets.
As shown in Fig.3, our DBN model contains 2 layers of hidden variables: h 1 , h 2 , and a visible vector v. The visible vector corresponds to a sample's original feature vector. The second-layer hidden variable vector h 2 are used as the LFR of this sample.
Suppose there are m, n 1 , n 2 variables in v, h 1 , h 2 respectively. To reduce the number of parameters in the DBN, we split its first layer (h 1 − v) into k groups, as we will explain in the following subsection. We confine the connections in this layer to variables within the same group. So there are only mn 1 /k parameters in the first layer. Without splitting, the number of parameters would be mn 1 . Therefore, learning that many parameters requires too much computation. By splitting, we reduce the number of parameters by a factor of k. If we choose k big enough, learning is feasible.
The second layer (h 2 − h 1 ) is fully connected, so that the variables in the second layer can capture the relations between variables in different groups in the first layer. There are n 1 n 2 parameters in the second layers. Because n 1 and n 2 are relatively small, learning the parameters in the second layer is also feasible.
In summary, by splitting the first layer into groups, we have largely reduced the number of pa-rameters in our DBN model. This makes learning our DBN model practical for our task. In our task, visible variables corresponds to original binary features and the second layer hidden variables are used as the LFR of these original features. One deficiency of splitting is that the relationships between original features in different groups can not be captured by hidden variables in the first layer. However, this deficiency is compensated by using the second layer to capture relationships between all variables in the first layer. In this way, the second layer still captures the relationships between all original features indirectly.

Splitting Features into Groups
When we split the first layer into k groups, every group, except the last one, contains m/k visible variables and n 1 /k hidden variables. The last group contains the remaining visible and hidden variables. But how to split the visible variables, i.e., the original features, into these groups? Of course there are many ways to split the original features. But it is difficult to find a good principle to split. So we tried two splitting strategies in this paper. The first strategy is very simple. We arrange all features as the order they appeared in the training data . Suppose each group contains r original features. We just put the first r unique features of training data into the first group, the following r unique features into the second group, and so on.
The second strategy is more sophisticated. All features can be divided into three categories: the common features, the source-specific features and the target-specific features. Its main idea is to make each group contain the three categories of features evenly, which we think makes the distribution of features close to the 'true' distribution over domains. Let F s and F t denote the sets of features that appeared on source and target domain data respectively. We collect F s and F t from our training data. The features in F s and F t are are ordered the same as the order they appeared in training data. And let F s∩t = F s ∩ F t (the common features), F s\t = F s \F t (the source-specific features), F t\s = F t \F s (the target-specific features). So, to evenly distribute features in F s∩t , F s\t and F t\s to each group, each group should consist of |F s∩t |/k, |F s\t |/k and |F t\s |/k features from F s∩t , F s\t and F t\s respec-274 tively. Therefore, we put the first |F s∩t |/k features from F s∩t , the first |F s\t |/k features from F s\t and the first |F t\s |/k features from F t\s into the first group. Similarly, we put the second |F s∩t |/k features from F s∩t , the second |F s\t |/k features from F s\t and the second |F t\s |/k features from F t\s into the second group. The intuition of this strategy is to let features in F s∩t act as pivot features that link features in F s\t and F t\s in each group. In this way, the first hidden layer might capture better relationships between features from source and target domains.

LFR of a Sample
Given a sample represented as a vector of original features, our DBN model will represent it as a vector of latent features. The sample's original feature vector corresponds to the visible vector v in our DBN model in Figure 3. Our DBN model uses the second-layer hidden variable vector h 2 to represent this sample. Therefore, we must infer the value of hidden variables in the second-layer given the visible vector. This inference can be done using the methods in . Given the visible vector, the values of the hidden variables in every layer can be efficiently inferred in a single, bottomup pass.

Training Our DBN Model
Inference in a DBN is simple and fast. Nonetheless, training a DBN is more complicated. A DBN can be trained in two stages: greedy layer-wise pretraining and fine tuning .

Greedy Layer-wise Pretraining
In this stage, the DBN is treated as a stack of RBMs as shown in Figure 4.
The second layer is treated as a single RBM. The first layer is treated as k parallel RBMs with each group being one RBM. These k RBMs are parallel because their visible variable vectors constitute a partition of the original feature vector. In this stage, we train these constituent RBMs in a bottomup layer-wise manner.
To learn parameters in the first layer, we only need to learn the parameters of each RBM in the first layer. With the original feature vector v given, these k RBMs can be trained using the Contrastive Divergence method (Hinton, 2002). After the first layer is ... trained, we will fix the parameters in the first layer and start to train the second layer.
For the RBM of the second layer, its visible variables are the hidden variables in the first layer. Given an original feature vector v, we first infer the activation probabilities for the hidden variables in the first layer using equation (2). And we use these activation probabilities as values for visible variables in the second layer RBM. Then we train the second layer RBM using contrastive divergence algorithm. Note that the activation probabilities are not binary values. But this is only a trick for training because using probabilities generally produces better models . This trick does not change our assumption that each variable is binary.

Fine Tuning
The greedy layer-wise pretraining initializes the parameters of our DBN to sensible values. But these values are not optimal and the parameters need to be fine tuned. For fine tuning, we unroll the DBN to form an autoencoder as in Hinton and Salakhutdinov (2006), which is shown in Figure 5.
In this autoencoder, the stochastic activities of binary hidden variables are replaced by its activation probabilities. So the autoencoder is in essence a feed-forward neural network. We tune the parameters of our DBN model on this autoencoder using backpropagation algorithm.

Domain Adaptation with Our DBN Model
In this section, we introduce how to use our DBN model to adapt a basic syntactic and semantic de-  pendency parsing system to target domain.

The Basic Pipelined System
We build a typical pipelined system, which first analyze syntactic dependencies, and then analyze semantic dependencies. This basic system only serves as a platform for experimenting with different feature representations. So we just briefly introduce our basic system in this subsection.

Syntactic Dependency Parsing
For syntactic dependency parsing, we use a deterministic shift-reduce method as in Nivre et al., (2006). It has four basic actions: left-arc, right-arc, shift, and reduce. A classifier is used to determine an action at each step. To decide the label for each dependency link, we extend the left/right-arc actions to their corresponding multi-label actions, leading to 31 left-arc and 66 right-arc actions. Altogether a 99class problem is yielded for parsing action classification. We add arcs to the dependency graph in an arc eager manner as in . We also projectivize the non-projective sequences in training data using the transformation from Nivre and Nilsson (2005). A maximum entropy classifier is used to make decisions at each step. The features utilized are the same as those in Zhao et al., (2008).

Semantic Dependency Parsing
Our semantic dependency parser is similar to the one in Che et al., (2009). We first train a predicate sense classifier on training data, using the same features as in Che et al., (2009). Again, a maximum en-tropy classifier is employed. Given a predicate, we need to decide its semantic dependency relation with each word in the sentence. To reduce the number of argument candidates, we adopt the pruning strategy in Zhao et al., (2009), which is adapted from the strategy in Xue and Palmer (2004). In the semantic role classification stage, we use a maximum entropy classifier to predict the probabilities of a candidate to be each semantic role. We train two different classifiers for verb and noun predicates using the same features as in Che et al., (2009). We use a simple method for post processing. If there are duplicate arguments for ARG0∼ARG5, we preserve the one with the highest classification probability and remove its duplicates.

Adapting the Basic System to Target Domain
In our basic pipeline system, both the syntactic and semantic dependency parsers are built using discriminative models. We train a syntactic parsing model and a semantic parsing model using the original feature representation. We will refer to this syntactic parsing model as OriSynModel, and the semantic parsing model as OriSemModel. However, these two models do not adapt well to the target domain. So we use the LFR of our DBN model to train new syntactic and semantic parsing models. We will refer to the new syntactic parsing model as LatSyn-Model, and the new semantic parsing model as Lat-SemModel. Details of using our DBN model are as follows.

Adapting the Syntactic Parser
The input data for training our DBN model are the original feature vectors on training and unlabeled data. Therefore, to train our DBN model, we first need to extract the original features for syntactic parsing on these data. Features on training data can be directly extracted using golden-standard annotations. On unlabeled data, however, some features cannot be directly extracted. This is because our syntactic parser uses history-based features which depend on previous actions taken when parsing a sentence. Therefore, features on unlabeled data can only be extracted after the data are parsed. To solve this problem, we first parse the unlabeled data using the already trained OriSynModel. In this way, we 276 can obtain the features on the unlabeled data. Because of the poor performance of the OriSynModel on the target domain, the extracted features on unlabeled data contains some noise. However, experiments show that our DBN model can still learn a good LFR despite the noise in the extracted features. Using the LFR, we can train the syntactic parsing model LatSynModel. Then by applying the LFR on test and unlabeled data, we can parse the data using LatSynModel. Experiments in later sections show that the LatSynModel adapts much better to the target domain than the OriSynModel.

Adapting the Semantic Parser
The situation here is similar to the adaptation of the syntactic parser. Features on training data can be directly extracted. To extract features on unlabeled data, we need to have syntactic dependency trees on this data. So we use our LatSynModel to parse the unlabeled data first. And we automatically identify predicates on unlabeled data using a classifier as in Che et al., (2008). Then we extract the original features for semantic parsing on unlabeled data. By feeding original features extracted on these data to our DBN model, we learn the LFR for semantic dependency parsing. Using the LFR, we can train the semantic parsing model LatSemModel.

Experiment Data
We use the English data in the CoNLL 2009 shared task for experiments. The training data and in-domain test data are from the WSJ corpus, whereas the out-of-domain test data is from the Brown corpus. We also use unlabeled data consisting of the following sections of the Brown corpus: K, L, M, N, P. The test data are excerpts from fictions. The unlabeled data are also excerpts from fictions or stories, which are similar to the test data. Although the unlabeled data is actually annotated in Release 3 of the Penn Treebank, we do not use any information contained in the annotation, only using the raw texts. The training, test and unlabeled data contains 39279, 425, and 16407 sentences respectively.

Settings of Our DBN Model
For the syntactic parsing task, there are 748,598 original features in total. We use 7,486 hidden variables in the first layer and 3,743 hidden variables in the second layer. For semantic parsing, there are 1,074,786 original features. We use 10,748 hidden variables in the first layer and 5,374 hidden variables in the second layer.
In our DBN models, we need to determine the number of groups k. Because larger k means less computational cost, k should not be set too small. We empirically set k as follows: according to our experience, each group should contain about 5000 original features. We have about 10 6 original features in our tasks. So we estimate k ≈ 10 6 /5000 = 200. And we set k to be 200 in the DBN models for both syntactic and semantic parsing. As for splitting strategy, we use the more sophisticated one in subsection 3.3.1 because it should generate better results than the simple one.

Details of DBN Training
In greedy pretraining of the DBN, the contrastive divergence algorithm is configured as follows: the training data is divided to mini-batches, each containing 100 samples. The weights are updated with a learning rate of 0.3, momentum of 0.9, weight decay of 0.0001. Each layer is trained for 30 passes (epochs) over the entire training data.
In fine-tuning, the backpropagation algorithm is configured as follows: The training data is divided to mini-batches, each containing 50 samples. The weights are updated with a learning rate of 0.1, momentum of 0.9, weight decay of 0.0001. The finetuning is repeated for 50 epochs over the entire training data.
We use the fast computing technique in Raina et al., (2009) to learn the LFRs. Moreover, in greedy pretraining, we train RBMs in the first layer in parallel.

Results and Discussion
We use the official evaluation measures of the CoNLL 2009 shared task, which consist of three different scores: (i) syntactic dependencies are scored using the labeled attachment score, (ii) semantic dependencies are evaluated using a labeled F1 score, and (iii) the overall task is scored with a macro av-  Table 1: The results of our basic and adapted systems erage of the two previous scores. The three scores above are represented by LAS, Sem F1, and Macro F1 respectively in this paper.

Comparison with Un-adapted System
Our basic system uses the OriSynModel for syntactic parsing, and the OriSemModel for semantic parsing. Our adapted system uses the LatSynModel for syntactic parsing, and the LatSemModel for semantic parsing. The results of these two systems are shown in Table 1, in which our basic and adapted systems are denoted as Ori and Lat respectively.
From the results in Table 1, we can see that Lat performs slightly worse than Ori on in-domain WSJ test data. But on the out-of-domain Brown test data, Lat performs much better than Ori, with 5 points improvement in Macro F1 score. This shows the effectiveness of our method for domain adaptation tasks.

Different Splitting Configurations
As described in subsection 5.1.2, we have empirically set the number of groups k to be 200 and chosen the more sophisticated splitting strategy. In this subsection, we experiment with different splitting configurations to see their effects.
Under each splitting configuration, we learn the LFRs using our the DBN models. Using the LFRs, we test the our adapted systems on both in-domain and out-of-domain data. Therefore we get many test results, each corresponding to a splitting configuration. The in-domain and out-of-domain test results are reported in Table 2 and Table 3 respectively. In these two tables, 's1' and 's2' represents the simple and the more sophisticated splitting strategies in subsection 3.3.1 respectively. 'k' represents the number of groups in our DBN models. For both syntactic and semantic parsing, we use the same k in their DBN models. The 'Time' column reports the training time of our DBN models for both syntactic and semantic parsing. The unit of the 'Time'   column is the hour. Please note that we only need to train our DBN models once. And we report the training time in Table 2. For easy viewing, we repeat those training times in Table 3. But this does not mean we need to train new DBN models for outof-domain test.
From Tables 2 and 3 we get the following observations: First, although the more sophisticated splitting strategy 's2' generate slightly better result than the simple strategy 's1', the difference is not significant. This means that the hierarchical structure of our DBN model can robustly capture the relationships between features. Even with the simple splitting strategy 's1', we still get quite good results.
Second, the 'Time' column in Table 2 shows that different splitting strategies with the same k value has the same training time. This is reasonable because training time only depends on the number of parameters in our DBN model. And different splitting strategies do not affect the number of parameters in our DBN model. Third, the number of groups k affects both the training time and the final results. When k increases, the training time reduces but the results degrade. As k gets larger, the time reduction gets less obvious, but the degradation of results gets more obvious. When k = 100, 200, 300, there is not much difference between the results. This shows that the results of our DBN model is not sensitive to the values of k within a range of 100 around our initial estimation 200. But when k is further away from our estimation, e.g. k = 400, the results get significantly worse.
Please note that the results in Tables 2 and 3 are not used to tune the parameter k or to choose a splitting strategy in our DBN model. As mentioned in subsection 5.1.2, we have chosen k = 200 and the more sophisticated splitting strategy beforehand. In this paper, we always use the results with k = 200 and the 's2' strategy as our main results, even though the results with k = 100 are better.

The Size of Unlabeled Target Domain Data
An interesting question for our method is how much unlabeled target domain data should be used. To empirically answer this question, we learn several LFRs by gradually adding more unlabeled data to train our DBN model. We compared the performance of these LFRs as shown in Figure 6.  From Figure 6, we can see that by adding more unlabeled target domain data, our system adapts better to the target domain with only small degradation of result on source domain. However, with more unlabeled data used, the improvement on target domain result gradually gets smaller.

Comparison with other methods
In this subsection, we compare our method with several systems. These are described below.
Daume07. Daumé III (2007) proposed a simple and effective adaptation method by augmenting feature vector. Its main idea is to augment the feature vector. They took each feature in the original problem and made three versions of it: a general version, a source-specific version and a target-specific version. Thus, the augmented source data contains only general and source-specific versions; the augmented target data contains general and target-specific versions. In the baseline system, we adopt the same technique for dependency and semantic parsing.
Chen. The participation system of Zhao et al., (2009), reached the best result in the out-of-domain test of the CoNLL 2009 shared task.
In Daumé III and and Marcu (2006), they presented and discussed several 'obvious' ways to attack the domain adaptation problem without developing new algorithms. Following their idea, we construct similar systems.
OnlySrc. The system is trained on only the data of the source domain (News).
OnlyTgt. The system is trained on only the data of the target domain (Fiction).
All. The system is trained on all data of the source domain and the target domain.
It is worth noting that training the systems of Daume07, OnlyTgt and All need the labeled data of the target domain. We utilize OnlySrc to parse the unlabeled data of the target domain to generate the labeled data.
ALl comparison results are shown in Table 4, in which the 'Diff' column is the difference of scores on in-domain and out-of-domain test data.
First, we compare OnlySrc, OnlyTgt and All. We can see that OnlyTgt performs very poor both in the source domain and in the target domain. It is not hard to understand that OnlyTgt performs poor in the source domain because of the adaptation problem. OnlyTgt also performs poor in the target domain. We think the main reason is that OnlyTgt is trained on the auto parsed data in which there are  many parsing errors. But we note that All performs better than both OnlySrc and OnlyTgt on the target domain test, although its training data contains some auto parsed data. Therefore, the data of the target domain, labeled or unlabeled, are potential in alleviating the adaptation problem of different domains. But All just puts the auto parsed data of the target domain into the training set. Thus, its improvement on the test data of the target domain is limited. In fact, how to use the data of the target domain, especially the unlabeled data, in the adaptation problem is still an open and hot topic in NLP and machine learning.
Second, we compare Daume07, All and our method. In Daume07, they reported improvement on the target domain test. But one point to note is that the target domain data used in their experiments is labeled while in our case there is only unlabeled data. We can see Daume07 have comparable performance with All in which there is not any adaptation strategy besides adding more data of the target domain. We think the main reason is that there are many parsing errors in the data of the target domain. But our method performs much better than Daume07 and All even though some faulty data are also utilized in our system. This suggests that our method successfully learns new robust representations for different domains, even when there are some noisy data.
Third, we compare Chen with our method. Chen reached the best result in the out-of-domain test of the CoNLL 2009 shared task. The results in Table 4 show that Chen's system performs better than ours on in-domain test data, especially on LAS score. Chen's system uses a sophisticated graph-based syntactic dependency parser. Graph-based parsers use substantially more features, e.g. more than 1.3 × 10 7 features are used in McDonald et al., (2005). Learning an LFR for that many features would take months of time using our DBN model. So at present we only use a transition-based parser. The better performance of Chen's system mainly comes from their sophisticated syntactic parsing method.
To reduce the sparsity of features, Chen's system uses word cluster features as in Koo et al., (2008). On out-of-domain tests, however, our system still performs much better than Chen's, especially on semantic parsing. To our knowledge, on out-of-domain tests on this data set, our system has obtained the best performance to date. More importantly, the performance difference between indomain and out-of-domain tests is much smaller in our system. This shows that our system adapts much better to the target domain.

Conclusions
In this paper, we propose a DBN model to learn LFRs for syntactic and semantic parsers. These LFRs are common representations of original features in both source and target domains. Syntactic and semantic parsers using the LFRs adapt to target domain much better than the same parsers using original feature representation. Our model provides a unified method that adapts both syntactic and semantic dependency parsers to a new domain. In the future, we hope to further scale up our method to adapt parsing models using substantially more features, such as graph-based syntactic dependency parsing models. We will also search for better splitting strategies for our DBN model. Finally, although our experiments are conducted on syntactic and semantic parsing, it is expected that the proposed ap-280 proach can be applied to the domain adaptation of other tasks with little adaptation efforts.