Aspect-augmented Adversarial Networks for Domain Adaptation

We introduce a neural method for transfer learning between two (source and target) classification tasks or aspects over the same domain. Rather than training on target labels, we use a few keywords pertaining to source and target aspects indicating sentence relevance instead of document class labels. Documents are encoded by learning to embed and softly select relevant sentences in an aspect-dependent manner. A shared classifier is trained on the source encoded documents and labels, and applied to target encoded documents. We ensure transfer through aspect-adversarial training so that encoded documents are, as sets, aspect-invariant. Experimental results demonstrate that our approach outperforms different baselines and model variants on two datasets, yielding an improvement of 27% on a pathology dataset and 5% on a review dataset.


Introduction
Many NLP problems are naturally multitask classification problems. For instance, values extracted for different fields from the same document are often dependent as they share the same context. Existing systems rely on this dependence (transfer across fields) to improve accuracy. In this paper, we consider a version of this problem where there is a clear dependence between two tasks but annotations are available only for the source task. For example, 1 The code is available at https://github.com/ yuanzh/aspect_adversarial. the target goal may be to classify pathology reports (shown in Figure 1) for the presence of lymph invasion but training data are available only for carcinoma in the same reports. We call this problem aspect transfer as the objective is to learn to classify examples differently, focusing on different aspects, without access to target aspect labels. Clearly, such transfer learning is possible only with auxiliary information relating the tasks together.
The key challenge is to articulate and incorporate commonalities across the tasks. For instance, in classifying reviews of different products, sentiment words (referred to as pivots) can be shared across the products. This commonality enables one to align feature spaces across multiple products, enabling useful transfer (Blitzer et al., 2006). Similar properties hold in other contexts and beyond senti-ment analysis. Figure 1 shows that certain words and phrases like "identified", which indicates the presence of a histological property, are applicable to both carcinoma and lymph invasion. Our method learns and relies on such shared indicators, and utilizes them for effective transfer.
The unique feature of our transfer problem is that both the source and the target classifiers operate over the same domain, i.e., the same examples. In this setting, traditional transfer methods will always predict the same label for both aspects and thus leading to failure. Instead of supplying the target classifier with direct training labels, our approach builds on a secondary relationship between the tasks using aspect-relevance annotations of sentences. These relevance annotations indicate a possibility that the answer could be found in a sentence, not what the answer is. One can often write simple keyword rules that identify sentence relevance to a particular aspect through representative terms, e.g., specific hormonal markers in the context of pathology reports. Annotations of this kind can be readily provided by domain experts, or extracted from medical literature such as codex rules in pathology (Pantanowitz et al., 2008). We assume a small number of relevance annotations (rules) pertaining to both source and target aspects as a form of weak supervision. We use this sentence-level aspect relevance to learn how to encode the examples (e.g., pathology reports) from the point of view of the desired aspect. In our approach, we construct different aspect-dependent encodings of the same document by softly selecting sentences relevant to the aspect of interest. The key to effective transfer is how these encodings are aligned.
This encoding mechanism brings the problem closer to the realm of standard domain adaptation, where the derived aspect-specific representations are considered as different domains. Given these representations, our method learns a label classifier shared between the two domains. To ensure that it can be adjusted only based on the source class labels, and that it also reasonably applies to the target encodings, we must align the two sets of encoded examples. 2 Learning this alignment is pos-sible because, as discussed above, some keywords are directly transferable and can serve as anchors for constructing this invariant space. To learn this invariant representation, we introduce an adversarial domain classifier analogous to the recent successful use of adversarial training in computer vision (Ganin and Lempitsky, 2014). The role of the domain classifier (adversary) is to learn to distinguish between the two types of encodings. During training we update the encoder with an adversarial objective to cause the classifier to fail. The encoder therefore learns to eliminate aspect-specific information so that encodings look invariant (as sets) to the classifier, thus establishing aspect-invariance encodings and enabling transfer. All three components in our approach, 1) aspect-driven encoding, 2) classification of source labels, and 3) domain adversary, are trained jointly (concurrently) to complement and balance each other.
Adversarial training of domain and label classifiers can be challenging to stabilize. In our setting, sentences are encoded with a convolutional model. Feedback from adversarial training can be an unstable guide for how the sentences should be encoded. To address this issue, we incorporate an additional word-level auto-encoder reconstruction loss to ground the convolutional processing of sentences. We empirically demonstrate that this additional objective yields richer and more diversified feature representations, improving transfer.
We evaluate our approach on pathology reports (aspect transfer) as well as on a more standard review dataset (domain adaptation). On the pathology dataset, we explore cross-aspect transfer across different types of breast disease. Specifically, we test on six adaptation tasks, consistently outperforming all other baselines. Overall, our full model achieves 27% and 20.2% absolute improvement arising from aspect-driven encoding and adversarial training respectively. Moreover, our unsupervised adaptation method is only 5.7% behind the accuracy of a supervised target model. On the review dataset, we test adaptations from hotel to restaurant reviews. Our model outperforms the marginalized denoising autoencoder (Chen et al., 2012) by 5%. Finally, we examine and illustrate the impact of individual components on the resulting performance.

Domain Adaptation for Deep Learning
Existing approaches commonly induce abstract representations without pulling apart different aspects in the same example, and therefore are likely to fail on the aspect transfer problem. The majority of these prior methods first learn a task-independent representation, and then train a label predictor (e.g. SVM) on this representation in a separate step. For example, earlier researches employ a shared autoencoder (Glorot et al., 2011;Chopra et al., 2013) to learn a cross-domain representation. Chen et al. (2012) further improve and stabilize the representation learning by utilizing marginalized denoising autoencoders. Later, Zhou et al. (2016) propose to minimize domain-shift of the autoencoder in a linear data combination manner. Other researches have focused on learning transferable representations in an end-to-end fashion. Examples include using transduction learning for object recognition (Sener et al., 2016) and using residual transfer networks for image classification (Long et al., 2016). In contrast, we use adversarial training to encourage learning domaininvariant features in a more explicit way. Our approach offers another two advantages over prior work. First, we jointly optimize features with the final classification task while many previous works only learn task-independent features using autoencoders. Second, our model can handle traditional domain transfer as well as aspect transfer, while previous methods can only handle the former.

Adversarial Learning in Vision and NLP
Our approach closely relates to the idea of domainadversarial training. Adversarial networks were originally developed for image generation (Goodfellow et al., 2014;Makhzani et al., 2015;Springenberg, 2015;Radford et al., 2015;Taigman et al., 2016), and were later applied to domain adaptation in computer vision (Ganin and Lempitsky, 2014;Ganin et al., 2015;Bousmalis et al., 2016;Tzeng et al., 2014) and speech recognition (Shinohara, 2016). The core idea of these approaches is to promote the emergence of invariant image features by optimizing the feature extractor as an adversary against the domain classifier. While Ganin et al. (2015) also apply this idea to sentiment analysis, their practical gains have remained limited.
Our approach presents two main departures. In computer vision, adversarial learning has been used for transferring across domains, while our method can also handle aspect transfer. In addition, we introduce a reconstruction loss which results in more robust adversarial training. We believe that this formulation will benefit other applications of adversarial training, beyond the ones described in this paper.

Semi-supervised Learning with Keywords
In our work, we use a small set of keywords as a source of weak supervision for aspect-relevance scoring. This relates to prior work on utilizing prototypes and seed words in semi-supervised learning (Haghighi and Klein, 2006;Grenager et al., 2005;Chang et al., 2007;Mann and McCallum, 2008;Jagarlamudi et al., 2012;Li et al., 2012;Eisenstein, 2017). All these prior approaches utilize prototype annotations primarily targeting model bootstrapping but not for learning representations. In contrast, our model uses provided keywords to learn aspect-driven encoding of input examples.

Attention Mechanism in NLP
One may view our aspect-relevance scorer as a sentence-level "semi-supervised attention", in which relevant sentences receive more attention during feature extraction. While traditional attention-based models typically induce attention in an unsupervised manner, they have to rely on a large amount of labeled data for the target task (Bahdanau et al., 2014;Rush et al., 2015;Chen et al., 2015;Cheng et al., 2016;Xu and Saenko, 2015;Yang et al., 2015;Martins and Astudillo, 2016;Lei et al., 2016). Unlike these methods, our approach assumes no label annotations in the target domain. Other researches have focused on utilizing human-provided rationales as "supervised attention" to improve prediction (Zaidan et al., 2007;Marshall et al., 2015;Zhang et al., 2016;Brun et al., 2016). In contrast, our model only assumes access to a small set of keywords as a source of weak supervision. Moreover, all these prior approaches focus on in-domain classification. In this paper, however, we study the task in the context of domain adaptation.

Multitask Learning
Existing multitask learning methods focus on the case where supervision is available for all tasks. A typical architecture involves using a shared encoder with a separate clas-sifier for each task. (Caruana, 1998;Pan and Yang, 2010;Collobert and Weston, 2008;Liu et al., 2015;Bordes et al., 2012). In contrast, our work assumes labeled data only for the source aspect. We train a single classifier for both aspects by learning aspectinvariant representation that enables the transfer.

Problem Formulation
We begin by formalizing aspect transfer with the idea of differentiating it from standard domain adaptation. In our setup, we have two classification tasks called the source and the target tasks. In contrast to source and target tasks in domain adaptation, both of these tasks are defined over the same set of examples (here documents, e.g., pathology reports). What differentiates the two classification tasks is that they pertain to different aspects in the examples. If each training document were annotated with both the source and the target aspect labels, the problem would reduce to multi-label classification. However, in our setting training labels are available only for the source aspect so the goal is to solve the target task without any associated training label.
To fix notation, let d = {s i } |d| i=1 be a document that consists of a sequence of |d| sentences s i . Given a document d, and the aspect of interest, we wish to predict the corresponding aspect-dependent class label y (e.g., y ∈ {−1, 1}). We assume that the set of possible labels are the same across aspects. We use y s l;k to denote the k-th coordinate of a one-hot vector indicating the correct training source aspect label for document d l . Target aspect labels are not available during training.
Beyond labeled documents for the source aspect {d l , y s l } l∈L , and shared unlabeled documents for source and target aspects {d l } l∈U , we assume further that we have relevance scores pertaining to each aspect. The relevance is given per sentence, for some subset of sentences across the documents, and indicates the possibility that the answer for that document would be found in the sentence but without indicating which way the answer goes. Relevance is always aspect dependent yet often easy to provide in the form of simple keyword rules.
We use r a i ∈ {0, 1} to denote the given relevance label pertaining to aspect a for sentence s i . Only a small subset of sentences in the training set have as-sociated relevance labels. Let R = {(a, l, i)} denote the index set of relevance labels such that if (a, l, i) ∈ R then aspect a's relevance label r a l,i is available for the i th sentence in document d l . In our case relevance labels arise from aspect-dependent keyword matches. r a i = 1 when the sentence contains any keywords pertaining to aspect a and r a i = 0 if it has any keywords of other aspects. 3 Separate subsets of relevance labels are available for each aspect as the keywords differ.
The transfer that is sought here is between two tasks over the same set of examples rather than between two different types of examples for the same task as in standard domain adaptation. However, the two formulations can be reconciled if full relevance annotations are assumed to be available during training and testing. In this scenario, we could simply lift the sets of relevant sentences from each document as new types of documents. The goal would be then to learn to classify documents of type T (consisting of sentences relevant to the target aspect) based on having labels only for type S (source) documents, a standard domain adaptation task. Our problem is more challenging as the aspect-relevance of sentences must be learned from limited annotations.
Finally, we note that the aspect transfer problem and the method we develop to solve it work the same even when source and target documents are a priori different, something we will demonstrate later.

Overview of our approach
Our model consists of three key components as shown in Figure 2. Each document is encoded in a relevance weighted, aspect-dependent manner (green, left part of Figure 2) and classified using the label predictor (blue, top-right). During training, the encoded documents are also passed on to the domain classifier (orange, bottom-right). The role of the domain classifier, as the adversary, is to ensure that the aspect-dependent encodings of documents are distributionally matched. This matching justifies the use of the same end-classifier to provide the predicted label regardless of the task (aspect).
Document representation Transformation Layer To encode a document, the model first maps each sentence into a vector and then passes the vector to a scoring network to determine whether the sentence is relevant for the chosen aspect. These predicted relevance scores are used to obtain document vectors by taking relevance-weighted sum of the associated sentence vectors. Thus, the manner in which the document vector is constructed is always aspectdependent due to the chosen relevance weights.
During training, the resulting adjusted document vectors are consumed by the two classifiers. The primary label classifier aims to predict the source labels (when available), while the domain classifier determines whether the document vector pertains to the source or target aspect, which is the label that we know by construction. Furthermore, we jointly update the document encoder with a reverse of the gradient from the domain classifier, so that the encoder learns to induce document representations that fool the domain classifier. The resulting encoded representations will be aspect-invariant, facilitating transfer.
Our adversarial training scheme uses all the training losses concurrently to adjust the model parameters. During testing, we simply encode each test document in a target-aspect dependent manner, and apply the same label predictor. We expect that the same label classifier does well on the target task since it solves the source task, and operates on relevance-weighted representations that are matched across the tasks. While our method is designed to work in the extreme setting that the examples for the two aspects are the same, this is by no means a re- x sen = max{h1, h2, . . .} Figure 3: Illustration of the convolutional model and the reconstruction of word embeddings from the associated convolutional layer. quirement. Our method will also work fine in the more traditional domain adaptation setting, which we will demonstrate later.

Sentence embedding
We apply a convolutional model illustrated in Figure 3 to each sentence s i to obtain sentence-level vector embeddings x sen i . The use of RNNs or bi-LSTMs would result in more flexible sentence embeddings but based on our initial experiments, we did not observe any significant gains over the simpler CNNs.
We further ground the resulting sentence embeddings by including an additional word-level reconstruction step in the convolutional model. The purpose of this reconstruction step is to balance adversarial training signals propagating back from the domain classifier. Specifically, it forces the sentence encoder to keep rich word-level information in contrast to adversarial training that seeks to eliminate aspect specific features. We provide an empirical analysis of the impact of this reconstruction in the experiment section (Section 7).
More concretely, we reconstruct word embedding from the corresponding convolutional layer, as shown in Figure 3. 4 We use x i,j to denote the embedding of the j-th word in sentence s i . Let h i,j be the convolutional output when x i,j is at the center of the window. We reconstruct x i,j bŷ where W c and b c are parameters of the reconstruction layer. The loss associated with the reconstruction for document d is where n is the number of tokens in the document and indexes i, j identify the sentence and word, respectively. The overall reconstruction loss L rec is obtained by summing over all labeled/unlabeled documents.

Relevance prediction
We use a small set of keyword rules to generate binary relevance labels, both positive (r = 1) and negative (r = 0). These labels represent the only supervision available to predict relevance. The prediction is made on the basis of the sentence vector x sen i passed through a feedforward network with a ReLU output unit. The network has a single shared hidden layer and a separate output layer for each aspect. Note that our relevance prediction network is trained as a non-negative regression model even though the available labels are binary, as relevance varies more on a linear rather than binary scale.
Given relevance labels indexed by R = {(a, l, i)}, we minimize wherer a l,i is the predicted (non-negative) relevance score pertaining to aspect a for the i th sentence in document d l , as shown in the left part of Figure 2. r a l,i , defined earlier, is the given binary (0/1) relevance label. We use a score in [0, 1] scale because it can be naturally used as a weight for vector combinations, as shown next. 4 This process is omitted in Figure 2 for brevity.

Document encoding
The initial vector representation for each document such as d l is obtained as a relevance weighted combination of the associated sentence vectors, i.e., The resulting vector selectively encodes information from the sentences based on relevance to the focal aspect.

Transformation layer
The manner in which document vectors arise from sentence vectors means that they will retain aspect-specific information that will hinder transfer across aspects. To help remove non-transferable information, we add a transformation layer to map the initial document vectors x doc,a l to their domain invariant (as a set) versions, as shown in Figure 2. Specifically, the transformed representation is given by x tr,a l = W tr x doc,a l . Meanwhile, the transformation has to be strongly regularized lest the gradient from the adversary would wipe out all the document signal. We add the following regularization term Ω tr = λ tr ||W tr − I|| 2 F (5) to discourage significant deviation away from identity I. λ tr is a regularization parameter that has to be set separately based on validation performance. We show an empirical analysis of the impact of this transformation layer in Section 7.

Primary label classifier
As shown in the topright part of Figure 2, the classifier takes in the adjusted document representation as an input and predicts a probability distribution over the possible class labels. The classifier is a feed-forward network with a single hidden layer using ReLU activations and a softmax output layer over the possible class labels. Note that we train only one label classifier that is shared by both aspects. The classifier operates the same regardless of the aspect to which the document was encoded. It must therefore be cooperatively learned together with the encodings.
Letp l;k denote the predicted probability of class k for document d l when the document is encoded from the point of view of the source aspect. Recall that [y s l;1 , . . . , y s l;m ] is a one-hot vector for the correct (given) source class label for document d l , hence also a distribution. We use the cross-entropy loss for the label classifier y s l;k logp l;k (6)

Domain classifier
As shown in the bottomright part of Figure 2, the domain classifier functions as an adversary to ensure that the documents encoded with respect to the source and target aspects look the same as sets of examples. The invariance is achieved when the domain classifier (as the adversary) fails to distinguish between the two. Structurally, the domain classifier is a feed-forward network with a single ReLU hidden layer and a softmax output layer over the two aspect labels.
Let y a = [y a 1 , y a 2 ] denote the one-hot domain label vector for aspect a ∈ {s, t}. In other words, y s = [1, 0] and y t = [0, 1]. We useq k (x tr,a l ) as the predicted probability that the domain label is k when the domain classifier receives x tr,a l as the input. The domain classifier is trained to minimize

Joint learning
We combine the individual component losses pertaining to word reconstruction, relevance labels, transformation layer regularization, source class labels, and domain adversary into an overall objective function L all = L rec + L rel + Ω tr + L lab − ρL dom (8) which is minimized with respect to the model parameters except for the adversary (domain classifier). The adversary is maximizing the same objective with respect to its own parameters. The last term −ρL dom corresponds to the objective of causing the domain classifier to fail. The proportionality constant ρ controls the impact of gradients from the adversary on the document representation; the adversary itself is always directly minimizing L dom . All the parameters are optimized jointly using standard backpropagation (concurrent for the adversary). Each mini-batch is balanced by aspect, half  coming from the source, the other half from the target. All the loss functions except L lab make use of both labeled and unlabeled documents. Additionally, it would be straightforward to add a loss term for target labels if they are available.

Experimental Setup
Pathology dataset This dataset contains 96.6k breast pathology reports collected from three hospitals (Yala et al., 2016). A portion of this dataset is manually annotated with 20 categorical values, representing various aspects of breast disease. In our experiments, we focus on four aspects related to carcinomas and atypias: Ductal Carcinoma In-Situ (DCIS), Lobular Carcinoma In-Situ (LCIS), Invasive Ductal Carcinoma (IDC) and Atypical Lobular Hyperplasia (ALH). Each aspect is annotated using binary labels. We use 500 held out reports as our test set and use the rest of the labeled data as our training set: 23.8k reports for DCIS, 10.7k for LCIS, 22.9k for IDC, and 9.2k for ALH. Table 1 summarizes statistics of the dataset.
We explore the adaptation problem from one aspect to another. For example, we want to train a model on annotations of DCIS and apply it on LCIS. For each aspect, we use up to three common names as a source of supervision for learning the relevance scorer, as illustrated in Table 2. Note that the provided list is by no means exhaustive. In fact Buckley et al. (2012) provide example of 60 different verbalizations of LCIS, not counting negations.

Review dataset
Our second experiment is based on a domain transfer of sentiment classification. For the source domain, we use the hotel review dataset introduced in previous work (Wang et al., 2010;Wang et al., 2011), and for the target domain, we use the restaurant review dataset from Yelp. 5 Both datasets have ratings on a scale of 1 to 5 stars. Following previous work (Blitzer et al., 2007), we label reviews with ratings > 3 as positive and those with ratings < 3 as negative, discarding the rest. The hotel dataset includes a total of around 200k reviews collected from TripAdvisor, 6 so we split 100k as labeled and the other 100k as unlabeled data. We randomly select 200k restaurant reviews as the unlabeled data in the target domain. Our test set consists of 2k reviews. Table 1 summarizes the statistics of the review dataset.
The hotel reviews naturally have ratings for six aspects, including value, room quality, checkin service, room service, cleanliness and location. We use the first five aspects because the sixth aspect location has positive labels for over 95% of the reviews and thus the trained model will suffer from the lack of negative examples. The restaurant reviews, however, only have single ratings for an overall impression. Therefore, we explore the task of adaptation from each of the five hotel aspects to the restaurant domain. The hotel reviews dataset also provides a total of 280 keywords for different aspects that are generated by the bootstrapping method used in Wang et al. (2010). We use those keywords as supervision for learning the relevance scorer.

Baselines and our method
We first compare against a linear SVM trained on the raw bagof-words representation of labeled data in source. Second, we compare against our SourceOnly model that assumes no target domain data or keywords. It thus has no adversarial training or target aspect-relevance scoring. Next   In the rest part of the paper, we name our method and its variants as AAN (Aspect-augmented Adversarial Networks). We compare against AAN-NA and AAN-NR that are our model variants without adversarial training and without aspectrelevance scoring respectively. Finally we include supervised models trained on the full set of In-Domain annotations as the performance upper bound. Table 3 summarizes the usage of labeled and unlabeled data in each domain as well as keyword rules by our model (AAN-Full) and different baselines. Note that our model assumes the same set of data as the AAN-NA, AAN-NR and mSDA methods.

Implementation details
Following prior work (Ganin and Lempitsky, 2014), we gradually increase the adversarial strength ρ and decay the learning rate during training. We also apply batch normalization (Ioffe and Szegedy, 2015) on the sentence encoder and apply dropout with ratio 0.2 on word embeddings and each hidden layer activation. We set the hidden layer size to 150 and pick the transformation regularization weight λ t = 0.1 for the pathol-  Table 4: Pathology: Classification accuracy (%) of different approaches on the pathology reports dataset, including the results of twelve adaptation scenarios from four different aspects (IDC, ALH, DCIS and LCIS) in breast cancer pathology reports. "mSDA" indicates the marginalized denoising autoencoder in (Chen et al., 2012). "AAN-NA" and "AAN-NR" corresponds to our model without the adversarial training and the aspect-relevance scoring component, respectively. We also include in the last column the in-domain supervised training results of our model as the performance upper bound. Boldface numbers indicate the best accuracy for each testing scenario.
ogy dataset and λ t = 10.0 for the review dataset. Table 4 summarizes the classification accuracy of different methods on the pathology dataset, including the results of twelve adaptation tasks. Our full model (AAN-Full) consistently achieves the best performance on each task compared with other baselines and model variants. It is not surprising that SVM and mSDA perform poorly on this dataset because they only predict labels based on an overall feature representation of the input, and do not utilize weak supervision provided by aspect-specific keywords. As a reference, we also provide a performance upper bound by training our model on the full labeled set in the target domain, denoted as In-Domain in the last column of Table 4. On average, the accuracy of our model (AAN-Full) is only 5.7% behind this upper bound. Table 5 shows the adaptation results from each aspect in the hotel reviews to the overall ratings of restaurant reviews. AAN-Full and AAN-NR are the two best performing systems on this review dataset, attaining around 5% improvement over the mSDA baseline. Below, we summarize our findings when comparing the full model with the two model variants AAN-NA and AAN-NR.

Impact of adversarial training
We first focus on comparisons between AAN-Full and AAN-NA. The only difference between the two models is that AAN-NA has no adversarial training. On the pathology dataset, our model significantly outperforms AAN-NA, yielding a 20.2% absolute average gain (see Table 4). On the review dataset, our model obtains 2.5% average improvement over AAN-NA. As shown in Table 5, the gains are more significant when training on room and checkin aspects, reaching 6.9% and 4.5%, respectively.

Impact of relevance scoring
As shown in Table 4, the relevance scoring component plays a crucial role in classification on the pathology dataset.  Table 4. Boldface numbers indicate the best accuracy for each testing scenario. Our model achieves more than 27% improvement over AAN-NR. This is because in general aspects have zero correlations to each other in pathology reports. Therefore, it is essential for the model to have the capacity of distinguishing across different aspects in order to succeed in this task.
On the review dataset, however, we observe that relevance scoring has no significant impact on performance. On average, AAN-NR actually outperforms AAN-Full by 0.9%. This observation can be explained by the fact that different aspects in hotel reviews are highly correlated to each other. For example, the correlation between room quality and cleanliness is 0.81, much higher than aspect correlations in the pathology dataset. In other words, the sentiment is typically consistent across all sentences in a review, so that selecting aspect-specific sentences becomes unnecessary. Moreover, our supervision for the relevance scorer is weak and noisy because the aspect keywords are obtained in a semiautomatic way. Therefore, it is not surprising that AAN-NR sometimes delivers a better classification

Analysis
Impact of the reconstruction loss Table 6 summarizes the impact of the reconstruction loss on the model performance. For our full model (AAN-Full), adding the reconstruction loss yields an average of 5.0% gain on the pathology dataset and 5.6% on the review dataset. the water tasted weird . … • i had the shrimp boil and it was very underseasoned . much closer to bland than anything . … doorknob to our bathroom door fell off , as well as the handle on the toilet . … in the second bedroom it literally rained water from above .
• the room decor was not entirely modern . … we just had the run of the mill hotel room without a view .

…
• the only problem i had was that … i was very ill with what was suspected to be food poison • probably the noisiest room he could have given us in the whole hotel .

Restaurant Reviews
• the fries were undercooked and thrown haphazardly into the sauce holder . the shrimp was over cooked and just deepfried . … even the water tasted weird . … • the room was old . … we did n't like the night shows at all . … • however , the decor was just fair . … in the second bedroom it literally rained water from above .
• rest room in this restaurant is very dirty . … • the only problem i had was that … i was very ill with what was suspected to be food poison Nearest Hotel Reviews by Ours-Full Nearest Hotel Reviews by Ours-NA Figure 5: Examples of restaurant reviews and their nearest neighboring hotel reviews induced by different models (column 2 and 3). We use room quality as the source aspect. The sentiment phrases of each review are in blue, and some reviews are also shortened for space.  Table 7: The effect of regularization of the transformation layer λ t on the performance.
To analyze the reasons behind this difference, consider Figure 4 that shows the heat maps of the learned document representations on the review dataset. The top half of the matrices corresponds to input documents from the source domain and the bottom half corresponds to the target domain. Unlike the first matrix, the other two matrices have no significant difference between the two halves, indicating that adversarial training helps learning of domain-invariant representations. However, adversarial training also removes a lot of information from representations, as the second matrix is much more sparse than the first one. The third matrix shows that adding reconstruction loss effectively addresses this sparsity issue. Almost 85% entries of the second matrix have small values (< 10 −6 ) while the sparsity is only about 30% for the third one. Moreover, the standard deviation of the third matrix is also ten times higher than the second one. These comparisons demonstrate that the reconstruction loss function improves both the richness and diversity of the learned representations. Note that in the case of no adversarial training (AAN-NA), adding the reconstruction component has no clear effect. This is expected because the main motivation of adding this component is to achieve a more robust adversarial training. Table 7 shows the averaged accuracy with differ-ent regularization weights λ t in Equation 5. We change λ t to reflect different model variants. First, λ t = ∞ corresponds to the removal of the transformation layer because the transformation is always identity in this case. Our model performs better than this variant on both datasets, yielding an average improvement of 9.8% on the pathology dataset and 2.1% on the review dataset. This result indicates the importance of adding the transformation layer. Second, using zero regularization (λ t = 0) also consistently results in inferior performance, such as 13.8% loss on the pathology dataset. We hypothesize that zero regularization will dilute the effect from reconstruction because there is too much flexibility in transformation. As a result, the transformed representation will become sparse due to the adversarial training, leading to a performance loss.

Examples of neighboring reviews
Finally, we illustrate in Figure 5 a case study on the characteristics of learned abstract representations by different models. The first column shows an example restaurant review. Sentiment phrases in this example are mostly food-specific, such as "undercooked" and "tasted weird". In the other two columns, we show example hotel reviews that are nearest neighbors to the restaurant reviews, measured by cosine similarity between their representations. In column 2, many sentiment phrases are specific for room quality, such as "old" and "rained water from above". In column 3, however, most sentiment phrases are either common sentiment expressions (e.g. dirty) or food-related (e.g. food poison), even though the focus of the reviews is based on the room quality of hotels. This observation indicates that adversarial training (AAN-Full) successfully learns to eliminate domain-specific information and to map those domain-specific words into similar domain-invariant Figure 6: Classification accuracy (y-axis) on two transfer scenarios (one on review and one on pathology dataset) with a varied number of keyword rules for learning sentence relevance (x-axis).
representations. In contrast, AAN-NA only captures domain-invariant features from phrases that commonly present in both domains.

Impact of keyword rules
Finally, Figure 6 shows the accuracy of our full model (y-axis) when trained with various amount of keyword rules for relevance learning (x-axis). As expected, the transfer accuracy drops significantly when using fewer rules on the pathology dataset (LCIS as source and ALH as target). In contrary, the accuracy on the review dataset (hotel service as source and restaurant as target) is not sensitive to the amount of used relevance rules. This can be explained by the observation from Table 5 that the model without relevance scoring performs equally well as the full model due to the tight dependence in aspect labels.

Conclusions
In this paper, we propose a novel aspect-augmented adversarial network for cross-aspect and crossdomain adaptation tasks. Experimental results demonstrate that our approach successfully learns invariant representation from aspect-relevant fragments, yielding significant improvement over the mSDA baseline and our model variants. The effectiveness of our approach suggests the potential application of adversarial networks to a broader range of NLP tasks for improved representation learning, such as machine translation and language generation.