Adapting to All Domains at Once: Rewarding Domain Invariance in SMT

Existing work on domain adaptation for statistical machine translation has consistently assumed access to a small sample from the test distribution (target domain) at training time. In practice, however, the target domain may not be known at training time or it may change to match user needs. In such situations, it is natural to push the system to make safer choices, giving higher preference to domain-invariant translations, which work well across domains, over risky domain-specific alternatives. We encode this intuition by (1) inducing latent subdomains from the training data only; (2) introducing features which measure how specialized phrases are to individual induced sub-domains; (3) estimating feature weights on out-of-domain data (rather than on the target domain). We conduct experiments on three language pairs and a number of different domains. We observe consistent improvements over a baseline which does not explicitly reward domain invariance.


Introduction
Mismatch in phrase translation distributions between test data (target domain) and train data is known to harm performance of statistical translation systems Carpuat et al., 2014). Domain-adaptation methods (Foster et al., 2010;Bisazza et al., 2011;Sennrich, 2012b;Razmara et al., 2012;Sennrich et al., 2013;Haddow, 2013;Joty et al., 2015) aim to specialize a system estimated on out-of-domain training data to a target domain represented by a small data sample. In practice, however, the target domain may not be known at training time or it may change over time depending on user needs. In this work we address exactly the setting where we have a domain-agnostic system but we have no access to any samples from the target domain at training time. This is an important and challenging setting which, as far as we are aware, has not yet received attention in the literature.
When the target domain is unknown at training time, the system could be trained to make safer choices, preferring translations which are likely to work across different domains. For example, when translating from English to Russian, the most natural translation for the word 'code' would be highly dependent on the domain (and the corresponding word sense). The Russian words 'xifr', 'zakon' or 'programma' would perhaps be optimal choices if we consider cryptography, legal and software development domains, respectively. However, the translation 'kod' is also acceptable across all these domains and, as such, would be a safer choice when the target domain is unknown. Note that such a translation may not be the most frequent overall and, consequently, might not be proposed by a standard (i.e., domain-agnostic) phrase-based translation system.
In order to encode preference for domaininvariant translations, we introduce a measure which quantifies how likely a phrase (or a phrase-pair) is to be "domain-invariant". We recall that most large parallel corpora are heterogeneous, consisting of diverse language use originating from a variety of unspecified subdomains. For example, news articles may cover sports, finance, politics, technology and a variety of other news topics. None of the subdomains may match the target domain particularly 99 well, but they can still reveal how domain-specific a given phrase is. For example, if we would observe that the word 'code' can be translated as 'kod' across cryptography and legal subdomains observed in training data, we can hypothesize that it may work better on a new unknown domain than 'zakon' which was specific only to a single subdomain (legal). This would be a suitable decision if the test domain happens to be software development, even though no texts pertaining to this domain were included in the heterogeneous training data.
Importantly, the subdomains are usually not specified in the heterogeneous training data. Therefore, we treat the subdomains as latent, so we can induce them automatically. Once induced, we define measures of domain specificity, particularly expressing two generic properties: Phrase domain specificity How specific is a target or a source phrase to some of the induced subdomains?
Phrase pair domain coherence How coherent is a source phrase and a target language translation across the induced subdomains?
These features capture two orthogonal aspects of phrase behaviour in heterogeneous corpora, with the rationale that phrase pairs can be weighted along these two dimensions. Domain-specificity captures the intuition that the more specific a phrase is to certain subdomains, the less applicable it is in general. Note that specificity is applied not only to target phrases (as 'kod' and 'zakon' in the above example) but also to source phrases. When applied to a source phrase, it may give a preference towards using shorter phrases as they are inherently less domain specific. In contrast to phrase domain specificity, phrase pair coherence reflects whether candidate target and source phrases are typically used in the same set of domains. The intuition here is that the more divergent the distributional behaviour of source and target phrases across subdomains, the less certain we are whether this phrase pair is valid for the unknown target domain. In other words, a translation rule with source and target phrases having two similar distributions over the latent subdomains is likely safer to use.
Weights for these features, alongside all other standard features, are tuned on a development set. Importantly, we show that there is no noteworthy benefit from tuning the weights on a sample from the target domain. It is enough to tune them on a mixed-domain dataset sufficiently different from the training data. We attribute this attractive property to the fact that our features, unlike the ones typically considered in standard domain-adaptation work, are generic and only affect the amount of risk our system takes. In contrast, for example, in Eidelman et al. (2012), Chiang et al. (2011), Hu et al. (2014), Hasler et al. (2014, , Sennrich (2012b), Chen et al. (2013b), and Carpuat et al. (2014), features capture similarities between a target domain and each of the training subdomains. Clearly, domain adaptation with such rich features, though potentially more powerful, would not be possible without a development set closely matching the target domain.
We conduct our experiments on three language pairs and explore adaptation to 9 domain adaptation tasks in total. We observe significant and consistent performance improvements over the baseline domain-agnostic systems. This result confirms that our two features, and the latent subdomains they are computed from, are useful also for the very challenging domain adaptation setting considered in this work.

Domain-Invariance for Phrases
At the core of a standard state-of-the-art phrasebased system (Koehn et al., 2003;Och and Ney, 2004) lies a phrase table { ẽ,f } extracted from a word-aligned training corpus together with estimates for phrase translation probabilities P count (ẽ|f ) and P count (f |ẽ). Typically the phrases and their probabilities are obtained from large parallel corpora, which are usually broad enough to cover a mixture of several subdomains. In such mixtures, phrase distributions may be different across different subdomains. Some phrases (whether source or target) are more specific for certain subdomains than others, while some phrases are useful across many subdomains. Moreover, for a phrase pair, the distribution over the subdomains for its source side may be similar or not to the distribution for its target side.

Target Phrase
Projection Figure 1: The projection framework of phrases into Kdimensional vector space of probabilistic latent subdomains.
Coherent pairs seem safer to employ than pairs that exhibit different distributions over the subdomains. These two factors, domain specificity and domain coherence, can be estimated from the training corpus if we have access to subdomain statistics for the phrases. In the setting addressed here, the subdomains are not known in advance and we have to consider them latent in the training data. Therefore, we introduce a random variable z ∈ {1, . . . , K} encoding (arbitrary) K latent subdomains that generate each source and target phraseẽ andf of every phrase pair ẽ,f . In the next Section, we aim to estimate distributions P (z|ẽ) and P (z|f ) for subdomain z over the source and target phrases respectively. In other words, we aim at projecting phrases onto a compact (K − 1) dimensional simplex of subdomains with vectors: ẽ = P (z = 1|ẽ), . . . , P (z = K|ẽ) , (1) Each of the K elements encodes how well each source and target phrase expresses a specific latent subdomain in the training data. See Fig. 1 for an illustration of the projection framework. Once the projection is performed, the hidden cross-domain translation behaviour of phrases and phrase pairs can be modeled as follows: • Domain-specificity of phrases: A rule with source and target phrases having a peaked distribution over latent subdomains is likely domain-specific.
Technically speaking, entropy comes as a natural choice for quantifying domain specificity. Here, we opt for the Renyi entropy and define the domain specificity as follows: For convenience, we refer to D α (·) as the domain specificity of a phrase. In this study, we choose the value of α as 2 which is the default choice (also known as the Collision entropy).
• Source-target coherence across subdomains: A translation rule with source and target phrases having two similar distributions over the latent subdomains is likely safer to use. We use the Chebyshev distance for measuring the similarity between two distributions. The divergence of two vectors ẽ and f is defined as follows We refer to D( ẽ, f ) as the phrase pair coherence across latent subdomains.
We investigated some other similarities for phrase pair coherence (the Kullback-Leibler divergence and the Hellinger distance) but have not observed any noticeable improvements in the performance. We will discuss these experiments in the empirical section.
Once computed for every phrase pair, the two measures D α ( ẽ), D α ( f ) D( ẽ, f ), will be integrated into a phrase-based SMT system as feature functions.

Latent Subdomain Induction
We now present our approach for inducing latent subdomain distributions P (z|ẽ) and P (z|f ) for every source and target phrasesẽ andf . In our experiments, we compare using our subdomain induction framework with relying on topic distributions provided by a standard topic model, Latent Dirichlet Allocation (Blei et al., 2003). Note that unlike LDA we rely on parallel data and word alignments when inducing domains. Our intuition is that latent variables capturing regularities in bilingual data may be more appropriate for the translation task. Inducing these probabilities directly is rather difficult as the task of designing a fully generative phrase-based model is known to be challenging. 1 In order to avoid this, we follow Matsoukas et al. (2009) and Cuong and Sima'an (2014a) who "embed" such a phrase-level model into a latent subdomain model that works at the sentence level. In other words, we associate latent domains with sentence pairs rather than with phrases, and use the posterior probabilities computed for the sentences with all the phrases appearing in the corresponding sentences. Given P (z| e, f ) -a latent subdomain model given sentence pairs e, f -the estimation of P (z|ẽ) and P (z|f ), for phrasesẽ andf , can be simplified by computing expectations z for all z ∈ {1, . . . , K}: Here, c(ẽ, e) is the count of a phraseẽ in a sentence e in the training corpus.
Latent subdomains for sentences. We now turn to describing our latent subdomain model for sentences. We assume the following generative story for sentence pairs: 1. generate the domain z from the prior P (z); 2. choose the generation direction: f -to-e or e-tof , with equal probability; 3. if the e-to-f direction is chosen then generate the pair relying on P (e| z)P (f | e, z); Formally, it is a uniform mixture of the generative processes for the two potential translation di-1 Doing that requires incorporating into the model additional hidden variables encoding phrase segmentation (DeNero et al., 2006). This would significantly complicate inference (Mylonakis and Sima Neubig et al., 2011;Cohn and Haffari, 2013). rections. 2 This generative story implies having two translation models (TMs) and two language models (LMs), each augmented with latent subdomains. Now, the posterior P (z| e, f ) can be computed as As we aim for a simple approach, our TMs are computed through the introduction of hidden alignments a and a in f -to-e and e-to-f directions respectively, in which P (f | e, z) = a P (f , a| e, z) and P (e| f , z) = a P (e, a | f , z). To make the marginalization of alignments tractable, we restrict P (f , a| e, z) and P (e, a | f , z) to the same assumptions as IBM Model 1 (Brown et al., 1993) (i.e., a multiplication of translation of lexical probabilities with respect to latent subdomains).
We use standard n th -order Markov model for P (e| z) and Here, the notation e i−1 i−n and f j−1 j−n is used to denote the history of length n for the source and target words e i and f j , respectively.
Training. For training, we maximize the loglikelihood L of the data As there is no closed-form solution, we use the expectation-maximization (EM) algorithm (Dempster et al., 1977).
In the E-step, we compute the posterior distribu-tions P (a, z| e, f ) and P (a , z| e, f ) as follows P (a, z| e, f ) ∝ P (z) P (e| z)P (f , a| e, z) In the M-step, we use the posteriors P (a, z| e, f ) and P (a , z| e, f ) to re-estimate parameters of both alignment models. This is done in a very similar way to estimation of the standard IBM Model 1.
We use the posteriors to re-estimate LM parameters as follows To obtain better parameter estimates for word predictions and avoid overfitting, we use smoothing in the M-step. In this work, we chose to apply expected Kneser-Ney smoothing technique (Zhang and Chiang, 2014) as it is simple and achieves state-of-theart performance on the language modeling problem. Finally, P (z) can be simply estimated as follows Hierarchical Training. In practice, we found that training the full joint model leads to brittle performance, as EM is very likely to get stuck in bad local maxima. To address this difficulty, in our implementation, we start out by first jointly training P (z), P (e| z) and P (f | z). In this way in the E-step, we fix our model parameters and compute P (z| e, f ) for every sentence pair: P (z| e, f ) ∝ P (e| z)P (f | z)P (z). In the M-step, we use the posteriors to re-estimate the model parameters, as in Equations (7), (8) and (9). Once the model is trained, we fix the language modeling parameters and finally train the full model. This parallel latent subdomain language model is less expressive and, consequently, is less likely to get stuck in a local maximum. The LMs estimated in this way will then drive the full alignment model towards better configurations in the parameter space. 3 In practice, this training scheme is particularly useful in case of learning a more fine-grained latent subdomain model with larger K.   to the training data. In this way, we test the stability of our results across a wide range of target domains.

Systems
We use a standard state-of-the-art phrase-based system. The Baseline system includes MOSES  baseline feature functions, plus eight hierarchical lexicalized reordering model feature functions (Galley and Manning, 2008). The training data is first word-aligned using GIZA++ (Och and Ney, 2003) and then symmetrized with grow(-diag)final-and (Koehn et al., 2003). We limit the phrase length to the maximum of seven words. The lan-guage models are interpolated 5-grams with Kneser-Ney smoothing, estimated by KenLM (Heafield et al., 2013) from a large monolingual corpus of nearly 2.1B English words collected within the WMT 2015 MT Shared Task. Finally, we use MOSES as a decoder .
Our system is exactly the same as the baseline, plus three additional feature functions induced for the translation rules: two features for domainspecificity of phrases (both for the source side (D α ( f )) and the target side (D α ( ẽ)), and one feature for source-target coherence across subdomains (D( ẽ, f )). For the projection, we use K=12. We also explored different values for K, but have not observed significant difference in the scores. In our experiments we do one iteration of EM with parallel LMs (as described in Section 3), before continuing with the full model for three more iterations. We did not observe a significant improvement from running EM any longer. Finally, we use hard EM, as it has been found to yield better models than the standard soft EM on a number of different task (e.g., (Johnson, 2007)). In other words, instead of standard 'soft' EM updates with phrase counts weighted according to the posterior P (z = i| e, f ), we use the 'winner-takes-all' approach: Here,ẑ e, f is the "winning" latent subdomain for sentence pair e, f : In practice, we found that using this hard version leads to better performance. 5

Alternative tuning scenarios
In order to tune all systems, we use the k-best batch MIRA (Cherry and Foster, 2012). We report the translation accuracy with three metrics -BLEU   Table 4: Adaptation results when tuning on the mixed-domain development set. The bold face indicates that the improvement over the baseline is significant. (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2011) and TER (Snover et al., 2006). We mark an improvement as significant when we obtain the p-level of 5 % under paired bootstrap resam-pling (Koehn, 2004). Note that better results correspond to larger BLEU and METEOR but to smaller TER. For every system reported, we run the optimizer at least three times, before running MultEval (Clark et al., 2011) for resampling and significance testing. Note that the scores for the systems are averages over multiple runs. For tuning the systems we explore two kinds of development sets: (1) An in-domain development set of in-domain data that directly exemplifies the translation task (i.e., a sample of targetdomain data), and (2) a mixed-domain development set which is a full concatenation of development sets from all the available domains for a language pair; this scenario is a more realistic one when no indomain data is available. In the analysis section we also test these two scenarios against the scenario mixed-domain minus in-domain, which excludes the in-domain development set part from the mixeddomain development set. By exploring the three different development sets we hope to shed light on the importance of having samples from the target domain when using our features. If our features can indeed capture domain invariance of phrases then they should improve the performance in all three settings, including the most difficult setting where the in-domain data has been explicitly excluded from the tuning phase.

Main results
In-domain tuning scenario. Table 3 presents the results for the in-domain development set scenario. The integration of the domain-invariant feature functions into the baseline results in a significant improvement across all domains: average +0.50 BLEU on two adaptation tasks for English-French, +0.40 BLEU on three adaptation tasks for English-Spanish and +0.43 BLEU on four adaptation tasks for English-German.
Mixed-domain tuning scenario. While the improvements are robust and consistent for the indomain development set scenario, we are especially delighted to see a similar improvement for the mixed-domain tuning scenario (Table 4). In detail, we observe an average +0.45 BLEU on two adaptation tasks for English-French, +0.47 BLEU on three adaptation tasks for English-Spanish and +0.30 BLEU on four adaptation tasks for English-German. We would like to emphasize that this performance improvement is obtained without tuning specifically for the target domain or using other domain-related meta-information in the training corpus.

Additional analysis
We investigate the individual contribution of each domain-invariance feature. We conduct experiments using a basic large-scale phrase-based system described in Koehn et al. (2003) as a baseline. The baseline includes two bi-directional phrase-based models (P count (ẽ|f ) and P count (f |ẽ)), three penalties for word, phrase and distortion, and finally, the language model. On top of the baseline, we build four different systems, each augmented with a domain-invariance feature. The first feature is the source-target coherence feature, D(ẽ,f ), where we use the Chebyshev distance as our default options. We also investigate the performance of other metrics including the Hellinger distance, 6 and the Kullback-Leibler divergence. 7 Our second and third features are the domain specificity of phrases on the source D α (f ) and on the target D α (ẽ) sides. Finally, we also deploy all these three domain-invariance features D α (f ) + D α (ẽ) + D(ẽ,f )). The experiments are conducted for the task Legal on English-German.  German-English (Task: Legal Services) Input im jahr 2004 befindet der rat über die verpflichtung der elektronischen übertragung solcher aufzeichnungen. the implementation of expenditure, the authorising officer commitments before, is a legal obligations the implementation of expenditure, the authorising officer commitments before, is a legal commitments + ALL the implementation of expenditure, the authorising officer commitments before, is a legal obligations Table 7: Translation outputs produced by the basic baseline and its augmented systems with additional abstract feature functions derived from hidden domain information.

English-German (Task: Legal) Dev
Metric BLEU↑  Table 6: Using different metrics as the measure of coherence. Table 5 and Table 6 present the results. Overall, we can see that all domain-invariance features contribute to adaptation performance. Specifically, we observe the following:

In-domain
• Favouring the source-target coherence across subdomains (i.e., adding the feature D(ẽ,f )) provides a significant translation improvement of +0.3 BLEU. Which specific similarity measure is used does not seem to matter that much (see Table 6). We obtain the best result (+0.4 BLEU) with the KL divergence (D KL ( ẽ, f )). However, the differences are not statistically significant.
• Integrating a preference for less domain-specific translation phrases at the target side (D α (ẽ)) leads to a translation improvement of +0.6 BLEU.
• Doing the same for the source side (D α (f )), in turn, leads to an improvement of +1.0 BLEU.
• Augmenting the baseline by integrating all our features leads to the best result, with an improvement of +1.1 BLEU.
• The translation improvement is observed also for training with a development set of mixed domains (even for the mixed-domain minus in-domain setting when excluding the Legal data from the mixed development set).
German phrase "elektronischen Übertragung", and from "internal administrative systems" to "internal management systems" for the German phrase "internen verwaltungssysteme". The revisions, however, are not always successful. For instance, adding D α (ẽ) and D α (f ) resulted in revising the translation of the German phrase "rechtliche verpflichtungen" to "legal obligations", which is a worse choice (at least according to BLEU) than "legal commitments" produced by the baseline. We also present a brief analysis of latent subdomains induced by our projection framework.
For each subdomain z we integrate the domain posteriors (P (z|ẽ) and P (z|f ) and the source-target domain-coherence feature P (z|ẽ) − P (z|f ) ). We hypothesize that whenever we observe an improvement for a translation task with domain-informed features, this means that the corresponding latent subdomain z is close to the target translation domain.
The results are presented in Table 8. Apparently, among the latent subdomains, z 4 , z 5 , z 6 , z 9 are closest to the target domain of Hardware. Their derived feature functions are helpful in improving the translation accuracy for the task. Similarly, z 1 , z 2 , z 5 , z 6 , z 9 and z 11 are closest to Professional & Business, z 6 is closest to Software, and z 3 is closest to Legal. Meanwhile, z 4 , z 5 and z 12 are not relevant to the task of Software. Similarly, z 3 is not relevant to Professional & Business, and z 2 , z 5 and z 10 are not relevant to Legal.
Using topic models instead of latent domains. Our domain-invariance framework demands access to posterior distributions of latent domains for phrases. Though we argued for using our domain induction approach, other latent variable models can be used to compute these posteriors. One natural option is to use topic models, and more specifically LDA (Blei et al., 2003). Will our domain-invariance framework still work with topic models, and how closely related are the induced latent domains induced with LDA and our model? These are the questions we study in this section.
We estimate LDA at the sentence level in a monolingual regime 8 on one side of each parallel corpus (let us assume for now that this is the source side). When the model is estimated, we obtain the posterior distributions of topics (we denote them as z, as we treat them as domains) for each source-side sentence in the training set. Now, as we did with our phrase induction framework, we associate these posteriors with every phrase both in the source and in the target sides of that sentence pair. Phrase and phrase-pair features defined in Section 2 are computed relying on these probabilities averaged over the entire training set. We try both directions, that is also estimating LDA on the target side and transferring the posterior probabilities to the source side.
In order to estimate LDA, we used Gibbs sampling implemented in the Mallet package (McCallum, 2002) with default values of hyper-parameters (α = 0.01 and β = 0.01). Table 9 presents the results for the Legal task with three different system optimization settings. BLEU, METEOR and TER are reported. As the result suggests, using our induction framework tends to yield slightly better translation results in terms of METEOR and especially BLEU. However, using LDA seems to lead to slightly better translation result in terms of TER.
Topics in LDA-like models encode co-occurrence patterns in bag-of-word representations of sentences.
In contrast, domains in our domaininduction framework rely on ngrams and wordalignment information. Consequently, these mod-  els are likely to encode different latent information about sentences. We also investigate translation performance when we use both coherence features from LDA and coherence features from our own framework. Table 10 shows that using all the induced coherence features results in the best translation, no matter which translation metric is used. We leave the exploration of such an extension for future work.

Related Work and Discussion
Domain adaptation is an important challenge for many NLP problems. A good survey of potential translation errors in MT adaptation can be found in . Lexical selection appears to be the most common source of errors in domain adaptation scenarios Wees et al., 2015). Other translation errors include reordering errors (Chen et al., 2013a;, alignment errors (Cuong and Sima'an, 2015) and overfitting to the source domain at the parameter tuning stage (Joty et al., 2015). Adaptation in SMT can be regarded as injecting prior knowledge about the target translation task into the learning process. Various approaches have so far been exploited in the literature. They can be loosely categorized according to the type of prior knowledge exploited for adaptation. Often, a seed in-domain corpus exemplifying the target translation task is used as a form of prior knowledge. Various techniques can then be used for adaptation. For example, one approach is to combine a system trained on the in-domain data with another general-domain system trained on the rest of the data (e.g., see Koehn and Schroeder (2007), Foster et al. (2010), Bisazza et al. (2011), Sennrich (2012b, Razmara et al. (2012), Sennrich et al. (2013), Haddow (2013), Joty et al. (2015)). Rather than using the entire training data, it is also common to combine the in-domain system with a system trained on a selected subset of the data (e.g., see Axelrod et al. (2011), Koehn and Haddow (2012), Duh et al. (2013), Kirchhoff and Bilmes (2014), Cuong and Sima'an (2014b)).
Recently, there was some research on adapting simultaneously to multiple domains, the goal related to ours (Clark et al., 2012;Sennrich, 2012a). For instance, Clark et al. (2012) augment a phrase-based MT system with various domain indicator features to build a single system that performs well across a range of domains. Sennrich (2012a) proposed to cluster training data in an unsupervised fashion to build mixture models that yield good performance on multiple test domains. However, their approaches are very different from ours, that is minimizing risk associated with choosing domain-specific translations.
Moreover, the present work deviates radically from earlier work in that it explores the scenario where no prior data or knowledge is available about the translation task during training time. The focus of our approach is to aim for safer translation by rewarding domain-invariance of translation rules over latent subdomains that can be (still) useful on adap-tation tasks. The present study is inspired by  which exploits topic-insensitivity that is learned over documents for translation. The goal and setting we are working on is markedly different (i.e., we do not have access to meta-information about the training and translation tasks at all). The domain-invariance induced is integrated into SMT systems as feature functions, redirecting the decoder to a better search space for the translation over adaptation tasks. This aims at biasing the decoder towards translations that are less domain-specific and more source-target domain coherent.
There is an interesting relation between this work and extensive prior work on minimum Bayes risk (MBR) objectives (used either at test time (Kumar and Byrne, 2004) or during training (Smith and Eisner, 2006;Pauls et al., 2009)). As with our work, the goal of MBR minimization is to select translations that are less "risky". Their risk is due to the uncertainty in model predictions, and some of this uncertainty may indeed be associated with domainvariability of translations. Still, a system trained with an MBR objective will tend to output most frequent translation rather than the most domaininvariant one, and this, as we argued in the introduction, might not be the right decision when applying it across domains. We believe that the two classes of methods are largely complementary, and leave further investigation for future work.
At a conceptual level it is also related to regularizers used in learning domain-invariant neural models (Titov, 2011), specifically autoencoders. Though they also consider divergences between distributions of latent variable vectors, they use these divergences at learning time to bias models to induce representations maximally invariant across domains. Moreover, they assume access to meta-information about domains and consider only classification problems.

Conclusion
This paper aims at adapting machine translation systems to all domains at once by favoring phrases that are domain-invariant, that are safe to use across a variety of domains. While typical domain adaptation systems expect a sample of the target domain, our approach does not require one and is directly applicable to any domain adaptation scenario. Ex-periments show that the proposed approach results in modest but consistent improvements in BLEU, METEOR and TER. To the best of our knowledge, our results are the first to suggest consistent and significant improvement by a fully unsupervised adaptation method across a wide variety of translation tasks.
The proposed adaptation framework is fairly simple, leaving much space for future research. One potential direction is the introduction of additional features relying on the assignment of phrases to domains. The framework for inducing latent domains proposed in this paper should be beneficial in this future work. The implementation of our subdomaininduction framework is available at https:// github.com/hoangcuong2011/UDIT.