Measuring Machine Translation Errors in New Domains

We develop two techniques for analyzing the effect of porting a machine translation system to a new domain. One is a macro-level analysis that measures how domain shift affects corpus-level evaluation; the second is a micro-level analysis for word-level errors. We apply these methods to understand what happens when a Parliament-trained phrase-based machine translation system is applied in four very different domains: news, medical texts, scientific articles and movie subtitles. We present quantitative and qualitative experiments that highlight opportunities for future research in domain adaptation for machine translation.


Introduction
When building a statistical machine translation (SMT) system, the expected use case is often limited to a specific domain, genre and register (henceforth "domain" refers to this set, in keeping with standard, imprecise, terminology), such as a particular type of legal or medical document. Unfortunately, it is expensive to obtain enough parallel data to reliably estimate translation models in a new domain. Instead, one can hope that large amounts of data from another, "old domain," might be close enough to stand as a proxy. This is the defacto standard: we train SMT systems on Parliament proceedings, but then use them to translate all sorts of new text. Unfortunately, this results in significantly degraded translation quality. In this paper, we present two complementary methods for quantifiably measuring the source of translation errors ( §5.1 and §5.2) in a novel taxonomy ( §4). We show quantitative ( §7.1) and qualitative ( §7.2) results obtained from our methods on  There are three types of errors: unseen words (blue), incorrect sense selection (red) and unknown sense (green).
four very different new domains: newswire, medical texts, scientific abstracts, and movie subtitles.
Our basic approach is to think of translation errors in the context of a novel taxonomy of error categories, "S 4 ." Our taxonomy contains categories for the errors shown in Table 1, in which an SMT system trained on the Hansard parliamentary procedings is applied to a new domain (in this case, medical texts). Our categorization focuses on the following: new French words, new French senses, and incorrectly chosen translations. The first methodology we develop for studying such errors is a micro-level study of the frequency and distribution of these error types in real translation output at the level of individual words ( §5.1), without respect to how these errors affect overall translation quality. The second is a macro-level study of how these errors affect translation performance (measured by BLEU; §5.2). One important feature of our methodologies is that we focus on errors that could possibly be fixed given access to data from a new domain, rather than all errors that might arise because the particular translation model used is inadequate to capture the required translation task (formally: we measure estimation error, not approximation error).
Our goal is neither to build better SMT systems nor to develop novel domain adaptation methods. We take an ab initio approach and ask: given a large unadapted, out of the box SMT system, what happens when it is applied in a new domain ? In order to answer this question, we will use parallel data in new domains, but only for testing purposes. The baseline SMT system is not adapted, except for the use of (1) a language model trained on monolingual newdomain language data, 1 and (2) a few thousand parallel sentences of tuning data in the new domain.

Summary of Results
We conduct experiments across a variety of domains (described in §6). 2 As in any study, our results are limited by assumptions about language, domains, and MT systems: these assumptions and their consequences are discussed in §8. Our high-level conclusions on the domains we study are summarized below (details may be found in §7). 1. Adapting an SMT system from the Parliament domain to the news domain is not a representative adaptation task; there are a very small number of errors due to unseen words, which are minor in comparison to all other domains. (Despite the fact that most previous work focuses exclusively on using news as a "new" domain, §3). 2. For the remaining domains, unseen words have a significant effect, both in terms of BLEU scores as well as fine-grained translation distinctions. However, many of these words have multiple translations, and a system must be able to correctly select which one to use in a particular context. 3. Likewise, words that gain new senses account for approximately as much error as unseen words, suggesting a novel avenue for research in sense induction. Unfortunately, it appears that choosing the right sense for these at translation time is even more difficult than in the unseen word case. 4. The story is more complicated for seen words with known translations: if we limit ourselves to "high 1. We use old/new to refer to domains and source/target to refer to languages, to avoid ambiguity (we stay away from indomain and out-of-domain, which is itself ambiguous).
2. All source data, methodological code and outputs are available at http://hal3.name/damt. confidence" translations, there is a lot to be gained by improving the scores in translation models. However, for an entire phrase table, manipulating scores can hurt as often as it helps.

Related Work
Most related work has focused on either (a) analyzing errors made by machine translation systems in a non-adaptation setting (Popović and Ney, 2011), or (b) trying to directly improve machine translation performance. A small amount of work (discussed next) addresses issues of analyzing MT systems in a domain adaptation setting.

Analysis of Domain Effects
To date, work on domain adaptation in SMT mostly proposed methods to efficiently combine data from multiple domains. To the best of our knowledge, there have been only a few studies to understand how domain shifts affect translation quality (Duh et al., 2010;Bisazza et al., 2011;Haddow and Koehn, 2012). However, these start from different premises than this paper, and as a result, ask related but complementary questions. These previous analyses focus on how to improve a particular MT architecture (trained on new domain data) by injecting old domain data into a specific part of the pipeline in order to improve BLEU score. In comparison to this work, we focus on finer-grained phenomena. We distinguish between effects previously lumped together as "missing phrase-table entries." Despite different starting assumptions, language pairs and data, some of our conclusions are consistent with previous work: in particular, we highlight the importance of differences in coverage in an adaptation setting. However, our fine-grained analysis shows that correctly scoring translations for previously unseen words and senses is a complex issue. Finally, these other studies suggest potential directions for refining our error categories: for instance, Haddow and Koehn (2012) show that the impact of additional new or old domain data is different for rare vs. frequent phrases.

Domain Adaptation for MT
Prior work focuses on methods combining data from old and new domains to learn translation and language models. Many filtering techniques have been proposed to select OLD data that is similar to NEW . Information retrieval techniques have been used to improve the language model (Zhao et al., 2004), the translation model (Hildebrand et al., 2005;Lu et al., 2007;Gong et al., 2011;Duh et al., 2010;Banerjee et al., 2012), or both (Lu et al., 2007); language model cross-entropy has also been used for data selection (Axelrod et al., 2011;Mansour et al., 2011;Sennrich, 2012).
Another research thread addresses corpora weighting, rather than hard filtering. Weighting has been applied at different levels of granularity: sentence pairs (Matsoukas et al., 2009), phrase pairs (Foster et al., 2010), n-grams (Ananthakrishnan et al., 2011), or sub-corpora through factored models (Niehues and Waibel, 2010). In particular, Foster et al. (2010) show that adapting at the phrase pair levels outperform earlier coarser corpus level combination approaches (Foster and Kuhn, 2007). This is consistent with our analysis: domain shifts have a fine-grained impact on translation quality.
Finally, strategies have been proposed to combine sub-models trained independently on different sub-corpora. Linear interpolation is widely used for mixing language models in speech recognition, and it has also been used for adapting translation and language models in MT (Foster and Kuhn, 2007;Tiedemann, 2010;Lavergne et al., 2011). Log-linear combination fits well in existing SMT architectures (Foster and Kuhn, 2007;Koehn and Schroeder, 2007). Koehn and Schroeder (2007) consider both an intersection setting (where only entries occurring in all phrase-tables combined are considered), and a union setting (where entries which are not in the intersection are given an arbitrary null score). Razmara et al. (2012) take this approach further and frame combination as ensemble decoding.

Targeting Specific Error Types
The experiments conducted in this article motivated follow-up work on identifying when a word has gained a new sense in a new domain (Carpuat et al., 2013), as well as learning joint word translation probability distributions from comparable new domain corpora . Earlier, Daumé III and Jagarlamudi (2011) showed how mining translations for unseen words from comparable corpora can im-prove SMT in a new domain.

The S Taxonomy
We begin with a simple question: when we move an SMT system from an old domain to a new domain, what goes wrong ? We employ a set of four error types as our taxonomy. We refer to these error types as SEEN, SENSE, SCORE and SEARCH, and together as the S 4 taxonomy: SEEN: an attempt to translate a source word or phrase that has never been seen before. For example, "voie(s)" in Table 1.
SENSE: an attempt to translate a previously seen source word or phrase, but for which the correct target language sense has never been observed. 3 In Table 1, the Hansard-trained system had never seen "mode" translated as "method." SCORE: an incorrect translation for which the system could have succeeded but did not because an incorrect alternative outweighed the correct translation. In a conventional translation system, this could be due to errors in the language model, translation model, or both. In Table 1, the Hansard-trained system had seen "administration" translated as "administration," but "directors" had a higher probability.
SEARCH: an error due to pruning in beam search.
When limiting oneself to issues of lexical selection, this set is exhaustive and disjoint: any lexical selection error made by an MT system can be attributed to exactly one of these error categories. This observation is important for developing methodologies for measuring the impact of each of these sources of error. Partitions of the set of errors that focus on categories other than lexical choice have been investigated by Vilar et al. (2006).

Methodology for Analyzing MT Systems
Given the S 4 taxonomy for categorizing SMT errors, it would be possible (if painstaking) to manually annotate SMT output with error types. We prefer automated methods. In this section we describe two such methods: a micro-level analysis to see what happens at the word level (regardless of how it affects translation performance) and a macrolevel analysis to discover impact on corpus translation performance. We focus on the first three S 4 categories and separately discuss search errors ( §7). In both cases, we use exact string match to detect translation equivalences, as has been done previously in other settings that also use word alignments to inspect errors or automatically generate data for other tasks (Blatz et al., 2004;Carpuat and Wu, 2007;Popović and Ney, 2011;Bach et al., 2011, among others).

Micro-analysis: WADE
We define Word Alignment Driven Evaluation, or WADE, which is a technique for analyzing MT system output at the word level, allowing us to (1) manually browse visualizations of MT output annotated with S 4 error types, and (2) aggregate counts of errors. WADE is based on the fact that we can automatically word-align a French test sentence and its English reference translation, and the MT decoder naturally produces a word alignment between a French sentence and its machine translation. We can then check whether the MT output has the same set of English words aligned to each French word that we would hope for, given the reference.
In some ways, WADE is similar to the word-based analysis technique of Popović and Ney (2011). However, in contrast to that work, we do not directly align the hypothesis and reference translations but, rather, pivot through the source text. Additionally, we use WADE to annotate S 4 errors, which are driven more by how lexical choice is made within the SMT framework than by linguistic properties of words in the reference and hypothesis translations. For example, in the case of domain adaptation, we do not expect the rate of inflectional errors to be affected by domain shift.
In WADE, the unit of analysis is each word alignment between a French word, f i , and a reference English word, e j . To annotate the aligned pair, a i,j , we consider the word(s), H i , in the output English sentence which are aligned (by the decoder) to f i . If e j appears in the set H i , then the alignment a i,j is marked correct. If not, the alignment is categorized with one of the S 4 error types. If the French word f i does not appear in the phrase table used for translation, then the alignment is marked as a SEEN error. If f i does appear in the phrase table, but it is never observed translating as e j , then the alignment is marked as a SENSE error. If f i had been observed translating as e j , but the decoder chose an alternate translation, then the alignment is marked as a SCORE error. Our results in §7 show that SEARCH errors are very infrequent, so we mark all errors other than SEEN and SENSE as SCORE errors. We make use of one additional category: Freebie. Our MT system copies unseen (aka "OOV") French words into the English output, and "freebies" are French words for which this is correct.
For WADE analysis only, we use the alignments yielded by a model trained over our train and test datasets and the grow-diag-final heuristic. Because WADE's unit of analysis is each alignment link between the source text and its reference, it ignores unaligned words in the input source text. Figure 1 shows an example of a WADE-annotated sentence. In addition to providing an easy way to visualize and browse the errors in MT output, WADE allows us to aggregate counts over the S 4 error types. In our analysis ( §7), we present results that show not only total numbers of each error type but also how WADE-annotations change when we introduce some NEW-domain parallel training data. For example, SEEN errors could remain SEEN errors, become correct, or become SENSE or SCORE errors when we introduce additional training data.

Macro-analysis: TETRA
In this section, we discuss an approach to measuring the effect of each potential source of error when a translation system is considered in full. The key idea is to enhance the translation model of OLD, an MT system trained on old domain parallel text, to compare the impact of potential sources of improvement. We use parallel new domain data to propose enhancements to the OLD system. This provides a realistic measure of what could be achieved if one had access to parallel data in the new domain. The specific system we build, called MIXED, is a linear interpolation of a translation model trained only on old domain data and a model trained only on new domain data (Foster and Kuhn, 2007). The mixing weights are selected via grid search on a tuning set, selecting for BLEU. We call our approach TETRA: Table Enhancement for Translation Analysis.
Below, we design experiments to tease apart the differences in domains by adjusting the models and enhancing OLD to be more like MIXED. We perform different enhancements depending on the error category we are targeting. As discussed in §6, our experiments are conducted using phrase-based SMT systems, so the translation models (TM) that are enhanced are the phrase table and reordering table.
Seen In order to estimate the effect of SEEN errors, we enhance the TM of OLD by adding phrase pairs that translate words found only in the newdomain data, and we measure the BLEU improvement. More precisely, we identify the set of phrase pairs in the TM of MIXED, for which the French side contains at least one word that does not appear in the old-domain training data. These are the phrases responsible for the SEEN errors. We build system TE-TRA+SEEN by adding these phrases to the TM of OLD. When adding these phrases, we add them together with their feature value scores.
Sense Analogously, the phrases responsible for SENSE errors are those from MIXED where the French side exists in the phrase table of OLD, but their English translations do not. We build TE-TRA+SENSE by adding these phrases to OLD.
Score To isolate and measure the effect of phrase scores, we consider the phrases that our OLD and MIXED systems have in common: the intersection of their translation tables. We build two systems, OLD SCORE and NEW SCORE, with identical phrase pairs; in OLD SCORE, the feature values are taken from the OLD system's tables; in NEW SCORE the feature va-  lues are taken from the MIXED system's tables.
6 Experimental conditions

Domains and Data
We conduct our study on French-English datasets. We consider five very different domains for which large corpora are publicly available. The largest corpus is the Hansard parliamentary proceedings. Corpora in the four other domains are smaller and more specialized, and, thus, more naturally serve as new domains. For each new domain, we use all available data. We do not attempt to hold the amount of new domain data constant, as we suspect that such artificial constraints would not be sufficient to control for the very different natures of the domains. Detailed statistics for the parallel corpora are given in Table 2.
Hansard: Canadian parliamentary proceedings, consists of manual transcriptions and translations of meetings of Canada's House of Commons and its committees from 2001 to 2009. Discussions cover a wide variety of topics, and speaking styles range from prepared speeches by a single speaker to more interactive discussions. It is significantly larger than Europarl, the common source of old domain data.
EMEA: Documents from the European Medicines Agency, made available with the OPUS corpora collection (Tiedemann, 2009). This corpus primarily consists of drug usage guidelines.
News: News commentary corpus made available for the WMT 2009 evaluation. It has been commonly used in the domain adaptation literature (Koehn and Schroeder, 2007;Foster and Kuhn, 2007;Haddow and Koehn, 2012, for instance).
Science: Parallel abstracts from scientific publications in many disciplines including physics, bio-logy, and computer science. We collected data from two distinct sources: (1) Canadian Science Publishing made available translated abstracts from their journals which span many research disciplines; (2) parallel abstracts from PhD theses in Physics and Computer Science collected from the HAL public repository (Lambert et al., 2012).
Subs: Translated movie subtitles, available through the OPUS corpora collection (Tiedemann, 2009). In contrast to the other domains considered, subtitles consist of informal noisy text. 4 In this study, we use the Hansard domain as the OLD domain, and we consider four possible NEW domains: EMEA, News, Science and Subs. Data sets for all domains were processed consistently. After tokenization, we paid particular attention to normalization in order to minimize artificial differences when combining data, such as American, British and Canadian spellings. This proved particularly important for the news domain; the impact of SEEN reduced by more than half after normalization.

MT systems
We build standard phrase-based SMT systems using the Moses toolkit  for all experiments. Each system scores translation candidates using standard features: 5 phrase-table features, including phrasal translation probabilities and lexical weights in both translation directions, and a constant phrase penalty; 6 lexicalized reordering features, including bidirectional models built for monotone, swap, discontinuous reorderings; 1 distance-based reordering feature; and 2 language models, a 5-gram model learned on the OLD domain, and a 5-gram model learned on the NEW domain.
Features are combined using a log-linear model optimized for BLEU, using the n-best batch MIRA algorithm (Cherry and Foster, 2012). This results in a strong large-scale OLD system, which performs well on the old domain and is a good starting point for studying domain shifts. 5

Results
Before moving on to the interesting results, we show that SEARCH is not a major source of error. We analyzed search errors separately by computing BLEU scores for each domain with varying beam size from 10 to 1000, using the OLD system. We find that increasing the beam from 10 to 200 yields approximately a one BLEU point advantage across all domains. Increasing it further (to 500 or 1000) does not bestow any additional advantages. This suggests that for sufficiently wide beams, search is unlikely to contribute to adaptation errors. 6 This is consistent with previous results obtained in non-adapted settings using other measurement techniques: search errors account for less than 5% of the error in modern MT systems (Wisniewski et al., 2010), or 0.13% for small beam settings with a "gap constraint" (Chang and Collins, 2011). We use a beam value of 200 for all other experiments in this work.

Quantitative Results
Results are summarized in Tables 3 and 4. Table  3 gives an overview of our WADE analysis on test sets in each domain translated using OLD and MIXED models. Table 4 shows BLEU score results based on the TETRA analysis.
We first present general observations based on each set of results. WADE shows that for news, new domain data helps solve only a small number of SEEN issues, and SENSE and SCORE errors remain essentially unchanged. TETRA agrees that SENSE and SCORE are not issues in this domain. In general, the OLD system performs better on news than on the other three domains. For comparison, using the OLD system to translate a test set in the old (Hansard) domain yields a BLEU score of 37.41 and, according to our WADE analysis, 67.64% of all alignments are al., 1996) in both directions, combined using grow-diag-final (Koehn et al., 2005). We estimate alignments jointly on all data sets. Thus, TETRA may have artificially good phrase tables.
6. This is likely dependent on language choice and the large amount of old domain parallel data.   correct. As in the news domain, most of the errors are SCORE followed by SENSE and then SEEN. For the other three domains, the two evaluation methods agree that SEEN is a fairly substantial problem. TE-TRA believes that SENSE is a fairly substantial issue, but WADE does not show this for Science. For SCORE, TETRA detects significant room for improvement, especially for Science. The large changes in BLEU score found with TE-TRA are somewhat surprising given how little the phrase tables change in each of these experimental conditions. For News, EMEA, and Science, adding unseen words results in an increase in number of phrase pairs between 0.045% (News) and 0.3% (Science). The sense additions were similarly small: from 0.15% (EMEA) to 0.59% (News). For Subs the story was different: adding unseen words amounted to a growth of 4.2% in phrase table size; sense amounted to 25.1%. In all cases, the size of the score phrase tables was only 0.05% smaller than that of OLD. At first glance, the WADE and TETRA analyses of the SCORE error type seem to contradict each other. The MIXED systems are worse in terms of SCORE (positive deltas, more errors than OLD), but have better BLEU scores. To understand this discrepancy, we must recognize that TETRA analyzes the score errors in isolation: by restricting the phrase tables to the intersection of the OLD and MIXED domain phrase tables, we remove all score and sense errors. In the WADE analysis however, many errors that "used to be" SEEN errors in the old domain become SCORE errors in the new domain.  To see the full picture, we must look at how the different error categories change from the OLD system to the MIXED system in WADE. This is shown in Table 5. In this table, the rightmost column contains the total percentage of errors in the OLD systems; the rows labeled Total show the total percentage of errors in the MIXED systems; the remaining cells these errors changing from OLD to MIXED. For the news domain, the OLD system has 25.8% SCORE errors. Of those, 2.2% are fixed in the MIXED system.
For the three domains of interest (all except news), addressing SEEN errors can be substantially helpful, in terms of both BLEU score and the fine-grained distinctions considered by WADE. The more interesting conclusion, however, is that simply bringing in new words isn't enough. Table 5 shows that in these three domains there are a substantial number of errors that transition from being SEEN-Incorrect to SENSE-Incorrect. This indicates that besides observing a new word, we must also observe it with all of its correct translations.
Likewise, there is a lot to be gained in BLEU by correcting new SENSE translation errors (essentially the same percentage as for SEEN). But this is harder to solve. We can see in Table 5 that from the SENSE errors of the OLD system, half become correct but the other half become SCORE errors. So giving appropriate scores to the new senses is a challenge. This makes sense: these new sense are now "competing" with old ones, and getting the interpolation right between old and new domain tables is difficult.
For SCORE, the situation is more complicated. Our TETRA analysis clearly indicates that there is room for improvement. But this is based on intersected phrase tables, from which we removed seen and sense distinctions, and in which there is no competition between phrases from the OLD and NEW systems. The WADE analysis shows a positive effect only for Science. The data in Table 5 shows that a lot (5.8/20) of the errors are corrected, but we also introduce a number of additional errors (3.6% that were correct, 0.3% that were SEEN and 1.6% that were SENSE). Similarly, in the EMEA domain, we fix 5% of 18% of SCORE errors but introduce 2.6% that were new sense errors before, 0.5% that were SEEN errors before, and make 3% additional error on words we got right before. Subs is similar: out of 25% SCORE errors we fix 4.5%, but introduce 0.5% from SEEN and 5.4% from SENSE, and suffer additional error on 2.5% of what we had correct before. Table 7 shows examples of the French words that WADE frequently identified as incorrectly translated by the OLD system due to SCORE or SEEN but that were correctly translated under the MIXED system. 7 For example, in the Science domain, 'mesures' suffered from SCORE errors under the OLD sys-7. Complete output lists are available at http://hal3. name/damt tem. While its correct translation was often 'measurements,' the OLD system preferred its most probable translations ('savings,' 'actions,' 'issues,' and 'provisions.'). Thirty of these error cases were correctly translated by the MIXED system. Similarly, in the Science domain, the French word 'finis,' when it should have been translated as 'finite,' was translated incorrectly due to a sense error 27 times. Its most frequent translations under the OLD system were 'finish,' 'finished,' and 'more.' The MIXED system corrected these sense errors. We omit examples of where seen errors made by OLD were frequently corrected by the MIXED system because they tend to be less interesting. Examples can be found in Daumé III and Jagarlamudi (2011).

Qualitative Results
We annotate the French test sentences using the Stanford part-of-speech (POS) tagger (Toutanova et al., 2003) and examine which POS categories correspond to the most errors of each type. Using the OLD system, new sense errors in the Subs domain are made on French nouns 40% of the time and on verbs 35% of the time. In EMEA, 51% are nouns and 23% are adjectives; in Science, 51% nouns and 20% adjectives; in News, 46% nouns and 23% verbs. Seen errors show a very similar trend: in the Subs domain 50% are nouns and 25% verbs; In EMEA, 48% are nouns and 37% adjectives; in Science, 46% are nouns and 40% adjectives; in News, 46% are nouns and 28% adjectives. Similarly, for all domains, more score errors are made on source nouns than any other POS category. In summary, we find that most errors correspond to source language nouns, followed by adjectives, except for Subs, where verbs are also commonly mistranslated due to all error types.
Table 6 (left) shows some examples of how TE-TRA can automatically estimate the errors due to unseen words when moving to a new domain. For example, the OCR error "miie" in the source sentence is correctly translated as "miss" by the enhanced system. The enhanced phrase tables of TE-TRA can also automatically estimate the errors due to poor lexical choice when moving to a new domain, and can select a more lucid translation term. For example, the enhanced system appropriately selected "shoot" instead of "growth" in the Science example in   tion tables, the system produces translations that take on the flavor of the new domain, yielding higher BLEU scores. This can be observed in Table 6 (right) where the TETRA-enhanced system used the science-specific word "equilibrium" rather than the political word "balance."

Results on an Adapted System
To show how WADE can be used on already adapted systems, we performed a simple experiment based on a standard adaptation technique. We used bilingual cross-entropy difference (Axelrod et al., 2011) to quantify the distance between each OLD do-main sentence pair and each NEW domain. We selected the top K closest sentences for each domain. For EMEA and Science, we set K to the size of the NEW domain data. For Subs, this would select nearly all of Hansard, so we arbitrarily set K = 1m. (We excluded the news domain.) We took this data, concatenated it to the NEW domain data, trained full models, and ran the WADE analysis on their outputs.
The trends across the three domains were remarkably similar. In all, SCORE in the adapted system were lower by around 2% than even the MIXED baseline (as much as 4% for Subs). This is likely because by excluding parts of the OLD domain most unlike the relevant NEW domain, the correct sense is observed more often. However, this comes at a price: SENSE and SEEN errors go up about 1% or 2% each. This suggests that a more fine-grained adaptation approach might achieve the best of both worlds.

Limiting Assumptions
This paper represents a partial exploration of the space of possible assumptions about models and data. We cannot hope to explore the combinatorial explosion of possibilities, and therefore have restricted our analysis to the following settings: Phrase-based models. All of our experiments are carried out using phrase-based translation, as implemented in the open-source Moses translation system  to ensure that they are reproducible. Our methods are easily extended to hierarchical phrase-based models (Chiang, 2007). It is not clear whether the same conclusions would hold: on the one hand, complex phrasal rules might overfit even more badly than phrases; on the other hand, hierarchical models might have more flexibility to generalize structures.
Translation languages. We only translate from French to English. This well-studied language pair presents several advantages; large quantities of data are publicly available in a wide variety of domains, and standard statistical machine translation architectures yield good performance. Unlike with more distant languages such as Chinese-English, or languages with radically different morphology or word order such as German or Czech, we know that the old-domain translation quality is high, and that translation failures during domain shift can be primarily attributed to domain issues rather than to problems with the SMT system.

Constant old domain. Our old domain is from
Hansards, and we only vary our new domain. It would be interesting to consider other datasets as old domains. We deliberately only use the Hansard data: based on its size and scope, we assume that it yields the most general of our SMT systems.
Monolingual new-domain data. We assume that we always have access to monolingual English data in the new domain for learning a domain-specific language model. Our focus is on the effect of the translation model; the effect of adapting language models has been studied previously (see §3). Without access to a new domain language model, the effect of unseen words and words with new senses is likely to be dramatically underestimated, because their translations are likely to be "thrown out" by an old-domain LM. Moreover, since SCORE errors conflate language model and translation model scores, using a new-domain language model lets us mostly isolate the effect of the translation model.
Parallel new-domain data for tuning. We assume that we always have access to a small amount of parallel data in the new domain, essentially for the purpose of running parameter tuning. Without this, one would not even be able to evaluate the performance of one's system, typically a non-starter.
Automatic word alignments for WADE WADE is fundamentally based upon word alignments, so alignment errors may affect its accuracy. Such errors are obvious in manually inspecting sentence triples using the visualizer. When developing this tool, we checked that alignment noise does not invalidate conclusions drawn from WADE counts. In order to estimate how much alignment errors affect WADE, a French speaker manually corrected the word alignments for 955 EMEA test set sentences. The analyses based on manual experiments show fewer errors overall, but the erroneous annotations appear to be randomly distributed among all categories (details ommited for space). As a result, we believe that WADE yields results which are informative despite the inevitable automatic alignment errors. In particular, because alignments between a test and reference set are held constant in a system comparison, such errors should impact all analyses in the same way.

Discussion
Translation performance degrades dramatically when migrating an SMT system to a new and different domain. Our work demonstrates that the majority of this degradation in performance is due to SEEN and SENSE errors: namely, unknown sourcelanguage words and known source-language words with unknown translations. This result holds in all domains we studied, except for news, in which there appears to be little adaptation influence at all (especially after spelling normalization).
Our two analysis methods: WADE (Section 5.1) and TETRA analysis (Section 5.2), are both lenses on the adverse affects of domain mismatch. Using WADE, we are able to pinpoint precise translation errors and their sources. This could be extended to more nuanced, human-assisted, analysis of adaptation effects. WADE also "labels" translations with different error types, which could be used to train more complex models. Using TETRA, we are able to see how these errors affect overall translation performance. In principle, this performance could be any measure, including human assessment. We started with the BLEU metric since it is most widely used in the community. One point of possible improvement would be to replace exact string match in WADE, and BLEU in TETRA, with metrics that are more morphologically or semantically informed.
Error analysis opens the door to building adapted machine translation systems that directly target spe-cific error categories. As we have seen, most existing domain adaptation techniques in MT aim to improve translation quality in general, and are accordingly evaluated using corpus-level metrics such as BLEU. Our intuitive finer-grained analysis suggests that finer-grained models might be better suited to understanding and comparing the errors made by adapted and unadapted systems. We have shown that considering the S 4 taxonomy is important: improving coverage, for example, does not necessarily improve translation quality. Translation candidates must also be complete and must be scored correctly.
Our techniques provide an intuitive way to understand the effectiveness of new MT domain adaptation approaches.