Automatic Detection and Language Identification of Multilingual Documents

Language identification is the task of automatically detecting the language(s) present in a document based on the content of the document. In this work, we address the problem of detecting documents that contain text from more than one language (multilingual documents). We introduce a method that is able to detect that a document is multilingual, identify the languages present, and estimate their relative proportions. We demonstrate the effectiveness of our method over synthetic data, as well as real-world multilingual documents collected from the web.


Introduction
Language identification is the task of automatically detecting the language(s) present in a document based on the content of the document. Language identification techniques commonly assume that every document is written in one of a closed set of known languages for which there is training data, and is thus formulated as the task of selecting the most likely language from the set of training languages. In this work, we remove this monolingual assumption, and address the problem of language identification in documents that may contain text from more than one language from the candidate set. We propose a method that concurrently detects that a document is multilingual, and estimates the proportion of the document that is written in each language.
Detecting multilingual documents has a variety of applications. Most natural language processing techniques presuppose monolingual input data, so inclusion of data in foreign languages introduces noise, and can degrade the performance of NLP systems (Alex et al., 2007;Cook and Lui, 2012). Automatic detection of multilingual documents can be used as a pre-filtering step to improve the quality of input data. Detecting multilingual documents is also important for acquiring linguistic data from the web (Scannell, 2007;Abney and Bird, 2010), and has applications in mining bilingual texts for statistical machine translation from online resources (Resnik, 1999;Nie et al., 1999;Ling et al., 2013). There has been particular interest in extracting text resources for low-density languages from multilingual web pages containing both the low-density language and another language such as English (Yamaguchi and Tanaka-Ishii, 2012;King and Abney, 2013). King andAbney (2013, p1118) specifically mention the need for an automatic method "to examine a multilingual document, and with high accuracy, list the languages that are present in the document".
We introduce a method that is able to detect multilingual documents, and simultaneously identify each language present as well as estimate the proportion of the document written in that language. We achieve this with a probabilistic mixture model, using a document representation developed for monolingual language identification (Lui and Baldwin, 2011). The model posits that each document is generated as samples from an unknown mixture of languages from the training set. We introduce a Gibbs sampler to map samples to languages for any given set of languages, and use this to select the set of languages that maximizes the posterior probability of the document.
Our method is able to learn a language identifier for multilingual documents from monolingual training data. This is an important property as there are no standard corpora of multilingual documents available, whereas corpora of monolingual documents are readily available for a reasonably large number of languages (Lui and Baldwin, 2011). We demonstrate the effectiveness of our method empirically, firstly by evaluating it on synthetic datasets drawn from Wikipedia data, and then by applying it to real-world data, showing that we are able to identify multilingual documents in targeted web crawls of minority languages (King and Abney, 2013).
Our main contributions are: (1) we present a method for identifying multilingual documents, the languages contained therein and the relative proportion of the document in each language; (2) we show that our method outperforms state-of-the-art methods for language identification in multilingual documents; (3) we show that our method is able to estimate the proportion of the document in each language to a high degree of accuracy; and (4) we show that our method is able to identify multilingual documents in real-world data.

Background
Most language identification research focuses on language identification for monolingual documents (Hughes et al., 2006). In monolingual LangID, the task is to assign each document D a unique language L i ∈ L. Some work has reported near-perfect accuracy for language identification of large documents in a small number of languages (Cavnar and Trenkle, 1994;McNamee, 2005). However, in order to attain such accuracy, a large number of simplifying assumptions have to be made (Hughes et al., 2006;Baldwin and Lui, 2010a). In this work, we tackle the assumption that each document is monolingual, i.e. it contains text from a single language.
In language identification, documents are modeled as a stream of characters (Cavnar and Trenkle, 1994;Kikui, 1996), often approximated by the corresponding stream of bytes (Kruengkrai et al., 2005;Baldwin and Lui, 2010a) for robustness over variable character encodings. In this work, we follow Baldwin and Lui (2010a) in training a single model for languages that naturally use multiple encodings (e.g. UTF8, Big5 and GB encodings for Chinese), as issues of encoding are not the focus of this research.
The document representation used for language identification generally involves estimating the relative distributions of particular byte sequences, selected such that their distributions differ between languages. In some cases the relevant sequences may be externally specified, such as function words and common suffixes (Giguet, 1995) or grammatical word classes (Dueire Lins and Gonçalves, 2004), though they are more frequently learned from labeled data (Cavnar and Trenkle, 1994;Grefenstette, 1995;Prager, 1999a;Lui and Baldwin, 2011).

Multilingual Documents
Language identification over documents that contain text from more than one language has been identified as an open research question (Hughes et al., 2006). Common examples of multilingual documents are web pages that contain excerpts from another language, and documents from multilingual organizations such as the European Union.  The Australiasian Language Technology Workshop 2010 hosted a shared task where participants were required to predict the language(s) present in a held-out test set containing monolingual and bilingual documents (Baldwin and Lui, 2010b). The dataset was prepared using data from Wikipedia, and bilingual documents were produced using a segment from a page in one language, and a segment from the same page in another language. We use the dataset from this shared task for our initial experiments.
To the authors' knowledge, the only other work to directly tackle identification of multiple languages and their relative proportions in a single document is the LINGUINI system (Prager, 1999a). The system is based on a vector space model, and cosine similarity between a feature vector for the test document and a feature vector for each language L i , computed as the sum of feature vectors for all the documents for language L i in the training data. The elements in the feature vectors are frequency counts over byte n-grams (2≤n≤5) and words. Language identification for multilingual documents is performed through the use of virtual mixed languages. Prager (1999a) shows how to construct vectors representative of particular combinations of languages independent of the relative proportions, and proposes a method for choosing combinations of languages to consider for any given document.
Language identification in multilingual documents could also be performed by application of supervised language segmentation algorithms. Given a system that can segment a document into labeled monolingual segments, we can then extract the languages present as well as the relative proportion of text in each language. Several methods for supervised language segmentation have been proposed. Teahan (2000) proposed a system based on text compression that identifies multilingual documents by first segmenting the text into monolingual blocks. Rehurek and Kolkus (2009) perform language segmentation by computing a relevance score between terms and languages, smoothing across ad-joining terms and finally identifying points of transition between high and low relevance, which are interpreted as boundaries between languages. Yamaguchi and Tanaka-Ishii (2012) use a minimum description length approach, embedding a compressive model to compute the description length of text segments in each language. They present a linear-time dynamic programming solution to optimize the location of segment boundaries and language labels.

Methodology
Language identification for multilingual documents is a multi-label classification task, in which a document can be mapped onto any number of labels from a closed set. In the remainder of this paper, we denote the set of all languages by L. We denote a document D which contains languages L x and L y as We denote a document that does not contain a language L x by D → {L x }, though we generally omit all the languages not contained in the document for brevity. We denote classifier output using ; e.g. D {L a , L b } indicates that document D has been predicted to contain text in languages L a and L b .

Document Representation and Feature Selection
We represent each document D as a frequency distribution over byte n-gram sequences such as those in Table 1. Each document is converted into a vector where each entry counts the number of times a particular byte n-gram is present in the document. This is analogous to a bag-of-words model, where the vocabulary of "words" is a set of byte sequences that has been selected to distinguish between languages. The exact set of features is selected from the training data using Information Gain (IG), an information-theoretic metric developed as a splitting criterion for decision trees (Quinlan, 1993). IGbased feature selection combined with a naive Bayes classifier has been shown to be particularly effective for language identification (Lui and Baldwin, 2011).

Generative Mixture Models
Generative mixture models are popular for text modeling tasks where a mixture of influences governs the content of a document, such as in multi-label document classification (McCallum, 1999;Ramage et al., 2009), and topic modeling (Blei et al., 2003). Such models normally assume full exchangeability between tokens (i.e. the bag-of-words assumption), and label each token with a single discrete label.
Multi-label text classification, topic modeling and our model for language identification in multilingual documents share the same fundamental representation of the latent structure of a document. Each label is modeled with a probability distribution over tokens, and each document is modeled as a probabilistic mixture of labels. As presented in Griffiths and Steyvers (2004), the probability of the i th token (w i ) given a set of T labels z 1 · · ·z T is modeled as: The set of tokens w is the document itself, which in all cases is observed. In the case of topic modeling, the tokens are words and the labels are topics, and z is latent. Whereas topic modeling is generally unsupervised, multi-label text classification is a supervised text modeling task, where the labels are a set of pre-defined categories (such as RUBBER, IRON-STEEL, TRADE, etc. in the popular Reuters-21578 data set (Lewis, 1997)), and the tokens are individual words in documents. z is still latent, but constrained in the training data (i.e. documents are labeled but the individual words are not). Some approaches to labeling unseen documents require that z for the training data be inferred, and methods for doing this include an application of the Expectation-Maximization (EM) algorithm (McCallum, 1999) and Labeled LDA (Ramage et al., 2009). The model that we propose for language identification in multilingual documents is similar to multilabel text classification. In the framework of Equation 1, each per-token label z i is a language and the vocabulary of tokens is not given by words but rather by specific byte sequences (Section 3.1). The key difference with multi-label text classification is that we use monolingual (i.e. mono-label) training data. Hence, z is effectively observed for the training data (since all tokens must share the same label). To infer z for unlabeled documents, we utilize a Gibbs sampler, closely related to that proposed by Griffiths and Steyvers (2004) for LDA. The sampling probability for a label z i for token w in a document d is: j is assumed to have a Dirichlet distribution with hyperparameter α, and the word distribution for each topic φ (w) j is also assumed to have a Dirichlet distribution with hyperparameter β. Griffiths (2002) where n (w) j is the number of times word w occurs with label j, and n (.) j is the total number of words that occur with label j. By setting β to 1, we obtain standard Laplacian smoothing. Hence, onlyθ (d) j is updated at each step in the Gibbs sampler: −i,j is the number of tokens in document d that are currently mapped to language j, and n (d) −i is the total number of tokens in document d. In both cases, the current assignment of z i is excluded from the count. T is the number of languages (i.e. the size of the label set). For simplicity, we set α to 0. We note that in the LDA model, α and β influence the sparsity of the solution, and so it may be possible to tune these parameters for our model as well. We leave this as an avenue for further research.

Language Identification in Multilingual Documents
The model described in Section 3.2 can be used to compute the most likely distribution to have generated an unlabeled document over a given set of languages for which we have monolingual training data, by letting the set of terms w be the byte n-gram sequences we selected using per-language information gain (Section 3.1), and allowing the labels z to range over the set of all languages L. Using training data, we computeφ ), and then we infer P (L j |D) for each L j ∈ L for the unlabeled document, by running the Gibbs sampler until the samples for z i converge and then tabulating z i over the whole d and normalizing by |d|. Naively, we could identify the languages present in the document by D {L x if ∃(z i = L x |D)}, but closelyrelated languages tend to have similar frequency distributions over byte n-gram features, and hence it is likely that some tokens will be incorrectly mapped to a language that is similar to the "correct" language.
We address this issue by finding the subset of languages λ from the training set L that maximizes P (λ|D) (a similar approach is taken in McCallum (1999)). Through an application of Bayes' theorem, P (λ|D) ∝ P (D|λ)·P (λ), noting that P (D) is a normalizing constant and can be dropped. We assume that P (λ) is constant (i.e. any subset of languages is equally likely, a reasonable assumption in the absence of other evidence), and hence maximize P (D|λ). For any given D = w 1 · · ·w n and λ, we infer P (D|λ) from the output of the Gibbs sampler: where both P (w i |z i = j) and P (z i = j) are estimated by their maximum likelihood estimates. In practice, exhaustive evaluation of the powerset of L is prohibitively expensive, and so we greedily approximate the optimal λ using Algorithm 1. In essence, we initially rank all the candidate languages by computing the most likely distribution over the full set of candidate languages. Then, for each of the top-N languages in turn, we consider whether to add it to λ. λ is initialized with L u , a dummy language with a uniform distribution over terms (i.e. P (w|L u ) = 1 |w| ). A language is added if it improves P (D|λ) by at least t. The threshold t is required to suppress the addition of spurious classes. Adding languages gives the model additional freedom to fit parameters, and so will generally increase P (D|λ). In the limit case, adding a completely irrelevant language will result in no tokens being mapped to the a language, and so the model will be no worse than without the language. The threshold t is thus used to control "how much" improvement is required before including the new language in λ.

Benchmark Approaches
We compare our approach to two methods for language identification in multilingual documents: (1) the virtual mixed languages approach (Prager, 1999a); and (2) the text segmentation approach (Yamaguchi and Tanaka-Ishii, 2012). Prager (1999a) describes LINGUINI, a language identifier based on the vector-space model commonly used in text classification and information retrieval. The document representation used by Prager (1999a) is a vector of counts across a set of character sequences. Prager (1999a) selects the feature set based on a TFIDF-like approach. Terms with occurrence count m < n × k are rejected, where m is the number of times the term occurs in the training data (the TF component), n is the number of languages in which the term occurred (the IDF component, where "document" is replaced with "language"), and k is a parameter to control the overall number of terms selected. In Prager (1999a), the value of k is reported to be optimal in the region 0.3 to 0.5. In practice, the value of k indirectly controls the number of features selected. Values of k are not comparable across datasets as m is not normalized for the size of the training data, so in this work we do not report the values of k and instead directly select the top-N features, weighted by m n . In LINGUINI, each language is modeled as a single pseudo-document, obtained by concatenating all the training data for the given language. A document is then classified according to the vector with which it has the smallest angle; this is implemented by finding the language vector with the highest cosine with the document vector. Prager (1999a) also proposes an extension to the approach to allow identification of bilingual documents, and suggests how this may be generalized to any number of languages in a document. The gist of the method is simple: for any given pair of languages, the projection of a document vector onto the hyperplane containing the language vectors of the two languages gives the mixture proportions of the two languages that minimizes the angle with the document vector. Prager (1999a) terms this projection a virtual mixed language (VML), and shows how to find the angle between the document vector and the VML. If this angle is less than that between the document vector and any individual language vector, the document is labeled as bilingual in the two languages from which the mixed vector was derived. The practical difficulty presented by this approach is that exhaustively evaluating all possible combinations of languages is prohibitively expensive. Prager (1999a) addresses this by arguing that in multilingual documents, "the individual component languages will be close to d (the document vector) -probably closer than most or all other languages". Hence, language mixtures are only considered for combinations of the top m languages. Prager (1999a) shows how to obtain the mixture coefficients for bilingual VMLs, arguing that the process generalizes. Prager (1999b) includes the coefficients for 3-language VMLs, which are much more complex than the 2-language variants. Using a computer algebra system, we verified the analytic forms of the coefficients in the 3-language VML. We also attempted to obtain an analytic form for the coefficients in a 4-language VML, but these were too complex for the computer algebra system to compute. Thus, our evaluation of the VML ap-proach proposed by Prager (1999a) is limited to 3language VMLs. Neither Prager (1999a) nor Prager (1999b) include an empirical evaluation over multilingual documents, so to the best of our knowledge this paper is the first empirical evaluation of the method on multilingual documents. As no reference implementation of this method is available, we have produced our own implementation, which we have made freely available. 1 The other benchmark we consider in this paper is the method for text segmentation by language proposed by Yamaguchi and Tanaka-Ishii (2012) (hereafter referred to as SEGLANG). The actual task addressed by Yamaguchi and Tanaka-Ishii (2012) is to divide a document into monolingual segments. This is formulated as the task of segmenting a document D = x 1 , · · · , x |D| (where x i denotes the i th character of D and |D| is the length of the document) by finding a list of boundaries B = [B 1 , · · · , B |B| ] where each B i indicates the location of a language boundary as an offset from the start of the document, resulting in a list of segments X = [X 0 , · · · , X |B| ]. For each segment X i , the system predicts L i , the language associated with the segment, producing a list of labellings L = [L 0 , · · · , L |B| ], with the constraint that adjacent elements in L must differ. Yamaguchi and Tanaka-Ishii (2012) solve the problem of determining X and L for an unlabeled text using a method based on minimum description length. They present a dynamic programming solution to this problem, and analyze a number of parameters that affect the overall accuracy of the system. Given this method to determine X and L, it is then trivial to label an unlabeled document according to D {L x if ∃L x ∈ L}, and the length of each segment in X can then be used to determine the proportions of the document that are in each language. In this work, we use a reference implementation of SEGLANG kindly provided to us by the authors.
Using the text segmentation approach of SEGLANG to detect multilingual documents differs from LINGUINI and our method primarily in that LINGUINI and our method fragment the document into small sequences of bytes, and discard information about the relative order of the fragments. This is in contrast to SEGLANG, where this information  Table 2: Results on the ALTW2010 dataset. "Benchmark" is the benchmark system proposed by the shared task organizers. "Winner" is the highest-F µ system submitted to the shared task.
is utilized in the sequential prediction of labels for consecutive segments of text, and is thus able to make better use of the locality of text (since there are likely to be monolingual blocks of text in any given multilingual document). The disadvantage of this is that the underlying model becomes more complex and hence more computationally expensive, as we observe in Section 5.

Evaluation
We seek to evaluate the ability of each method: (1) to correctly identify the language(s) present in each test document; and (2) for multilingual documents, to estimate the relative proportion of the document written in each language. In the first instance, this is a classification problem, and the standard notions of precision (P), recall (R) and F-score (F) apply. Consistent with previous work in language identification, we report both the documentlevel micro-average, as well as the language-level macro-average. For consistency with Baldwin and Lui (2010a), the macro-averaged F-score we report is the average of the per-class F-scores, rather than the harmonic mean of the macro-averaged precision and recall; as such, it is possible for the F-score to not fall between the precision and recall values. As is common practice, we compute the F-score for β = 1, giving equal importance to precision and recall. 2 We tested the difference in performance for statistical significance using an approximate randomization procedure (Yeh, 2000) with 10000 iterations. Within each table of results (Tables 2, 3 and 4), all differences between systems are statistically significant at a p < 0.05 level.
To evaluate the predictions of the relative proportions of a document D written in each detected language L i , we compare the topic proportion predicted by our model to the gold-standard proportion, measured as a byte ratio as follows: length of L i part of D in bytes length of D in bytes (7) We report the correlation between predicted and actual proportions in terms of Pearson's r coefficient. We also report the mean absolute error (MAE) over all document-language pairs.

Experiments on ALTW2010
Our first experiment utilizes the ALTW2010 shared task dataset (Baldwin and Lui, 2010b), a synthetic dataset of 10000 bilingual documents 3 generated from Wikipedia data, introduced in the ALTW2010 shared task, 4 The dataset is organized into training, development and test partitions. Following standard machine learning practice, we train each system using the training partition, and tune parameters using the development partition. We then report macro and micro-averaged precision, recall and F-score on the test partition, using the tuned parameters. The results on the ALTW2010 shared task dataset are summarized in Table 2. Each of the three systems we compare was re-trained using the training data provided for the shared task, with a slight difference: in the shared task, participants were provided with multilingual training documents, but the systems targeted in this research require monolingual training data. We thus split the training documents into monolingual segments using the metadata provided with the dataset. The metadata was only published after completion of the task and was not available to task participants. For comparison, we have included the benchmark results published by the shared task organizers, as well as the score attained by the winning entry (Tran et al., 2010).
We tune the parameters for each system using the development partition of the dataset, and report results on the test partition. For LINGUINI, there is a single parameter k to be tuned: the number of features per language. We tested values between 10000 and 50000, and selected 46000 features as the optimal value. For our method, there are two parameters to be tuned: (1) the number of features selected for each language, and (2) the threshold t for including a language. We tested features-per-language counts between 30 and 150, and found that adding features beyond 70 per language had minimal effect. We tested values of the threshold t from 0.01 to 0.15, and found the best value was 0.14. For SEGLANG, we introduce a threshold t on the minimum proportion of a document (measured in bytes) that must be labeled by a language before that language is included in the output set. This was done because our initial experiments indicate that SEGLANG tends to over-produce labels. Using the development data, we found the best value of t was 0.10.
We find that of the three systems tested, two outperform the winning entry to the shared task. This is more evident in the macro-averaged results than in the micro-averaged results. In micro-averaged terms, our method is the best performer, whereas on the macro-average, SEGLANG has the highest F-score. This suggests that our method does well on higher-density languages (relative to the ALTW2010 dataset), and poorly on lower-density languages. This also accounts for the higher microaveraged precision but lower micro-averaged recall for our method as compared to SEGLANG. The improved macro-average F-score of SEGLANG comes at a much higher computational cost, which increases dramatically as the number of languages is increased. In our testing on a 16-core workstation, SEGLANG took almost 24 hours to process the ALTW2010 shared task test data, compared to 2 minutes for our method and 40 seconds for LIN-GUINI. As such, SEGLANG is poorly suited to detecting multilingual documents where a large number of candidate languages is considered.
The ALTW2010 dataset is an excellent starting point for this research, but it predominantly contains bilingual documents, making it difficult to assess the ability of systems to distinguish multilingual documents from monolingual ones. Furthermore, we are unable to use it to assess the ability of systems to detect more than 2 languages in a document. To address these shortcomings, we construct a new dataset in a similar vein. The dataset and experiments performed on it are described in the next section.

Experiments on WIKIPEDIAMULTI
To fully test the capabilities of our model, we generated WIKIPEDIAMULTI, a dataset that contains a mixture of monolingual and multilingual documents. To allow for replicability of our results and to facilitate research in language identification, we have made the dataset publicly available. 5 WIKI-PEDIAMULTI is generated using excerpts from the mediawiki sources of Wikipedia pages downloaded from the Wikimedia foundation. 6 The dumps we used are from July-August 2010.
To generate WIKIPEDIAMULTI, we first normalized the raw mediawiki documents. Mediawiki documents typically contain one paragraph per line, interspersed with structural elements. We filtered each document to remove all structural elements, and only kept documents that exceeded 2500 bytes after normalization. This yielded a collection of around 500,000 documents in 156 languages. From this initial document set (hereafter referred to as WI-KICONTENT), we only retained languages that had more than 1000 documents (44 languages), and generated documents for WIKIPEDIAMULTI as follows: 1. randomly select the number of languages K (1≤K≤5) 2. randomly select a set of K languages S = {L i ∈L for i = 1· · ·K} without replacement 3. randomly select a document for each L i ∈S from WIKICONTENT without replacement 4. take the top 1 K lines of the document 5. join the K sections into a single document. As a result of the procedure, the relative proportion of each language in a multilingual document tends not to be uniform, as it is conditioned on the length of the original document from which it was sourced, independent of the other K −1 for the other languages that it was combined with. Overall, the average document length is 5500 bytes (standard deviation = 3800 bytes). Due to rounding up in taking  Table 3: Results on the WIKIPEDIAMULTI dataset. the top 1 k lines (step 4), documents with higher K tend to be longer (6200 bytes for K = 5 vs 5100 bytes for K = 1).
The WIKIPEDIAMULTI dataset contains training, development and test partitions. The training partition consists of 5000 monolingual (i.e. K = 1) documents. The development partition consists of 5000 documents, 1000 documents for each value of K where 1≤K≤5. The test partition contains 200 documents for each K, for a total of 1000 documents. There is no overlap between any of the partitions.

Results over WIKIPEDIAMULTI
We trained each system using the monolingual training partition, and tuned parameters using the development partition. For LINGUINI, we tested feature counts between 10000 and 50000, and found that the effect was relatively small. We thus use 10000 features as the optimum value. For SEGLANG, we tested values for threshold t between 0.01 and 0.20, and found that the maximal macro-averaged F-score is attained when t = 0.06. Finally, for our method we tested features-per-language counts between 30 and 130 and found the best performance with 120 features per language, although the actual effect of varying this value is rather small. We tested values of the threshold t for adding an extra language to λ from 0.01 to 0.15, and found that the best results were attained when t = 0.02.
The results of evaluating each system on the test partition are summarized in Table 3. In this evaluation, our method clearly outperforms both SEGLANG and LINGUINI. The results on WIKI-PEDIAMULTI and ALTW2010 are difficult to compare directly due to the different compositions of the two datasets. ALTW2010 is predominantly bilingual, whereas WIKIPEDIAMULTI contains documents with text in 1-5 languages. Furthermore, the average document in ALTW2010 is half the length of that in WIKIPEDIAMULTI. Overall, we observe that SEGLANG has a tendency to over-label (despite the introduction of the t parameter to reduce this ef-fect), evidenced by high recall but lower precision. LINGUINI is inherently limited in that it is only able to detect up to 3 languages per document, causing recall to suffer on WIKIPEDIAMULTI. However, it also tends to always output 3 languages, regardless of the actual number of languages in the document, hurting precision. Furthermore, even on ALTW2010 it has lower recall than the other two systems.

Estimating Language Proportions
In addition to detecting multiple languages within a document, our method also estimates the relative proportions of the document that are written in each language. This information may be useful for detecting documents that are candidate bitexts for training machine translation systems, since we may expect languages in the document to be present in equal proportions. It also allows us to identify the predominant language of a document.
A core element of our model of a document is a distribution over a set of labels. Since each label corresponds to a language, as a first approximation, we take the probability mass associated with each label as a direct estimate of the proportion of the document written in that language. We examine the results for predicting the language proportions in the test partition of WIKIPEDIAMULTI. Mapping label distributions directly to language proportions produces excellent results, with a Pearson's r value of 0.863 and an MAE of 0.108.
Although labels have a one-to-one correspondence with languages, the label distribution does not actually correspond directly to the language proportion, because the distribution estimates the proportion of byte n-gram sequences associated with a label and not the proportion of bytes directly. The same number of bytes in different languages can produce different numbers of n-gram sequences, because after feature selection not all n-gram sequences are retained in the feature set. Hereafter, we refer to each n-gram sequence as a token, and the average number of tokens produced per byte of text as the token emission rate.
We estimate the per-language token emission rate (Figure 1) using the training partition of WIKIPE-DIAMULTI. To improve our estimate of the language proportions, we correct our label distribution  Figure 1: Example of calculating n-gram emission rate for a text string.
using estimates of the per-language token emission rate R L i in bytes per token for L i ∈L. Assume that a document D of length |D| is estimated to contain K languages in proportions P i for i = 1· · ·K. The corrected estimate for the proportion of L i is: Note that the |D| term is common to the numerator and denominator and has thus been eliminated. This correction improves our estimates of language proportions. After correction, the Pearson's r rises to 0.981, and the MAE is reduced to 0.024. The improvement is most noticeable for languagedocument pairs where the proportion of the document in the given language is about 0.5 ( Figure 2).

Real-world Multilingual Documents
So far, we have demonstrated the effectiveness of our proposed approach using synthetic data. The results have been excellent, and in this section we validate the approach by applying it to a real-world task that has recently been discussed in the literature. Yamaguchi and Tanaka-Ishii (2012) and King and Abney (2013) both observe that in trying to gather linguistic data for "non-major" languages from the web, one challenge faced is that documents retrieved often contain sections in another language. SEGLANG (the solution of Yamaguchi and Tanaka-Ishii (2012)) concurrently detects multilingual documents and segments them by language, but the approach is computationally expensive and has a tendency to over-label (Section 5). On the other hand, the solution of King and Abney (2013) is incomplete, and they specifically mention the need for an automatic method "to examine a multilingual document, and with high accuracy, list the languages that are present in the document". In this section, we show that our method is able to fill this need. We  Table 4: Detection accuracy for English-language inclusion in web documents from targeted web crawls for low-density languages. make use of manually-annotated data kindly provided to us by Ben King, which consists of 149 documents containing 42 languages retrieved from the web using a set of targeted queries for low-density languages. Note that the dataset described in King and Abney (2013) was based on manual confirmation of the presence of English in addition to the lowdensity language of primary interest; our dataset contains these bilingual documents as well as monolingual documents in the low-density language of interest. Our purpose in this section is to investigate the ability of automatic systems to select this subset of bilingual documents. Specifically, given a collection of documents retrieved for a target language, the task is to identify the documents that contain text in English in addition to the target language. Thus, we re-train each system for each target language, using only training data for English and the target language. We reserve the data provided by Ben King for evaluation, and train our methods using data separately obtained from the Universal Declaration of Human Rights (UDHR). Where UDHR translations for a particular language were not available, we used data from Wikipedia or from a bible translation. Approximately 20-80 kB of data were used for each language. As we do not have suitable development data, we made use of the best parameters for each system from the experiments on WIKIPEDIAMULTI.
We find that all 3 systems are able to detect that each document contains the target language with 100% accuracy. However, systems vary in their ability to detect if a document also contains English in addition to the target language. The detection accuracy for English-language inclusion is summarized in Table 4. 7 For comparison, we include a heuristic baseline based on labeling all documents as contain- ing English. We find that, like the heuristic baseline, SEGLANG and LINGUINI both tend to overlabel documents, producing false positive labels of English, resulting in increased recall at the expense of precision. Our method produces less false positives (but slightly more false negatives). Overall, our method attains the best F for detecting English inclusions. Manual error analysis suggests that the false negatives for our method generally occur where a relatively small proportion of the document is written in English.

Future Work
Document segmentation by language could be accomplished by a combination of our method and the method of King and Abney (2013), which could be compared to the method of Yamaguchi and Tanaka-Ishii (2012) in the context of constructing corpora for low-density languages using the web. Another area we have identified in this paper is the tuning of the parameters α and β in our model (currently α = 0 and β = 1), which may have some effect on the sparsity of the model.
Further work is required in dealing with crossdomain effects, to allow for "off-the-shelf" language identification in multilingual documents. Previous work has shown that it is possible to generate a document representation that is robust to variation across domains (Lui and Baldwin, 2011), and we intend to investigate if these results are also applicable to lan-guage identification in multilingual documents. Another open question is the extension of the generative mixture models to "unknown" language identification (i.e. eliminating the closed-world assumption (Hughes et al., 2006)), which may be possible through the use of non-parametric mixture models such as Hierarchical Dirichlet Processes (Teh et al., 2006).

Conclusion
We have presented a system for language identification in multilingual documents using a generative mixture model inspired by supervised topic modeling algorithms, combined with a document representation based on previous research in language identification for monolingual documents. We showed that the system outperforms alternative approaches from the literature on synthetic data, as well as on real-world data from related research on linguistic corpus creation for low-density languages using the web as a resource. We also showed that our system is able to accurately estimate the proportion of the document written in each of the languages identified. We have made a full reference implementation of our system freely available, 8 as well as the synthetic dataset prepared for this paper (Section 5), in order to facilitate the adoption of this technology and further research in this area.