Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction

Neural architectures are prominent in the construction of language models (LMs). However, word-level prediction is typically agnostic of subword-level information (characters and character sequences) and operates over a closed vocabulary, consisting of a limited word set. Indeed, while subword-aware models boost performance across a variety of NLP tasks, previous work did not evaluate the ability of these models to assist next-word prediction in language modeling tasks. Such subword-level informed models should be particularly effective for morphologically-rich languages (MRLs) that exhibit high type-to-token ratios. In this work, we present a large-scale LM study on 50 typologically diverse languages covering a wide variety of morphological systems, and offer new LM benchmarks to the community, while considering subword-level information. The main technical contribution of our work is a novel method for injecting subword-level information into semantic word vectors, integrated into the neural language modeling training, to facilitate word-level prediction. We conduct experiments in the LM setting where the number of infrequent words is large, and demonstrate strong perplexity gains across our 50 languages, especially for morphologically-rich languages. Our code and data sets are publicly available.


Introduction
Language Modeling (LM) is a key NLP task, serving as an important component for applications that require some form of text generation, such as machine translation (Vaswani et al., 2013), speech recognition (Mikolov et al., 2010), dialogue generation (Serban et al., 2016), or summarisation (Filippova et al., 2015).
A traditional recurrent neural network (RNN) LM setup operates on a limited closed vocabulary of words (Bengio et al., 2003;Mikolov et al., 2010). The limitation arises due to the model learning parameters exclusive to single words. A standard training procedure for neural LMs gradually modifies the parameters based on contextual/distributional information: each occurrence of a word token in training data contributes to the estimate of a word vector (i.e., model parameters) assigned to this word type. Low-frequency words therefore often have incorrect estimates, not having moved far from their random initialisation. A common strategy for dealing with this issue is to simply exclude the low-quality parameters from the model (i.e., to replace them with the <unk> placeholder), leading to only a subset of the vocabulary being represented by the model. This limited vocabulary assumption enables the model to bypass the problem of unreliable word estimates for low-frequency and unseen words, but it does not resolve it. The assumption is far from ideal, partly due to the Zipfian nature of each language (Zipf, 1949), and its limitation is even more pronounced for morphologically-rich languages (MRLs): these languages inherently generate a plethora of words by their morphological systems. As a consequence, there will be a large number of words for which a standard RNN LM cannot guarantee a reliable word estimate.
Since gradual parameter estimation based on contextual information is not feasible for rare phenomena 451 in the full vocabulary setup (Adams et al., 2017), it is of crucial importance to construct and enable techniques that can obtain these parameters in alternative ways. One solution is to draw information from additional sources, such as characters and character sequences. As a consequence, such character-aware models should facilitate LM word-level prediction in a real-life LM setup which deals with a large amount of low-frequency or unseen words.
Efforts into this direction have yielded exciting results, primarily on the input side of neural LMs. A standard RNN LM architecture relies on two word representation matrices learned during training for its input and next-word prediction. This effectively means that there are two sets of per-word specific parameters that need to be trained. Recent work shows that it is possible to generate a word representation on-the-fly based on its constituent characters, thereby effectively solving the problem for the parameter set on the input side of the model (Kim et al., 2016;Luong and Manning, 2016;Miyamoto and Cho, 2016;Ling et al., 2015). However, it is not straightforward how to advance these ideas to the output side of the model, as this second set of word-specific parameters is directly responsible for the next-word prediction: it has to encode a much wider range of information, such as topical and semantic knowledge about words, which cannot be easily obtained from its characters alone (Jozefowicz et al., 2016).
While one solution is to directly output characters instead of words (Graves, 2013;Miyamoto and Cho, 2016), a recent work from Jozefowicz et al. (2016) suggests that such purely character-based architectures, which do not reserve parameters for information specific to single words, cannot attain state-ofthe-art LM performance on word-level prediction.
In this work, we combine the two worlds and propose a novel LM approach which relies on both wordlevel (i.e., contextual) and subword-level knowledge. In addition to training word-specific parameters for word-level prediction using a regular LM objective, our method encourages the parameters to also reflect subword-level patterns by injecting knowledge about morphology. This information is extracted in an unsupervised manner based on already available information in convolutional filters from earlier network layers. The proposed method leads to large improvements in perplexity across a wide spectrum of languages: 22 in English,144 in Hebrew,378 in Finnish,957 in Korean on our LM benchmarks. We also show that the gains extend to another multilingual LM evaluation set, compiled recently for 7 languages by Kawakami et al. (2017).
We conduct a systematic LM study on 50 typologically diverse languages, sampled to represent a variety of morphological systems. We discuss the implications of typological diversity on the LM task, both theoretically in Section 2, and empirically in Section 7; we find a clear correspondence between performance of state-of-the art LMs and structural linguistic properties. Further, the consistent perplexity gains across the large sample of languages suggest wide applicability of our novel method.
Finally, this article can also be read as a comprehensive multilingual analysis of current LM architectures on a set of languages which is much larger than the ones used in recent LM work (Botha and Blunsom, 2014;Vania and Lopez, 2017;Kawakami et al., 2017). We hope that this article with its new datasets, methodology and models, all available online at http://people.ds.cam. ac.uk/dsg40/lmmrl.html, will pave the way for true multilingual research in language modeling.

LM Data and Typological Diversity
A language model defines a probability distribution over sequences of tokens, and is typically trained to maximise the likelihood of token input sequences. Formally, the LM objective is expressed as follows: (1) t i is a token with the index i in the sequence. For word-level prediction a token corresponds to one word, whereas for character-level (also termed charlevel) prediction it is one character.
LMs are most commonly tested on Western European languages. Standard LM benchmarks in English include the Penn Treebank (PTB) (Marcus et al., 1993), the 1 Billion Word Benchmark (BWB) (Chelba et al., 2014), and the Hutter Prize data (Hutter, 2012). English datasets extracted from BBC News (Greene and Cunningham, 2006) and IMDB Movie Reviews (Maas et al., 2011) are also used for LM evaluation (Wang and Cho, 2016;Miyamoto and 452 Cho, 2016;Press and Wolf, 2017).
Regarding multilingual LM evaluation, Botha and Blunsom (2014) extract datasets for other languages from the sets provided by the 2013 Workshop on Statistical Machine Translation (WMT) (Bojar et al., 2013): they experiment with Czech, French, Spanish, German and Russian. A recent work of Kim et al. (2016) reuses these datasets and adds Arabic. Ling et al. (2015) evaluate on English, Portuguese, Catalan, German and Turkish datasets extracted from Wikipedia. Verwimp et al. (2017) use a subset of the Corpus of Spoken Dutch (Oostdijk, 2000) for Dutch LM. Kawakami et al. (2017) evaluate on 7 European languages using Wikipedia data, including Finnish.
Perhaps the largest and most diverse set of languages used for multilingual LM evaluation so far is the one of Vania and Lopez (2017). Their study includes 10 languages in total representing several morphological types (fusional, e.g., Russian, and agglutinative, e.g., Finnish), as well as languages with particular morphological phenomena (root-and-pattern in Hebrew and reduplication in Malay). In this work, we provide LM evaluation datasets for 50 typologically diverse languages, with their selection guided by structural properties.
Language Selection Aiming for a comprehensive multilingual LM evaluation, we include languages for all possible types of morphological systems. Our starting point is the Polyglot Wikipedia (PW) (Al-Rfou et al., 2013). While at first PW seems comprehensive and quite large already (covering 40 languages), the majority of the PW languages are similar from both a genealogical perspective (26/40 are Indo-European) and a geographic perspective (28/40 Western European). As a consequence, they share many patterns and are not a representative sample of the world's languages.
In order to quantitatively analyse global trends and cross-linguistic generalisations across a large set of languages, we propose to test on all PW languages and source additional data from the same domain, Wikipedia 1 , considering candidates in descending order of corpus size and morphological type. Traditionally, languages have been grouped into the four 1 Chinese, Japanese, and Thai are sourced from Wikipedia and processed with the Polyglot tokeniser since we found their preprocessing in the PW is not adequate for language modeling. main types: isolating, fusional, introflexive and agglutinative, based on their position along a spectrum measuring their preference on breaking up concepts in many words (on one extreme) or rather compose them into single words (on the other extreme). However, even languages belonging to the same type display different out-of-vocabulary rates and type-token ratios. This happens because languages specify different subsets of grammatical categories (such as tense for verbs, or number for nouns) and values (such as future for tense, plural for number). The amount of grammatical categories expressed in a language determines its inflectional synthesis (Bickel and Nichols, 2013).
In our final sample of languages, we select languages belonging to morphological types different from the fusional one, which is over-represented in the PW. In particular, we include new isolating (Min Nan, Burmese, Khmer), agglutinative (Basque, Georgian, Kannada, Tamil, Mongolian, Javanese), and introflexive languages (Amharic).
As the underlying model we opt for the state-of-theart neural LM architecture of Kim et al. (2016): it has been shown to work across a number of languages and in a large-scale setup (Jozefowicz et al., 2016). It already provides a solution for the input side parameters of the model by building word vectors based on the word's constituent character sequences. However, its output side still operates with a standard word-level matrix within the closed and limited vocabulary assumption. We refer to this model as Char-CNN-LSTM and describe its details in the following. Figure 1 (left) illustrates the model architecture.
Char-CNN-LSTM constructs input word vectors based on the characters in each word using a convolutional neural network (CNN) (LeCun et al., 1989), then processes the input word-level using a LSTM (Hochreiter and Schmidhuber, 1997). The next word is predicted using word embeddings, a large number of parameters which have to be trained specifically to represent the semantics of single words. We refer to this space of word representations as M w .
Formally, for the input layer the model trains a look-up matrix C ∈ R |V c |×dc , corresponding to one d c -dimensional vector per character c in the char vocabulary V c . For each input, it takes a sequence of characters of a fixed length m, [c 1 , ...c m ], where m is the maximum length of all words in the word vocabulary V w , and the length of each word is l ≤ m.
Looking up all characters of a word yields a sequence of char representations in R dc×l , which is zero-padded to fit the fixed length m. For each word one gets a sequence of char representations C w ∈ R dc×m , passed through a 1D convolution: H i ∈ R d f,i ×s i is a filter or kernel of size/width s i and A, B = T r(AB T ) is the Frobenius inner product. The model has multiple filters, H i , with kernels of different width, s i , and dimensionality d f,i , i is used to index filters. Since the model performs a convolution over char embeddings, s i corresponds to the char window the convolution is operating on: e.g., a filter of width s i = 3 and d 3,i = 150 could be seen as learning 150 features for detecting 3-grams. By learning kernels of different width, s i , the model can learn subword-level features for charac-ter sequences of different lengths. f w i is the output of taking the convolution with filter H i for word w. Since f w i can get quite large, its dimensionality is reduced using max-over-time (1D) pooling: Here, j indexes the dimensions d f,i of the filter f w i , and y w i ∈ R d f,i . This corresponds to taking the maximum value for each feature of H i , with the intuition that the most informative feature would have the highest activation. The output of all max-pooling operations y w i is concatenated to form a word vector y w ∈ R dp , where d p is the number of all features for all H i : This vector is passed through a highway network (Srivastava et al., 2015) to give the network the possibility to reweigh or transform the features: h w = Highway(y w ). So far all transformations were done per word; after the highway transformation word representations are processed in a sequence by an LSTM (Hochreiter and Schmidhuber, 1997): The LSTM yields one output vector o w t per word in the sequence, given all previous time steps [y w 1 , ...y w t−1 ]. To predict the next word w t+1 , one takes the dot product of the vector o w t ∈ R 1×d l with a lookup matrix M w ∈ R d l ×|V w | , where d l corresponds to the LSTM hidden state size. The vector p t+1 ∈ R 1×|V w | is normalised to contain values between 0 and 1, representing a probability distribution over the next word. This corresponds to calculating the softmax function for every word k in V w : where P (w t+1 = k|o t ) is the probability of the next word w t+1 being k given o t , and m k is the output embedding vector taken from M w .
Word-Level Vector Space: M w The model parameters in M w can be seen as the bottleneck of the model, as they need to be trained specifically for single words, leading to unreliable estimates for infrequent words. As an analysis of the corpus statistics later in Section 7 reveals, the Zipfian effect and its influence on word vector estimation cannot be fully resolved even with a large corpus, especially taking into account how flexible MRLs are in terms of word formation and combination. Yet, having a good estimate for the parameters in M w is essential for the final LM performance, as they are directly responsible for the next-word prediction. Therefore, our aim is to improve the quality of representations in M w , focusing on infrequent words. To achieve this, we turn to another source of information: character patterns. In other words, since M w does not have any information about character patterns from lower layers, we seek a way to: a) detect words with similar subword structures (i.e., "morpheme"level information), and b) let these words share their semantic information.

Character-Aware Vector Space
The CNN part of Char-CNN-LSTM, see Eq. (3), in fact provides information about such subword-level patterns: the model constructs a word vector y w onthe-fly based on the word's constituent characters. We let the model construct y w for all words in the vocabulary, resulting in a character-aware word vector space M c ∈ R |V w |×dp . The construction of the space is completely unsupervised and independent of the word's context; only the first (CNN) network layers are activated. Our core idea is to leverage this information obtained from M c to influence the output matrix M w , and consequently the network prediction, and extend the model to handle unseen words.
We first take a closer look at the character-aware space M c , and then describe how to improve and expand the semantic space M w based on the information contained in M c (Section 5). Each vocabulary entry in M c encodes character n-gram patterns about the represented word, for 1 ≥ n ≤ 7. The n-gram patterns arise through filters of different lengths, and their maximum activation is concatenated to form each individual vector y w . The matrix M c is of dimensionality |V w | × 1100, where each of the 1,100 dimensions corresponds to the activation of one kernel feature. In practice, dimensions [0, 1,   delve deeper into the filter activations and analyse the key properties of the vector space M c . The qualitative analysis reveals that many features are interpretable by humans, and indeed correspond to frequent subword patterns, as illustrated in Table 1. For instance, tokenised Chinese data favours short words: consequently short filters activate strongly for one or two characters. The first two filters (width 1) are highly active for two common single characters each: one filter is active for 更 (again, more), 不 (not), and the other for 今 (now), 代 (time period). Larger filters (width 5-7) do not show interpretable patterns in Chinese, since the vocabulary largely consists of short words (length 1-4). Agglutinative languages show a tendency towards long words. We find that medium-sized filters (width 3-5) are active for morphemes or short common subword units, and the long filters are activated for different surface realisations of the same root word. In Turkish, one filter is highly active on various forms of the word üniversite (university). Further, in MRLs with the Latin alphabet short filters are typically active on capitalisation or special chars. Table 2 shows examples of nearest neighbours based on the activations in M c . The space seems to be arranged according to shared subword patterns based on the CNN features. It does not rely only on a simple character overlap, but also captures shared morphemes. This property is exploited to influence the LM output word embedding matrix M w in a completely unsupervised way, as illustrated on the right side of Figure 1.

Fine-Tuning the LM Prediction
While the output vector space M w captures wordlevel semantics, M c arranges words by subword features. A model which relies solely on characterlevel knowledge (similar to the information stored in M c ) for word-level prediction cannot fully capture word-level semantics and even hurts LM performance (Jozefowicz et al., 2016). However, shared subword units still provide useful evidence of shared semantics (Cotterell et al., 2016;: injecting this into the space M w to additionally reflect shared subword-level information should lead to improved word vector estimates, especially for MRLs.

Fine-Tuning and Constraints
We inject this information into M w by adapting recent fine-tuning (often termed retrofitting or specialisation) methods for vector space post-processing (Faruqui et al., 2015;Wieting et al., 2015;Vulić et al., 2017, i.a.). These models enrich initial vector spaces by encoding external knowledge provided in the form of simple linguistic constraints (i.e., word pairs) into the initial vector space.
There are two fundamental differences between our work and previous work on specialisation. First, previous models typically use rich hand-crafted lexical resources such as WordNet (Fellbaum, 1998) or the Paraphrase Database (Ganitkevitch et al., 2013), or manually defined rules  to extract the constraints, while we generate them directly using the implicit knowledge coded in M c . Second, our method is integrated into a language model: it performs updates after each epoch of the LM training. 2 In Section 5.2, we describe our model for fine-tuning M w based on the information provided in M c .
Our fine-tuning approach relies on constraints: positive and negative word pairs (x i , x j ), where x i , x j ∈ V w . Iterating over each cue word x w ∈ V w we find a set of positive word pairs P w and negative word pairs N w : their extraction is based on their (dis)similarity with x w in M c . Positive pairs (x w , x p ) contain words x p yielding the highest cosine similarity to the x w (=nearest neighbors) in M c . Negative pairs (x w , x n ) are constructed by randomly sampling words x n from the vocabulary. Since M c gets updated during the LM training, we (re)generate the sets P w and N w after each epoch.

Attract-Preserve
We now present a method for fine-tuning the output matrix M w within the Char-CNN-LSTM LM framework. As said, the fine-tuning procedure runs after each epoch of the standard log-likelihood LM training (see Figure 1). We adapt a variant of a stateof-the-art post-processing specialisation procedure (Wieting et al., 2015;. The idea of the fine-tuning method, which we label Attract-Preserve (AP), is to pull the positive pairs closer together in the output word-level space, while pushing the negative pairs further away.
Let v i denote the word vector of the word x i . The AP cost function has two parts: attract and preserve. In the attract term, using the extracted sets P w and N w , we push the vector of x w to be closer to x p by a similarity margin δ than to its negative sample x n : ReLU (x) is the standard rectified linear unit (Nair and Hinton, 2010). The δ margin is set to 0.6 in all experiments as in prior work  without any subsequent fine-tuning.
The preserve cost acts as a regularisation pulling the "fine-tuned" vector back to its initial value: λ reg = 10 −9 is the L 2 -regularisation constant ;v w is the original word vector before the procedure. This term tries to preserve the semantic content present in the original vector space, as long as this information does not contradict the knowledge injected by the constraints. The final cost function adds the two costs: cost = attr + pres.

Experiments
Datasets We use the Polyglot Wikipedia (Al-Rfou et al., 2013) for all available languages except for Japanese, Chinese, and Thai, and add these and further languages using Wikipedia dumps. The Wiki dumps were cleaned and preprocessed by the Polyglot tokeniser. We construct similarly-sized datasets by extracting 46K sentences for each language from the beginning of each dump, filtered to contain only full sentences, and split into train (40K), validation (3K), and test (3K). The final list of languages along with standard language codes (ISO 639-1 standard, used throughout the paper) and statistics on vocabulary and token counts are provided in Table 4.

Evaluation Setup
We report perplexity scores (Jurafsky and Martin, 2017, Chapter 4.2.1) using the full vocabulary of the respective LM dataset. This means that we explicitly decide to retain also infrequent words in the modeled data. Replacing infrequent words by a placeholder token <unk> is a standard technique in LM to obtain equal vocabulary sizes across different datasets. Motivated by the observation that infrequent words constitute a significant part of the vocabulary in MRLs, and that vocabulary sizes naturally differ between languages, we have decided to avoid the <unk> placeholder for low-frequency words, and run all models on the full vocabulary (Adams et al., 2017;Grave et al., 2017).
We believe that this full-vocabulary setup offers additional insight into the standard LM techniques, leading to evaluation which pinpoints crucial limitations of current word-based models with regard to morphologically-rich languages. In our setup the vocabulary contains all words occurring at least once in the training set. To ensure a fair comparison between all neural models, words occurring only in the test set are mapped to a random vector with the same technique for all neural models, as described next.
Sampling Vectors of Unseen Words Since zeroshot semantic vector estimation at test time is an unresolved problem, we seek an alternative way to compare model predictions at test time. We report all results with unseen test words being mapped to one randomly sampled <unk> vector. The <unk> vector is part of the vocabulary at training time, but remains untrained and at its random initialization  since it never occurs in the training data. Therefore, we sample a random <unk> vector at test time from the same part of the space as the trained vectors, using a normal distribution with the mean and the variance of M w and the same fixed random seed for all models. We employ this methodology for all neural LM models, and thereby ensure that results are comparable.

Training Setup and Parameters
We reproduce the standard LM setup of Zaremba et al. (2015) and parameter choices of Kim et al. (2016), with batches of 20 and a sequence length of 35, where one step corresponds to one token. The maximum word length is chosen dynamically based on the longest word in the corpus. The corpus is processed continuously, and the RNN hidden state resets occur at the beginning of each epoch. Parameters are optimised with stochastic gradient descent. The gradient is averaged over the batch size and sequence length. We then scale the averaged gradient by the sequence length (=35) and clip to 5.0 for more stable training. The learning rate is 1.0, decayed by 0.5 after each epoch if the validation perplexity does not improve. We train all models for 15 epochs on the smaller corpora, and for 30 on the large ones, which is typically sufficient for model convergence.
Our AP fine-tuning method operates on the whole M w space, but we only allow words more frequent than 5 as cue words x w (see Section 5 again), while there are no restrictions on x p and x n . 3 Our preliminary analysis on the influence of the number of nearest neighbours in M c shows that this parameter has only a moderate effect on the final LM scores. We thus fix it to 3 positive and 3 negative samples for each x w without any tuning. AP is optimised with Adagrad (Duchi et al., 2011) and a learning rate of 0.05, the gradients are clipped to ±2. 4 A full summary of all hyper-parameters and their values is provided in Table 3.
(Baseline) Language Models The availability of LM evaluation sets in a large number of diverse languages, described in Section 2, now provides an opportunity to perform a full-fledged multilingual analysis of representative LM architectures. At the same time, these different architectures serve as the baselines for our novel model which fine-tunes the output matrix M w . As mentioned, the traditional LM setup is to use words both on the input and on the output side (Goodman, 2001;Bengio et al., 2003;Deschacht and Moens, 2009) relying on n-gram word sequences. We evaluate a strong model from the n-gram family of models from the KenLM package (https://github.com/kpu/kenlm): it is based on 5grams with extended Kneser-Ney smoothing (KN5) (Kneser and Ney, 1995;Heafield et al., 2013) 5 . The rationale behind including this non-neural model is to also probe the limitations of such n-gram-based LM architectures on a diverse set of languages.
Recurrent neural networks (RNNs), especially Long-Short-Term Memory networks (LSTMs), have taken over the LM universe recently (Mikolov et al., 2010;Sundermeyer et al., 2015;Chen et al., 2016, i.a.). These LMs map a sequence of input words to embedding vectors using a look-up matrix. The embeddings are passed to the LSTM as input, and the model is trained in an autoregressive fashion to predict the next word from the pre-defined vocabulary given the current context. As a strong baseline from this LM family, we train a standard LSTM LM (LSTM-Word) relying on the setup from Zaremba et al. (2015) (see Table 3).
Finally, a recent strand of LM work uses characters on the input side while retaining word-level prediction on the output side. A representative architecture from this group, also serving as the basis in our work (Section 3), is Char CNN-LSTM (Kim et al., 2016).
All neural models operate on exactly the same vocabulary and treat out-of-vocabulary (OOV) words in exactly the same way. As mentioned, we include KN5 as a strong (non-neural) baseline to give perspective on how this more traditional model performs across 50 typologically diverse languages. We have selected the setup for the KN5 model to be as close as possible to that of neural LMs, However, due to the different nature of the models, we note that the results between KN5 and other models are not comparable.
In KN5 discounts are added for low-frequency words, and unseen words at test time are regarded as outliers and assigned low probability estimates. In contrast, for all neural models we sample unseen word vectors to lie in the space of trained vectors (see before). We find the latter setup to better reflect our intuition that especially in MRLs unseen words are not outliers but often arise due to morphological complexity.

Results and Discussion
In this section, we present the main empirical findings of our work. The focus is on: a) the results of our novel language model with the AP fine-tuning procedure, and its comparison to other language models in our comparison; b) the analysis of the LM results in relation to typological features and corpus statistics. Table 4 that lists all 50 test languages along with their language codes and provides the key statistics of our 50 LM evaluation benchmarks. The statistics include the number of word types in training data, the number of word types occurring in test data but unseen in training, as well as the total number of word tokens in both training and test data, and typeto-token ratios.  Note that the absolute scores in the KN5 column are not comparable to the scores obtained with neural models (see Section 6). Right: Results with Char-CNN-LSTM and our AP fine-tuning strategy. ∆ is indicating the difference in performance over the original Char-CNN-LSTM model. The best scoring neural baseline is underlined. The overall best performing neural model for each language is in bold.
Word, Char-CNN-LSTM, and our model with the AP fine-tuning. Furthermore, a visualisation of the Char-CNN-LSTM+AP model as a function of type/token ratio is shown in Figure 2.

Fine-Tuning the Output Matrix
First, we test the impact of our AP fine-tuning method. As the main finding, the inclusion of finetuning into Char-CNN-LSTM (this model is termed 459 +AP) yields improvements on a large number of test languages. The model is better than both strong neural baseline language models for 47/50 languages, and it improves over the original Char-CNN-LSTM LM for 47/50 languages. The largest gains are indicated for the subset of agglutinative MRLs (e.g., 950 perplexity points in Korean, large gains also marked for FI, HE, KA, HU, TA, ET). We also observe large gains for the three introflexive languages included in our study (Amharic, Arabic, Hebrew). While these large absolute gains may be partially attributed to the exponential nature of the perplexity measure, one cannot ignore the substantial relative gains achieved by our models: e.g., EU (∆P P L=38) improves more than a fusional language like DA (∆P P L=24) even with a lower baseline perplexity. This suggests that injecting subword-level information is more straightforward for the former: in agglutinative languages, the mapping between morphemes and meanings is less ambiguous. Moreover, the number of words that benefit from the injection of character-based information is larger for agglutinative languages, because they also tend to display the highest inflectional synthesis.
For the opposite reasons, we do not surpass Char-CNN-LSTM in a few fusional (IT) and isolating languages (KM, VI). We also observe improvements for Slavic languages with rich morphology (RU, HR, PL). The gains are also achieved for some isolating and fusional languages with smaller vocabularies and a smaller number of rare words, e.g., in Tagalog, English, Catalan, and Swedish. This suggests that our method for fine-tuning the LM prediction is not restricted to MRLs only, and has the ability to improve the estimation for rare words in multiple typologically diverse languages.

Language Models, Typological Features, and Corpus Statistics
In the next experiment, we estimate correlation strength of all perplexity scores with a series of independent variables. The variables are 1) type-token ratio in the train data; 2) new word types in the test data; 3) the morphological type of the language among isolating, fusional, introflexive, and agglutinative, capturing different aspects related to the morphological richness of a language. Results with Pearson's ρ (numerical) and η 2 in  Table 4.
one-way ANOVA (categorical) are shown in Table 5. Significance tests show p-values < 1 −3 for all combinations of models and independent variables, demonstrating all of them are good performance predictors. Our main finding indicates that linguistic categories and data statistics both correlate well (≈ 0.35 and ≈ 0.82, respectively) with the performance of language models.
For the categorical variables we compare the mean values per category with the numerical dependent variable. As such, η 2 can be interpreted as the amount of variation explained by the model -the resulting high correlations suggest that perplexities tend to be homogeneous for languages of a same morphological type, especially so for state-of-the-art models. This is intuitively evident in Figure 2, where perplexity scores of Char-CNN-LSTM+AP are plotted against type/token ratio. Isolating languages are placed on the left side of the spectrum as expected, with low type/token ratio and good performance (e.g., VI, ZH). As for fusional languages, sub-groups behave differently. We find that Romance and Germanic languages display roughly the same level of performance as isolating languages, despite their overall larger type/token ratio. Balto-Slavic languages (e.g. CS, LV) instead show both higher perplexities and higher type/token ratio. These differences may be explained in terms of different inflectional synthesis.
Introflexive and agglutinative languages can be  Table 5: Correlations between model performance and language typology as well as with corpus statistics (type/token ratio and new word types in test data). All variables are good performance predictions.
found mostly on the right side of the spectrum in terms of performance (see Figure 2). Although the languages with highest absolute perplexity scores are certainly classified as agglutinative (e.g., Dravidian languages such as KN and TA), we also find some outliers in the agglutinative languages (EU) with remarkably low perplexity scores.

Corpus Size and Type/Token Ratio
Building on the strong correlation between type/token ratio and model performance from Section 7.2, we now further analyse the results in light of corpus size and type/token statistics. The LM datasets for our 50 languages are similar in size to the widely used English PTB dataset (Marcus et al., 1993). As such, we hope that these evaluation datasets can help guide multilingual language modeling research across a wide spectrum of languages. However, our goal now is to verify that type/token ratio and not absolute corpus size is the deciding factor when unraveling the limitations of standard LM architectures across different languages. To this end, we conduct additional experiments on all languages of the recent Multilingual Wikipedia Corpus (MWC) (Kawakami et al., 2017) for language modeling, using the same setup as before (see Table 3). The corpus provides datasets for 7 languages from the same domain as our benchmarks (Wikipedia), and comes in two sizes. We choose the larger corpus variant for each language, which provides about 3-5 times as many tokens as contained in our data sets from Table 4.
The results on the MWC evaluation data along with corpus statistics are summarised in Table 6. As one important finding, we observe that the gains in perplexity using our fine-tuning AP method extend also to these larger evaluation datasets. In particular, we find improvements of the same magnitude as in the PTB-sized data sets over the strongest baseline model (Char-CNN-LSTM) for all MWC languages. For instance, perplexity is reduced from 1781 to 1578 for Russian, and from 365 to 352 for English. We also observe a gain for French and Spanish with perplexity reduced from 282 to 272 and 255 to 243 respectively.
In addition, we test on samples of the Europarl corpus (Koehn, 2005;Tiedemann, 2012) which contains approximately 10 times more tokens than our PTB-sized evaluation data: we use 400K sentences from Europarl for training and testing. However, this data comes from a much narrower domain of parliamentary proceedings: this property yields a very low type/token ratio as visible from Table 6. In fact, we find the type/token ratio in this corpus to be on the same level or even smaller than isolating languages (compare with the scores in Table 4): 0.02 for Dutch and 0.03 for Czech. This leads to similar perplexities with and without +AP for these two selected test languages. The third EP test language, Finnish, has a slightly higher type/token ratio. Consequently, we do observe an improvement of 10 points in perplexity. A more detailed analysis of this phenomenon follows. Table 7 displays the overall type/token ratio in the training set of these copora. We observe that the MWC has comparable or even higher type/token ratios than the smaller sets despite its increased size. The corpus has been constructed by sampling the data from a variety of different Wikipedia categories (Kawakami et al., 2017): it can therefore be regarded as more diverse and challenging to model.   Table 7: Comparison of type/token ratios in the corpora used for evaluation. The ratio is not dependent only on the corpus size but also on the language and domain of the corpus.
Europarl on the other hand shows substantially lower type/token ratios, presumably due to its narrower domain and more repetitive nature.
In general, we find that although the type/token ratio decreases with increasing corpus size, the decreasing rate slows down dramatically at a certain point (Herdan, 1960;Heaps, 1978). This depends on the typology of the language and domain of the corpus. Figure 3 shows the empirical proof of this intuition. We show the variation of type/token ratios in Wikipedia and Europarl with increasing corpus size. We can see that in a very large corpus of 800K sentences, the type/token ratio in MRLs such as Korean or Finnish stays close to 0.1, a level where we still expect an improvement in perplexity with the proposed AP fine-tuning method applied on top of  Figure 3: Type/token ratio values vs. corpus size. A domain-specifc corpus (Europarl) has a lower type/token ratio than a more general corpus (Wikipedia), regardless of the absolute corpus size.
Char-CNN-LSTM. In order to isolate and verify the effect of the type/token ratio, we now present results on synthetically created data sets where the ratio is controlled explicitly. We experiment with subsets of the German Wikipedia with equal number of sentences (25K) 6 , comparable number of tokens, but varying type/token ratio. We generate these controlled data sets by clustering sparse bag-of-words sentence vectors with the k-means algorithm, sampling from different clusters,    Table 8. The AP method is especially helpful for corpora with high type/token ratios. and then selecting the final combinations according to their type/token ratio and the number of tokens. Corpora statistics along with corresponding perplexity scores are shown in Table 8, and plotted in Figure 4. These results clearly demonstrate and verify that the effectiveness of the AP method increases for corpora with higher type/token ratios. This finding also further supports the usefulness of the proposed method for morphologically-rich languages in particular, where such high type/token ratios are expected.

Conclusion
We have presented a comprehensive language modeling study over a set of 50 typologically diverse languages. The languages were carefully selected to represent a wide spectrum of different morphological systems that are found among the world's languages. Our comprehensive study provides new benchmarks and language modeling baselines which should guide the development of next-generation language models focused on the challenging multilingual setting.
One particular LM challenge is an effective learning of parameters for infrequent words, especially for morphologically-rich languages (MRLs). The methodological contribution of this work is a new neural approach which enriches word vectors at the LM output with subword-level information to capture similar character sequences and consequently to facilitate word-level LM prediction. Our method has been implemented as a fine-tuning step which gradually refines word vectors during the LM training, based on subword-level knowledge extracted in an unsupervised manner from character-aware CNN layers. Our approach yields gains for 47/50 languages in the challenging full-vocabulary setup, with largest gains reported for MRLs such as Korean or Finnish. We have also demonstrated that the gains extend to larger training corpora, and are well correlated with the type-to-token ratio in the training data.
In future work we plan to deal with the open vocabulary LM setup and extend our framework to also handle unseen words at test time. One interesting avenue might be to further fine-tune the LM prediction based on additional evidence beyond purely contextual information. In summary, we hope that this article will encourage further research into learning semantic representations for rare and unseen words, and steer further developments in multilingual language modeling across a large number of diverse languages. Code and data are available online: http://people. ds.cam.ac.uk/dsg40/lmmrl.html.