Enriching Word Vectors with Subword Information

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.


Introduction
Learning continuous representations of words has a long history in natural language processing (Rumelhart et al., 1988).
These representations are typically derived from large unlabeled corpora using co-occurrence statistics (Deerwester et al., 1990;Schütze, 1992;Lund and Burgess, 1996).A large body of work, known as distributional semantics, have studied the properties of these methods (Turney et al., 2010;Baroni and Lenci, 2010).In the neural network community, Collobert and Weston (2008) proposed to learn word embeddings using a feedforward * These authors contributed equally.neural network, by predicting a word based on the two words on the left and two words on the right.More recently, Mikolov et al. (2013b) proposed simple log-bilinear models to learn continuous representations of words on very large corpora efficiently.
Most of these techniques represent each word of the vocabulary by a distinct vector, without parameters sharing.In particular, they ignore the internal structure of words, which is an important limitation for morphologically rich languages, such as Turkish or Finnish.These languages contain many words that occur rarely, making it difficult to learn good word-level representations.In this paper, we propose to learn representations for character n-grams, and represent words as the sum of the n-gram vectors.Our main contribution is to introduce an extension of the continuous skip-gram model (Mikolov et al., 2013b), which takes into account subword information.We evaluate this model on five different languages, with various degree of morphology, showing the benefit of our approach.

Related work
Morphological word representations.In recent years, many methods have been proposed to incorporate morphological information into word representations.
To model rare words better, Alexandrescu and Kirchhoff (2006) introduced factored neural language models, where words are represented as sets of features.
These features might include morphological information, and this technique was succesfully applied to morphologically rich languages, such as Turk-ish (Sak et al., 2010).
Recently, several works have proposed different composition functions to derive representations of words from morphemes (Lazaridou et al., 2013;Luong et al., 2013;Botha and Blunsom, 2014;Qiu et al., 2014).These different approaches rely on a morphological decomposition of words, while our does not.Similarly, Chen et al. (2015) introduced a method to jointly learn embeddings for Chinese words and characters.Cui et al. (2015) proposed to constrain morphologically similar words to have similar representations.Soricut and Och (2015) described a method to learn vector representations of morphological transformations, allowing to obtain representations for unseen words by applying these rules.Word representations trained on morphologically annotated data were introduced by Cotterell and Schütze (2015).Closest to our approach, Schütze (1993) learned representations of character fourgrams through singular value decomposition, and derived representations for words by summing the fourgrams representations.
Character level features for NLP.Another area of research closely related to our work are character-level models for natural language processing, to learn representations directly from sequence of characters.A first class of such models are recurrent neural networks, applied to language modeling (Mikolov et al., 2012;Sutskever et al., 2011;Graves, 2013;Bojanowski et al., 2015), text normalization (Chrupała, 2014), part-ofspeech tagging (Ling et al., 2015) and parsing (Ballesteros et al., 2015).Another family of models are convolutional neural networks trained on characters, which were applied to part-of-speech tagging (dos Santos and Zadrozny, 2014), sentiment analysis (dos Santos and Gatti, 2014), text classification (Zhang et al., 2015) and language modeling (Kim et al., 2016).
Sperr et al. ( 2013) introduced a language model based on restricted Boltzmann machine, in which words are encoded as a set of character n-grams.Finally, recent works in machine translation have proposed to use subword units to obtain representations of rare words (Sennrich et al., 2016;Luong and Manning, 2016).

Model
In this section, we propose a model to learn representations taking into account word morphology.

General
model.We start by briefly reviewing the continuous skip-gram model (Mikolov et al., 2013b) from which our model is derived.Given a vocabulary of size W , where a word is identified by its index w ∈ {1, ..., W }, the goal is to learn a vectorial representation for each word w.Inspired by the distributional hypothesis (Harris, 1954), these representations are trained to predict words appearing in the context of a given word.More formally, given a large training corpus represented as a sequence of words w 1 , ..., w T , the objective of the skip-gram model is to maximize the log-likelihood where the context C t is the set of indices of words surrounding w t .The probability of observing a context word w c given w t is parametrized using the word vectors.Given a scoring function s, which maps pairs of (word, context) to scores in R, a possible choice to define the probability of a context word is the softmax wt, j)   .
However, such a model is not adapted to our case as it implies that, given a word w t , we only predict one context word w c .This problem can also be framed as a set of independent binary classification tasks, where the goal is to predict the presence (or absence) of context words.For the word at position t and the context c, we obtain the negative log-likelihood log(1 + e −s(wt, wc) ) + n∈Nt,c log 1 + e s (wt, n) , where N t,c is a set of negative examples sampled from the vocabulary.Introducing the logistic loss function ℓ : x → log(1 + e −x ), we obtain the objective A natural parametrization for the score function between a word w t and a context word w c is to take the scalar product between word and context embeddings s(w t , w c ) = u ⊤ wt v wc , where u wt and v wc are vectors in R d .This is the skipgram model with negative sampling, introduced by Mikolov et al. (2013b).Subword model.By using a distinct vector representation for each word, the skip-gram model ignore the internal structure of words.In this section, we thus propose a different scoring function s, in order to take into account this information.Given a word w, let us denote by G w ⊂ {1, . . ., G} the set of n-grams appearing in w.We associate a vector representation z g to each n-gram g.We represent a word by the sum of the vector representations of its n-grams.We thus obtain the scoring function We always include the word w in the set of its ngrams, to also learn a vector representation for each word.The set of n-grams is thus a superset of the vocabulary.It should be noted that different vectors are assigned to a word and a n-gram sharing the same sequence of characters.For example, the word as and the bigram as, appearing in the word paste, will be assigned to different vectors.This simple model allows sharing the representations across words, thus allowing to learn reliable representation for rare words.
Dictionary of n-grams.The presented model is simple and leaves room for design choices in the definition of G w .In this paper, we adopt a very simple scheme: we keep all the n-grams with a length greater or equal than 3 and smaller or equal than 6.Different sets of n-grams could be used, for example prefixes and suffixes.We also add a special character for the beginning and the end of the word, thus allowing to distinguish prefixes and suffixes.
In order to bound the memory requirements of our model, we use a hashing function that maps n-grams to integers in 1 to K. In the following, we use K equal to 2 millions.In the end, a word is represented by its index in the word dictionary and the value of the hashes of its n-grams.To improve the efficiency of our model, we do not use n-grams to represent the P most frequent words in the vocabulary.There is a trade-off in the choice of P , as smaller values imply higher computational cost but better performance.When P = W , our model is the skip-gram model of Mikolov et al. (2013b).

Experiments
Datasets and baseline.First, we compare our model to the C implementation of the skip-gram and CBOW models from the word2vec1 package.The different models are trained on Wikipedia data2 in five languages: English, Czech, French, German and Spanish.We randomly sampled datasets of three different sizes: small (50M tokens), medium (200M tokens) and full (the complete Wikipedia dump).All the datasets are shuffled, and we train by using five epochs.
Implementation details.For all experiments, for both our model and the baselines, we use the following parameters.We sample 5 negatives, we use a window size of 5, a rejection threshold of 10 −4 and keep words appearing at least 5 times.On small datasets, we use vector representations of dimension 100, on medium datasets, we use vectors of dimension 200, while on the full datasets, we use a dimension of 300.The learning rate is set to 0.025 for the skip-gram baseline and to 0.05 for both our model and the CBOW baseline.Using this setting on English data, our model with character ngrams is approximately 1.5× slower to train than the skip-gram baseline (105k words/second/thread versus 145k words/second/thread for the baseline).Our model is implemented in C++, and we will make our code available upon publication.
Human similarity judgement.We first evaluate the quality of our representations by computing Spearman's rank correlation coefficient (Spearman, 1904) between human judgement and the cosine similarity between the vector representations.For English, we use the WS353 dataset introduced by Finkelstein et al. ( 2001) and the rare word dataset (RW), introduced by Luong et al. (2013).
For German, we com-  pare the different models on three datasets: Gur65, Gur350 and ZG222 (Gurevych, 2005;Zesch and Gurevych, 2006).Finally, we evaluate the French word vectors on the translated dataset RG65 (Joubarne and Inkpen, 2011) and the Spanish word vectors on the dataset WS353 (Hassan and Mihalcea, 2009). 3ome words from these datasets do not appear in our training data, and thus, we cannot obtain word representation for these words using the CBOW and skip-gram baselines.Therefore, we decided to exclude pairs containing such words from the evaluation.We report the out of vocabulary rate (OOV) with our results in Table 1.It should be noted that our method and the baselines share the same vocabulary, so that results of different methods on the same training set are comparable.On the other hand, results on different training corpora are not compara-ble, since the vocabularies are not the same (hence the different OOV rates).
First, we notice that the proposed model, which uses subword information outperforms the baselines on most datasets.We also observe that the effect of using character n-grams is significantly more important for German than for English or Spanish.This is not surprising since German is morphologically richer than English or Spanish.Unsurprisingly, the difference between our model and the baselines is also more important on smaller datasets.Second, we observe that on the English rare words dataset (RW), our approach also outperforms the baselines.
Word analogy tasks.We now evaluate our approach on word analogy questions, of the form A is to B as C is to D, where D must be predicted by the models.
We use the syntactic (en-syn) and semantic (en-sem) questions introduced by Mikolov et al. (2013a)   glish and the dataset cs-all, introduced by Svoboda and Brychcin (2016), for Czech.Some questions contain words that do not appear in our training corpus, and we thus exclude these questions and report the out of vocabulary rate.All methods are trained on the same data, therefore the reported results are comparable.We report accuracy for the different models in Table 1.We observe that morphological information significantly helps for the syntactic tasks, our approach outperforming the baselines on en-syn.In contrast, it degrades the performance on semantic tasks for small training datasets.Second, we observe that for Czech, a morphologically rich language, using subword information strongly improves the results over the baselines.

Comparison with morphological representations.
We also compare our approach to previous work on incorporating subword information in word vectors, on word similarity tasks.The methods used are: the recursive neural network of Luong et al. (2013), the morpheme CBOW of Qiu et al. ( 2014) and the morphological transformations of Soricut and Och (2015).For English, we trained our model on the Wikipedia data released by Shaoul and Westbury (2010), while for German, Spanish and French, we use the news crawl data from the 2013 WMT shared task.These datasets were also used to train the models we are comparing to.Contrary to the previous experiments, we keep all the words from the evaluation data, even if they did not appear in the training data.Using our model, we obtain representations of out-ofvocabulary words by summing the representations of character n-grams.This makes our results com-parable to those reported in previous work.We report results in Table 3.We observe that our simple approach, based on character n-grams, performs well relative to techniques based on subword information obtained from morphological segmentors.In particular, our method does not need any preprocessing of the data, making it fast to apply.

Discussion
In this paper, we investigate a very simple method to learn word representations taking into account subword information.Our approach, which incorporates character n-grams into the skip-gram model, is related to an old idea (Schütze, 1993), which has not received a lot of attention in the last decade.We show that our method outperforms baselines which do not take into account subword information, on rare words, morphologically rich languages and small training datasets.We will open source the implementation of our model, in order to facilitate comparison of future work on learning subword representations.

Table 1 :
Mikolov et al. (2013b)ector similarity score and human judgement on several datasets (top) and accuracies on analogy tasks (bottom) for models trained on Wikipedia.For each dataset, we evaluate models trained on several sizes of training sets.Small contains 50M tokens, Medium 200M tokens and Full is the complete Wikipedia dump.For each dataset, we report the out-of-vocabulary rate as well as the performance of our model and the skip-gram and CBOW models fromMikolov et al. (2013b).

Table 2 :
Nearest neighbors of rare words using our representations and skip-gram.These hand picked examples are for illustration.

Table 3 :
Comparison of our approach with previous work incorporating morphology in word representations, on word similarity tasks.We keep all the word pairs of the evaluation set and obtain representations for out-of-vocabulary words with our model by summing the vectors of character n-grams.We report Spearman's rank correlation coefficient between model scores and human judgement.