Sparse Coding of Neural Word Embeddings for Multilingual Sequence Labeling

In this paper we propose and carefully evaluate a sequence labeling framework which solely utilizes sparse indicator features derived from dense distributed word representations. The proposed model obtains (near) state-of-the art performance for both part-of-speech tagging and named entity recognition for a variety of languages. Our model relies only on a few thousand sparse coding-derived features, without applying any modification of the word representations employed for the different tasks. The proposed model has favorable generalization properties as it retains over 89.8% of its average POS tagging accuracy when trained at 1.2% of the total available training data, i.e. 150 sentences per language.


Introduction
Determining the linguistic structure of natural language texts based on rich hand-crafted features has a long-going history in natural language processing. The focus of traditional approaches has mostly been on building linguistic analyzers for a particular kind of analysis, which often leads to the incorporation of extensive linguistic and/or domain knowledge for defining the feature space. Consequently, traditional models easily become language and/or task specific resulting in improper generalization properties.
A new research direction has emerged recently, that aims at building more general models that require far less feature engineering or none at all. These advancements in natural language processing, pioneered by Bengio et al. (2003), followed by Collobert and Weston (2008), Collobert et al. (2011), Mikolov et al. (2013a among others, employ a different philosophy. The objective of these works is to find representations for linguistic phenomena in an unsupervised manner by relying on large amounts of text. Natural language phenomena are extremely sparse by their nature, whereas continuous word embeddings employ dense representations of words. In our paper we empirically verify via rigorous experiments that turning these dense representations into a much sparser (yet denser than one-hot encoding) form can keep the most salient parts of word representations that are highly suitable for sequence models.
Furthermore, our experiments reveal that our proposed model performs substantially better than traditional feature-rich models in the absence of abundant training data. Our proposed model also has the advantage of performing well on multiple sequence labeling tasks without any modification in the applied word representations thanks to the sparse features derived from continuous word representations.
Our work aims at introducing a novel sequence labeling model solely utilizing features derived from the sparse coding of continuous word embeddings. Even though sparse coding had previously been utilized in NLP prior to us Chen et al., 2016), to the best of our knowledge, we are the first to propose a sequence labeling framework incorporating it with the following contributions: • We show that the proposed sparse representation is general as sequence labeling models trained on them achieve (near) state-of-the-art performances for both POS tagging and NER.
• We show that the representation is general in the other sense, that it produces reasonable results for more than 40 treebanks for POS tagging, • rigorously compare different sparse coding approaches in conjunction with differently trained continuous word embeddings, • highlight the favorable generalization properties of our model in settings when access to a very limited training corpus is assumed, • release the sparse word representations determined for our experiments at https:// begab.github.io/sparse_embeds to ensure the replicability of our results and to foster further multilingual NLP research.

Related work
The line of research introduced in this paper relies on distributed word representations (Al-Rfou et al., 2013) and dictionary learning for sparse coding (Mairal et al., 2010) and also shows close resemblance to .

Distributed word representations
Distributed word representations assign some relatively low-dimensional, dense vectors to each word in a corpus such that words with similar context and meaning tend to have similar representations. From an algebraic point of view, the embedding of word i having index idx i in a vocabulary V can be thought of as the result of a matrix-vector multiplication W 1 i , where the i th column of matrix W ∈ R k×|V | contains the k-dimensional (k |V |) embedding for word i and vector 1 i ∈ R |V| is the one-hot representation of word i. The one-hot representation of word i is such a vector, which contains zeros for all of its entries except for index idx i where it stores a one. Depending on how the columns of W (i.e. the word embeddings) get determined, we could distinguish a plethora of approaches (Bengio et al., 2003;Lebret and Collobert, 2014;Mnih and Kavukcuoglu, 2013;Collobert and Weston, 2008;Mikolov et al., 2013a;Pennington et al., 2014).
Prediction-based distributed word embedding approaches such as word2vec (Mikolov et al., 2013a) have been conjectured to have superior performance over count-based word representations (Baroni et al., 2014). However, as Lebret and Collobert (2015), Levy et al. (2015) and Qu et al. (2015) point out count-based distributional models can perform on par with prediction-based distributed word embedding models. Levy et al. (2015) illustrate that the effectiveness of neural word embeddings largely depend on the selection of model hyperparameters and other design choices.
According to these findings, in order to avoid any hassles of tuning the hyperparameters of the word embedding model employed, we primarily use the publicly available pre-trained polyglot word embeddings of Al-Rfou et al. (2013) instead, without any task specific modification for our experiments.
A key thing to note is that polyglot word embeddings are not tailored toward any specific language analysis task such as POS tagging or NER. These word embeddings are instead trained in a manner favoring the word analogy task introduced by Mikolov et al. (2013c). The polyglot project distributes word embeddings for more than 100 languages. Al-Rfou et al. (2013) also report results on POS tagging, however, word representations they apply for these experiments are different from the task-agnostic representations they made publicly available.
There has been previous research on training neural networks for learning distributed word representations for various specific language analysis tasks. Collobert et al. (2011) propose neural network architectures to four natural language processing tasks, i.e. POS tagging, named entity recognition, semantic role labeling and chunking. Collobert et al. (2011) train word representations on large amounts of unannotated texts from Wikipedia, then update the pretrained word representations for the individual tasks. Our approach is different in that we do not update our word representations for the different tasks and most importantly that we use successfully the features derived from sparse coding in a log-linear model instead of a neural network architecture. A final difference to (Collobert et al., 2011) is that we experiment with a much wider range of languages while they report results for English only. Qu et al. (2015) evaluate the impacts of choosing different embedding methods on four sequence labeling tasks, i.e. POS tagging, NER, syntactic chunking and multiword expression identification. The hand-crafted features they employ for POS tagging and NER are the same as in Collobert et al. (2011) and Turian et al. (2010).

Sparse coding
The general goal of sparse coding is to express signals in the form of sparse linear combination of basis vectors and the task of finding an appropriate set of basis vectors is referred to as the dictionary learning problem (Mairal et al., 2010). Generally, given a data matrix X ∈ R k×n with its i th column x i representing the i th k-dimensional signal, the task is to find D ∈ R k×m and α ∈ R m×n , such that X ≈ Dα. This can be formalized into an 1 -regularized linear least-squares minimization problem having the form with C being the convex set of matrices of column vectors having an 2 norm at most one, matrix D acting as the shared dictionary across the signals, and the columns of the sparse matrix α containing the coefficients for the linear combinations of each of the n observed signals.
Performing sparse coding of word embeddings has recently been proposed by , however, the objective function they optimize differs from (1). In Section 4, we compare the effects of employing different sparse coding paradigms including the ones in .
In their work,  proposed an efficient learning algorithm for determining hierarchically organized sparse word representations using stochastic proximal methods. Most recently, Sun et al. (2016) have proposed an online learning algorithm using regularized dual averaging to directly obtain 1 regularized continuous bag of words (CBOW) representations (Mikolov et al., 2013a) without the need to determine dense CBOW representations first.

Sequence labeling framework
This section introduces the sequence labeling framework we use for both POS tagging and NER. Since our goal is to measure the effectiveness of sparse word embeddings alone, we do not apply any features based on gazetters, capitalization patterns or character suffixes.
As described previously, word embedding methods turn a high-dimensional (i.e., as many dimensions as words in the vocabulary) and extremely sparse (i.e. containing only one non-zero element at the vocabulary index of the word it represents) onehot encoded representation of words into a dense embedding of much lower dimensionality k.
In our work, instead of using the low dimensional dense word embeddings, we use a dictionary learning approach to obtain sparse codings for the embedded word representations. Formally, given the lookup matrix W ∈ R k×|V | which contains the embedding vectors, we learned D ∈ R k×m being the dictionary matrix shared across all the embedding vectors and α ∈ R m×|V | containing sparse linear combination coefficients for each of the word embeddings so that W −Dα 2 F +λ α 1 is minimized. Once the dictionary matrix D is learned, the sparse linear combination coefficients α i can easily be determined for a word embedding vector w i by solving an 1 -regularized linear least-squares minimization problem (Mairal et al., 2010). We define features based on vector α i by taking the signs and indices of its non-zero coefficients, that is where α i [j] denotes the j th coefficient in the sparse vector α i . The intuition behind this feature is that words with similar meaning are expected to use an overlapping set of basis vectors from dictionary D.
Incorporating the signs of coefficients into the feature function can help to distinguish cases when a basis vector takes part in the reconstruction of a word representation "destructively" or "constructively". When assigning features to a target word at some position within a sentence, we determine the same set of feature functions for the target word itself and its neighboring words of window size 1. Experiments with window size 2 were also performed. However, we omit these results for brevity as they do not substantially differ from those obtained with a window size of 1.
We then use the previously described set of features in a linear chain CRF (Lafferty et al., 2001) using CRFsuite (Okazaki, 2007) with its default settings for hyperparameters, i.e., the coefficients of 1.0 and 0.001 for 1 and 2 regularization, respectively.

Experiments
We rely on the SPArse Modeling Software 1 (SPAMS) (Mairal et al., 2010) for performing sparse coding of distributed word representations. For dictionary learning as formulated in Equation 1, one should choose m and λ, controlling the number of the basis vectors and the regularization coefficient affecting the sparsity of α, respectively. Starting with m = 256 and doubling it at each iteration, our preliminary investigations showed a steady growth in the usefulness of sparse word representations as a function of m, plateauing at m = 1024. We set m to that value for further experiments.

Baseline methods
Brown clustering Various studies have identified Brown clustering (Brown et al., 1992) as a useful source of feature generation for sequence labeling tasks (Ratinov and Roth, 2009;Turian et al., 2010;Owoputi et al., 2013;Stratos and Collins, 2015;Derczynski et al., 2015). We should note that sparse coding can also be viewed as a kind of clustering that -unlike Brown clustering -has the capability of assigning word forms to multiple clusters at a time (corresponding to the non-zero coefficients in α).
We thus define a linear chain CRF relying on features from the Brown cluster identifier of words as one of our baseline approach. Since Brown clustering defines a hierarchical clustering over words, cluster supersets can easily function as features. We generate features from length-p (p ∈ {4, 6, 10, 20}) prefixes of Brown cluster identifiers similar to Ratinov and Roth (2009) and Turian et al. (2010).
In our experiments we use the implementation by Liang (2005) for performing Brown clustering 2 . We provide the very same Wikipedia articles as input text for determining Brown clusters that are used for training the polyglot 3 word embeddings. We Table 1: Features and feature templates applied by our feature-rich baseline for target word w t . ⊕ is a binary operator forming a feature from words and their relative positions by combining them together. also set the number of Brown clusters to be identified to 1024, which is the number of basis vectors applied during sparse coding (cf. D ∈ R 64×1024 ).

Feature-rich representation
We report results relying on linear chain CRFs that assign standard state-of-the-art feature-rich representation to sequences. We apply the very same features and feature templates included in the POS tagging model of CRFSuite 4 . We summarize these features in Table 1, where ⊕ denotes the binary operator which defines features as a combination of word forms at different (not necessarily contiguous) positions of a sentence.
We use the same pool of features described in Table 1 for both POS tagging and NER. The reason why we do not adjust the feature-rich representation employed as our baseline for the different tasks is that we do not alter our representation in any way when using our sparse coding-based model either.
Note that features #1 through #5 in Table 1 operate at character-level, whereas our proposed framework solely uses features derived from the sparse coding of word forms. We thus distinguish two feature-rich baselines, i.e. FR w+c including both word and character-level features and FR w treating word forms as atomic units to derive features from.
Using dense word representations As our ultimate goal is to demonstrate the usefulness of sparse features derived from dense word representations, it is important to address the question of whether sparse word representations are more beneficial for sequence labeling tasks compared to their dense counterparts. To this end, we developed a similar model to the one proposed in Section 3, except for using the original dense word representations for inducing features.
According to this modification, we made the following change in our feature function: instead of calculating Equation (2) for some word i, the modified feature function we use for this baseline is That is, instead of relying on the nonzero values in α i , each word is characterized by its k real-valued coordinates in the embedding space. In order to notationally distinguish sparse and dense representations, we add subscript SC when we refer to a sparse coded version of some word embedding (e.g. SG SC ).

POS tagging experiments
Even though it is reasonable to assume that languages share a common coarse set of linguistic categories, linguistic resources had their own notations for part-of-speech tags. The first notable attempt to canonize the multiple tag sets was the Google universal part-of-speech tags introduced by Petrov et al. (2012) in which the POS tags of various tagging schemes were mapped to 12 language-independent part-of-speech tags.
The recent initiative of universal dependencies (UD)  aims to provide a unified notation for multiple linguistic phenomena, including part-of-speech tags as well. The POS tag set proposed for UD has 17 categories which partially overlap with those defined by Petrov et al. (2012).

Experiments using CoNLL 2006/07 data
We use 12 treebanks in the CoNLL-X format from the CoNLL-2006/07 (Buchholz and Marsi, 2006;Nivre et al., 2007) shared tasks. The complete list of the treebanks included in our experiments is presented in Table 2.
We rely on the official scripts released by Petrov et al. (2012)    POS tags to the Google universal POS tags in order to obtain results comparable across languages.
For our experiments we used the original CoNLL-X train/test splits of the treebanks.
A key factor for the efficiency of our proposed model resides in the coverage of word embeddings, i.e. the proportion of tokens/word forms for which distributed representation is determined. Figure 1 depicts these coverage scores calculated over the merged training and test sets for the different languages. Figure 1 reveals that a substantial amount of tokens has distributed representation defined for (around 90% for the majority of languages, except for Turkish where it is 5 point less). Token coverages of the word embeddings are most likely affected by the morphological richness of the languages and the elaborateness of the corresponding Wikipedia articles used for training word embeddings.  Comparing word embeddings Our motivation for choosing polyglot word embeddings as input to sparse coding is that they are publicly available for a variety of languages. However, distributed word representations trained in any other reasonable manner can serve as input to our approach. In order to investigate if some of the popular word embedding techniques seem favorable for our algorithm, we conduct experiments using alternatively trained embeddings, i.e. skip-gram (SG), continuous bagof-words (CBOW) and Glove.
In order that the utility of different word embeddings not to be conflated with other factors, we train them on the same Wikipedia dumps used for training the polyglot word vectors. We choose further hyperparameters identically to polyglot, i.e. we train 64 dimensional dense word representations using a symmetric context window of size 2 for both SG/CBOW 6 and Glove 7 . Figure 2 includes POS tagging accuracies over the 12 treebanks from the CoNLL 2006/07 shared tasks evaluated against Google Universal POS tags. Instead of reporting results as a function of λ, we rather present accuracies as a function of the different sparsity levels induced by different λ values. Figure 2 demonstrates that POS tagging performance is quite insensitive to the choice of λ unless it yields some extreme sparsity level (>99.5%). Figure 2 also reveals that the usage of  polyglot SC word representations tend to produce superior results over all alternative representations we experiment with. Furthermore, models using polyglot SC consistently outperform the FR w and Brown clustering-based baselines. Models relying on SG SC and CBOW SC representations have an average tagging accuracy of 93.74 and 93.63, respectively, and they typically perform better than the baseline using Brown clustering with an average tagging performance of 93.27. Although utilizing Glove embeddings produce the lowest scores (91.92 on average), its scores still surpass those of the FR w baseline for all languages except for Turkish.
The average tagging performance over the 12 languages when relying on features based on polyglot SC is only 1.3 points below that of F R w+c (i.e. 94.4 versus 95.7). Recall that F R w+c uses a feature-rich representation, whereas our proposed model uses only O(m) features, i.e. it is tied to the number of the basis vectors employed for sparse coding. Furthermore, our model does not employ word identity features, nor does it rely on character-level features of words.
Analyzing the effects of window size Hyperparameters for training word representations can greatly impact their quality as also concluded by Levy et al. (2015). We thus investigate if providing a larger context window size during the training of CBOW, SG and Glove embeddings can improve their performance in our model. According to Figure 3 applying context window sizes of 2 for training the word embeddings tend to produce better overall POS tagging accuracies than applying a larger window size of 10. Differences are the most pronounced in case of skip-gram representation, confirming the findings of Lin et al. (2015), i.e. embedding models that model short-range context are more effective for POS tagging.

Comparing dense and sparse representations
Unless stated otherwise, we use λ = 0.1 for the experiments below in accordance to Figure 2. Table 3 demonstrates that performances obtained by models using dense word representations as features are consistently inferior to those models relying on sparse word representations.
In Table 3b, we can see that polyglot embeddings perform the best for dense representations as well. When using dense features, the CBOW representation-based model tends to produce results better than by a 1.4 points margin on average compared to SG embeddings. This performance gap between the two word2vec variants vanishes, however, when dense representations are replaced by their sparse counterparts.

Comparing the effects of training corpus size
We also investigate the generalization characteristics of the proposed representation by training models that have access to substantially different amounts of training data per language. We distinguish three scenarios, i.e. when using only the first 150, the first 1,500 and all the available training sentences from each corpus. Figure 4 illustrates the average POS tagging accuracy over the 12 CoNLL-X datasets for different amounts of training data and models.  6.85 points advantage over FR w with a training corpus of 1,500 sentences. FR w+c has an average of 1.3 points advantage over polyglot SC when we provide access to all training data during training, nevertheless FR w still underperforms polyglot SC in that setting by 3.67 points.
Comparing sparse coding techniques Next, we compare different sparse coding approaches on the pre-trained polyglot word representations. The recent work of  formulated alternative approaches to determine sparse word representations. One of the objective functions  apply is The main difference in Eq. 1 and 3 is that the latter does not explicitly constrain D to be a member of the convex set of matrices comprising of column vectors having a pre-defined upper bound on their norm. In order to implicitly control for the norms of the basis vectors  apply an additional regularization term affected by an extra parameter τ in their objective function.  also formulated a constrained objective function of the form for which a non-negativity constraint on the elements of α (but no constraint on D) is imposed. When using the objective functions introduced by Faruqui et al. (2015), we use the default τ = 10 −5 value. Notationally, we distinguish the sparse coding approaches based on the equation they use as their objective function, i.e. SC-i, i ∈ {1, 3, 4}. We applied λ = 0.05 for SC-1 and λ = 0.5 for SC-3 and SC-4 in order to obtain word representations of comparable average sparsity levels across the 12 languages, i.e. 95.3%, 94.5% and 95.2%, respectively (cf. the left of Figure 5). The right of Figure 5 further illustrates the spread of POS tagging accuracies over the 12 CoNLL-X treebanks when using models that rely on different sparse coding strategies with comparable sparsity levels.
Although Murphy et al. (2012) mentions nonnegativity as a desired property of word representations for cognitive plausibility, Figure 5 reveals that our sequence labeling model cannot benefit from it as the average POS tagging accuracy for SC-4 is 0.7 points below that of SC-3 approach. The average performances when applying SC-1 and SC-3 are nearly identical with a 0.18 point difference between the two. It is instructive to analyze the patterns different sparse coding approaches exhibit. Even though the objective functions used by the different approaches are similar, decompositions obtained by them convey rather different sparsity structures. Figure 6a illustrates that there exist substantial variation in the length of the basis vectors obtained by SC-3 and SC-4 both within and across languages. However, SC-1 produces practically no variation in the length of the basis vectors comprising D due to the constraint present in the objective function it employs. Figure 6b shows similar differences about the relative frequency of basis vectors taking part in the reconstruction of word embeddings. Figure 7 shows a strong correlation between the 2 norm of basis vectors and the relative number of times a non-zero coefficient is assigned to them in α for SC-3 and SC-4 but not for SC-1. It can be further noted from Figure  of the basis vectors determined by SC-3 and SC-4 are often orders of magnitude larger than those determined by SC-1. This effect, however, can be naturally mitigated by increasing τ . Overall, the different approaches convey comparable POS tagging accuracies but different decompositions due to the differences in the objective functions they employ. Experiments described below are conducted using the objective function in Eq. 1.

Experiments using UD treebanks
For POS tagging we also experiment with UD v1.2  treebanks. We used the default train-test splits of the treebanks not utilizing the development sets for fine tuning performance on any of the languages during our experiments. We omitted the Japanese treebank as words in it are stripped off due to licensing issues. Also there is no polyglot vector released for Old Church Slavonic and Gothic. Even though polyglot word representations are released for Arabic, it was of no practical use as it contained unvocalized surface forms of tokens in contrast to the vocalized forms in UD v.1.2. For this reason, we discarded the Arabic treebank as less than 30% of its tokens could be associated with a representation. By omitting these 4 languages from our experiments we are finally left with 33 treebanks for 29 languages. We note that for Ancient Greek treebanks (grc*) we use word embeddings trained on Modern Greek.
We should add that there are 4 languages (related to 6 treebanks) for which polyglot word vectors are accessible, however, the Wikipedia dumps used for training them are not distributed. For this reason, Brown clustering-based baselines are missing for the affected treebanks.
We report our results on UD v1.2 in Table 5. Recall that the default behavior of our sparse codingbased models (SC in Table 5) is that they do not handle word identity as an explicit feature. We now investigate how much contribution word identity features convey on their own and also when used in conjunction with sparse coding-derived features. For this end we introduce a simple linear chain CRF model generating features solely on the identity of the current word and the ones surrounding it (WI in Table 5). Likewise, we define a model that relies on WI and SC features simultaneously (WI+SC). Table 5 reveals that SC outperforms WI by a large margin and that combining the two feature sets together yields some further improvements over SC scores.
We also present in Table 5 the state-of-the-art results of the bidirectional LSTM models by Plank et al. (2016) for comparative purposes. Note that the authors reported results only on a subset of UD v1.2 (i.e. treebanks with at least 60k tokens), for which reason we can include their results on 21 treebanks. Out of these 21 UD v1.2 treebanks there are 15 and 20 cases, respectively, for which SC and WI+SC produces better results than bi-LSTM w . Only FR w+c and bi-LSTM w+c , models which enjoy the additional benefit of employing character-level features besides word-level ones, are capable of outperforming SC and WI+SC.

Named entity recognition experiments
Besides the POS tagging experiments, we investigated if the very same features as the ones applied for POS tagging can be utilized in a different sequence labeling task, namely named entity recognition. In order to evaluate our approach, we obtained the English, Spanish and Dutch datasets from the 2002 and 2003 CoNLL shared tasks on multilingual Named Entity Recognition (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003).
We use the train-test splits provided by the or-  ganizers and report our NER results using the F1 scores based on the official evaluation script of the CoNLL shared task. Similar to Collobert et al. (2011) we also apply the 17-tag IOBES tagging scheme during training and inference. The best F1 scores reported for English by Collobert et al. (2011) without employing additional unlabeled texts to enhance their language model is 81.47. When pre-training their neural language model on large amounts of Wikipedia texts they report an F1 score of 87.58. Figure 8 includes our NER results obtained using different word embedding representations as input for sparse coding and different levels of sparsity. Similar to our POS tagging experiments, using polyglot SC vectors tend to perform best for NER as well. However, a substantial difference compared to the POS tagging results is that NER performances   do not degrade even for extreme levels of sparsity. Also, the sparse coding-based models perform much better when compared to the FR w+c baseline. In Table 6, we compare the effectiveness of models relying on sparse and dense word representations for NER. In order not to fine-tune hyperparameters for a particular experiment, similarly to our previous choices m and λ are set to 1024 and 0.1, respectively. Results in Table 6 are in line with those reported in Table 3 for POS tagging.

Conclusion
In this paper we show that it is possible to train sequence models that perform nearly as well as best existing models on a variety of languages for both POS tagging and NER. Our approach does not require word identity features to perform reliably, furthermore, it is capable of achieving comparable results to traditional feature-rich models. We also il-lustrate the advantageous generalization property of our model as it retained 89.8% of its original average POS tagging accuracy when trained on only 1.2% of the total accessible training sentences.
As Mikolov et al. (2013b) pointed out the similarities of continuous word embeddings across languages, we think that our proposed model could be employed not in just multi-lingual, but also in crosslingual language analysis settings. In fact, we investigate its feasibility in our future work. Finally, we have made the sparse coded word embedding vectors publicly available in order to facilitate the reproducibility of our results and to foster multilingual and cross-lingual research.