Encoding Prior Knowledge with Eigenword Embeddings

Canonical correlation analysis (CCA) is a method for reducing the dimension of data represented using two views. It has been previously used to derive word embeddings, where one view indicates a word, and the other view indicates its context. We describe a way to incorporate prior knowledge into CCA, give a theoretical justification for it, and test it by deriving word embeddings and evaluating them on a myriad of datasets.


Introduction
In recent years there has been an immense interest in representing words as low-dimensional continuous real-vectors, namely word embeddings.Word embeddings aim to capture lexico-semantic information such that regularities in the vocabulary are topologically represented in a Euclidean space.Such word embeddings have achieved state-of-theart performance on many natural language processing (NLP) tasks, e.g., syntactic parsing (Socher et al., 2013), word or phrase similarity (Mikolov et al., 2013b;Mikolov et al., 2013c), dependency parsing (Bansal et al., 2014), unsupervised learning (Parikh et al., 2014) and others.Since the discovery that word embeddings are useful as features for various NLP tasks, research on word embeddings has taken on a life of its own, with a vibrant community searching for better word representations in a variety of problems and datasets.
These word embeddings are often induced from large raw text capturing distributional co-occurrence information via neural networks (Bengio et al., 2003;Mikolov et al., 2013b;Mikolov et al., 2013c) or spectral methods (Deerwester et al., 1990;Dhillon et al., 2015).While these general purpose word embeddings have achieved significant improvement in various tasks in NLP, it has been discovered that further tuning of these continuous word representations for specific tasks improves their performance by a larger margin.For example, in dependency parsing, word embeddings could be tailored to capture similarity in terms of context within syntactic parses (Bansal et al., 2014) or they could be refined using semantic lexicons such as WordNet (Miller, 1995), FrameNet (Baker et al., 1998) and the Paraphrase Database (Ganitkevitch et al., 2013) to improve various similarity tasks (Yu and Dredze, 2014;Faruqui et al., 2015;Rothe and Schütze, 2015).This paper proposes a method to encode prior semantic knowledge in spectral word embeddings (Dhillon et al., 2015).
Spectral learning algorithms are of great interest for their speed, scalability, theoretical guarantees and performance in various NLP applications.These algorithms are no strangers to word embeddings either.In latent semantic analysis (LSA, (Deerwester et al., 1990;Landauer et al., 1998)), word embeddings are learned by performing SVD on the word by document matrix.Recently, Dhillon et al. (2015) have proposed to use canonical correlation analysis (CCA) as a method to learn lowdimensional real vectors, called Eigenwords.Unlike LSA based methods, CCA based methods are scale invariant and can capture multiview information such as the left and right contexts of the words.As a result, the eigenword embeddings of Dhillon et al. (2015) that were learned using the simple linear methods give accuracies comparable to or better than state of the art when compared with highly nonlinear deep learning based approaches (Collobert and Weston, 2008;Mnih and Hinton, 2007;Mikolov et al., 2013b;Mikolov et al., 2013c).
The main contribution of this paper is a technique to incorporate prior knowledge into the derivation of canonical correlation analysis.In contrast to previous work where prior knowledge is introduced in the off-the-shelf embeddings as a post-processing step (Faruqui et al., 2015;Rothe and Schütze, 2015), our approach introduces prior knowledge in the CCA derivation itself.In this way it preserves the theoretical properties of spectral learning algorithms for learning word embeddings.The prior knowledge is based on lexical resources such as WordNet, FrameNet and the Paraphrase Database.
Our derivation of CCA to incorporate prior knowledge is not limited to eigenwords and can be used with CCA for other problems.It follows a similar idea to the one proposed by Koren and Carmel (2003) for improving the visualization of principal vectors with principal component analysis (PCA).Our derivation represents the solution to CCA as that of an optimization problem which maximizes the distance between the two view projections of training examples, while weighting these distances using the external source of prior knowledge.As such, our approach applies to other uses of CCA in the NLP literature, such as the one of Jagarlamudi and Daumé (2012), who used CCA for transliteration, or the one of Silberer et al. (2013), who used CCA for semantically representing visual attributes.

Background and Notation
For an integer n, we denote by [n] the set of integers {1, . . ., n}.We assume the existence of a vocabulary of words, usually taken from a corpus.This set of words is denoted by H = {h 1 , . . ., h |H| }.For a square matrix A, we denote by diag(A) a diagonal matrix B which has the same dimensions as A such that B ii = A ii for all i.For vector v ∈ R d , we denote its ℓ 2 norm by ||v||, i.e. ||v|| = d i=1 v 2 i .We also denote by v j or [v] j the jth coordinate of v.For a pair of vectors u and v, we denote their dot product by u, v .
We define a word embedding as a function f from H to R m for some (relatively small) m.For example, in our experiments we vary m between 50 and 300.The word embedding function maps the word to some real-vector representation, with the inten-tion to capture regularities in the vocabulary that are topologically represented in the corresponding Euclidean space.For example, all vocabulary words that correspond to city names could be grouped together in that space.
Research on the derivation of word embeddings that capture various regularities has enormously accelerated in the recent years.Various methods used for this purpose range from low rank approximations of co-occurrence statistics (Deerwester et al., 1990;Dhillon et al., 2015) to neural network jointly learning a language model (Bengio et al., 2003;Mikolov et al., 2013a) or other NLP tasks (Collobert and Weston, 2008).

Canonical Correlation Analysis for Deriving Word Embeddings
One recent approach to derive word embeddings, developed by Dhillon et al. (2015), is through the use of canonical correlation analysis, resulting in socalled "eigenwords."CCA is a technique for multiview dimensionality reduction.It assumes the existence of two views for a set of data, similarly to co-training (Yarowsky, 1995;Blum and Mitchell, 1998), and then projects the data in the two views in a way that maximizes the correlation between the projected views.Dhillon et al. (2015) used CCA to derive word embeddings through the following procedure.They first break each document in a corpus of documents into n sequences of words of a fixed length 2k + 1, where k is a window size.For example, if k = 2, the short document "Harry Potter has been a bestseller" would be broken into "Harry Potter has been a" and "Potter has been a best-seller."In each such sequence, the middle word is identified as a pivot.
This leads to the construction of the following training set from a set of documents: {(w With abuse of notation, this is a multiset, as certain words are expected to appear in certain contexts multiple times.Each w (i) is a pivot word, and the rest of the elements are words in the sequence called "the context words."With this training set in mind, the two views for CCA are defined as following.
We define the first view through a sparse "context matrix" C ∈ R n×2k|H| such that each row in the The word and context views represented as matrix W and C. Each row in W is a vector of length |H|, corresponding to a one-hot vector for the word in the example indexed by the row.Each row in C is a vector of length 2k|H|, divided into sub-vectors each of length |H|.Each such sub-vector is a one-hot vector for one of the 2k context words in the example indexed by the row.
matrix is a vector, consisting of 2k one-hot vectors, each of length |H|.Each such one-hot vector corresponds to a word that fired in a specific index in the context.In addition, we also define a second view through a matrix W ∈ R n×|H| such that W ij = 1 if w (i) = h j .We present both views of the training set in Figure 1.
Note that now the matrix M = W ⊤ C is in R |H|×(2k|H|) such that each element M ij gives the count of times that h i appeared with the corresponding context word and context index encoded by j.
Similarly, we define a matrix D 1 = diag(W ⊤ W ) and D 2 = diag(C ⊤ C).Finally, to get the word embeddings, we perform singular value decomposition (SVD) on the matrix . Note that in its original form, CCA requires use of W ⊤ W and C ⊤ C in their full form, and not just the corresponding diagonal matrices D 1 and D 2 , however, in practice, inverting these matrices can be quite intensive computationally and can lead to memory issues.As such, we approximate CCA by using the diagonal matrices D 1 and D 2 .
From the SVD step, we get two projections U ∈ R |H|×m and V ∈ R 2k|H|×m such that where Σ ∈ R m×m is a diagonal matrix with Σ ii > 0 being the ith largest singular value of . In order to get the final word embeddings, we calculate D −1/2 1 U ∈ R |H|×m .Each row in this matrix corresponds to an m-dimensional vector for the corresponding word in the vocabulary.This means that f (h i ) for h i ∈ H is the ith row of the matrix D −1/2 1 U .The projection V can be used to get "context embeddings."See more about this in Dhillon et al. (2015).
The use of CCA this way to derive word embeddings follows the usual distributional hypothesis (Harris, 1957) that most word embeddings techniques rely on.In the case of CCA, this hypothesis is translated into action in the following way.CCA finds projections for the contexts and for the pivot words which are most correlated.This means that if a word co-occurs in a specific context many times (either directly, or transitively through similarity to other words), then this context is expected to be projected to a point "close" to the point to which the word is projected.As such, if two words occur in a specific context many times, these two words are expected to be projected to points which are close to each other.
For the next section, we denote . To refer to the dimensions of X and Y generically, we denote d = |H| and d ′ = 2k|H|.In addition, we refer to the column vectors of U and V as u 1 , . . ., u m and v 1 , . . ., v m .

Mathematical Intuition Behind CCA
The procedure that CCA follows finds a projection of the two views in a shared space, such that the correlation between the two views is maximized at each coordinate, and there is minimal redundancy between the coordinates of each view.This means that CCA solves the following sequence of optimization problem for j ∈ [m] where a j ∈ R 1×d and b j ∈ R 1×d ′ : where corr is a function that accepts two vectors and return the Pearson correlation between the pairwise elements of the two vectors.The approximate solution to this optimization problem (when using diagonal D 1 and D 2 ) is â⊤ . CCA also has a probabilistic interpretation as a maximum likelihood solution of a latent variable model for two normal random vectors, each drawn based on a third latent Gaussian vector (Bach and Jordan, 2005).
The way we describe CCA for deriving word embeddings is related to Latent Semantic Indexing (LSI), which performs singular value decomposition on the matrix M directly, without doing any kind of variance normalization.Dhillon et al. (2015) describe some differences between LSI and CCA.The extra normalization step decreases the importance of frequent words when doing SVD.

Incorporating Prior Knowledge into Canonical Correlation Analysis
In this section, we detail the technique we use to incorporate prior knowledge into the derivation of canonical correlation analysis.The main motivation behind our approach is to improve the optimization of correlation between the two views by weighing them using the external source of prior knowledge.The prior knowledge is based on lexical resources such as WordNet, FrameNet and the Paraphrase Database.Our approach follows a similar idea to the one proposed by Koren and Carmel (2003) for improving the visualization of principal vectors with principal component analysis (PCA).It is also related to Laplacian manifold regularization (Belkin et al., 2006).An important notion in our derivation is that of a Laplacian matrix.The Laplacian of an undirected weighted graph is an n × n matrix where n is the number of the nodes in the graph.It equals D − A where A is the adjacency matrix of the graph (so that A ij is the weight for the edge (i, j) in the graph, if it exists, and 0 otherwise) and D is a diagonal matrix such that D ii = j A ij .The Laplacian is always a symmetric square matrix such that the sum over rows (or columns) is 0. It is also positive semidefinite.
We propose a generalization of CCA, in which we introduce a Laplacian matrix into the derivation of CCA itself, as shown in Figure 2. We encode prior knowledge about the distances between the projections of two views into the Laplacian.The Laplacian allows us to improve the optimization of the correlation between the two views by weighing them using the external source of prior knowledge.

Generalization of CCA
We present three lemmas (proofs are given in Appendix A), followed by our main proposition.These three lemmas are useful to prove our final proposition.
The main proposition shows that CCA maximizes the distance between the two view projections for any pair of examples i and j, i = j, while minimizing the two view projection distance for the two views of an example i.The two views we discuss here in practice are the view of the word through a one-hot representation, and the view which represents the context words for a specific word token.The distance between two view projections is defined in Eq. 2.
Lemma 1.Let X and Y be two matrices of size n×d and n × d ′ , respectively, for example, as defined in §3.Assume that n i=1 X ij = 0 for j ∈ [d] and Then X ⊤ LY equals X ⊤ Y up to a multiplication by a positive constant.
Then the rank m thin-SVD of A can be found by solving the following optimization problem: where u i ∈ R d×1 denote the left singular vectors, and v i ∈ R d ′ ×1 denote the right singular vectors.prior knowledge (optional) The last utility lemma we describe shows that interjecting the Laplacian between the two views can be expressed as a weighted sum of the distances between the projections of the two views (these distances are given in Eq. 2), where the weights come from the Laplacian.Lemma 3. Let u 1 , . . ., u m and v 1 , . . ., v m be two sets of vectors of length d and d ′ respectively.Let L ∈ R n×n be a Laplacian and X ∈ R n×d and Y ∈ R n×d ′ .Then: where The following proposition is our main result for this section.
where d m ij is defined as in Eq. 2 for u 1 , . . ., u m being the columns of U and v 1 , . . ., v m being the columns of V .
Proof.According to Lemma 3, the objective in Eq. 3 equals where L is defined as in Eq. 1.Therefore, maximizing Eq. 3 corresponds to maximization of under the constraints that the U and the V matrices have orthonormal vectors.Using Lemma 2, it can be shown that the solution to this maximization is done by doing singular value decomposition on X ⊤ LY .According to Lemma 1, this corresponds to finding U and V by doing singular value decomposition on X ⊤ Y , because a multiplicative constant does not change the value of the right/left singular vectors.
The above proposition shows that CCA tries to find projections of both views such that the distances between the two views for pairs of examples with indices i = j are maximized (first term in Eq. 3), while minimizing the distance between the projections of the two views for a specific example (second term in Eq. 3).Therefore, CCA tries to project a context and a word in that context to points that are close to each other in a shared space, while maximizing the distance between a context and a word which do not often co-occur together.
As long as L is a Laplacian, Proposition 4 is still true, only with the maximization of the objective where L ij ≤ 0 for i = j and L ii ≥ 0. This result lends itself to a generalization of CCA, in which we use predefined weights for the Laplacian that encode some prior knowledge about the distances that the projections of two views should satisfy.
If the weight −L ij is large for a specific (i, j), then we will try harder to maximize the distance between one view of example i and the other view of example j (i.e.we will try to project the word w (i) and the context of example j into distant points in the space).
This means that in the current formulation, −L ij plays the role of a dissimiliarity indicator between pairs of words.The more dissimilar words are, the larger the weight, and then the more distant the projections are for the contexts and the words.

From CCA with Dissimilarities to CCA with Similarities
It is often more convenient to work with similarity measures between pairs of words.To do that, we can retain the same formulation as before with the Laplacian, where −L ij now denotes a measure of similarity.Now, instead of maximizing the objective in Eq. 4, we are required to minimize it.
It can be shown that such mirror formulation can be done with an algorithm similar to CCA, leading to a preposition in the style of Proposition 4. To solve this minimization formulation, we just need to choose the singular vectors associated with the smallest m singular values (instead of the largest).
Once we change the CCA algorithm with the Laplacian to choose these projections, we can define L, for example, based on a similarity graph.The graph is an undirected graph that has |H| nodes, for each word in the vocabulary, and there is an edge between a pair of words whenever the two words are similar to each other based on some external source of information, such as WordNet (for example, if they are synonyms).
We then define the Laplacian L such that L ij = −1 if i and j are adjacent in the graph (and i = j), L ii is the degree of the node i and L ij = 0 in all other cases.By using this variant of CCA, we strive to maximize the distance of the two views between words which are adjacent in the graph (or continuing the example above, maximize the distance between words which are not synonyms).In addition, the fewer adjacent nodes a word has (or the more synonyms it has), the less important it is to minimize the distance between the two views of that given word.

Final Algorithm
In order to use an arbitrary Laplacian matrix with CCA, we require that the data is centered, i.e. that the average over all examples of each of the coordinates of the word and context vectors is 0. However, such a prerequisite would make the matrices C and W dense (with many non-zero values), and hard to maintain in memory, and would also make singular value decomposition inefficient.
As such, we do not center the data to keep it sparse, and as such, use a matrix L which is not strictly a Laplacian, but that behaves better in practice. 1Given the graph mentioned in §4 which is extracted from an external source of information, we use L such that L ij = α for an α ∈ (0, 1) which is treated as a smoothing factor for the graph (see below the choices of α) if i and j are not adjacent in the graph, L ij = 0 if i = j are adjacent, and finally L ii = 1 for all i ∈ [n].Therefore, this matrix is symmetric, and the only constraint it does not satisfy is that of rows and columns summing to 0.
Scanning the documents and calculating the statistic matrix with the Laplacian is computationally infeasible with a large number of tokens given as input.It is quadratic in that number.As such, we make another modification to the algorithm, and calculate a "local" Laplacian.The modification re- k and w 2k .
• If i = j and word w (i) is connected to word w (j)  in G, increase M rs by α for r denoting the index of word w (i) and for all s denoting the context indices of words w k and w 2k .
• Calculate D 1 and D 2 as specified in §3.
(Singular value decomposition step) • Perform singular value decomposition on (Word embedding projection) • For each word h i for i ∈ [|H|] return the word embedding that corresponds with the ith row of U .quires an integer N as input (we use N = 12), and then it makes updates to pairs of word tokens only if they are within an N -sized window of each.
The final algorithm we use is described in Figure 3.The algorithm works by directly computing the cooccurrence matrix M (instead of maintaining W and C).It does so by increasing by 1 any cells corresponding to word-context co-occurrence in the documents and by α any cells corresponding to word and contexts that are connected in the graph.

Experiments
In this section we describe our experiments.

Experimental Setup
Training Data We used three datasets, WIKI1, WIKI2 and WIKI5, all based on the first 1, 2 and 5 billion words from Wikipedia respectively.2Each dataset is broken into chunks of length 13 (window sizes of 6), corresponding to a document.The above Laplacian L is calculated within each document separately.This means that −L ij is 1 only if i and j denote two words that appear in the same document.This is done to make the calculations computationally feasible.We calculate word embeddings for the top most frequent 200K words.
Prior Knowledge Resources We consider three sources of prior knowledge: WordNet (Miller, 1995), the Paraphrase Database of Ganitkevitch et al. (2013), abbreviated as PPDB,3 and FrameNet (Baker et al., 1998).Since FrameNet and WordNet index words in their base form, we use WordNet's stemmer to identify the base form for the text in our corpora whenever we calculate the Laplacian graph.
For WordNet, we have an edge in the graph if one word is a synonym, hypernym or hyponym of the other.For PPDB, we have an edge if one word is a paraphrase of the other, according to the database.For FrameNet, we connect two words in the graph if they appear in the same frame.

System Implementation
We modified the implementation of the SWELL Java package4 of Dhillon et al. (2015).Specifically, we needed to modify the loop that iterates over words in each document to a nested loop that iterates over pairs of words, in order to compute a sum of the form ij X ri L ij Y js .5 2012), Faruqui and Dyer ( 2014) and Dhillon et al. (2015) respectively.The second middle blocks (D-F) is our eigenword embeddings encoded with prior knowledge using our method.Each row in the block corresponds to a specific use of an α value (smoothing factor), as described in Figure 3.In the lower blocks (G-I) we take the word embeddings from the second block, and retrofit them using the method of Faruqui et al. (2015).Best results in each block are in bold.

Baselines
Off-the-shelf Word Embeddings We compare our word embeddings with existing state-of-theart word embeddings, such as Glove (Pennington et al., 2014), Skip-Gram (Mikolov et al., 2013b), Global Context (Huang et al., 2012) and Multilingual (Faruqui and Dyer, 2014).We also compare our word embeddings with the Eigen word embeddings of Dhillon et al. (2015) without any prior knowledge.

Retrofitting for Prior Knowledge
We compare our approach of incorporate prior knowledge into the derivation of CCA against the previous works where prior knowledge is introduced in the off-theshelf embeddings as a post-processing step (Faruqui et al., 2015;Rothe and Schütze, 2015).In this paper, we focus on the retrofitting approach of Faruqui et al. (2015).
Retrofitting works by optimizing an objective function which has two terms: one that tries to keep the distance between the word vectors close to the original distances, and the other which enforces the vectors of words which are adjacent in the prior knowledge graph to be close to each other in the new embedding space.We use the retrofitting package7 to compare our results in different settings against the results of retrofitting of Faruqui et al. (2015).

Evaluation Benchmarks
We evaluated the quality of our eigenword embeddings with three different tasks: word similarity, geographic analogies and NP bracketing.
Word Similarity For the word similarity task we experimented with 11 different widely used benchmarks.The WS-353-ALL dataset (Finkelstein et al., 2002) consists of 353 pairs of English words with their human similarity ratings.Later, Agirre et al. (2009) re-annotated WS-353-ALL for similarity (WS-353-SIM) and relatedness (WS-353-REL) with specific distinctions between them.The SimLex-999 dataset (Hill et al., 2014) was built to measure how well models capture similarity, rather than relat-edness or association.The MEN-TR-3000 dataset (Bruni et al., 2014) consists of 3000 word pairs sampled from words that occur at least 700 times in a large web corpus.The datasets, MTurk-287 (Radinsky et al., 2011) andMTurk-771 (Halawi et al., 2012), were scored by Amazon Mechanical Turk workers for relatedness of English word pairs.The YP-130 (Yang and Powers, 2005) and Verb-143 (Baker et al., 2014) datasets were developed for verb similarity predictions.The last two datasets, MC-30 (Miller and Charles, 1991) and RG-65 (Rubenstein and Goodenough, 1965) consist of 30 and 65 noun pairs respectively.
For each dataset, we calculate the cosine similarity between the vectors of word pairs and measure Spearman's rank correlation coefficient between the scores produced by the embeddings and human ratings.We report the average of the correlations on all 11 datasets.Each word similarity task in the above list represents a different aspect of word similarity, and as such, averaging the results points to the quality of the word embeddings on several tasks.We later analyze specific datasets.Mikolov et al. (2013c) created a test set of analogous word pairs such as a:b c:d raising the analogy question of the form "a is to b as c is to " where d is unknown.We report on a subset of this dataset which focuses on finding capitals of common countries, e.g., Greece is to Athens as Iraq is to .This dataset consists of 506 word pairs.For a given word pairs, a:b c:d where d is unknown, we use the vector offset method (Mikolov et al., 2013b), i.e., we compute a vector v = v b −v a +v c where v a , v b and v c are vector representations of the words a, b and c respectively; and return the word d with the greatest cosine similarity to v.

NP Bracketing
Here the goal is to identify correct bracketing of a three-word noun (Lazaridou et al., 2013).For example, the bracketing of annual (price growth) is "right," while the bracketing of (entry level) machine is "left."Similar to Faruqui and Dyer (2015), we concatenate the word vectors of the three words, and use this vector for binary classification into left or right.
Since most of the datasets that we evaluate in this paper are not standardly separated into development and test sets, we report all results we calculated (with respect to hyperparameter differences) and do not select just a subset of the results.

Evaluation
Preliminary Experiments In our first set of experiments, we vary the dimension of the word embeddings vector.We try m ∈ {50, 100, 200, 300}.Our experiments showed that the results consistently improve when the dimension increases for all the different datasets.For example, for m = 50 and WIKI1, we get an average of 46.4 on the word similarity tasks, 50.1 for m = 100, 53.4 for m = 200 and 54.2 for m = 300.The more data are available, the more likely larger dimension will improve the quality of the word embeddings.Indeed, for WIKI5, we get an average of 49.4, 54.9, 57.0 and 59.5 for each of the dimensions.The improvements with respect to the dimension are consistent across all of our results, so we fix m at 300.
We also noticed a consistent improvement in accuracy when using more data from Wikipedia.For example, for m = 300, using WIKI1 gives an average of 54.1, while using WIKI2 gives an average of 54.9 and finally, using WIKI5 gives an average of 59.5.We fix the dataset we use to be WIKI5.

Results
Table 1 describes the results from our first set of experiments.In general, adding prior knowledge to eigenword embeddings does improve the quality of word vectors for the word similarity, geographic analogies and NP bracketing tasks on several occasions.For example, our eigenword vectors encoded with prior knowledge (CCAPrior) consistently perform better than the eigenword vectors that do not have any prior knowledge for the word similarity task (59.5, Eigen in the first row under NPK column, versus block D).Only exceptions are for α = 0.1 with WordNet (59.1), for α = 0.7 with PPDB (59.3) and for α = 0.9 with FrameNet (58.9),where α denotes the smoothing factor.
In several cases, running the retrofitting algorithm of Faruqui et al. (2015) on top of our word embeddings helps further, as if "adding prior knowledge twice is better than once."Results for these word embeddings (CCAPrior+RF) are shown in Table 1.Adding retrofitting to our encoding of prior knowledge often performs better for word similarity and NP bracketing tasks (block D versus G and block F versus I).Interestingly, CCAPrior+RF embeddings also often perform better than eigenword vectors (Eigen) of Dhillon et al. (2015) when retrofitted using the method of Faruqui et al. (2015).For example, in the word similarity task, eigenwords retrofitted with WordNet get an accuracy of 62.2 whereas encoding prior knowledge using both CCA and retrofitting gets a maximum accuracy of 63.3.We see the same pattern for PPDB, with 63.6 for "Eigen" and 64.9 for "CCAPrior+RF".We hypothesize that the reason for these changes is that the two methods for encoding prior knowledge maximize different objective functions.
The performance with FrameNet is weaker, in some cases leading to worse performance (e.g., with Glove and SG vectors).We believe that FrameNet does not perform as well as the other lexicons because it groups words based on very abstract concepts; often words with seemingly distantly related meanings (e.g., push and growth) can evoke the same frame.This also supports the findings of Faruqui et al. (2015), who noticed that the use of FrameNet as a prior knowledge resource for improving the quality of word embeddings is not as helpful as other resources such as WordNet and PPDB.
We note that CCA works especially well for the geographic analogies dataset.The quality of eigenword embeddings (and the other embeddings) degrades when we encode prior knowledge using the method of Faruqui et al. (2015).Our method improves the quality of eigenword embeddings.

Global Picture of the Results
When comparing retrofitting to CCA with prior knowledge, there is a noticable difference.Retrofitting performs well or badly, depending on the dataset, while the results with CCA are more stable.We attribute this to the difference between our algorithm and retrofitting work.Retrofitting makes a direct use of the source of prior knowledge, by adding a regularization term that enforces words which are similar according to the prior knowledge to be closer in the embedding space.Our algorithm, on the other hand, makes a more indirect use of the source of prior knowledge, by changing the co-occurence matrix on which we do singular value decomposition.
Specifically, we believe that our algorithm is more stable to cases in which the vocabulary we need to use the word embeddings for the relevant task is not included the source of prior knowledge.This is demonstrated with the geographical analogies task: in that case, retrofitting lowers the results in most cases.

Further Analysis
We further inspected the results on the word similarity tasks for the RG-65 and WS-353-ALL datasets.Our goal was to find cases in which either CCA embeddings by themselves outperform other types of embeddings or that encoding prior knowledge into CCA the way we describe significantly improves the results.
For the WS-353-ALL dataset, the eigenword embeddings get a correlation of 69.6.The next best performing word embeddings are the multilingual word embeddings (68.0) and skip-gram (85.3).Interestingly enough, the multilingual word embeddings also use CCA to project words into a lowdimensional space using a linear transformation, suggesting that linear projections are a good fit for the WS-353-ALL dataset.The dataset itself includes pairs of common words with a corresponding similarity score.The words that appear in the dataset are actually expected to occur in similar contexts, a property that CCA directly encodes when deriving word embeddings.
The best performance on the RG-65 dataset is with the Glove word embeddings (76.6).CCA embeddings give an accuracy of 69.7 on that dataset.However, with this dataset, we observe significant improvement when encoding prior knowledge using our method.For example, using WordNet with this dataset improves the results by 4.2 points (73.9).Using the method of Faruqui et al. ( 2015) (with Word-Net) on top of our CCA word embeddings improves the results even further by 8.7 points (78.4).

The Role of Prior Knowledge
We also designed an experiment to test whether using distributional information is necessary for having well-performing word embeddings, or whether it is sufficient to rely on the prior knowledge resource.In order to test this, we created a sparse matrix that corresponds to the graph based on the external resource graph.We then follow up with singular value decomposition on that graph, and get embeddings of size 300.Table 2 gives the results when using these embeddings.We see that the results are consistently lower than the results that appear in Table 1, implying that the use of prior knowledge comes hand in hand with the use of distributional information.When using the retrofitting method by Faruqui et al. on top of these word embeddings, the results barely improved.

Related Work
Our ideas in this paper for encoding prior knowledge to eigenword embeddings relate to three main threads in existing literature.
One of the threads focuses on modifying the objective of word vector training algorithms.Yu and Dredze (2014), Xu et al. (2014), Fried and Duh (2014) and Bian et al. (2014) augment the training objective in neural language models of Mikolov et al. (2013a) to encourage semantically related word vectors to come closer to each other.Wang et al. (2014) propose a method for jointly embedding entities (from FreeBase, a large community-curated knowledge base) and words (from Wikipedia) into the same continuous vector space.Chen and de Melo (2015) propose a similar joint model to improve the word embeddings, but rather than using structured knowledge sources their model focuses on discovering stronger semantic connections in specific contexts in a text corpus.
Another research thread relies on post-processing steps to encode prior knowledge from semantic lexicons to off-the-shelf word embeddings.The main intuition behind this trend is to update word vectors by running belief propagation on a graph extracted from the relation information in semantic lexicons.The retrofitting approach of Faruqui et al. (2015) uses such techniques to obtain higher quality semantic vectors using WordNet, FrameNet, and the Paraphrase Database.They report on how retrofitting helps improve the performance of vari-ous off-the-shelf word vectors such as Glove, Skip-Gram, Global Context, and Multilingual, on various word similarity tasks.Rothe and Schütze (2015) also describe how standard word vectors can be extended to various data types in semantic lexicons, e.g., synsets and lexemes in WordNet.
Most of the standard word vector training algorithms use co-occurrence within a window-based contexts to measure relatedness among words.Several studies question the limitations of defining relatedness in this way and investigate if the word co-occurrence matrix can be constructed to encode prior knowledge straightway to improve the quality of word vectors.Wang et al. (2015) investigate the notion of relatedness in embedding models by incorporating syntactic and lexicographic knowledge.In spectral learning, Yih et al. (2012) augment the word co-occurrence matrix on which LSA operates with the relational information such that synonyms will tend to have positive cosine similarity, and antonyms will tend to have negative similarities.Their vector space representation successfully projects synonyms and antonyms on opposite sides in the projected space.Chang et al. (2013) further generalize this approach to encode multiple relations (and not just opposing relations, such as synonyms and antonyms) using multi-relational LSA.
In spectral learning, most of the studies on incorporating prior knowledge to word vectors focus on LSA based word embeddings (Yih et al., 2012;Chang et al., 2013;Turney and Littman, 2005;Turney, 2006;Turney and Pantel, 2010).
From the technical perspective, our work is also related to that of Jagarlamudi et al. (2011), who showed how to generalize CCA so that it uses locality preserving projections (He and Niyogi, 2004).They also assume the existence of a weight matrix in a multi-view setting that describes the distances between pairs of points in the two views.

Conclusion
We described a method for incorporating prior knowledge into CCA.Our method requires a relatively simple change to the original canonical correlation analysis, where extra counts are added to the matrix on which singular value decomposition is performed.We used our method to derive word em-beddings in the style of eigenwords, and tested them on a set of datasets.Our results demonstrate several advantages in encoding prior knowledge into eigenword embeddings.

Figure 2 :
Figure 2: Introducing prior knowledge in CCA.W ∈ R n×d and C ∈ R n×d ′ denote the word and context views respectively.L ∈ R n×n is a Laplacian matrix encoded with the prior knowledge about the distances between the projections of W and C.
an integer m, an α ∈ (0, 1], an undirected graph G over H, an integer N .Data structures:A matrix M of size |H| × (2k|H|) (cross-covariance matrix), a matrix U corresponding to the word embeddings Algorithm: (Cross-covariance estimation) ∀i, j ∈ [n] such that |i− j| ≤ N• If i = j, increase M rs by 1 for r denoting the index of word w (i) and for all s denoting the context indices of words w

Figure 3 :
Figure 3: The CCA-like algorithm that returns word embeddings with prior knowledge encoded based on a similarity graph.

Table 1 :
Mikolov et al. (2013b)4)ilarity datasets, geographic analogies and NP bracketing.The first upper blocks (A-C) presents the results with retrofitting.NPK stands for no prior knowledge (no retrofitting is used), WN for WordNet, PD for PPDB and FN for FrameNet.Glove, Skip-Gram, Global Context, Multilingual and Eigen are the word embeddings ofPennington et al. (2014),Mikolov et al. (2013b),Huang et al. (