Word Embeddings as Metric Recovery in Semantic Spaces

Continuous word representations have been remarkably useful across NLP tasks but remain poorly understood. We ground word embeddings in semantic spaces studied in the cognitive-psychometric literature, taking these spaces as the primary objects to recover. To this end, we relate log co-occurrences of words in large corpora to semantic similarity assessments and show that co-occurrences are indeed consistent with an Euclidean semantic space hypothesis. Framing word embedding as metric recovery of a semantic space unifies existing word embedding algorithms, ties them to manifold learning, and demonstrates that existing algorithms are consistent metric recovery methods given co-occurrence counts from random walks. Furthermore, we propose a simple, principled, direct metric recovery algorithm that performs on par with the state-of-the-art word embedding and manifold learning methods. Finally, we complement recent focus on analogies by constructing two new inductive reasoning datasets—series completion and classification—and demonstrate that word embeddings can be used to solve them as well.


Introduction
Continuous space models of words, objects, and signals have become ubiquitous tools for learning rich representations of data, from natural language processing to computer vision. Specifically, there has been particular interest in word embeddings, largely due to their intriguing semantic properties (Mikolov et al., 2013b) and their success as features for downstream natural language processing tasks, such as named entity recognition (Turian et al., 2010) and parsing (Socher et al., 2013).
The empirical success of word embeddings has prompted a parallel body of work that seeks to better understand their properties, associated estimation algorithms, and explore possible revisions. Recently, Levy and Goldberg (2014a) showed that linear linguistic regularities first observed with word2vec extend to other embedding methods. In particular, explicit representations of words in terms of cooccurrence counts can be used to solve analogies in the same way. In terms of algorithms, Levy and Goldberg (2014b) demonstrated that the global minimum of the skip-gram method with negative sampling of Mikolov et al. (2013b) implicitly factorizes a shifted version of the pointwise mutual information (PMI) matrix of word-context pairs. Arora et al. (2015) explored links between random walks and word embeddings, relating them to contextual (probability ratio) analogies, under specific (isotropic) assumptions about word vectors.
In this work, we take semantic spaces studied in the cognitive-psychometric literature as the prototypical objects that word embedding algorithms estimate. Semantic spaces are vector spaces over concepts where Euclidean distances between points are assumed to indicate semantic similarities. We link such semantic spaces to word co-occurrences through semantic similarity assessments, and demonstrate that the observed cooccurrence counts indeed possess statistical properties that are consistent with an underlying Euclidean space where distances are linked to semantic similarity.  Figure 1: Inductive reasoning in semantic space proposed in Sternberg and Gardner (1983). A, B, C are given, I is the ideal point and D are the choices. The correct answer is shaded green.
Formally, we view word embedding methods as performing metric recovery. This perspective is significantly different from current approaches. Instead of aiming for representations that exhibit specific semantic properties or that perform well at a particular task, we seek methods that recover the underlying metric of the hypothesized semantic space. The clearer foundation afforded by this perspective enables us to analyze word embedding algorithms in a principled task-independent fashion. In particular, we ask whether word embedding algorithms are able to recover the metric under specific scenarios. To this end, we unify existing word embedding algorithms as statistically consistent metric recovery methods under the theoretical assumption that cooccurrences arise from (metric) random walks over semantic spaces. The new setting also suggests a simple and direct recovery algorithm which we evaluate and compare against other embedding methods.
The main contributions of this work can be summarized as follows: • We ground word embeddings in semantic spaces via log co-occurrence counts. We show that PMI (pointwise mutual information) relates linearly to human similarity assessments, and that nearest-neighbor statistics (centrality and reciprocity) are consistent with an Euclidean space hypothesis (Sections 2 and 3).
• In contrast to prior work (Arora et al., 2015), we take metric recovery as the key object of study, unifying existing algorithms as consistent metric recovery methods based on cooccurrence counts from simple Markov random walks over graphs and manifolds. This strong link to manifold estimation opens a promising direction for extensions of word embedding methods (Sections and 4 and 5).
• We propose and evaluate a new principled direct metric recovery algorithm that performs comparably to the existing state of the art on both word embedding and manifold learning tasks, and show that GloVe (Pennington et al., 2014) is closely related to the second-order Taylor expansion of our objective.
• We construct and make available two new inductive reasoning datasets 1 -series completion and classification-to extend the evaluation of word representations beyond analogies, and demonstrate that these tasks can be solved with vector operations on word embeddings as well (Examples in Table 1).

Word vectors and semantic spaces
Most current word embedding algorithms build on the distributional hypothesis (Harris, 1954) where similar contexts imply similar meanings so as to tie co-occurrences of words to their underlying meanings. The relationship between semantics and cooccurrences has also been studied in psychometrics and cognitive science (Rumelhart and Abrahamson, 1973;Sternberg and Gardner, 1983), often by means of free word association tasks and semantic spaces. The semantic spaces, in particular, provide a natural conceptual framework for continuous representations of words as vector spaces where semantically related words are close to each other. For example, the observation that word embeddings can solve analogies was already shown by Rumelhart and Abrahamson (1973) Sternberg and Gardner (1983).
space is a valid representation of semantic concepts. There is substantial empirical evidence in favor of this hypothesis. For example, Rumelhart and Abrahamson (1973) showed experimentally that analogical problem solving with fictitious words and human mistake rates were consistent with an Euclidean space. Sternberg and Gardner (1983) provided further evidence supporting this hypothesis, proposing that general inductive reasoning was based upon operations in metric embeddings. Using the analogy, series completion and classification tasks shown in Table 1 as testbeds, they proposed that subjects solve these problems by finding the word closest (in semantic space) to an ideal point: the vertex of a parallelogram for analogies, a displacement from the last word in series completion, and the centroid in the case of classification ( Figure 1). We use semantic spaces as the prototypical structures that word embedding methods attempt to uncover, and we investigate the suitability of word cooccurrence counts for doing so. In the next section, we show that co-occurrences from large corpora indeed relate to semantic similarity assessments, and that the resulting metric is consistent with an Euclidean semantic space hypothesis.
3 The semantic space of log co-occurrences Most word embedding algorithms are based on word co-occurrence counts. In order for such methods to uncover an underlying Euclidean semantic space, we must demonstrate that co-occurrences themselves are indeed consistent with some semantic space. We must relate co-occurrences to semantic similarity assessments, on one hand, and show that they can be embedded into a Euclidean metric space, on the other. We provide here empirical evidence for both of these premises.
We commence by demonstrating in Figure 2 that the pointwise mutual information (Church and Hanks, 1990) evaluated from co-occurrence counts has a strong linear relationship with semantic similarity judgments from survey data (Pearson's r=0.75). 2 However, this suggestive linear relationship does not by itself demonstrate that log cooccurrences (with normalization) can be used to define an Euclidean metric space.
Earlier psychometric studies have asked whether human semantic similarity evaluations are consistent with an Euclidean space. For example, Tversky and Hutchinson (1986) investigate whether concept representations are consistent with the geometric sampling (GS) model: a generative model in which points are drawn independently from a continuous distribution in an Euclidean space. They use two nearest neighbor statistics to test agreement with this model, and conclude that certain hierarchical vocabularies are not consistent with an Euclidean embedding. Similar results are observed by Griffiths et al. (2007). We extend this embeddability analysis to lexical co-occurrences and show that semantic similarity estimates derived from these are mostly consistent with an Euclidean space hypothesis.
The first test statistic for the GS model, the centrality C, is defined as Under the GS model (i.e. when the words are consistent with a Euclidean space representation), C ≤ 2 with high probability as the number of words n → ∞ regardless of the dimension or the underlying density (Tversky and Hutchinson, 1986). For metrically embeddable data, typical non-asymptotic values of C   (Schwarz and Tversky, 1980;Tversky and Hutchinson, 1986), is defined as and measures the fraction of words that are their nearest neighbor's nearest neighbor. Under the GS model, this fraction should be greater than 0.5. 3 Table 2 shows the two statistics computed on three popular large corpora and a free word association dataset (see Section 6 for details). The nearest neighbor calculations are based on PMI. The results show a surprisingly high agreement on all three statistics for all corpora, with C and R f contained in small intervals: C ∈ [2.21, 2.66] and R f ∈ [0.62, 0.73]. These results are consistent with Euclidean semantic spaces and the GS model in particular. The largest violators of C and R f are consistent with Tversky's analysis: the word with the largest centrality in the non-stopword Wikipedia corpus is 'the', whose inclusion would increase C to 3.46 compared to 2.21 without it. Tversky's original analysis of semantic similarities argued that certain words, such as superordinate and function words, could not be embedded. Despite such specific exceptions, we find that for an appropriately normalized corpus, the majority of words are consistent with the GS model, and therefore can be represented meaningfully as vectors in Euclidean space.
The results of this section are an important step towards justifying the use of word co-occurrence counts as the central object of interest for semantic vector representations of words. We have shown 3 Both R and C are asymptotically dimension independent because they rely only on the single nearest neighbor. Estimating the latent dimensionality requires other measures and assumptions (Kleindessner and von Luxburg, 2015). that they are empirically related to a human notion of semantic similarity and that they are metrically embeddable, a desirable condition if we expect word vectors derived from them to truly behave as elements of a metric space. This, however, does not yet fully justify their use to derive semantic representations. The missing piece is to formalize the connection between these co-occurrence counts and some intrinsic notion of semantics, such as the semantic spaces described in Section 2. In the next two sections, we establish this connection by framing word embedding algorithms that operate on cooccurrences as metric recovery methods.

Semantic spaces and manifolds
We take a broader, unified view on metric recovery of semantic spaces since the notion of semantic spaces and the associated parallelogram rule for analogical reasoning extend naturally to objects other than words. For example, images can be approximately viewed as points in an Euclidean semantic space by representing them in terms of their underlying degrees of freedom (e.g. orientation, illumination). Thus, questions about the underlying semantic spaces and how they can be recovered should be related.
The problem of recovering an intrinsic Euclidean coordinate system over objects has been specifically addressed in manifold learning. For example, methods such as Isomap (Tenenbaum et al., 2000) reconstitute an Euclidean space over objects (when possible) based on local comparisons. Intuitively, these methods assume that naive distance metrics such as the L 2 distance over pixels in an image may be meaningful only when images are very similar. Longer distances between objects are evaluated through a series of local comparisons. These longer distances-geodesic distances over the manifoldcan be approximated by shortest paths on a neighborhood graph. If we view the geodesic distances on the manifold (represented as a graph) as semantic distances, then the goal is to isometrically embed these distances into an Euclidean space. Tenenbaum (1998) showed that such isometric embeddings of image manifolds can be used to solve "visual analogies" via the parallelogram rule.
Typical approaches to manifold learning as dis-276 cussed above differ from word embedding in terms of how the semantic distances between objects are extracted. Word embeddings approximate semantic distances between words using the negative log cooccurrence counts (Section 3), while manifold learning approximates semantic distances using neighborhood graphs built from local comparisons of the original, high-dimensional points. Both views seek to estimate a latent geodesic distance. In order to study the problem of metric recovery from co-occurrence counts, and to formalize the connection between word embedding and manifold learning, we introduce a simple random walk model over the underlying objects (e.g. words or images). This toy model permits us to establish clean consistency results for recovery algorithms. We emphasize that while the random walk is introduced over the words, it is not intended as a model of language but rather as a tool to understand the recovery problem.

Random walk model
Consider now a simple metric random walk X t over words where the probability of transitioning from word i to word j is given by Here ||x i − x j || 2 2 is the Euclidean distance between words in the underlying semantic space to be recovered, and h is some unknown, sub-Gaussian function linking semantic similarity to co-occurrence. 4 Under this model, the log frequency of occurrences of word j immediately after word i will be proportional to log(h(||x i − x j || 2 2 /σ)) as the corpus size grows large. Here we make the surprising observation that if we consider co-occurrences over a sufficiently large window, the log co-occurrence instead converges to −||x i −x j || 2 2 /σ, i.e. it directly relates to the underlying metric. Intuitively, this result is an analog of the central limit theorem for random walks. Note that, for this reason, we do not need to know the link function h.
Formally, given an m-token corpus consisting of sentences generated according to Equation 1 from a vocabulary of size n, let C m,n ij (t n ) be the number of times word j occurs t n steps after word i in the corpus. 5 We can show that there exist unigram normalizers a m,n i , b m,n i such that the following holds: Lemma 1. Given a corpus generated by Equation 1 there exists a i and b j such that simultaneously over all i, j: We defer the precise statement and conditions of Lemma 1 to Corollary 6. Conceptually, this limiting 6 result captures the intuition that while onestep transitions in a sentence may be complex and include non-metric structure expressed in h, cooccurrences over large windows relate directly to the latent semantic metric. For ease of notation, we henceforth omit the corpus and vocabulary size descriptors m, n (using C ij , a i , and b j in place of C m,n ij (t n ), a m,n i , and b m,n j ), since in practice the corpus is large but fixed.
Lemma 1 serves as the basis for establishing consistency of recovery for word embedding algorithms (next section). It also allows us to establish a precise link between between manifold learning and word embedding, which we describe in the remainder of this section.

Connection to manifold learning
Let {v 1 . . . v n } ∈ R D be points drawn i.i.d. from a density p, where D is the dimension of observed inputs (e.g. number of pixels, in the case of images), and suppose that these points lie on a manifold M ⊂ R D that is isometrically embeddable into d < D dimensions, where d is the intrinsic dimensionality of the data (e.g. coordinates representing illumination or camera angle in the case of images). The problem of manifold learning consists of finding an embedding of v 1 . . . v n into R d that preserves the structure of M by approximately preserving the distances between points along this manifold. In light of Lemma 1, this problem can be solved with word embedding algorithms in the following way: 1. Construct a neighborhood graph (e.g. connecting points within a distance ε) over {v 1 . . . v n }. 2. Record the vertex sequence of a simple random walk over these graphs as a sentence, and concatenate these sequences initialized at different nodes into a corpus. 3. Use a word embedding method on this corpus to generate d-dimensional vector representations of the data.
Under the conditions of Lemma 1, the negative log co-occurrences over the vertices of the neighborhood graph will converge, as n → ∞, to the geodesic distance (squared) over the manifold. In this case we will show that the globally optimal solutions of word embedding algorithms recover the low dimensional embedding (Section 5). 7

Recovering semantic distances with word embeddings
We now show that, under the conditions of Lemma 1, three popular word embedding methods can be viewed as doing metric recovery from co-occurrence counts. We use this observation to derive a new, simple word embedding method inspired by Lemma 1.

Word embeddings as metric recovery
GloVe The Global Vectors (GloVe) (Pennington et al., 2014) method for word embedding optimizes the objective function with f (C ij ) = min(C ij , 100) 3/4 . If we rewrite the bias terms as a i = a i − || x i || 2 2 and b j = b j − || c j || 2 2 , we obtain the equivalent representation: Together with Lemma 1, we recognize this as a weighted multidimensional scaling (MDS) objective 7 This approach of applying random walks and word embeddings to general graphs has already been shown to be surprisingly effective for social networks (Perozzi et al., 2014), and demonstrates that word embeddings serve as a general way to connect metric random walks to embeddings. with weights f (C ij ). Splitting the word vector x i and context vector c i is helpful in practice but not necessary under the assumptions of Lemma 1 since the true embedding x i = c i = x i /σ and a i , b i = 0 is a global minimum whenever dim( x) = d. In other words, GloVe can recover the true metric provided that we set d correctly. word2vec The skip-gram model of word2vec approximates a softmax objective: .
Without loss of generality, we can rewrite the above with a bias term b j by making dim( x) = d + 1 and setting one of the dimensions of x to 1. By redefining the bias .
Since according to Lemma 1 C ij / n k=1 C ik approaches , this is the stochastic neighbor embedding (SNE) (Hinton and Roweis, 2002) objective weighted by n k=1 C ik . The global optimum is achieved by x i = c i = x i ( √ 2/σ) and b j = 0 (see Theorem 8). The negative sampling approximation used in practice behaves much like the SVD approach of Levy and Goldberg (2014b), and by applying the same stationary point analysis as they do, we show that in the absence of a bias term the true embedding is a global minimum under the additional assumption that ||x i || 2 2 (2/σ 2 ) = log( j C ij / √ ij C ij ) (Hinton and Roweis, 2002).
SVD The method of Levy and Goldberg (2014b) uses the log PMI matrix defined in terms of the unigram frequency C i as: and computes the SVD of the shifted and truncated matrix: (M ij + τ ) + where τ is a truncation parameter to keep M ij finite. Under the limit of Lemma 1, the corpus is sufficiently large that no truncation is necessary (i.e. τ = − min(M ij ) < ∞). We will recover the underlying embedding if we additionally assume 1 σ ||x i || 2 2 = log(C i / √ j C j ) via the law of large numbers since M ij → x i , x j (see Theorem 7). Centering the matrix M ij before obtaining the SVD would relax the norm assumption, resulting exactly in classical MDS (Sibson, 1979).

Metric regression from log co-occurrences
We have shown that by simple reparameterizations and use of Lemma 1, existing embedding algorithms can be interpreted as consistent metric recovery methods. However, the same Lemma suggests a more direct regression method for recovering the latent coordinates, which we propose here. This new embedding algorithm serves as a litmus test for our metric recovery paradigm.
Lemma 1 describes a log-linear relationship between distance and co-occurrences. The canonical way to fit this relationship would be to use a generalized linear model, where the co-occurrences follow a negative binomial distribution C ij ∼ NegBin(θ, p), . Under this overdispersed log linear model, Here, the parameter θ controls the contribution of large C ij , and is akin to GloVe's f (C ij ) weight function. Fitting this model is straightforward if we define the log-likelihood in terms of the expected rate λ ij = exp(− 1 2 ||x i − x j || 2 2 + a i + b j ) as: To generate word embeddings, we minimize the negative log-likelihood using stochastic gradient descent. The implementation mirrors that of GloVe and randomly selects word pairs i, j and attracts or repulses the vectors x and c in order to achieve the relationship in Lemma 1. Implementation details are provided in Appendix C.
Relationship to GloVe The overdispersion parameter θ in our metric regression model sheds light on the role of GloVe's weight function f (C ij ). Taking the Taylor expansion of the log-likelihood at where u ij = (log λ ij − log C ij ) and k ij does not depend on x. Note the similarity of the second order term with the GloVe objective. As C ij grows, the weight functions C ij θ 2(C ij +θ) and f (C ij ) = max(C ij , x max ) 3/4 converge to θ/2 and x max respectively, down-weighting large co-occurrences.

Empirical validation
We will now experimentally validate two aspects of our theory: the semantic space hypothesis (Sections 2 and 3), and the correspondence between word embedding and manifold learning (Sections 4 and 5). Our goal with this empirical validation is not to find the absolute best method and evaluation metric for word embeddings, which has been studied before (e.g. Levy et al. (2015)). Instead, we provide empirical evidence in favor of the semantic space hypothesis, and show that our simple algorithm for metric recovery is competitive with the state-of-the-art on both semantic induction tasks and manifold learning. Since metric regression naturally operates over integer co-occurrences, we use co-occurrences over unweighted windows for this and-for fairness-for the other methods (see Appendix C for details).

Datasets
Corpus and training: We used three different corpora for training: a Wikipedia snapshot of 03/2015 (2.4B tokens), the original word2vec corpus (Mikolov et al., 2013a) (6.4B tokens), and a combination of Wikipedia with Gigaword5 emulating GloVe's corpus (Pennington et al., 2014) (5.8B tokens). We preprocessed all corpora by removing punctuation, numbers and lower-casing all the text. The vocabulary was restricted to the 100K most frequent words in each corpus. We trained embeddings using four methods: word2vec, GloVe, randomized SVD, 8 and metric regression (referred to as regression). Full implementation details are provided in the Appendix.   For open-vocabulary tasks, we restrict the set of answers to the top 30K words, since this improves performance while covering the majority of the questions. In the following, we show performance for the GloVe corpus throughout but include results for all corpora along with our code package.

Evaluation tasks:
We test the quality of the word embeddings on three types of inductive tasks: analogies, sequence completion and classification ( Figure  1). For the analogies, we used the standard openvocabulary analogy task of Mikolov et al. (2013a) (henceforth denoted Google), consisting of 19,544 semantic and syntactic questions. In addition, we use the more difficult SAT analogy dataset (version 3) (Turney and Littman, 2005), which contains 374 questions from actual exams and guidebooks. Each question consists of 5 exemplar pairs of words word1:word2, where the same relation holds for all pairs. The task is to pick from among another five pairs of words the one that best fits the category implicitly defined by the exemplars.
Inspired by Sternberg and Gardner (1983), we propose two new difficult inductive reasoning tasks beyond analogies to verify the semantic space hypothesis: sequence completion and classification. As described in Section 2, the former involves choosing the next step in a semantically coherent sequence of words (e.g. hour, minute, . . .), and the latter consists of selecting an element within the same category out of five possible choices. Given the lack of publicly available datasets, we generated our own questions using WordNet (Fellbaum, 1998) relations and word-word PMI values. These datasets were constructed before training the embeddings, so as to avoid biasing them towards any one method.
For the classification task, we created in-category words by selecting words from WordNet relations associated to root words, from which we pruned to four words based on PMI-similarity to the other words in the class. Additional options for the multiple choice questions were created searching over words related to the root by a different relation type, and selecting those most similar to the root. For the sequence completion task, we obtained WordNet trees of various relation types, and pruned these based on similarity to the root word to obtain the sequence. For the multiple-choice questions, we proceeded as before to select additional (incorrect) options of a different relation type to the root.
After pruning, we obtain 215 classification questions and 220 sequence completion questions, of which 51 are open-vocabulary and 169 are multiple choice. These two new datasets are available 9 .

Results on inductive reasoning tasks
Solving analogies using survey data alone: We demonstrate that, surprisingly, word embeddings trained directly on semantic similarity derived from survey data can solve analogy tasks. Extending a study by Rumelhart and Abrahamson (1973), we use a free-association dataset (Nelson et al., 2004) to construct a similarity graph, where vertices correspond to words and the weights w ij are given by the number of times word j was considered most similar to word i in the survey. We take the largest connected component of this graph (consisting of 4845 words and 61570 weights) and embed it using Isomap for which squared edge distances are defined as − log(w ij / max kl (w kl )). We use the result- Figure 3: Dimensionality reduction using word embedding and manifold learning. Performance is quantified by percentage of 5-nearest neighbors sharing the same digit label.
ing vectors as word embeddings to solve the Google analogy task. The results in Table 4 show that embeddings obtained with Isomap on survey data can outperform the corpus based metric regression vectors on semantic, but not syntactic tasks. We hypothesize that free-association surveys capture semantic, but not syntactic similarity between words.
Analogies: The results on the Google analogies shown in Table 3 demonstrate that our proposed framework of metric regression and L 2 distance is competitive with the baseline of word2vec with cosine distance. The performance gap across methods is small and fluctuates across corpora, but metric regression consistently outperforms GloVe on most tasks and outperforms all methods on semantic analogies, while word2vec does better on syntactic categories. For the SAT dataset, the L 2 distance performs better than the cosine similarity, and we find word2vec to perform best, followed by metric regression. The results on these two analogy datasets show that directly embedding the log cooccurrence metric and taking L 2 distances between vectors is competitive with current approaches to analogical reasoning. Sequence and classification tasks: As predicted by the semantic field hypothesis, word embeddings perform well on the two novel inductive reasoning tasks (Table 3). Again, we observe that the metric recovery with metric regression coupled with L 2 distance consistently performs as well as and often better than the current state-of-the-art word embedding methods on these two additional semantic datasets.

Word embeddings can embed manifolds
In Section 4 we proposed a reduction for solving manifold learning problems with word embeddings which we show achieves comparable performance to manifold learning methods. We now test this rela-tion by performing nonlinear dimensionality reduction on the MNIST digit dataset, reducing from D = 256 to two dimensions. Using a four-thousand image subset, we construct a k-nearest neighbor graph (k = 20) and generate 10 simple random walks of length 200 starting from each vertex in the graph, resulting in 40,000 sentences of length 200 each. We compare the four word embedding methods against standard dimensionality reduction methods: PCA, Isomap, SNE and, t-SNE. We evaluate the methods by clustering the resulting low-dimensional data and computing cluster purity, measured using the percentage of 5-nearest neighbors having the same digit label. The resulting embeddings, shown in Fig. 3, demonstrate that metric regression is highly effective at this task, outperforming metric SNE and beaten only by t-SNE (91% cluster purity), which is a visualization method specifically designed to preserve cluster separation. All word embedding methods including SVD (68%) embed the MNIST digits remarkably well and outperform baselines of PCA (48%) and Isomap (49%).

Discussion
Our work recasts word embedding as a metric recovery problem pertaining to the underlying semantic space. We use co-occurrence counts from random walks as a theoretical tool to demonstrate that existing word embedding algorithms are consistent metric recovery methods. Our direct regression method is competitive with the state of the art on various semantics tasks, including two new inductive reasoning problems of series completion and classification.
Our framework highlights the strong interplay and common foundation between word embedding methods and manifold learning, suggesting several avenues for recovering vector representations of phrases and sentences via properly defined Markov processes and their generalizations.

Appendix A Metric recovery from Markov processes on graphs and manifolds
Consider an infinite sequence of points X n = {x 1 , . . . , x n }, where x i are sampled i.i.d. from a density p(x) over a compact Riemannian manifold equipped with a geodesic metric ρ. For our purposes, p(x) should have a bounded log-gradient and a strict lower bound p 0 over the manifold. The random walks we consider are over unweighted spatial graphs defined as Definition 2 (Spatial graph). Let σ n : X n → R >0 be a local scale function and h : R ≥0 → [0, 1] a piecewise continuous function with sub-Gaussian tails. A spatial graph G n corresponding to σ n and h is a random graph with vertex set X n and a directed edge from x i to x j with probability Simple examples of spatial graphs where the connectivity is not random include the ε ball graph (σ n (x) = ε) and the k-nearest neighbor graph (σ n (x) =distance to k-th neighbor).
Log co-occurrences and the geodesic will be connected in two steps. (1) we use known results to show that a simple random walk over the spatial graph, properly scaled, behaves similarly to a diffusion process; (2) the log-transition probability of a diffusion process will be related to the geodesic metric on a manifold.
(1) The limiting random walk on a graph: Just as the simple random walk over the integers converges to a Brownian motion, we may expect that under specific constraints the simple random walk X n t over the graph G n will converge to some welldefined continuous process. We require that the scale functions converge to a continuous functionσ (σ n (x)g −1 n a.s. − − →σ(x)); the size of a single step vanish (g n → 0) but contain at least a polynomial number of points within σ n (x) (g n n 1 d+2 log(n) − 1 d+2 → ∞). Under this limit, our assumptions about the density p(x), and regularity of the transitions 10 , the 10 For t = Θ(g −2 n ), the marginal distribution nP(Xt|X0) must be a.s. uniformly equicontinuous. For undirected spatial graphs, this is always true (Croydon and Hambly, 2008), but for directed graphs it is an open conjecture from (Hashimoto et al., 2015b). following holds: Theorem 3 ( (Hashimoto et al., 2015b;Ting et al., 2011)). The simple random walk X n t on G n converges in Skorokhod space D([0, ∞), D) after a time scaling t = tg 2 n to the Itô process Y t valued in C([0, ∞), D) as X n tg −2 n → Y t . The process Y t is defined over the normal coordinates of the manifold (D, g) with reflecting boundary conditions on D as The equicontinuity constraint on the marginal densities of the random walk implies that the transition density for the random walk converges to its continuum limit.
Lemma 4 (Convergence of marginal densities). (Hashimoto et al., 2015a) Let x 0 be some point in our domain X n and define the marginal densities q t (x) = P(Y t = x|Y 0 = x 0 ) and q tn (x) = P(X n t = x|X n 0 = x 0 ). If t n g 2 n = t = Θ(1), then under condition ( ) and the results of Theorem 3 such that X n t → Y n t weakly, we have (2) Log transition probability as a metric We may now use the stochastic process Y t to connect the log transition probability to the geodesic distance using Varadhan's large deviation formula.
Theorem 5 ((Varadhan, 1967;Molchanov, 1975)). Let Y t be a Itô process defined over a complete Riemann manifold (D, g) with geodesic distance This estimate holds more generally for any space admitting a diffusive stochastic process (Saloff-Coste, 2010). Taken together, we finally obtain: Corollary 6 (Varadhan's formula on graphs). For any δ,γ,n 0 there exists some t, n > n 0 , and sequence b n j such that the following holds for the simple random walk X n t : P sup Where ρ σ(x) is the geodesic defined as Proof. The proof is in two parts. First, by Varadhan's formula (Theorem 5, (Molchanov, 1975, Eq. 1.7)) for any δ 1 > 0 there exists some t such that: The uniform equicontinuity of the marginals implies their uniform convergence (Lemma S4), so for any δ 2 > 0 and γ 0 , there exists a n such that By the lower bound on p and compactness of D, P(Y t |Y 0 ) is lower bounded by some strictly positive constant c and we can apply uniform continuity of log(x) over (c, ∞) to get that for some δ 3 and γ, P sup (3) Finally we have the bound, P sup To combine the bounds, given some δ and γ, set b n j = log(np(x j )), pick t such that δ 1 < δ/2, then pick n such that the bound in Eq. 3 holds with probability γ and error δ 3 < δ/(2 t).

B Consistency proofs for word embedding
Lemma 7 (Consistency of SVD). Assume the norm of the latent embedding is proportional to the unigram frequency Under these conditions, Let X be the embedding derived from the SVD of M ij as Then there exists a τ such that this embedding is close to the true embedding under the same equivalence class as Lemma S7 Proof. By Corollary 6 for any δ 1 > 0 and ε 1 > 0 there exists a m such that Now additionally, if C i / j C j = ||x i || 2 /σ 2 then we can rewrite the above bound as and therefore, Given that the dot product matrix has error at most δ 1 , the resulting embedding it known to have at most √ δ 1 error (Sibson, 1979). This completes the proof, since we can pick τ = − log(mc), δ 1 = δ 2 and ε 1 = ε.
Theorem 8 (Consistency of softmax/word2vec). Define the softmax objective function with bias as Define x m , c m , b m as the global minima of the above objective function for a co-occurrence C ij over a corpus of size m. For any ε > 0 and δ > 0 there exists some m such that By differentiation, any objective of the form min λ ij C ij log exp(−λ ij ) k exp(−λ ik ) has the minima λ * ij = − log(C ij ) + a i up to un-identifiable a i with objective function value C ij log(C ij / k C ik ). This gives a global function lower bound Now consider the function value of the true embedding x σ ; .
Note that for word2vec with negative-sampling, applying the stationary point analysis of Levy and Goldberg (2014b) combined with the analysis in Lemma S7 shows that the true embedding is a global minimum.

C Empirical evaluation details C.1 Implementation details
We used off-the-shelf implementations of word2vec 11 and GloVe 12 .
The two other methods (randomized) SVD and regression embedding are both implemented on top of the GloVe codebase. We used 300-dimensional vectors and window size 5 in all models. Further details are provided below. 11 http://code.google.com/p/word2vec 12 http://nlp.stanford.edu/projects/glove word2vec. We used the skip-gram version with 5 negative samples, 10 iterations, α = 0.025 and frequent word sub-sampling with a parameter of 10 −3 .

GloVe.
We disabled GloVe's corpus weighting, since this generally produced superior results. The default step-sizes results in NaN-valued embeddings, so we reduced them. We used X MAX = 100, η = 0.01 and 10 iterations.
SVD. For the SVD algorithm of Levy and Goldberg (2014b), we use the GloVe co-occurrence counter combined with a parallel randomized projection SVD factorizer, based upon the redsvd library due to memory and runtime constraints. 13 Following Levy et al. (2015), we used the square root factorization, no negative shifts (τ = 0 in our notation), and 50,000 random projections.
Regression Embedding. We use standard SGD with two differences. First, we drop co-occurrence values with probability proportional to 1 − C ij /10 when C ij < 10, and scale the gradient, which resulted in training time speedups with no loss in accuracy. Second, we use an initial line search step combined with a linear step size decay by epoch. We use θ = 50 and η is line-searched starting at η = 10.

C.2 Solving inductive reasoning tasks
The ideal point for a task is defined below: • Analogies: Given A:B::C, the ideal point is given by B − A + C (parallelogram rule). • Analogies (SAT): Given prototype A:B and candidates C 1 : D 1 . . . C n : D n , we compare D i − C i to the ideal point B − A. • Categories: Given a category implied by w 1 , . . . , w n , the ideal point is I = 1 n n i=1 w i . • Sequence: Given sequence w 1 : · · · : w n we compute the ideal as I = w n + 1 n (w n − w 1 ).
Once we have the ideal point I, we pick the answer as the word closest to I among the options, using L 2 or cosine distance. For the latter, we normalize I to unit norm before taking the cosine distance. For L 2 we do not apply any normalization.