Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach

We propose a novel geometric approach for learning bilingual mappings given monolingual embeddings and a bilingual dictionary. Our approach decouples the source-to-target language transformation into (a) language-specific rotations on the original embeddings to align them in a common, latent space, and (b) a language-independent similarity metric in this common space to better model the similarity between the embeddings. Overall, we pose the bilingual mapping problem as a classification problem on smooth Riemannian manifolds. Empirically, our approach outperforms previous approaches on the bilingual lexicon induction and cross-lingual word similarity tasks. We next generalize our framework to represent multiple languages in a common latent space. Language-specific rotations for all the languages and a common similarity metric in the latent space are learned jointly from bilingual dictionaries for multiple language pairs. We illustrate the effectiveness of joint learning for multiple languages in an indirect word translation setting.


Introduction
Bilingual word embeddings are a useful tool in natural language processing (NLP) that has attracted a lot of interest lately due to a fundamental property: similar concepts/words across * This work was carried out during the author's internship at Microsoft, India. different languages are mapped close to each other in a common embedding space. Hence, they are useful for joint/transfer learning and sharing annotated data across languages in different NLP applications such as machine translation (Gu et al., 2018), building bilingual dictionaries (Mikolov et al., 2013b), mining parallel corpora (Conneau et al., 2018), text classification (Klementiev et al., 2012), sentiment analysis (Zhou et al., 2015), and dependency parsing (Ammar et al., 2016). Mikolov et al. (2013b) empirically show that a linear transformation of embeddings from one language to another preserves the geometric arrangement of word embeddings. In a supervised setting, the transformation matrix, W, is learned given a small bilingual dictionary and their corresponding monolingual embeddings. Subsequently, many refinements to the bilingual mapping framework have been proposed (Xing et al., 2015;Smith et al., 2017b;Conneau et al., 2018;Artetxe et al., 2016Artetxe et al., , 2017Artetxe et al., , 2018a.
In this work, we propose a novel geometric approach for learning bilingual embeddings. We rotate the source and target language embeddings from their original vector spaces to a common latent space via language-specific orthogonal transformations. Furthermore, we define a similarity metric, the Mahalanobis metric, in this common space to refine the notion of similarity between a pair of embeddings. We achieve the above by learning the transformation matrix as follows: W = U t BU s , where U t and U s are the orthogonal transformations for target and source language embeddings, respectively, and B is a positive definite matrix representing the Mahalanobis metric.
The proposed formulation has the following benefits: • The learned similarity metric allows for a more effective similarity comparison of embeddings based on evidence from the data.
• A common latent space decouples the source and target language transformations, and naturally enables representation of word embeddings from both languages in a single vector space.
• We also show that the proposed method can be easily generalized to jointly learn multilingual embeddings, given bilingual dictionaries of multiple language pairs. We map multiple languages into a single vector space by learning the characteristics common across languages (the similarity metric) as well as language-specific attributes (the orthogonal transformations).
The optimization problem resulting from our formulation involves orthogonal constraints on language-specific transformations (U i for language i) as well as the symmetric positivedefinite constraint on the metric B. Instead of solving the optimization problem in the Euclidean space with constraints, we view it as an optimization problem in smooth Riemannian manifolds, which are well-studied topological spaces (Lee, 2003). The Riemannian optimization framework embeds the given constraints into the search space and conceptually views the problem as an unconstrained optimization problem over the manifold.
We evaluate our approach on different bilingual as well as multilingual tasks across multiple languages and datasets. The following is a summary of our findings: • Our approach outperforms state-of-the-art supervised and unsupervised bilingual mapping methods on the bilingual lexicon induction as well as the cross-lingual word similarity tasks.
• An ablation analysis reveals that the following contribute to our model's improved performance: (a) aligning the embedding spaces of different languages, (b) learning a similarity metric which induces a latent space, (c) performing inference in the in-duced latent space, and (d) formulating the tasks as a classification problem.
• We evaluate our multilingual model on an indirect word translation task: translation between a language pair that does not have a bilingual dictionary, but the source and target languages each possess a bilingual dictionary with a third, common pivot language. Our multilingual model outperforms a strong unsupervised baseline as well as methods based on adapting bilingual methods for this indirect translation task.
• Lastly, we propose a semi-supervised extension of our approach that further improves performance over the supervised approaches.
The rest of the paper is organized as follows. Section 2 discusses related work. The proposed framework, including problem formulations for bilingual and multilingual mappings, is presented in Section 3. The proposed Riemannian optimization algorithm is described in Section 4. In Section 5, we discuss our experimental setup. Section 6 presents the results of experiments on direct translation with our algorithms and analyzes the results. Section 7 presents experiments on indirect translation using our generalized multilingual algorithm. We discuss a semi-supervised extension to our framework in Section 8. Section 9 concludes the paper.

Related Work
Bilingual Embeddings. Mikolov et al. (2013b) show that a linear transformation from embeddings of one language to another can be learned from a bilingual dictionary and corresponding monolingual embeddings by performing linear least-squares regression. A popular modification to this formulation constrains the transformation matrix to be orthogonal (Xing et al., 2015;Smith et al., 2017b;2018a). This is known as the orthogonal Procrustes problem (Schönemann, 1966). Orthogonality preserves monolingual distances and ensures the transformation is reversible. Lazaridou et al. (2015) and  optimize alternative loss functions in this framework. Artetxe et al. (2018a) improves on the Procrustes solution and propose a multi-step framework consisting of a series of linear transformations to the data. Faruqui and Dyer (2014) use Canonical Correlation Analysis (CCA) to learn linear projections from the source and target languages to a common space such that correlations between the embeddings projected to this space are maximized. Procrustes solutionbased approaches have been shown to perform better than CCA-based approaches (Artetxe et al., 2016(Artetxe et al., , 2018a. We view the problem of mapping the source and target languages word embeddings as (a) aligning the two language spaces and (b) learning a similarity metric in this (learned) common space. We accomplish this by learning suitable language-specific orthogonal transformations (for alignment) and a symmetric positive-definite matrix (as Mahalanobis metric). The similarity metric is useful in addressing the limitations of mapping to a common latent space under orthogonality constraints, an issue discussed by Doval et al. (2018). Whereas Doval et al. (2018) learn a second correction transformation by assuming the average of the projected source and target embeddings as the true latent representation, we make no such assumption and learn the similarity metric from the data. Kementchedjhieva et al. (2018), recently, employed the generalized Procrustes analysis (GPA) method (Gower, 1975) for the bilingual mapping problem. GPA maps both the source and target language embeddings to a latent space, which is constructed by averaging over the two language spaces.
Unsupervised methods have shown promising results, matching supervised methods in many studies. Artetxe et al. (2017) proposed a bootstrapping method for bilingual lexicon induction problem by using a small-seed bilingual dictionary. Subsequently, Artetxe et al. (2018b) and Hoshen and Wolf (2018) have proposed initialization methods that eliminate the need for a seed dictionary. Zhang et al. (2017b) and  proposed aligning the source and target language word embeddings by optimizing the Wasserstein distance. Unsupervised methods based on adversarial training objectives have also been proposed (Barone, 2016;Zhang et al., 2017a;Conneau et al., 2018;Chen and Cardie, 2018). A recent work by  discusses cases in which unsupervised bilingual lexicon induction does not lead to good performance.
Multilingual Embeddings. Ammar et al. (2016) and Smith et al. (2017a) adapt bilingual ap-proaches for representing embeddings of multiple languages in a common vector space by designating one of the languages as a pivot language. In this simple approach, bilingual mappings are learned independently from all other languages to the pivot language. A GPA-based method (Kementchedjhieva et al., 2018) may also be used to jointly transform multiple languages to a common latent space. However, this requires an n-way dictionary to represent n languages. In contrast, the proposed approach requires only pairwise bilingual dictionaries such that every language under consideration is represented in at least one bilingual dictionary.
The above-mentioned approaches are referred to as offline since the monolingual and bilingual embeddings are learned separately. In contrast, online approaches directly learn a bilingual/ multilingual embedding from parallel corpora (Hermann and Blunsom, 2014;Duong et al., 2017), optionally augmented with monolingual corpora (Klementiev et al., 2012;Chandar et al., 2014;Gouws et al., 2015). In this work, we focus on offline approaches.

Learning Latent Space Representation
In this section, we first describe the proposed geometric framework to learn bilingual embeddings. We then present its generalization to the multilingual setting.

Geometry-aware Factorization
We propose to transform the word embeddings from the source and target languages to a common space in which the similarity of word embeddings may be better learned. To this end, we align the source and target languages embedding spaces by learning language-specific rotations: U s ∈ O d and U t ∈ O d for the source and target languages embeddings, respectively. Here O d represents the space of d-dimensional orthogonal matrices. An embedding x in the source language is thus transformed to ψ s (x) = U s x. Similarly, for an embedding z in the target language: ψ t (z) = U t z. These orthogonal transformations map (align) both the source and target language embeddings to a common space in which we learn a data-dependent similarity measure, as discussed below.
We learn a Mahalanobis metric B to refine the notion of similarity 1 between the two transformed embeddings ψ s (x) and ψ t (z). The Mahalanobis metric incorporates the feature correlation information from the given training data. This allows for a more effective similarity comparison of language embeddings (than the cosine similarity). In fact, Mahalanobis similarity measure reduces to cosine similarity when the features are uncorrelated and have unit variance, which may be a strong assumption in real-world applications.  have argued that monolingual embedding spaces across languages are not necessarily isomorphic, hence learning an orthogonal transformation alone may not be sufficient. A similarity metric learned from the data may mitigate this limitation to some extent by learning a correction in the latent space. Since 0. The similarity between the embeddings x and z in the proposed setting is The source to the target language transformation is expressed as W ts = U t BU s . For an embedding x in the source language, its transformation to the target language space is given by W ts x.
The proposed factorization of the transformation W = UBV , where U, V ∈ O d and B 0, is sometimes referred to as polar factorization of a matrix (Bonnabel and Sepulchre, 2010;Meyer et al., 2011). Polar factorization is similar to the singular value decomposition (SVD) The key difference is that SVD enforces B to be a diagonal matrix with non-negative entries, which accounts for only the axis rescaling instead of full feature correlation and is more difficult to optimize Harandi et al., 2017).

Latent Space Interpretation
Computing the Mahalanobis similarity measure is equivalent to computing the cosine similarity in a special latent (feature) space. This latent space is defined by the transformation φ : where the mapping is defined as φ(w) = B 1 2 w.
1 Mahalanobis metric generalizes the notion of cosine similarity. For given two unit normalized vectors x 1 , x 2 ∈ R d , their cosine similarity is given by sim I (x 1 , x 2 ) = x 1 Ix 2 = x 1 x 2 , where I is the identity matrix. If this space is endowed with a metric B 0, then sim B (x 1 , x 2 ) = x 1 Bx 2 .
Since B is a symmetric positive-definite matrix, B 1 2 is well-defined and unique. Hence, our model may equivalently be viewed as learning a suitable latent space as follows. The source and target languages embeddings are linearly transformed as x → φ(ψ s (x)) and z → φ(ψ t (z)), respectively. The functions φ(ψ s (·)) and φ(ψ t (·)) map the source and target language embeddings, respectively, to a common latent space. We learn the matrices B, U s , and U t corresponding to the transformations φ(·), ψ s (·), and ψ t (·), respectively. Since the matrix B is embedded implicitly in this latent feature space, we employ the usual cosine similarity measure, It should be noted that this is equal to h st (x, z).

A Classification Model
We assume a small bilingual dictionary (of size n) is available as the training data. Let X s ∈ R d×n s and X t ∈ R d×n t denote the embeddings of the dictionary words from the source and target languages, respectively. Here, n s and n t are the number of unique words in the source and target languages present in the dictionary. We propose to model the bilingual word embedding mapping problem as a binary classification problem. Consider word embeddings x and z from the source and target languages, respectively. If the words corresponding to x and z constitute a translation pair then the pair {x, z} belongs to the positive class, else it belongs to the negative class. The prediction function for the pair {x, z} is h st (x, z). We create a binary label matrix Y st ∈ {0, 1} n s ×n t whose (i, j)-th entry corresponds to the correctness of mapping the i-th embedding in X s to the j-th embedding in X t . Our overall optimization problem is as follows: where · F is the Frobenius norm and λ > 0 is the regularization parameter. We employ the square loss function since it is smooth and relatively easier to optimize. It should be noted that our prediction function is invariant of the direction of mapping, i.e., h st (x, z) = h ts (z, x). Hence, our model learns bidirectional mapping. The transformation matrix from the target to the source language is given by The computation complexity of computing the loss term in (1) is linear in n, the size of the given bilingual dictionary. This is because the loss term in (1) can be re-written as follows: where x si represents the i-th column in X s , x tj represents the j-th column in X t , Ω is the set of row-column indices corresponding to entry value 1 in Y st , and Tr(·) denotes the trace of a matrix. The complexity of computing the first and third term in (2) is O(d 3 + n s d 2 + n t d 2 ) and O(nd + n s d 2 + n t d 2 ), respectively. Similarly, the computation cost of the gradient of the objective function in (1) is also linear in n. Hence, our framework can efficiently leverage information from all the negative samples.
In the next section, we discuss a generalization of our approach to multilingual settings.

Generalization to Multilingual Setting
In this section, we propose a unified framework for learning mappings when bilingual dictionaries are available for multiple language pairs. We formalize the setting as an undirected, connected graph G(V, E), where each node represents a language and an edge represents the availability of a bilingual dictionary between the corresponding pair of languages. Given all bilingual dictionaries corresponding to the edge set E, we propose to align the embedding spaces of all languages in the node set V and learn a common latent space for them.
To this end, we jointly learn an orthogonal transformation U i ∈ O d for every language L i and the Mahalanobis metric B 0. The latter is common across all languages in the multilingual setup and helps incorporate information across languages in the latent space. It should be noted that the transformation U i is employed for all the bilingual mapping problems in this graph associated with L i . The transformation from L i to L j is given by W ji = U j BU i . Further, we are also able to obtain transformations between any language pair in the graph, even if a bilingual dictionary between them is not available. Let X j i ∈ R d×m be 2 the embeddings of the dictionary words of L i in the dictionary corresponding to edge e ij ∈ E. Let Y ij ∈ {0, 1} m×m be the binary label matrix corresponding to the dictionary between L i and L j . The proposed optimization problem for multilingual setting is We term our approach as Geometry-aware Multilingual Mapping (GeoMM). We next discuss the optimization algorithm for solving the bilingual mapping problem (1) as well as its generalization to the multilingual setting (3).

Optimization Algorithm
The geometric constraints U s ∈ O d , U t ∈ O d , and B 0 in the proposed problems (1) and (3) have been studied as smooth Riemannian manifolds, which are well explored topological spaces (Edelman et al., 1998). The orthogonal matrices U i lie in, what is popularly known as, the d-dimensional Orthogonal manifold. The space of d × d symmetric positive definite matrices (B 0) is known as the Symmetric Positive Definite manifold. The Riemannian optimization framework embeds such constraints into the search space and conceptually views the problem as an unconstrained problem over the manifolds. In the process, it is able to exploit the geometry of the manifolds and the symmetries involved in them. Absil et al. (2008) discuss several tools to systematically optimize such problems. We optimize the problems (1) and (3) using the Riemannian conjugate gradient algorithm (Absil et al., 2008;Sato and Iwai, 2013).
Publicly available toolboxes such as Manopt (Boumal et al., 2014), Pymanopt (Townsend et al., 2016), or ROPTLIB (Huang et al., 2016) have scalable off-the-shelf generic implementations of several Riemannian optimization algorithms. We employ Pymanopt in our experiments, where we only need to supply the objective function.

Experimental Settings
In this section, we describe the evaluation tasks, the datasets used, and the experimental details of the proposed approach.
Evaluation Tasks. We evaluate our approach on several tasks: • To evaluate the quality of the bilingual mappings generated, we evaluate our algorithms primarily for the bilingual lexicon induction (BLI) task, i.e., word translation task, and compare Precision@1 with previously reported state-of-the-art results on benchmark datasets Artetxe et al., 2016;Conneau et al., 2018).
• We also evaluate on the cross-lingual word similarity task using the SemEval 2017 dataset.
• To ensure that quality of embeddings on monolingual tasks does not degrade, we evaluate the quality of our embeddings on the monolingual word analogy task (Artetxe et al., 2016).
• To illustrate the utility of representing embeddings of multiple language in a single latent space, we evaluate our multilingual embeddings on the one-hop translation task, i.e., a direct dictionary between the source and target languages is not available, but the source and target languages share a bilingual dictionary with a pivot language.
Datasets. For bilingual and multilingual experiments, we report results on the following widely used, publicly available datasets: • VecMap: This dataset was originally made available by  with subsequent extensions by other researchers (Artetxe et al., 2017(Artetxe et al., , 2018a. It contains bilingual dictionaries from English (en) to four languages: Italian (it), German (de), Finnish (fi), and Spanish (es). The detailed experimental settings for this BLI task can be found in Artetxe et al. (2018b).
• MUSE: This dataset was originally made available by Conneau et al. (2018). It contains bilingual dictionaries from English to many languages such as Spanish (es), French (fr), German (de), Russian (ru), Chinese (zh), and vice versa. The detailed experimental settings for this BLI task can be found in Conneau et al. (2018). This dataset also contains bilingual dictionaries between several other European languages, which we employ in multilingual experiments.
Experimental Settings of GeoMM. We select the regularization hyper-parameter λ from the set {10, 10 2 , 10 3 , 10 4 } by evaluation on a validation set created out of the training dataset. For inference, we use the (normalized) latent space representations of embeddings (B 1 2 U i x) to compute similarity between the embeddings. For inference in the bilingual lexicon induction task, we employ the Cross-domain Similarity Local Scaling (CSLS) similarity score (Conneau et al., 2018) in nearest neighbor search, unless otherwise mentioned. CSLS has been shown to perform better than other methods in mitigating the hubness problem  for search in highdimensional spaces.
While discussing experiments, we denote our bilingual mapping algorithm (Section 3.3) as GeoMM and its generalization to the multilingual setting (Section 3.4) as GeoMM multi . Our code is available at https://github.com/anoop kunchukuttan/geomm.

Direct Translation: Results and Analysis
In this section, we evaluate the performance of our approach on two tasks: bilingual lexicon induction and cross-lingual word similarity. We also perform ablation tests to understand the effect of major sub-components of our algorithm. We verify the monolingual performance of the mapped embeddings generated by our algorithm.

Bilingual Lexicon Induction (BLI)
We compare GeoMM with the best performing supervised methods. We also compare with unsupervised methods as they have been shown to be competitive with supervised methods. The following baselines are compared in the BLI experiments.
Method en-es es-en en-fr fr-en en-de de-en en-ru ru-en en-zh zh-en avg.  It improves on the original system (MSF-ISF) by Artetxe et al. (2018a), which employs inverted softmax function (ISF) score for retrieval.
We also include results of the correction algorithm proposed by Doval et al. (2018) on the MSF results (referred to as MSF µ ). In addition, we also include results of several recent works (Kementchedjhieva et al., 2018;Chen and Cardie, 2018;Hoshen and Wolf, 2018) on MUSE and VecMap datasets, which are reported in the original papers. Table 1 reports the results on the MUSE dataset. We observe that our algorithm GeoMM outperforms all the supervised baselines. GeoMM also obtains significant improvements over unsupervised approaches.

Results on MUSE Dataset:
The performance of the multilingual extension, GeoMM multi , is almost equivalent to the bilingual GeoMM. This means that in spite of multiple embeddings being jointly learned and represented in a common space, its performance is still   (Faruqui and Dyer, 2014), and Adv-Refine are reported by Artetxe et al. (2018b). CCA-NN employs nearest neighbor retrieval procedure. The results of GPA are reported by Kementchedjhieva et al. (2018).
better than existing bilingual approaches. Thus, our multilingual framework is quite robust since languages from diverse language families have been embedded in the same space. This can allow downstream applications to support multiple languages without performance degradation. Even if bilingual embeddings are represented in a single vector space using a pivot language, the embedding quality is inferior compared with GeoMM multi . We discuss more multilingual experiments in Section 7.  Results on VecMap Dataset: Table 2 reports the results on the VecMap dataset. We observe that GeoMM obtains the best performance in each language pair, surpassing state-of-the-art results reported on this dataset. GeoMM also outperforms GPA (Kementchedjhieva et al., 2018), which also learns bilingual embeddings in a latent space.

Ablation Tests
We next study the impact of different components of our framework by varying one component at a time. The results of these tests on VecMap dataset are shown in Table 3 and are discussed below.
(1) Classification with unconstrained W. We learn the transformation W directly as follows: The performance drops in this setting compared with GeoMM, underlining the importance of the proposed factorization and the latent space representation. In addition, the proposed factorization helps GeoMM generalize to the multilingual setting (GeoMM multi ). Further, we also observe that the overall performance of this simple classification based model is better than recent supervised approaches such as Procrustes, MSF-ISF (Artetxe et al., 2018a), and GPA (Kementchedjhieva et al., 2018). This suggests that a classification model is better suited for the BLI task. Next, we look at both components of the factorization.
(2) Without language-specific rotations. We enforce U s = U t = I in (1) for GeoMM, i.e., W = B. We observe a significant drop in performance, which highlights the need for aligning the feature space of different languages.
(3) Without similarity metric. We enforce B = I in (1) for GeoMM, i.e., W = U t U s . It can be observed that the results are poor, which underlines the importance of a suitable similarity metric in the proposed classification model.
(4) Target space inference. We learn W = U t BU s by solving (1), as in GeoMM.
During the retrieval stage, the similarity between embeddings is computed in the target space, i.e., given embeddings x and z from the source and target languages, respectively, we compute the similarity of the (normalized) vectors Wx and z. It should be noted that GeoMM computes similarity of x and z in the latent space, i.e., it computes the similarity of the (normalized) vectors B 1 2 U s x and B 1 2 U t z, respectively. We observe that inference in the target space degrades the performance. This shows that the latent space representation captures useful information and allows GeoMM to obtain much better accuracy.
We pose BLI as a regression problem, as done in previous approaches, by employing the following loss function: U t BU s X s − X t 2 F . We observe that its performance is worse than the classification baseline (W ∈ R d×d ). The classification setting directly models the similarity score via the loss function, and hence corresponds with inference more closely. This result further reinforces the observation made in the first ablation test.
To summarize, the proposed modeling choices are better than the alternatives compared in the ablation tests.

Cross-lingual Word Similarity
The results on the cross-lingual word similarity task using the SemEval 2017 dataset (Camacho-Collados et al., 2017) are shown in Table 4. We observe that GeoMM performs better than Procrustes, MSF, and the SemEval 2017 baseline NASARI (Camacho-Collados et al., 2016). It is also competitive with Luminoso run2 (Speer and Lowry-Duda, 2017), the best reported system on this dataset. It should be noted that NASARI and luminoso run2 use additional knowledge sources like BabelNet and ConceptNet.   Table 5: Results on the monolingual word analogy task. Table 5 shows the results on the English monolingual analogy task after obtaining it→en mapping on the VecMap dataset (Mikolov et al., 2013a;Artetxe et al., 2016). We observe that there is no significant drop in the monolingual performance by the use of non-orthogonal mappings compared with monolingual embeddings as well as other bilingual embeddings (Procrustes and MSF).

Indirect Translation: Results and Analysis
In the previous sections, we have established the efficacy of our approach for a bilingual mapping problem when a bilingual dictionary between the source and target languages is available. We also showed that our proposed multilingual generalization (Section 3.4) performs well in this scenario. In this section, we explore if our multilingual generalization is beneficial when a bilingual dictionary is not available between the source and the target, in other words, indirect translation. For this evaluation, our algorithm learns a single model for various language pairs such that word embeddings of different languages are transformed to a common latent space.

Evaluation Task: One-hop Translation
We consider the BLI task from language L src to language L tgt in the absence of a bilingual lexicon between them. We, however, assume the Method fr-it-pt it-de-es es-pt-fr avg.  availability of lexicons for L src -L pvt and L pvt -L tgt , where L pvt is a pivot language. As baselines, we adapt any supervised bilingual approach (Procrustes, MSF, and the proposed GeoMM) to the one-hop translation setting by considering their following variants: • Composition (cmp): Using the given bilingual approach, we learn the L src → L pvt and L pvt → L tgt transformations as W 1 and W 2 , respectively. Given an embedding x from L src , the corresponding embedding in L tgt is obtained by a composition of the transformations, i.e., W 2 W 1 x. This is equivalent to computing the similarity of L src and L tgt embeddings in the L pvt embedding space. Recently, Smith et al. (2017a) explored this technique with the Procrustes algorithm.
• Pipeline (pip): Using the given bilingual approach, we learn the L src → L pvt and L pvt → L tgt transformations as W 1 and W 2 , respectively. Given a word embedding x from L src , we infer its translation embedding z in L pvt . Then, the corresponding embedding of x in L tgt is W 2 z.
As discussed in Section 3.4, our framework allows the flexibility to jointly learn the common latent space of multiple languages, given bilingual dictionaries of multiple language pairs. Our multilingual approach, GeoMM multi , views this setting as a graph with three nodes {L src , L tgt , L pvt } and two edges {L src -L pvt , L pvt -L tgt } (dictionaries).

Experimental Settings
We experiment with the following one-hop translation cases: (a) fr-it-pt, (b) it-de-es, and (c) Method en-es es-en en-fr fr-en en-de de-en en-ru ru-en en-zh zh-en en-it it-en avg.  es-pt-fr (read the triplets as L src -L pvt -L tgt ). The training/test dictionaries and the word embeddings are from the MUSE dataset. In order to minimize direct transfer of information from L src to L tgt , we generate L src -L pvt and L pvt -L tgt training dictionaries such that they do not have any L pvt word in common. The training dictionaries have the same size as the L src -L pvt and L pvt -L tgt dictionaries provided in the MUSE dataset while the test dictionaries have 1,500 entries. Table 6 shows the results of the one-hop translation experiments. We observe that GeoMM multi outperforms pivoting methods (cmp and pip) built on top of MSF and Procrustes for all language pairs. It should be noted that pivoting may lead to cascading of errors in the solution, whereas learning a common embedding space jointly mitigates this disadvantage. This is reaffirmed by ourobservation that GeoMM multi performs significantly better than GeoMM (cmp) and GeoMM (pip). Since unsupervised methods have been shown to be competitive with supervised methods, they can be an alternative to pivoting. Indeed, we observe that the unsupervised method SL-unsup is better than the pivoting methods, although it used no bilingual dictionaries. On the other hand, GeoMM multi is better than the unsupervised methods too. It should be noted that the unsupervised methods use much larger vocabulary than GeoMM multi during the training stage.

Results and Analysis
We also experimented with scenarios where some words from L pvt occur in both L src -L pvt and L pvt -L tgt training dictionaries. In these cases too, we observed that GeoMM multi perform better than other methods. We have not included these results because of space constraints.

Semi-supervised GeoMM
In this section, we discuss an extension of GeoMM, which benefits from unlabeled data. For  the bilingual mapping problem, unlabeled data is available in the form of vocabulary lists for both the source and target languages. Existing unsupervised and semi-supervised techniques (Artetxe et al., 2017(Artetxe et al., , 2018bHoshen and Wolf, 2018) have an iterative refinement procedure that employs the vocabulary lists to augment the dictionary with positive or negative mappings.
Given a seed bilingual dictionary, we implement a bootstrapping procedure that iterates over the following two steps until convergence: 1. Learn the GeoMM model by solving the proposed formulation (1) with the current bilingual dictionary.
2. Compute a new bilingual dictionary from the vocabulary lists, using the (current) GeoMM model for retrieval. The seed dictionary along with this new dictionary is used in the next iteration.
In order to keep the computational cost low, we restrict the vocabulary list to k most frequent words for both the languages (Artetxe et al., 2018b;Hoshen and Wolf, 2018). In addition, we perform bidirectional dictionary induction (Artetxe et al., 2018b;Hoshen and Wolf, 2018). We track the model's performance on a validation set to avoid overfitting and use it as a criterion for convergence of the bootstrap procedure. We evaluate the proposed semi-supervised GeoMM algorithm (referred to as GeoMM semi ) on the bilingual lexicon induction task on MUSE and VecMap datasets. The bilingual dictionary for training is split 80/20 into the seed dictionary and the validation set. We set k = 25,000, which works well in practice.
We compare GeoMM semi with RCSLS, a recently proposed state-of-the-art semi-supervised algorithm by . RCSLS directly optimizes the CSLS similarity score (Conneau et al., 2018), which is used during retrieval stage for GeoMM, among other algorithms. On the other hand, GeoMM semi optimizes a simpler classification-based square loss function (see Section 3.3). In addition to the training dictionary, RCSLS uses the full vocabulary list of the source and target languages (200,000 words each) during training.
The results are reported in Table 7. We observe that the overall performance of GeoMM semi is slightly better than RCSLS. In addition, our supervised approach GeoMM performs slightly worse than RCSLS, although it does not have the advantage of learning from unlabeled data, as is the case for RCSLS and GeoMM semi . We also notice that GeoMM semi improves on GeoMM in almost all language pairs. We also evaluate GeoMM semi on the VecMap dataset. The results are reported in Table 8. To the best of our knowledge, GeoMM semi obtains state-of-the-art results on the VecMap dataset.

Conclusion and Future Work
In this work, we develop a framework for learning multilingual word embeddings by aligning the embeddings for various languages in a common space and inducing a Mahalanobis similarity metric in the common space. We view the translation of embeddings from one language to another as a series of geometrical transformations and jointly learn the language-specific orthogonal rotations and the symmetric positive definite matrix representing the Mahalanobis metric. Learning such transformations can also be viewed as learning a suitable common latent space for multiple languages. We formulate the problem in the Riemannian optimization framework, which models the above transformations efficiently.
We evaluate our bilingual and multilingual algorithms on the bilingual lexicon induction and the cross-lingual word similarity tasks. The results show that our algorithm outperforms existing approaches on multiple datasets. In addition, we demonstrate the efficacy of our multilingual algorithm in a one-hop translation setting for bilingual lexicon induction, in which a direct dictionary between the source and target languages is not available. The semi-supervised extension of our algorithm shows that our framework can leverage unlabeled data to obtain further improvements. Our analysis shows that the combination of the proposed transformations, inference in the induced latent space, and modeling the problem in classification setting allows the proposed approach to achieve state-of-the-art performance.
In future, an unsupervised extension to our approach can be explored. Optimizing the CSLS loss function  within our framework can be investigated to address the hubness problem. We plan to work on downstream applications like text classification, machine translation, etc., which may potentially benefit from the proposed latent space representation of multiple languages by sharing annotated resources across languages.