Hierarchical Mapping for Crosslingual Word Embedding Alignment

Abstract The alignment of word embedding spaces in different languages into a common crosslingual space has recently been in vogue. Strategies that do so compute pairwise alignments and then map multiple languages to a single pivot language (most often English). These strategies, however, are biased towards the choice of the pivot language, given that language proximity and the linguistic characteristics of the target language can strongly impact the resultant crosslingual space in detriment of topologically distant languages. We present a strategy that eliminates the need for a pivot language by learning the mappings across languages in a hierarchical way. Experiments demonstrate that our strategy significantly improves vocabulary induction scores in all existing benchmarks, as well as in a new non-English–centered benchmark we built, which we make publicly available.


Introduction
Word embeddings have changed how we build text processing applications, given their capabilities for representing the meaning of words (Mikolov et al., 2013a;Pennington et al., 2014;Bojanowski et al., 2017). Traditional embedding-generation strategies create different embeddings for the same word depending on the language. Even if the embeddings themselves are different across languages, their distributions tend to be consistentthe relative distances across word embeddings are preserved regardless of the language (Mikolov et al., 2013b). This behavior has been exploited for crosslingual embedding generation by aligning any two monolingual embeddings spaces into one (Dinu et al., 2014;Xing et al., 2015;Artetxe et al., 2016).
Alignment techniques have been successful in generating bilingual embedding spaces that can later be merged into a crosslingual space using a pivoting language, English being the most common choice. Unfortunately, mapping one language into another suffers from a neutrality problem, as the resultant bilingual space is impacted by language-specific phenomena and corpus-specific biases of the target language (Doval et al., 2018). To address this issue, Doval et al. (2018) propose mapping any two languages into a different middle space. This mapping, however, precludes the use of a pivot language for merging multiple bilingual spaces into a crosslingual one, limiting the solution to a bilingual scenario. Additionally, the pivoting strategy suffers from a generalized bias problem, as languages that are the most similar to the pivot obtain a better alignment and are therefore better represented in the crosslingual space. This is because language proximity is a key factor when learning alignments. This is evidenced by the results in Artetxe et al. (2017), which indicate that when using English (Indo-European) as a pivot, the vocabulary induction results for Finnish (Uralic) are about 10 points below the rest of the Indo-European languages under study.
If we want to incorporate all languages into the same crosslingual space regardless of their characteristics, we need to go beyond the trainbilingual/merge-by-pivoting (TB/MP) model, and instead seek solutions that can directly generate crosslingual spaces without requiring a bilingual step. This motivates the design of HCEG (Hierarchical Crosslingual Embedding Generation), the hierarchical pivotless approach for generating crosslingual embedding spaces that we present in this paper. HCEG addresses both the language proximity and target-space bias problems by learning a compositional mapping across multiple languages in a hierarchical fashion. This is accomplished by taking advantage of a language family tree for aggregating multiple languages into a single crosslingual space. What distinguishes HCEG from TB/MP strategies is that it does not need to include the pivot language in all mapping functions. This enables the option to learn mappings between typologically similar languages, known to yield better quality mappings (Artetxe et al., 2017). The main contributions of our work include: • A strategy 1 that leverages a language family tree for learning mapping matrices that are composed hierarchically to yield crosslingual embedding spaces for language families.
• An analysis of the benefits of hierarchically generating mappings across multiple languages compared to traditional unsupervised and supervised TB/MP alignment strategies.

Related Work
Recent interest in crosslingual word embedding generation has led to manifold strategies that can be classified into four groups (Ruder et al., 2017): (1) Mapping techniques that rely on a bilingual lexicon for mapping an already trained monolingual space into another (Mikolov et al., 2013b;Artetxe et al., 2017;Doval et al., 2018); (2) Pseudo-crosslingual techniques that generate synthetic crosslingual corpora that are then used in a traditional monolingual strategy, by randomly replacing words of a text with their translations (Gouws and Søgaard, 2015;Duong et al., 2016) or by combining texts in various languages into one (Vulić and Moens, 2016); (3) Approaches that only optimize for a crosslingual objective function, which require parallel corpora in the form of aligned sentences (Hermann and Blunsom, 2013;Lauly et al., 2014) or texts ; and (4) Approaches using a joint objective function that optimizes both mono-and crosslingual loss, that rely on a parallel corpora aligned at the word (Zou et al., 2013;Luong et al., 2015) or sentence level Coulmance et al., 2015).
A key factor for crosslingual embedding generation techniques is the amount of supervised signal needed. Parallel corpora are a scarce resource-even nonexistent for some isolated or low-resource languages. Thus, we focus on mapping-based strategies that can go from requiring just a bilingual lexicon (Mikolov et al., 2013b) to absolutely no supervised signal (Artetxe et al., 2018). This aligns with one of the premises for our research to enable the generation of a single crosslingual embedding space for as many languages as possible. Mikolov et al. (2013b) first introduced a mapping strategy for aligning two monolingual spaces that learns a linear transformation from source to target space using stochastic gradient descent. This approach was later enhanced with the use of least squares for finding the optimal solution, L2-normalizing the word embedding, or constraining the mapping matrix to be orthogonal (Dinu et al., 2014;Shigeto et al., 2015;Xing et al., 2015;Artetxe et al., 2016;Smith et al., 2017); enhancements that soon became standard in the area. These models, however, are affected by hubness, where some words tend to be in the neighborhood of an exceptionally large number of other words, causing problems when using nearest-neighbor as the retrieval algorithm, and neutrality, where the resultant crosslingual space is highly conditioned by the characteristics of the language used as target. Hubness was addressed by a correction applied to nearest-neighbor retrieval whether using a inverted softmax (Smith et al., 2017) or a cross-domain similarity local scaling (Conneau et al., 2017) later incorporated as part of the training loss . Neutrality was noticed by Doval et al. (2018), for which they proposed using two independent linear transformations so that the resulting crosslingual space is in a middle point between the two languages rather than just on the target language, and therefore not biased towards either language.
Other important trends in the area concentrate on (i) the search of unsupervised techniques for learning mapping functions (Conneau et al., 2017;Artetxe et al., 2018) and their versatility in dealing with low-resource languages ; (ii) the long-tail problem, where most existing crosslingual embedding generation strategies tend to under-perform (Braune et al., 2018;Czarnowska et al., 2019); and (iii) the formulation of more robust evaluation procedures oriented to determining the quality of generated crosslingual spaces (Glavas et al., 2019;. Most existing works focus on a bilingual scenario. Yet, there is an increase on the interest for designing strategies that directly consider more than two languages at training time, thus creating fully multilingual spaces that do not depend on the TB/MP model (Kementchedjhieva et al., 2018) for multilingual inference. Attempts to do so include the efforts by , who leverage an inverted index based on the Wikipedia multilingual links to generate multilingual word representations. Wada et al. (2019) instead use a sentence-level neural language model for directly learning multilingual word embeddings and as a result bypassing the need for mapping functions. In the paradigm of aligning pre-trained word embeddings where we focus, Heyman et al. (2019) propose a technique that iteratively builds a multilingual space starting from a monolingual space and incrementally incorporating languages to it. Even if this strategy deviates from the traditional TB/MP model, it still preserves the idea of having a pivot language. Chen and Cardie (2018) separate the mapping functions into encoders and decoders, which are not language-pair dependent, unlike those in the TB/MP model. This removes the need for a pivot language, given that the multilingual space is now latent among all encoder and decoders and not centered in a specific language. The same pivot-removal effect is achieved by the strategy introduced in Jawanpuria et al. (2019), which generalizes a bilingual word embedding strategy into a multilingual counterpart by inducing a Mahalanobis similarity metric in the common space. These two strategies, however, still consider all languages equidistant to each other, ignoring the similarities and differences that lay among them.
Our work is inspired by Doval et al. (2018) and Chen and Cardie (2018), in the sense that it focuses on obtaining a non-biased or neutral crosslingual space that does not need to be centered in English (or any other pivot language) as the primary source. This neutrality is obtained by a compositional mapping strategy that hierarchically combines mapping functions in order to generate a single, non-language-centered crosslingual space, enabling a better mapping for languages that are distant or non-typologically related to English.

Proposed Strategy
A language family tree is a natural categorization of languages that has historically been used by linguistics as a reference that encodes similarities and differences across languages (Comrie, 1989). For example, based on the relative distances among languages in the tree illustrated in Figure 1, we infer that both Spanish and Portuguese are relatively similar to each other, given that they are part of the same Italic family. At the same time, both languages are farther apart from English than each other, and are radically different with respect to Finnish.
A language family tree offers a natural organization that can be exploited when building crosslingual spaces that integrate typologically diverse languages. We leverage this structure in HCEG, in order to generate a hierarchically compositional crosslingual word embedding space. Unlike traditional TB/MP strategies that generate a single crosslingual space, the result of HCEG is a set of transformation matrices that can be used to hierarchically compose the space required in each use-case. This maximizes the typological intra-similarity among languages used for generating the embedding space, while minimizing the differences across languages that can hinder the quality of the crosslingual embedding space. Thus, if an external application only considers languages that are Germanic, then it can just use the Germanic crosslingual space generated by HCEG, whereas if it needs languages beyond Germanic it can utilize a higher level family, such as the Indo-European. This cannot be done with the traditional TB/MP model. In this case, if an application is, for example, using only Uralic languages, then it would be forced to use an English-centered crosslingual space; this would in a decrease in the quality of the crosslingual space used because of the potential bad quality of mappings between typologically different languages, such as Uralic and Indo-European languages (Artetxe et al., 2017).

Definitions
Let L = {l 1 , . . . , l |L| } be a set of languages considered, F = {f 1 , . . . , f |F | } a set of language families, and S = L ∪ F = {s 1 , . . . , s |F |+|L| } a set of possible language spaces. Let X l ∈ R V l ×d be the set of word embeddings in language l, where V l is the vocabulary of l and d is the number of dimensions of each embedding. Consider T as a language family tree (exemplified in Figure 1). The nodes in T represent language spaces in S, while each edge represents a transformation between the two nodes attached to it-that is, W s a ← −s b ∈ R d×d refers to the transformation from space s b to space s a . For notation ease, we refer to W s a * ← −s b as the transformation that results from aggregating all transformations in the path from s b to s a , using the dot product: Finally, P is a set of bilingual lexicons, where P l 1 ,l 2 ∈ {0, 1} V l 1 ×V l 2 is a bilingual lexicon with word pairs in languages l 1 and l 2 . P l 1 ,l 2 (i, j) = 1 if the i th word of V l 1 and the j th word of V l 2 are aligned, P l 1 ,l 2 (i, j) = 0 otherwise.
Example. Consider the set of embeddings for English X en , the transformation that converts embeddings in the English space to the Germanic language family space W s ge * ← −s en , and the English embeddings transformed to the Germanic space W s ge * ← −s en X en . HCEG makes it so that W s ge * ← −s en X en and W s ge * ← −s de X de (the transformed embeddings of English and German) are in the same Germanic embedding space, while W s in * ← −s en X en and W s in * ← −s es X es (the transformed embeddings of English and Spanish) are in the same Indo-European embedding space.
In the rest of this section we describe HCEG in detail. Values given to each hyperparameter mentioned in this section are defined in Section 4.4.

Embedding Normalization
When dealing with embeddings generated from different sources and languages, it is important to normalize them. For doing so, HCEG follows a normalization sequence shown to be beneficial (Artetxe et al., 2018), which consists of length normalization, mean centering, and a second length normalization. The last length normalization allows computing cosine similarity between embeddings in a more efficient manner, simplifying the computation of cosine similarity to a dot product given that the embeddings are of unit-length.

Word Pairs
In order to generate a crosslingual embedding space, HCEG requires a set P of aligned words across different languages. When using HCEG in a supervised way, P can be any existing resource consisting of bilingual lexicons, such as the ones described in Section 4.1. However, best advantage of the proposed strategy is taken when using unsupervised lexicon induction techniques, as they enable generating input lexicons for any pair of languages needed. Unlike TB/MP strategies that can only take advantage of signal that involves the pivot language, HCEG can use signal across all combinations of languages. For example, a TB/MP model where English is the pivot can only use lexicons composed of English words. Instead, HCEG can exploit bilingual lexicons from other languages, such as Spanish-Portuguese or Spanish-Dutch, that if using the language tree in Figure 1 When using HCEG in unsupervised mode, P needs to be automatically inferred. Yet, computing each P l 1 ,l 2 ∈ P given two monolingual embedding matrices X l 1 and X l 2 is not a trivial task, as X l 1 and Figure 2: Distributions of word rankings across languages. The coordinates of each dot (representing a word pair) are determined by the position in the frequency ranking the word pair in each of the languages. Numbers are written in thousands. Scores computed using FastText embedding rankings  and MUSE crosslingual pairs (Conneau et al., 2017). Pearson's correlation (ρ) computed using the full set of word pairs, figures generated using a random sample of 500 word pairs for illustration purposes.
X l 2 are not aligned in vocabulary or dimension axes. Artetxe et al. (2018) leverage the fact that the relative distances among words are maintained across languages (Mikolov et al., 2013b), and thus propose using a language-agnostic representation M l for generating an initial alignment P l 1 ,l 2 : where given that X l is length normalized, and X l X ⊤ l computes a matrix of dimensions V l × V l containing in each row the cosine similarities of the corresponding word embedding with respect to all other word embeddings. The values in each row are then sorted to generate a distribution representation of each word that in a ideal case where the isometry assumption holds perfectly would be language agnostic. Using the embedding representations M l 1 and M l 2 , P l 1 ,l 2 can be computed by assigning each word its most similar representation as its pair, that is, P l 1 ,l 2 (i, j) = 1 if: The results in Artetxe et al. (2018) indicate that this assumption is strong enough to generate an initial alignment across languages. However, as we demonstrate in Section 3.3, the quality of this type of initial alignment is dependent on the languages used, making this initialization not applicable for languages that are typologically too distant from each other-a statement also echoed by Artetxe et al. (2018) and .
To ensure a more robust initialization, we enhance the strategy presented in Artetxe et al. (2018) by introducing a new signal based on the frequency of use of words. Lin et al. (2012) found that the top-2 most frequent words tend to be consistent across different languages. Motivated by this result, we measure to what extent the frequency rankings of words correlates across languages. As shown in Figure 2, the wordfrequency rankings are strongly correlated across languages, meaning that popular words tend to be popular regardless of the language. We exploit this behavior in order to reduce the search space of Equation (3) as follows: where t is a value used to determine the search window. Note that we assume the embeddings in any matrix X l are sorted in ascending order of frequency, namely, the embedding in the first row represents the most frequent word of language l. Apart from improving the overall quality of the inferred lexicons (see Section 5.1), incorporating a frequency ranking based search as part of the initialization reduces the computation time needed as the search space is considerably reduced.

Objective Function
Unlike traditional objective functions that optimize a transformation matrix for two languages at a time, the goal of HCEG is to simultaneously optimize the set of all transformation matrices W such that the loss function L is minimized: L is a linear combination of three different losses: (6) where L align , L orth , L reg , represent the alignment, orthogonality, and regularization losses, and β 1 , β 2 , β 3 are their weights. L align gauges the extent to which training word pairs align. This is done by computing the sum of the cosine similarity among all word pairs in P : where s l 1 ,l 2 refers to the space in the lowest common parent node for s l 1 and s l 2 in T (e.g., s es,en = s in in Figure 1). We found that using s l 1 ,l 2 instead of the space in the root node of T improves the overall performance of HCEG, apart from reducing the time taken for training (see Section 5.3). Several researchers have found it beneficial to enforce orthogonality in the transformation matrices W (Xing et al., 2015;Artetxe et al., 2016;Smith et al., 2017). This constraint ensures that the original quality of the embeddings is not degraded when transforming them to a crosslingual space. For this reason, we incorporate an orthogonality constraint L orth into our loss function in Equation 8, with I being the identity matrix.
We also find it beneficial to include a regularization term in L: 3.5 Learning the Parameters HCEG utilizes stochastic gradient descent for tuning the parameters in W with respect to the training word pairs in P . In each iteration, L is computed and backtracked in order to tune each transformation matrix in W such that L is minimized. Batching is used to reduce the computational load in each iteration. A batch of word pairsP is sampled from P by randomly selecting α lpairs language pairs as well as α wpairs word pairs in eachP l 1 ,l 2 ∈P -for example, a batch might consist of 10P l 1 ,l 2 matrices each containing 500 aligned words. Iterations are grouped into epochs of α iter iterations at the end of which L is computed for the whole P . We take a conservative approach as convergence criterion. If no improvement is found in L in the last α conv epochs, the training loop stops.
We achieve best convergence time initializing each W s 1 ← −s 2 ∈ W to be orthogonal. We tried several methods for orthogonal initialization, such as simply initializing to the identity matrix. However, we obtained most consistent results using the random semi-orthogonal initialization introduced by Saxe et al. (2013).

Iterative Refinement
As shown by Artetxe et al. (2017), the initial lexicon P is iteratively improved by using the generated crosslingual space for inferring a new lexicon P ′ at the end of each learning phase described in Section 3.5. More specifically, when computing each P Potentially, any new bilingual lexicon P ′ l 1 ,l 2 can be inferred and included in P ′ at the end of each learning phase. However, as the cardinality of L grows, this process can take a prohibitive amount of time given combinatorial explosion. Therefore, in practice, we only infer P ′ l 1 ,l 2 following a criterion intended to maximize lexicon quality. P ′ l 1 ,l 2 is inferred for languages l 1 and l 2 only if l 1 and l 2 are siblings in T (they share the same parent node) or l 1 and l 2 are the best representatives of their corresponding family. A language is deemed the best representative of its family if it is the most frequently-spoken 2 language in its subtree. For example, in Figure 1, Spanish is the best representative for the Italic family, but not for Indo-European, for which English is used.
The set criterion not only reduces the amount of time required to infer P ′ but also improves overall HCEG performance. This is due to a better utilization of the hierarchical characteristics of our crosslingual space, only inferring bilingual lexicons from typologically related languages or their best representatives in terms of resource quality.

Retrieval Criterion
As discussed in Section 2, one of the issues effecting nearest-neighbor retrieval is hubness (Dinu et al., 2014), where certain words are in the surrounding of an abnormally large number of other words, causing the nearest-neighbor algorithm to incorrectly prioritize hub words. To address this issue, we use Cross-domain Similarity Local Scaling (CSLS) (Conneau et al., 2017) as the retrieval algorithm during both training and prediction time. CSLS is a rectification for nearest-neighbor retrieval that avoids hubness by counterbalancing the cosine similarity between two embeddings by a factor consisting of the average similarity of each embeddings with its k closest neighbors. Following the criteria in Conneau et al. (2017), we set the number of neighbours used by CSLS to k = 10.

Evaluation Framework
We describe below the evaluation set up used for conducting the experiments presented in Section 5.

Word Pair Datasets
Dinu-Artetxe. The Dinu-Artetxe dataset, presented by Dinu et al. (2014) and enhanced by Artetxe et al. (2016), is the one of the first benchmarks for evaluating crosslingual embeddings. It is composed of English-centered bilingual lexicons for Italian, Spanish, German, and Finnish.
MUSE. The MUSE dataset (Conneau et al., 2017) contains bilingual lexicons for all combinations of German, English, Spanish, French, Italian, and Portuguese. In addition, it includes word pairs for 44 languages with respect to English.
Panlex. Dinu-Artetxe and MUSE are both English-centered datasets, given that most (if not all) of their word pairs have English as their source or target language. This makes the datasets suboptimal for our purpose of generating and evaluating a non-language centered crosslingual space. For this reason, we generated a dataset using Panlex (Kamholz et al., 2014), a panlingual lexical database. This dataset (made public in our repository) includes bilingual lexicons for all combinations of 157 languages for which FastText is available, totalling 24,492 bilingual lexicons. Each of the lexicons was generated by randomly sampling 5k words from the top-200k words in the embedding set for the source language, and translating them to the target language using the Panlex database. We find it important to highlight that this dataset contains considerably more noise than other datasets given that Panlex is generated in an automatic way and is not as finely curated by humans as previous datasets. We still find comparisons using this dataset fair, given that its noisy nature should affect all strategies equally.

Language Selection and Family Tree
As previously stated, we aim to generate a single crosslingual space for as many languages as possible. We started with the 157 languages for which FastText embeddings are available . We then removed languages that did not meet both of the following criteria: (1) there must exist a bilingual lexicon with at least 500 word pairs for the language in any of the datasets described in Section 4.1, and (2) the embedding set provided by FastText must contain at least 20k words. The first criterion is a minimal condition for evaluation, while the second one is necessary for the unsupervised initialization strategy. The criteria are met by 107 languages, which are the ones used in our experiments. Their corresponding ISO-639 codes can be seen later in Table 5. We use the language family tree defined by Lewis and Gary (2015).

Framework
For experimental purposes, each dataset described in Section 4.1 is split into training and testing sets. We use the original train-test splits for Dinu-Artetxe and MUSE. For Panlex, we generate a split randomly sampling word pairs-keeping 80% for the training and the remaining 20% for testing. For development and parameter tuning purposes, we use a disjoint set of word pairs specifically created for this purpose based on the Panlex lexical database. This development set contains 10 different languages with varied popularity. None of the word pairs present in this development set are part of either the train or test sets.

Evaluation
We discuss below the results of the study conducted over 107 languages to assess HCEG.

Unsupervised Initialization
We first evaluate the performance of the unsupervised initialization strategy described in Section 3.3, and compare it with the state-of-theart strategy proposed by Artetxe et al. (2018). In this case, we run both initialization strategies using the top-20k FastText embeddings  for all pairwise combinations of the 107 languages we study. For each language pair, we measure how many of the inferred word pairs are present in the corresponding lexicons in the MUSE and Panlex datasets. For MUSE, our proposed initialization strategy (Frequency based) obtains an average of 48.09 correct pairs, an improvement with respect to the 29.62 obtained by the strategy proposed by Artetxe et al. (2018). For Panlex, the respective average correct pair counts are 1.05 and 0.55. Both differences are statistically significant (p < 0.01) using a paired t-test. The noticeable difference across datasets is due to how the sampling was done for generating the datasets: MUSE contains a considerably higher number of frequent words in comparison to Panlex, making the latter a relatively harder dataset for vocabulary induction. In Figure 3 we illustrate the results of each strategy grouped by language-pair similarity. This similarity is based on the number of common parents the two languages share. For example, in Figure 1, Spanish has a similarity of 3, 2, and 1 with Portuguese, English, and Finnish, respectively. As we see in Figure 3, similarity is a factor that strongly determines the quality of the alignment generated by the unsupervised initialization. Even if this phenomenon affects both analyzed strategies, our proposed frequencybased initialization strategy consistently obtains a few more correct word pairs for the least similar language pairs, which, as we show in Table 4, are key for generating a correct mapping for those languages.

State-of-the-Art Comparison
In order to contextualize the performance of HCEG with respect to the state-of-the-art (listed in Tables 1 and 2), we measure the word translation prediction capabilities of each of the strategies. We do so using Precision@1 for bilingual lexicon induction as a means to quantify vocabulary induction performance. Scores reported hereafter are average Precision@1 in percentage form, for each of the words in the testing set.
When applicable, we report results for both the supervised (HCEG-S) and unsupervised (HCEG-U) versions of HCEG. In the supervised mode, we train one single model per dataset using all the training word pairs available. We then use this model for computing all pairwise scores. In the unsupervised mode, unless explicitly stated otherwise, we train a single model regardless of the dataset used for testing purposes. This means that, in some cases, the unsupervised mode leverages monolingual data beyond the languages used for testing, as it uses all 107 language embeddings. We found it unfair to train a supervised model using  Artetxe et al. (2018); the remaining ones were reported in the corresponding original papers.
the Dinu-Artetxe dataset given that it only contains four bilingual lexicons, not enough for training our tree structure. Thus, only unsupervised results are shown for that dataset. As shown in Table 1, the unsupervised version of HCEG achieves, in most cases, the best performance among all unsupervised strategies, even improving over state-of-the-art supervised models in some cases. The improvement is most noticeable for Italian and Spanish, where HCEG-U obtains an improvement of 1 and 3 points, respectively. A similar behavior can be seen in Table 2, where we describe the results on the MUSE dataset. Spanish, along with Catalan, Italian, and Portuguese, obtains a substantially larger improvement compared with other languages. We attribute this to the fact that Spanish is the second most resourceful language in terms of corpora after English. This makes the quality of Spanish word embeddings comparably better than other languages, which as a result improves the mapping quality of typologically related languages, such as Portuguese, Italian, or Catalan.
To further contextualize the performance of HCEG-U, in terms of its capability for generating crosslingual embeddings in an unsupervised fashion, we conducted further experiments. In Table 3, we summarize the results obtained from comparing HCEG-U with other unsupervised strategies focused on learning crosslingual word embeddings. In our comparisons we include (i) a direct bilingual learning baseline that simply learns a bilingual mapping using two monolingual word embeddings (Conneau et al., 2017), (ii) a pivot-based strategy that can leverage a third language for learning a crosslingual space (Conneau et al., 2017), and (iii) a fully multilingual, pivotless strategy that aggregates languages into a joint space in an iterative manner (Chen and   Artetxe et al. (2018) were obtained using the scripts shared by the authors. All the other scores were reported in . HCEG-U − only considers the 29 languages in the experiment for training.
Cardie, 2018). From the reported results, we see that HCEG-U − outperforms all other considered strategies for 24 out of 30 language pairs. Highest improvements are found for languages of the Italic family (Spanish, Portuguese, Italian, and French). We observe that HCEG-U − under-performed when the corresponding experiment involved the German language as source or target. We attribute this behavior to the fact that the Italic family is predominant in the languages explored in this experiment.
In order to perform a fair comparison with respect to the work proposed by Chen and Cardie (2018), we limited the monolingual data that HCEG-U − used to the six languages considered in this experiment (results that are reported in Table 3). However, in order to show the full potential of HCEG-U, we also include results achieved when using 107 languages (column HCEG-U). As seen in Tables 2 and 3, the differences between HCEG-U − and HCEG-U are considerable, manifesting the capabilities of the proposed model to take advantage of monolingual data in multiple languages at the same time.
The importance of explicitly considering topological connections among languages to Method Type en-de en-fr en-es en-it en-pt de-fr de-es de-it de-pt fr-es fr-it fr-pt es-it es-pt it-pt  Multi merges all languages into the same space without using a pivot. All scores except HCEG-U were originally reported by Chen and Cardie (2018). HCEG-U − only considers the six languages in the experiment for training. Note that HCEG-U is excluded when highlighting the best model (bold), given that it uses monolingual data beyond what other models do.
enhance mappings become more evident when analyzing the data in Table 5. Here we include the pairing that yielded the best and worst mapping for each language, as well as the position of English in the quality ranking. English and Spanish have a strong quality mapping with respect to each other, Spanish being the language with which English obtains the best mapping and English is the second-best mapped language for Spanish. Additionally, Spanish is the language with which Italian, Portuguese, and Catalan obtain the best mapping quality. On the other side of the spectrum, the worst mappings are dominated by two languages, Georgian and Vietnamese, with 40 languages having these two language as worst; this is followed by Maltese, Albanian, and Finnish, with 8 occurrences each. This is not unexpected, as these languages are relatively isolated in the language family tree, and also have a low number of speakers. We also see that English is usually on the top side of the ranking for most languages. For languages that are completely isolated, such as Basque and Yoruba, English tends to be their best mapped language. From this we surmise that when typological relations are lacking, the quality of the embedding space is the only aspect the mapping strategy can rely on. Given space constraints, we cannot show the vocabulary induction scores for the 24,492 language pairs in the Panlex dataset. Instead, we group the results using two variables: the sum of number of speakers for each of the two languages, and the minimum similarity (as defined in Section 5.1) for each language with respect to English. We rely on these variables for grouping purposes as they align with two of our objectives for designing HCEG: (1) remove the bias towards the pivot language (English), and (2) improve the performance of low-resource languages by taking advantage of typologically similar languages. Figure 4 captures the improvement (2.7 on average) of HCEG-U over the strategy introduced in Artetxe et al. (2018) (the best-performing benchmark), grouped by the aforementioned variables. We excluded Hindi and Chinese from the figure, as they made any pattern hard to observe given their high number of speakers. The sum of number of speakers axis was also logarithmically scaled to facilitate visualization. The figure captures an evident trend in the similarity axis. The lower the similarity of the language with respect to English, the higher the improvement achieved by HCEG-U. This can be attributed to the manner in which TB/MP models generated the space using English as primary resource, hindering the potential quality of languages that are distant from it. Additionally, we see a less-prominent but existing trend in Figure 4: Improvement over the strategy proposed by Artetxe et al. (2018) in Panlex, in terms of language similarity and number of speakers. Darker denotes larger improvement.
the speaker sum axis. Despite some exceptions, HCEG-U obtains higher differences with respect to Artetxe et al. (2018) the less spoken a language is. A behavior that is similar in essence to a Pareto front can also be depicted from the figure. Even if both variables contribute to the difference in improvement of HCEG-U, one variable needs to compensate for the other in order to maximize accuracy. In other words, the improvement is higher the fewer speakers the language pair has or the more distant the two languages are from English, but when both variables go to the extreme, the improvement decreases. The aforementioned trends serve as evidence that the hierarchical structure is indeed important when building a crosslingual space that considers typologically diverse languages, validating our premises for designing HCEG.

Ablation Study
In order to assess the validity of each functionality included as part of HCEG, we conducted an ablation study. We summarize the results of this study in Table 4, where the symbol ¬ indicates that the subsequent feature is ablated in the model. For example, ¬Hierarchy indicates that the Hierarchy structure is removed, replacing it by a structure where each language needs just one transformation matrix to reach the World languages space.  As indicated by the ablation results, the hierarchical structure is indeed a key part of HCEG, considerably reducing its performance when removed, and having its strongest effect in the dataset with the highest number of languages (i.e., Panlex). The importance of the Iterative Refinement strategy is also noticeable, making the unsupervised version of HCEG useless when removed. The Frequency-based initialization is also a characteristic that considerably improves the results of HCEG-U. Looking deeper into the data, we found 2,198 language pairs (about 9% of all pairs) that obtained a vocabulary induction accuracy close to 0 (<0.05) without using this initialization, but were able to produce enough signal to yield more substantial accuracy values (>10.0) when using the Frequency-based initialization. Finally, the design decisions that we initially took for reducing training time-(i) the orthogonal initialization, (ii) the heuristic based inference, and (iii) using the lowest common root for computing the loss function-also have a positive effect on the performance of the HCEG.

Influence of Pivot Choice
One of the premises for building HCEG was to design a strategy that would not require pivots for achieving a single space with multiple word embeddings, given that a pivot induces a bias into the final space that can hinder the quality of the mapping for languages that are too distant to it. In this section we describe the results of experiments conducted for measuring the effect pivot selection can have on the performance of the mapping. For doing so, we measure the   Table 6: Results obtained by existing bilingual mapping strategies using different pivots on the Panlex dataset. Values in each cell indicate the average performance obtained for each of the pairwise combinations of languages under the family noted in the corresponding column title. For example, the first cell indicates the average score obtained for all possible combinations of afro-asiatic languages using English as a pivot. Results are averaged across the strategy presented in Conneau et al. (2017) and Artetxe et al. (2018) in order to avoid system-specific biases.
performance of state-of-the-art bilingual mapping strategies in a pivot-based inference scenario. We use 11 different pivots and average the results of two different strategies- (Conneau et al., 2017) and (Artetxe et al., 2018)-grouped by several language families. As depicted by the results presented in Table 6, selecting a pivot that belongs to the family of the languages being tested is always the best choice. In cases where we considered multiple pivots of the same family, the most resource-rich language resulted in the best option, namely, Spanish in the case of the Italic family and English for the Germanic family. On average, English is the best choice of pivot if all language families need to be considered, followed by Spanish and Portuguese. This validates two of the design decisions for HCEG, that is, the need to avoid selecting a pivot and the importance of using the languages with largest speaker-base when performing language transfer.

Conclusion and Future Work
We have introduced HCEG, a crosslingual space learning strategy that does not depend on a pivot language, as instead, it takes advantage of the natural hierarchy existing among languages.
Results from extensive studies on 107 languages demonstrate that the proposed strategy outperforms existing crosslingual space generation techniques, in terms of vocabulary induction, for both popular and not so popular languages. HCEG improves the mapping quality of many low-resource languages. We noticed that this improvement mostly happens when a language has more typologically related counterparts, however. Therefore, as future work, we intend to investigate other techniques that can help improve the quality of mapping for typologically isolated low-resource languages. Additionally, it is important to note that the time complexity required by the proposed algorithm is N (N −1), with N being the number of languages considered. For the traditional TB/MP strategy, complexity is limited to learning from N language pairs. Therefore, we plan on exploring strategies to reduce the number of language pairs that need to be learned for creating the crosslingual space. Finally, we will explore different data-driven strategies for building the tree structure, such as geographical proximity or lexical overlap, which could lead to better optimized arrangements of the crosslingual space.