Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints

We present Attract-Repel, an algorithm for improving the semantic quality of word vectors by injecting constraints extracted from lexical resources. Attract-Repel facilitates the use of constraints from mono- and cross-lingual resources, yielding semantically specialised cross-lingual vector spaces. Our evaluation shows that the method can make use of existing cross-lingual lexicons to construct high-quality vector spaces for a plethora of different languages, facilitating semantic transfer from high- to lower-resource ones. The effectiveness of our approach is demonstrated with state-of-the-art results on semantic similarity datasets in six languages. We next show that Attract-Repel-specialised vectors boost performance in the downstream task of dialogue state tracking (DST) across multiple languages. Finally, we show that cross-lingual vector spaces produced by our algorithm facilitate the training of multilingual DST models, which brings further performance improvements.


Introduction
Word representation learning has become a research area of central importance in modern natural language processing.The common techniques for inducing distributed word representations are grounded in the distributional hypothesis, relying on co-occurrence information in large textual corpora to learn meaningful word representations (Mikolov et al., 2013b;Pennington et al., 2014;Ó Séaghdha and Korhonen, 2014;Levy and Goldberg, 2014).Recently, methods which go beyond stand-alone unsupervised learning have gained increased popularity.
These models typically build on distributional ones by using human-or automatically-constructed knowledge bases to enrich the semantic content of existing word vector collections.Often this is done as a postprocessing step, where the distributional word vectors are refined to satisfy constraints extracted from a lexical resource such as WordNet (Faruqui et al., 2015;Wieting et al., 2015;Mrkšić et al., 2016).We term this approach semantic specialisation.
In this paper we advance the semantic specialisation paradigm in a number of ways.We introduce a new algorithm, ATTRACT-REPEL, that uses synonymy and antonymy constraints drawn from lexical resources to tune word vector spaces using linguistic information that is difficult to capture with conventional distributional training.Our evaluation shows that ATTRACT-REPEL outperforms previous methods which make use of similar lexical resources, achieving state-of-the-art results on two word similarity datasets: SimLex-999 (Hill et al., 2015) and SimVerb-3500 (Gerz et al., 2016).
We then deploy the ATTRACT-REPEL algorithm in a multilingual setting, using semantic relations extracted from BabelNet (Navigli and Ponzetto, 2012;Ehrmann et al., 2014), a cross-lingual lexical resource, to inject constraints between words of different languages into the word representations.This allows us to embed vector spaces of multiple languages into a single vector space, exploiting information from high-resource languages to improve the word representations of lower-resource ones.In each case, the vast majority of each words' neighbours are meaningful synonyms/translations. 1 While there is a considerable amount of prior research on joint learning of cross-lingual vector spaces (see Sect. 2.2), to the best of our knowledge we are the first to apply semantic specialisation to this problem. 2We demonstrate its efficacy with state-of-theart results on the four languages in the Multilingual SimLex-999 dataset (Leviant and Reichart, 2015).To show that our approach yields semantically informative vectors for lower-resource languages, we collect intrinsic evaluation datasets for Hebrew and Croatian and show that cross-lingual specialisation significantly improves word vector quality in these two (comparatively) low-resource languages.
In the second part of the paper, we explore the use of ATTRACT-REPEL-specialised vectors in a downstream application.One important motivation for training word vectors is to improve the lexical coverage of supervised models for language understanding tasks, e.g.question answering (Iyyer et al., 2014) or textual entailment (Rocktäschel et al., 2016).In 1 Some residual (negative) effects of the distributional hypothesis do persist.For example, nl_krieken, which is Dutch for cherries, is (presumably) identified as a synonym for en_morning due to a song called 'a Morning Wish' by Emile Van Krieken.
2 Our approach is not suited for languages for which no lexical resources exist.However, many languages have some coverage in cross-lingual lexicons.For instance, BabelNet 3.7 automatically aligns WordNet to Wikipedia, providing accurate cross-lingual mappings between 271 languages.In our evaluation, we demonstrate substantial gains for Hebrew and Croatian, both of which are spoken by less than 10 million people worldwide.
this work, we use the task of dialogue state tracking (DST) for extrinsic evaluation.This task, which arises in the construction of statistical dialogue systems (Young et al., 2013), involves understanding the goals expressed by the user and updating the system's distribution over such goals as the conversation progresses and new information becomes available.
We show that incorporating our specialised vectors into a state-of-the-art neural-network model for DST improves performance on English dialogues.In the multilingual spirit of this paper, we produce new Italian and German DST datasets and show that using ATTRACT-REPEL-specialised vectors leads to even stronger gains in these two languages.Finally, we show that our cross-lingual vectors can be used to train a single model that performs DST in all three languages, in each case outperforming the monolingual model.To the best of our knowledge, this is the first work on multilingual training of any component of a statistical dialogue system.Our results indicate that multilingual training holds great promise for bootstrapping language understanding models for other languages, especially for dialogue domains where data collection is very resource-intensive.
All resources relating to this paper are available at www.github.com/nmrksic/attract-repel.These include: 1) the ATTRACT-REPEL source code; 2) bilingual word vector collections combining English with 51 other languages; 3) Hebrew and Croatian intrinsic evaluation datasets; and 4) Italian and German Dialogue State Tracking datasets collected for this work.
Semantic specialisation methods (broadly) fall into two categories: a) those which train distributed representations 'from scratch' by combining distributional knowledge and lexical information; and b) those which inject lexical information into pre-trained collections of word vectors.Methods from both categories make use of similar lexical resources; common examples include WordNet (Miller, 1995), FrameNet (Baker et al., 1998) or the Paraphrase Databases (PPDB) (Ganitkevitch et al., 2013;Ganitkevitch and Callison-Burch, 2014;Pavlick et al., 2015).
Learning from Scratch: some methods modify the prior or the regularization of the original training procedure using the set of linguistic constraints (Yu and Dredze, 2014;Xu et al., 2014;Bian et al., 2014;Kiela et al., 2015;Aletras and Stevenson, 2015).Other ones modify the skip-gram (Mikolov et al., 2013b) objective function by introducing semantic constraints (Yih et al., 2012;Liu et al., 2015) to train word vectors which emphasise word similarity over relatedness.Osborne et al. (2016) propose a method for incorporating prior knowledge into the Canonical Correlation Analysis (CCA) method used by Dhillon et al. (2015) to learn spectral word embeddings.While such methods introduce semantic similarity constraints extracted from lexicons, approaches such as the one proposed by Schwartz et al. (2015) use symmetric patterns (Davidov and Rappoport, 2006) to push away antonymous words in their pattern-based vector space.Ono et al. (2015) combine both approaches, using thesauri and distributional data to train embeddings specialised for capturing antonymy.Faruqui and Dyer (2015) use many different lexicons to create interpretable sparse binary vectors which achieve competitive performance across a range of intrinsic evaluation tasks.
In theory, word representations produced by models which consider distributional and lexical information jointly could be as good (or better) than representations produced by fine-tuning distributional vectors.However, their performance has not surpassed that of fine-tuning methods. 3ine-Tuning Pre-trained Vectors: Rothe and Schütze (2015) fine-tune word vector spaces to improve the representations of synsets/lexemes found in WordNet.Faruqui et al. (2015) and Jauhar et al. (2015) use synonymy constraints in a procedure termed retrofitting to bring the vectors of semantically similar words close together, while Wieting et al. (2015) modify the skip-gram objective function to fine-tune word vectors by injecting paraphrasing constraints from PPDB.Mrkšić et al. (2016) build on the retrofitting approach by jointly injecting synonymy and antonymy constraints; the same idea is reassessed by Nguyen et al. (2016).Kim et al. (2016a) further expand this line of work by incorporating semantic intensity information for the constraints, while Recski et al. (2016) use ensembles of rich concept dictionaries to further improve a combined collection of semantically specialised word vectors.
ATTRACT-REPEL is an instance of the second family of models, providing a portable, light-weight approach for incorporating external knowledge into arbitrary vector spaces.In our experiments, we show that ATTRACT-REPEL outperforms previously proposed post-processors, setting the new state-of-art performance on the widely used SimLex-999 word similarity dataset.Moreover, we show that starting from distributional vectors allows our method to use existing cross-lingual resources to tie distributional vector spaces of different languages into a unified vector space which benefits from positive semantic transfer between its constituent languages.
However, prior work on cross-lingual word embedding has tended not to exploit pre-existing linguistic resources such as BabelNet.In this work, we make use of cross-lingual constraints derived from such repositories to induce high-quality cross-lingual vector spaces by facilitating semantic transfer from highto lower-resource languages.In our experiments, we show that cross-lingual vector spaces produced by ATTRACT-REPEL consistently outperform a representative selection of five strong cross-lingual word embedding models in both intrinsic and extrinsic evaluation across several languages.

The ATTRACT-REPEL Model
In this section, we propose a new algorithm for producing semantically specialised word vectors by in-jecting similarity and antonymy constraints into distributional vector spaces.This procedure, which we term ATTRACT-REPEL, builds on the Paragram (Wieting et al., 2015) and counter-fitting procedures (Mrkšić et al., 2016), both of which inject linguistic constraints into existing vector spaces to improve their ability to capture semantic similarity.
Let V be the vocabulary, S the set of synonymous word pairs (e.g.intelligent and brilliant), and A the set of antonymous word pairs (e.g.vacant and occupied).For ease of notation, let each word pair (x l , x r ) in these two sets correspond to a vector pair (x l , x r ).The optimisation procedure operates over mini-batches B, where each of these consists of a set of synonymy pairs B S (of size k 1 ) and a set of antonymy pairs ] be the pairs of negative examples for each synonymy and antonymy example pair.These negative examples are chosen from the 2(k 1 + k 2 ) word vectors present in B S ∪ B A : • For each synonymy pair (x l , x r ), the negative example pair (t l , t r ) is chosen from the remaining in-batch vectors so that t l is the one closest (cosine similarity) to x l and t r is closest to x r .
• For each antonymy pair (x l , x r ), the negative example pair (t l , t r ) is chosen from the remaining in-batch vectors so that t l is the one furthest away from x l and t r is the one furthest from x r .
These negative examples are used to: a) force synonymous pairs to be closer to each other than to their respective negative examples; and b) to force antonymous pairs to be further away from each other than from their negative examples.The first term of the cost function pulls synonymous words together: where τ (x) = max(0, x) is the hinge loss function and δ syn is the similarity margin which determines how much closer synonymous vectors should be to each other than to their respective negative examples.
The second part of the cost function pushes antony-mous word pairs away from each other: In addition to these two terms, we include an additional regularisation term which aims to preserve the abundance of high-quality semantic content present in the initial (distributional) vector space, as long as this information does not contradict the injected linguistic constraints.If V (B) is the set of all word vectors present in the given mini-batch, then: where λ reg is the L2 regularisation constant and x i denotes the original (distributional) word vector for word x i .The final cost function of the ATTRACT-REPEL algorithm can then be expressed as: Comparison to Prior Work ATTRACT-REPEL draws inspiration from three methods: 1) retrofitting (Faruqui et al., 2015); 2) PARAGRAM (Wieting et al., 2015); and 3) counter-fitting (Mrkšić et al., 2016).Whereas retrofitting and PARAGRAM do not consider antonymy, counter-fitting models both synonymy and antonymy.ATTRACT-REPEL differs from this method in two important ways: 1. Context-Sensitive Updates: Counter-fitting uses attract and repel terms which pull synonyms together and push antonyms apart without considering their relation to other word vectors.For example, its 'attract term' is given by: where S is the set of synonymy constraints and δ syn is the (minimum) similarity enforced between synonyms.Conversely, ATTRACT-REPEL fine-tunes vector spaces by operating over minibatches of example pairs, updating word vectors only if the position of their negative example implies a stronger semantic relation than that expressed by the position of its target example.Importantly, ATTRACT-REPEL makes finegrained updates to both the example pair and the negative examples, rather than updating the example word pair but ignoring how this affects its relation to all other word vectors.
2. Regularisation: Counter-fitting preserves distances between pairs of word vectors in the initial vector space, trying to 'pull' the words' neighbourhoods with them as they move to incorporate external knowledge.The radius of this initial neighbourhood introduces an opaque hyperparameter to the procedure.Conversely, ATTRACT-REPEL implements standard L2 regularisation, which 'pulls' each vector towards its distributional vector representation.
In our intrinsic evaluation (Sect.5), we perform an exhaustive comparison of these models, showing that ATTRACT-REPEL significantly outperforms counterfitting in both mono-and cross-lingual setups.
Optimisation Following Wieting et al. (2015), we use the AdaGrad algorithm (Duchi et al., 2011) to train the word embeddings for five epochs, which suffices for the magnitude of the parameter updates to converge.Similar to Faruqui et al. (2015), Wieting et al. (2015) and Mrkšić et al. (2016), we do not use early stopping.By not relying on language-specific validation sets, the ATTRACT-REPEL procedure can induce semantically specialised word vectors for languages with no intrinsic evaluation datasets. 4yperparameter Tuning We use Spearman's correlation of the final word vectors with the Multilingual WordSim-353 gold-standard association dataset (Finkelstein et al., 2002;Leviant and Reichart, 2015).The ATTRACT-REPEL procedure has six hyperparameters: the regularization constant λ reg , the similarity and antonymy margins δ sim and δ ant , mini-batch sizes k 1 and k 2 , and the size of the PPDB constraint set used for each language (larger sizes include more constraints, but also a larger proportion of false synonyms).We ran a grid search over these for the four SimLex languages, choosing the hyperparameters which achieved the best WordSim-353 score.and Andreev (2015).In addition, for each of the 16 languages we also train the skip-gram with negative sampling variant of the word2vec model (Mikolov et al., 2013b), on the latest Wikipedia dump of each language, to induce 300-dimensional word vectors. 6

Linguistic Constraints
Table 2 shows the number of monolingual and crosslingual constraints for the four SimLex languages.

Monolingual Similarity
We employ the Multilingual Paraphrase Database (Ganitkevitch and Callison-Burch, 2014).This resource contains paraphrases automatically extracted from parallel-aligned corpora and over the six PPDB sizes for the four SimLex languages.λreg = 10 −9 , δsim = 0.6, δant = 0.0 and k1 = k2 ∈ [10, 25, 50] consistently achieved the best performance (we use k1 = k2 = 50 in all experiments for consistency).The PPDB constraint set size XL was best for English, German and Italian, and M achieved the best performance for Russian. 6The frequency cut-off was set to 50: words that occurred less frequently were removed from the vocabularies.Other word2vec parameters were set to the standard values (Vulić and Korhonen, 2016a): 15 epochs, 15 negative samples, global (decreasing) learning rate: 0.025, subsampling rate: 1e − 4.
for ten of our sixteen languages.In our experiments, the remaining six languages (HE, HR, SV, GA, VI, FA) serve as examples of lower-resource languages, as they have no monolingual synonymy constraints.
Cross-Lingual Similarity We employ BabelNet, a multilingual semantic network automatically constructed by linking Wikipedia to WordNet (Navigli and Ponzetto, 2012;Ehrmann et al., 2014).BabelNet groups words from different languages into Babel synsets.We consider two words from any (distinct) language pair to be synonymous if they belong to (at least) one set of synonymous Babel synsets.We made use of all BabelNet word senses tagged as conceptual but ignored the ones tagged as Named Entities.
Given a large collection of cross-lingual semantic constraints (e.g. the translation pair en_sweet and it_dolce), ATTRACT-REPEL can use them to bring the vector spaces of different languages together into a shared cross-lingual space.Ideally, sharing information across languages should lead to improved semantic content for each language, especially for those with limited monolingual resources.
Antonymy BabelNet is also used to extract both monolingual and cross-lingual antonymy constraints.Following Faruqui et al. (2015), who found PPDB constraints more beneficial than the WordNet ones, we do not use BabelNet for monolingual synonymy.
Availability of Resources Both PPDB and Babel-Net are created automatically.However, PPDB relies on large, high-quality parallel corpora such as Europarl (Koehn, 2005).In total, Multilingual PPDB provides collections of paraphrases for 22 languages.On the other hand, BabelNet uses Wikipedia's interlanguage links and statistical machine translation (Google Translate) to provide cross-lingual mappings for 271 languages.In our evaluation, we show that PPDB and BabelNet can be used jointly to improve word representations for lower-resource languages by tying them into bilingual spaces with high-resource ones.We validate this claim on Hebrew and Croatian, which act as 'lower-resource' languages because of their lack of any PPDB resource and their relatively small Wikipedia sizes. 75 Intrinsic Evaluation

Datasets
Spearman's rank correlation with the SimLex-999 dataset (Hill et al., 2015) is used as the intrinsic evaluation metric throughout the experiments.Unlike other gold standard resources such as WordSim-353 (Finkelstein et al., 2002) or MEN (Bruni et al., 2014), SimLex-999 consists of word pairs scored by annotators instructed to discern between semantic similarity and conceptual association, so that related but nonsimilar words (e.g.book and read) have a low rating.Leviant and Reichart (2015) translated SimLex-999 to German, Italian and Russian, crowd-sourcing the similarity scores from native speakers of these languages.We use this resource for multilingual intrinsic evaluation. 8To investigate the portability of our approach to lower-resource languages, we used the same experimental setup to collect SimLex-999 datasets for Hebrew and Croatian.9For English vectors, we also report Spearman's correlation with SimVerb-3500 (Gerz et al., 2016), a semantic similarity dataset that focuses on verb pair similarity.

Experiments
Monolingual and Cross-Lingual Specialisation We start from distributional vectors for the SimLex languages: English, German, Italian and Russian.For each language, we first perform semantic specialisation of these spaces using: a) monolingual synonyms; b) monolingual antonyms; and c) the combination of both.We then add cross-lingual synonyms and antonyms to these constraints and train a shared fourlingual vector space for these languages.
Comparison to Baseline Methods Both monoand cross-lingual specialisation was performed using ATTRACT-REPEL and counter-fitting, in order to conclusively determine which of the two methods exhibited superior performance.Retrofitting and PARA-GRAM methods only inject synonymy, and their cost functions can be expressed using sub-components of counter-fitting and ATTRACT-REPEL cost functions.As such, the performance of the two investigated methods when they make use of similarity (but not antonymy) constraints illustrates the performance range of the two preceding models.

Importance of Initial Vectors
We use three different sets of initial vectors: a) well-known distributional word vector collections (Sect.4.1); b) distributional vectors trained on the latest Wikipedia dumps; and c) word vectors randomly initialised using the XAVIER initialisation (Glorot and Bengio, 2010).

Specialisation for Lower-Resource Languages
In this experiment, we first construct bilingual spaces which combine: a) one of the four SimLex languages; with b) each of the other twelve languages. 1010 Since each pair contains at least one SimLex language, we can analyse the improvement over monolingual specialisation to understand how robust the performance gains are across different language pairs.We next use the newly collected SimLex datasets for Hebrew and Croatian to evaluate the extent to which bilingual semantic specialisation using ATTRACT-REPEL and BabelNet constraints can improve word representations for lower-resource languages.

Comparison to State-of-the-Art Bilingual Spaces
The English-Italian and English-German bilingual spaces induced by ATTRACT-REPEL were compared to five state-of-the-art methods for constructing bilingual vector spaces: 1. (Mikolov et al., 2013a), retrained using the constraints used by our model; and 2.-5.(Hermann and Blunsom, 2014a; Gouws et al., 2015;Vulić and Korhonen, 2016a;Vulić and Moens, 2016).The latter models use various sources of supervision (word-, sentence-and document-aligned corpora), which means they cannot be trained using our sets of constraints.For these models, we use competitive setups proposed in (Vulić and Korhonen, 2016a).The goal of this experiment is to show that vector spaces induced by ATTRACT-REPEL exhibit better intrinsic and extrinsic performance when deployed in language understanding tasks.

Results and Discussion
Table 3 shows the effects of monolingual and crosslingual semantic specialisation of four well-known distributional vector spaces for the SimLex languages.
Monolingual specialisation leads to very strong improvements in the SimLex performance across all languages.Cross-lingual specialisation brings further improvements, with all languages benefiting from sharing the cross-lingual vector space.Italian in particular shows strong evidence of effective transfer, with Italian vectors' performance coming close to the top-performing English ones.
Comparison to Baselines Table 3 gives an exhaustive comparison of ATTRACT-REPEL to counterfitting: ATTRACT-REPEL achieved substantially stronger performance in all experiments.We believe these results conclusively show that the finegrained updates and L2 regularisation employed by ATTRACT-REPEL present a better alternative to the context-insensitive attract/repel terms and pair-wise regularisation employed by counter-fitting.
State-of-the-Art Wieting et al. (2016) note that the hyperparameters of the widely used Paragram-SL999 vectors (Wieting et al., 2015) are tuned on SimLex-999, and as such are not comparable to methods which holdout the dataset.This implies that further work which uses these vectors (e.g., (Mrkšić et al., 2016;Recski et al., 2016)) as starting point does not yield meaningful high scores either.Our reported English score of 0.71 on the Multilingual SimLex-999 corresponds to 0.751 on the original SimLex-999: it outperforms the 0.706 score reported by Wieting et al. (2016) and sets a new high score for this dataset.
Similarly, the SimVerb-3500 score of these vectors is 0.674, outperforming the current state-of-the-art score of 0.628 reported by Gerz et al. (2016).When comparing distributional vectors trained on Wikipedia to the high-quality word vector collections used in Table 3, the Italian and Russian vectors in particular start from substantially weaker SimLex scores.The difference in performance is largely mitigated through semantic specialisation.However, all vector spaces still exhibit weaker performance compared to those in Table 3.We believe this shows that the quality of the initial distributional vector spaces is important, but can in large part be compensated for through semantic specialisation.

Starting Distributional Spaces
Bilingual Specialisation Table 5 shows the effect of combining the four original SimLex languages with each other and with twelve other languages (Sect.4.1).Bilingual specialisation substantially improves over monolingual specialisation for all language pairs.This indicates that our improvements are language independent to a large extent.
Interestingly, even though we use no monolingual synonymy constraints for the six right-most languages, combining them with the SimLex languages still improved word vector quality for these four highresource languages.The reason why even resourcedeprived languages such as Irish help improve vector space quality of high-resource ones such as English or Italian is that they provide implicit indicators of semantic similarity.English words which map to the same Irish word are likely to be synonyms, even if those English pairs are not present in the PPDB datasets (Faruqui and Dyer, 2014). 12ower-Resource Languages The previous experiment indicates that bilingual specialisation further improves the (already) high-quality estimates for highresource languages.However, it does little to show how much (or if) the word vectors of lower-resource languages improve during such specialisation.Table 6 investigates this proposition using the newly collected SimLex datasets for Hebrew and Croatian.
Tying the distributional vectors for these languages (which have no monolingual constraints) into crosslingual spaces with high-resource ones (which do, in our case from PPDB) leads to substantial improvements.Table 6 also shows how the distributional vectors of the four SimLex languages improve when tied to other languages (in each row, we use monolingual constraints only for the 'added' language).Hebrew and Croatian exhibit similar trends to the original SimLex languages: tying to English and Italian leads to stronger gains than tying to the morphologically sophisticated German and Russian.Indeed, tying to English consistently lead to strongest performance.We believe this shows that bilingual ATTRACT-REPEL specialisation with English promises to produce highquality vector spaces for many lower-resource languages which have coverage among the 271 Babel-Net languages (but are not available in PPDB).
spaces.For both languages in both language pairs, ATTRACT-REPEL achieves substantial gains over all of these methods.In the next section, we show that these differences in intrinsic performance lead to substantial gains in downstream evaluation.
6 Downstream Task Evaluation

Dialogue State Tracking
Task-oriented dialogue systems help users achieve goals such as making travel reservations or finding restaurants.In slot-based systems, application domains are defined by ontologies which enumerate the goals that users can express (Young, 2010).The goals are expressed by slot-value pairs such as [price: cheap] or [food: Thai].For modular task-based systems, the Dialogue State Tracking (DST) component is in charge of maintaining the belief state, which is the system's internal distribution over the possible states of the dialogue.Figure 1 shows the correct dialogue state for each turn of an example dialogue.
Unseen Data/Labels As dialogue ontologies can be very large, many of the possible class labels (i.e., the various food types or street names) will not occur in the training set.To overcome this problem, delexicalisation-based DST models (Henderson et al., 2014c;Henderson et al., 2014b;Mrkšić et al., 2015;Wen et al., 2017)   Exact Matching as a Bottleneck Semantic lexicons can be hand-crafted for small dialogue domains.Mrkšić et al. (2016) showed that semantically specialised vector spaces can be used to automatically induce such lexicons for simple dialogue domains.However, as domains grow more sophisticated, the reliance on (manually-or automatically-constructed) semantic dictionaries which list potential rephrasings for ontology values becomes a bottleneck for deploying dialogue systems.Ambiguous rephrasings are just one problematic instance of this approach: a user asking about Iceland could be referring to the country or the supermarket chain, and someone asking for songs by Train is not interested in train timetables.More importantly, the use of English as the principal language in most dialogue systems research understates the challenges that complex linguistic phenomena present in other languages.In this work, we investigate the extent to which semantic specialisation can empower DST models which do not rely on such dictionaries.

Neural Belief Tracker (NBT)
The NBT is a novel DST model which operates purely over distributed representations of words, learning to compose utterance and context representations which it then uses to decide which of the potentially many ontologydefined intents (goals) have been expressed by the user (Mrkšić et al., 2017).To overcome the data sparsity problem, the NBT uses label embedding to decompose this multi-class classification problem into many binary classification ones: for each slot, the model iterates over slot values defined by the ontology, deciding whether each of them was expressed in the current utterance and its surrounding context.The first NBT layer consists of neural networks which produce distributed representations of the user utterance,13 the preceding system output and the embedded label of the candidate slot-value pair.These representations are then passed to the downstream semantic decoding and context modelling networks, which subsequently make the binary decision regarding the current slot-value candidate.When contradicting goals are detected (i.e.cheap and expensive), the model chooses the more probable one.
The NBT training procedure keeps the initial word vectors fixed: that way, at test time, unseen words semantically related to familiar slot values (i.e.affordable or cheaper to cheap) are recognised purely by their position in the original vector space.Thus, it is essential that deployed word vectors are specialised for semantic similarity, as distributional effects which keep antonymous words' vectors together can be very detrimental to DST performance (e.g., by matching northern to south or inexpensive to expensive).
The Multilingual WOZ 2.0 Dataset Our DST evaluation is based on the WOZ 2.0 dataset introduced by Wen et al. (2017) and Mrkšić et al. (2017).This dataset is based on the ontology used for the 2nd DST Challenge (DSTC2) (Henderson et al., 2014a).It consists of 1,200 Wizard-of-Oz (Fraser and Gilbert, 1991) dialogues in which Amazon Mechanical Turk users assumed the role of the dialogue system or the caller looking for restaurants in Cambridge, UK.Since users typed instead of using speech and interacted with intelligent assistants, the language they used was more sophisticated than in case of DSTC2, where users would quickly adapt to the system's inability to cope with complex queries.For our experiments, the ontology and 1,200 dialogues were translated to Italian and German through gengo.com, a web-based human translation platform.

DST Experiments
The principal evaluation metric in our DST experiments is the joint goal accuracy, which represents the proportion of test set dialogue turns where all the search constraints expressed up to that point in the conversation were decoded correctly.Our DST experiments investigate two propositions: 1. Intrinsic vs. Downstream Evaluation If mono-and cross-lingual semantic specialisation improves the semantic content of word vector collections according to intrinsic evaluation, we would expect the NBT model to perform higherquality belief tracking when such improved vectors are deployed.We investigate the difference in DST performance for English, German and Italian when the NBT model employs the following word vector collections: 1) distributional word vectors; 2) monolingual semantically specialised vectors; and 3) monolingual subspaces of the cross-lingual semantically specialised EN-DE-IT-RU vectors.For each language, we also compare to the NBT performance achieved using the five state-of-the-art bilingual vector spaces we compared to in Sect.5.3.

Training a Multilingual DST Model
The values expressed by the domain ontology (e.g., cheap, north, Thai, etc.) are language independent.If we assume common semantic grounding across languages, we can decouple the ontologies from the dialogue corpora and use a single ontology (i.e. its values' vector representations) across all languages.Since we know that high-performing English DST is attainable, we will ground the Italian and German ontologies (i.e.all slot-value pairs) to the original English ontology.The use of a single ontology coupled with cross-lingual vectors then allows us to combine the training data for multiple languages and train a single NBT model capable of performing belief tracking across all three languages at once.Given a high-quality cross-lingual vector space, combining the languages effectively increases the training set size and should therefore lead to improved performance across all languages.

Results and Discussion
The DST performance of the NBT-CNN model on English, German and Italian WOZ 2.0 datasets is shown in  vector spaces.The subsequent three rows show the performance of: a) distributional vector spaces; b) their monolingual specialisation; and c) their EN-DE-IT-RU cross-lingual specialisation.The last row shows the performance of the multilingual DST model trained using ontology grounding, where the training data of all three languages was combined and used to train an improved model.Figure 2 investigates the usefulness of ontology grounding for bootstrapping DST models for new languages with less data: the two figures display the Italian / German performance of models trained using different proportions of the in-language training dataset.The topperforming dash-dotted curve shows the performance of the model trained using the language-specific dialogues and all of the English training data.
The results in Table 8 show that both types of specialisation improve over DST performance achieved using the distributional vectors or the five baseline bilingual spaces.Interestingly, the bilingual vectors of Vulić and Korhonen (2016a) outperform ours for EN (but not for IT and DE) despite their weaker Sim-Lex performance, showing that intrinsic evaluation does not capture all relevant aspects pertaining to word vectors' usability for downstream tasks.
Each figure shows the performance of the model trained using the subspace of the given vector space corresponding to the target language.For the English baseline figures, we show the stronger of the EN-IT / EN-DE figures.
provements, with particularly large gains in the lowdata scenario investigated in Figure 2 (dash-dotted purple line).This figure also shows that the difference in performance between our mono-and crosslingual vectors is not very substantial.Again, the large disparity in SimLex scores induced only minor improvements in DST performance.In summary, our results show that: a) semantically specialised vectors benefit DST performance; b) large gains in SimLex scores do not always induce large downstream gains; and c) high-quality crosslingual spaces facilitate transfer learning between languages and offer an effective method for bootstrapping DST models for lower-resource languages.
Finally, German DST performance is substantially weaker than both English and Italian, corroborating our intuition that linguistic phenomena such as cases and compounding make German DST very challenging.We release these datasets in hope that multilingual DST evaluation can give the NLP community a tool for evaluating downstream performance of vector spaces for morphologically richer languages.

Conclusion
We have presented a novel ATTRACT-REPEL method for injecting linguistic constraints into word vector space representations.The procedure semantically specialises word vectors by jointly injecting monoand cross-lingual synonymy and antonymy constraints, creating unified cross-lingual vector spaces which achieve state-of-the-art performance on the well-established SimLex-999 dataset and its multilingual variants.Next, we have shown that ATTRACT-REPEL can induce high-quality vectors for lowerresource languages by tying them into bilingual vector spaces with high-resource ones.We also demonstrated that the substantial gains in intrinsic evaluation translate to gains in the downstream task of dialogue state tracking (DST), for which we release two novel non-English datasets (in German and Italian).Finally, we have shown that our semantically rich cross-lingual vectors facilitate language transfer in DST, providing an effective method for bootstrapping belief tracking models for new languages.

Further Work
Our results, especially with DST, emphasise the need for improving vector space models for morphologically rich languages.Moreover, our intrinsic and task-based experiments exposed the discrepancies between the conclusions that can be drawn from these two types of evaluation.We consider these to be major directions for future work.

Figure 1 :
Figure 1: Annotated dialogue states in a sample dialogue.Underlined words show rephrasings for ontology values which are typically handled using semantic dictionaries.

Figure 2 :
Figure 2: Joint goal accuracy of the NBT-CNN model for Italian (left) and German (right) WOZ 2.0 test sets as a function of the number of in-language dialogues used for training.

Table 1 :
Table1illustrates the effects of cross-lingual ATTRACT-REPEL specialisation by showing the nearest neighbours for three English words across three cross-lingual spaces.Nearest neighbours for three example words across Slavic, Germanic and Romance language groups (with English included as part of each word vector collection).Semantically dissimilar words have been underlined.

Table 2 :
5Linguistic constraint counts (in thousands).For each language pair, the two figures show the number of injected synonymy and antonymy constraints.Monolingual constraints (the diagonal elements) are underlined.

Table 3 :
Multilingual SimLex-999.The effect of using the COUNTER-FITTING and ATTRACT-REPEL procedures to inject mono-and cross-lingual synonymy and antonymy constraints into the four collections of distributional word vectors.Our best results set the new state-of-the-art performance for all four languages.

Table 5 :
SimLex-999 performance.Tying the SimLex languages into bilingual vector spaces with 16 different languages.The first number in each row represents monolingual specialisation.All but two of the bilingual spaces improved over these baselines.The EN-FR vectors set a new high score of 0.754 on the original (English) SimLex-999.

Table 6 :
Bilingual semantic specialisation for: a) Hebrew and Croatian; and b) the original SimLex languages.Each row shows how SimLex scores for that language improve when its distributional vectors are tied into bilingual vector spaces with the four high-resource languages.

Table 8 .
The first five rows show the performance when the model employs the five baseline