Learning to Understand Phrases by Embedding the Dictionary

Distributional models that learn rich semantic word representations are a success story of recent NLP research. However, developing models that learn useful representations of phrases and sentences has proved far harder. We propose using the definitions found in everyday dictionaries as a means of bridging this gap between lexical and phrasal semantics. Neural language embedding models can be effectively trained to map dictionary definitions (phrases) to (lexical) representations of the words defined by those definitions. We present two applications of these architectures: reverse dictionaries that return the name of a concept given a definition or description and general-knowledge crossword question answerers. On both tasks, neural language embedding models trained on definitions from a handful of freely-available lexical resources perform as well or better than existing commercial systems that rely on significant task-specific engineering. The results highlight the effectiveness of both neural embedding architectures and definition-based training for developing models that understand phrases and sentences.


Introduction
Much recent research in computational semantics has focussed on learning representations of arbitrary-length phrases and sentences.This task is challenging partly because there is no obvious gold standard of phrasal representation that could be used in training, evaluation and comparison of different systems.Consequently, it is difficult to design approaches that could learn from such a gold standard, and also hard to evaluate or compare different models.
In this work, we use dictionary definitions to address this issue.The composed meaning of the words in a dictionary definition (a tall, longnecked, spotted ruminant of Africa) should correspond to the meaning of the word they define (giraffe).This bridge between lexical and phrasal semantics is useful because high quality vector representations of single words can be used as a target when learning to combine the words into a coherent phrasal representation.
This approach still requires a model capable of learning to map between arbitrary-length phrases and fixed-length continuous-valued word vectors.For this purpose we use a recurrent neural network (RNN) (Schmidhuber, 1989) with long-short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997).Prior to training the RNN, we learn its target lexical representations by training the Word2Vec software (Mikolov et al., 2013) on billions of words of raw text.
We demonstrate the usefulness of our approach for two applications.The first is a reverse dictionary or concept finder: a system that returns words based on user descriptions or definitions (Zock and Bilac, 2004).Reverse dictionaries are used by copywriters, novelists, translators and other professional writers to find words for notions or ideas that might be on the tip of their tongue.For instance, a travel-writer might look to enhance her prose by searching for examples of a country that people associate with warm weather or an activity that is mentally or physically demanding.We show that an RNN-based reverse dictionary trained on only a handful of dictionaries identifies novel definitions and concept descriptions comparably or better than commercial systems, which took many years to develop and rely on a much larger memory footprint.Moreover, thanks to recent work on multilingual embedding spaces (Gouws et al., 2014), we show that the RNN approach can be easily extended to produce a potentially useful cross-lingual reverse dictionary.
The second application of our model is as a general-knowledge crossword question answerer.When trained on both dictionary definitions and the opening sentences of Wikipedia articles, the RNN produces plausible answers to (non-cryptic) crossword clues, even those that apparently require detailed world knowledge.Our system outperforms two bespoke commercial crossword solving tools, and again has a smaller memory footprint, making it much more portable.Qualitative analysis reveals that the RNN learns to relate concepts that are not directly connected in the training data and can thus generalise well to unseen input.To facilitate further research, all of our code, training and evaluation sets (together with a system demo) are published online with this paper.

Model Architecture
The architecture underlining our model is a recurrent neural network (RNN).RNNs operate on variable-length sequences of inputs; in our case, natural language definitions, descriptions or sentences.RNNs (with LSTMs) have achieved state-of-the-art performance in language modelling (Mikolov et al., 2010) image caption generation (Kiros et al., 2014), approach state-of-the-art performance in machine translation (Bahdanau et al., 2015).During training, the input to the RNN is a dictionary definition (for the reverse dictionary model), or sentence from an encyclopedia (in the question answering model).The objective of the model is to map these definitions to an embedding of the word that the definition defines.The target word embeddings are learned independently of the RNN weights, using the Word2Vec software (Mikolov et al., 2013).
The set of all words in the training data constitutes the vocabulary of the RNN.For each word in this vocabulary we randomly initialise a realvalued vector (input embedding) of model parameters.The RNN 'reads' the first word in the input by applying a non-linear projection of its embedding v 1 parameterised by input weight matrix W and b, a vector of biases.
yielding the first internal activation state A 1 .In our implementation, we use φ(x) = tan(x), though in theory φ can be any differentiable nonlinear function.Subsequent internal activations (after time-step t) are computed by projecting the embedding of the t th word and using this information to 'update' the internal activation state.
As such, the values of the final internal activation state units A N are a weighted function of all input word embeddings, and constitute a 'summary' of the information in the sentence.

Long Short Term Memory
A known limitation when training RNNs to read language using gradient descent is that the error signal (gradient) on the training examples either vanishes or explodes as the number of time steps (sentence length) increases.Consequently, after reading longer sentences the final internal activation A N typically retains useful information about the most recently read (sentence-final) words, but can neglect important information near the start of the input sentence.LSTMs were designed to mitigate this long-term dependency problem.
At each time step t, in place of the single internal layer of units A, the LSTM RNN computes six internal layers g w , g i , g f , g o , h and m.The first, g w , represents the core information passed to the LSTM unit by the latest input word at t.It is computed as a simple linear projection of the input embedding v t (by input weights W w ) and the output state of the LSTM at the previous time step h t−1 (by update weights U w ): The layers g i , g f and g o are computed as weighted sigmoid functions of the input embeddings, again parameterised by layer-specific weight matrices W and U : where x stands for one of i, f or o.These vectors take values on [0, 1] and are often referred to as gating activations.Finally, the internal memory state, m t and new output state h t , of the LSTM at t are computed as, where ⊙ indicates elementwise vector multiplication and φ is, as before, some non-linear function (we use tanh).Thus, g i determines to what extent the new input word is considered at each time step, g f determines to what extent the existing state of the internal memory is retained or forgotten in computing the new internal memory, and g o determines how much this memory is considered when computing the output state at t.
The sentence-final memory state of the LSTM, m N , a 'summary' of all the information in the sentence, is then projected via an extra non-linear projection (parameterised by a further weight matrix) to a target embedding space.This layer enables the target (defined) word embedding space to take a different dimension to the activation layers of the RNN, and in principle enables a more complex definition-reading function to be learned.
The training objective of the model M is to map the input sentence s c defining word c to the pretrained embedding v c of c.The cost of the wordsentence pair (c, s c ) from the training data is then simply the cosine distance between M (s c ) and v c .

Implementation Details
The RNN word embeddings in our implementation had length 256, and at each time step each of the four LSTM RNN internal layers (gating and activation states) had 512 units.To create the space for target embeddings, we trained a continuous bag-of-words (CBOW) model using the Word2Vec software on approximately 8 billion words of running text. 1 The embeddings in the target space had dimension 500.
The model was implemented with Theano (Bergstra et al., 2010) and trained with minibatch SGD on GPUs.The batch size was fixed at 16 and the learning rate was controlled by adadelta (Zeiler, 2012).Training each model took approximately 24 hours.We make all model code publicly available.

Reverse Dictionaries
The most immediate application of our trained models is as a reverse dictionary or concept finder.It is simple to look up a definition in a dictionary given a word, but professional writers often also require suitable words for a given idea, concept or definition.2Reverse dictionaries satisfy this need by returning candidate words given a phrase, description or definition.For instance, when queried with the phrase an activity that requires strength and determination, the OneLook.comreverse dictionary returns the concepts exercise and work.Our trained RNN model can perform a similar function, simply by mapping a phrase to a point in the target (Word2Vec) embedding space, and returning the words corresponding to the embeddings that are closest to that point.
Several other academic studies have proposed reverse dictionary models.These generally rely on common techniques from information retrieval, comparing definitions in their internal database to the input query, and returning the word whose definition is 'closest' to the input query (Bilac et al., 2003;Bilac et al., 2004;Zock and Bilac, 2004).Proximity is quantified differently in each case, but is generally a function of hand-engineered features of the two sentences.For instance, Shaw et al. (2013) propose a method in which the candidates for a given input query are all words in the model's database whose definitions contain one or more words from the query.This candidate list is then ranked according to a query-definition similarity metric based on the hypernym and hyponym relations in WordNet, features commonly used in IR such as tf-idf and a parser.
There are, in addition, at least two commercial online reverse dictionary applications, whose architecture is proprietary knowledge.The first is the Dictionary.comreverse dictionary3 , which retrieves candidate words from the Dictionary.comdictionary based on user definitions or descriptions.The sencond is OneLook.com,whose algorithm searches 1061 indexed dictionaries, including all major freely-available online dictionaries and resources such as Wikipedia and WordNet.

Training
To compile a bank of dictionary definitions for training the model, we started with all words in the target embedding space.For each of these words, we extracted dictionary-style definitions from five electronic resources: Wordnet, The American Her-itage Dictionary, The Collaborative International Dictionary of English, Wiktionary and Webster's.We chose these five dictionaries because they are freely-available via the WordNik API,4 but in theory any dictionary could be chosen.Most words in our training data had multiple definitions.For each word w with definitions {d 1 . . .d n } we included all pairs (w, d 1 ) . . .(w, d n ) as training examples.This resulted in ≈ 900, 000 word-definition pairs of ≈ 10, 000 unique words.We label the model trained on all of these definitions (except those in the test set) RNN All All.
We also wished to explore the effect training on only subsets of this data.To test whether there is any advantage on training on multiple dictionaries, we trained an equivalent model, RNN WN All, on definitions from WordNet only. 5To test whether any advantage is gained by training on multiple definitions for each word, we train an additional model on the first definition in WordNet for each word only RNN WN First.As with other dictionaries, the first definition of a word in WordNet generally corresponds to the most typical or common sense of that word.

Comparison and Evaluation
As a baseline for the RNN approach, we implemented two unsupervised methods using the neural (Word2Vec) word embeddings from the target word space.In the first (W2V add), we compose the embeddings for each word in the input query by pointwise addition, and return as candidates the nearest word embeddings to the resulting composed vector.The second baseline, (W2V mult), is identical except that the embeddings are composed by elementwise multiplication.Both methods of composition were suggested in a recent study on building phrase representations from word embeddings (Milajevs et al., 2014).
None of the models or evaluations from previous academic research on reverse dictionaries is publicly available, so direct comparison is not possible.However, we do compare performance with the commercial systems.The Dictionary.com system returned no candidates for over 96% of our input definitions.We therefore conduct detailed comparison with OneLook.com, which is the first reverse dictionary tool returned by a google search seems to be the most popular among writers.
To our knowledge there are no established means of measuring reverse dictionary performance.In the only previous academic research on English reverse dictionaries that we are aware of, evaluation was conducted on 300 word-definition pairs written by lexicographers, but which are not publicly available (Shaw et al., 2013).We therefore developed new evaluation sets and make them publicly available for evaluating future models.
The evaluation items are of three types, designed to test different properties of the models.To create the seen evaluation, we randomly selected 500 words from the WordNet training data (seen by all models), and then randomly selected a definition for each word.Testing models on the resulting 500 word-definition pairs assesses their ability to recall or decode previously encoded information.For the unseen evaluation, we randomly selected 500 words from WordNet and excluded all definitions of these words from the training data of all models.
Finally, for a fair comparison with OneLook, which has both the seen and unseen pairs in its internal database, we built a new dataset of concept descriptions that do not appear in the training data for any model.To do so, we randomly selected 200 adjectives, nouns or verbs from among the top 3000 most frequent tokens in the British National Corpus (Leech et al., 1994) (but outside the top 100).We then asked ten native English speakers to write a single-sentence 'description' of these words.To ensure the resulting descriptions were good quality, for each description we asked two participants who did not produce that description to list all words that fitted the description.If the target word was not produced by one of the two checkers, the original participant was asked to rewrite the description.These concept descriptions, together with other evaluation sets, can be downloaded from our website for future comparisons.1: Performance of different reverse dictionary models in different evaluation settings.*Low variance in mult models is due to consistently poor scores, so not highlighted.

Results
Table 1 shows the performance of the different models in the three evaluation settings.Of the baseline methods involving bottom-up composition of Word2Vec embeddings, elementwise addition is clearly more effective than multiplication, which almost never returns the correct word as the nearest neighbour of the composition.Overall, however, the bespoke reverse dictionary models (RNN and OneLook) outperform both baselines.This is unsurprising given that both systems are designed for this task whereas the baselines involve no task-specific training.
Training the RNN models on different sets of definitions results in interesting variation in performance.When training is restricted to the first WordNet definition for each word (RNN WN First), the model is effective at retrieving words for definitions it has already seen, but lacks the general knowledge to effectively generalise to new, unseen items.Interestingly, when training is extended to all definitions for each word (RNN WN All), the performance on seen (first) definitions from WordNet improves further.This implies that training data can improve the retrieval of words to which that data does not directly pertain, possibly by increasing the general linguistic and conceptual knowledge of the model.However, there is a limit to this effect: when all definitions from all available dictionaries (RNN All All) are included in the training data, the performance on the (seen) WordNet first definitions degrades.This suggests that directly relevant knowledge can be lost if the model sees too much indirectly-relevant information.Nevertheless, the model trained on all available data is clearly the most robust, in that it performs similarly well on both seen and unseen definitions and descriptions.
The results also indicate interesting differences between the RNN approach and the OneLook dictionary search engine.The Seen (WN first) definitions in Table 1 occur in both the training data for the RNN models and the lookup data for the OneLook model.Clearly the OneLook algorithm is better than the RNN models at retrieving already available information (returning 89% of correct words among the top-ten candidates on this set).However, this comes at the cost of a greater memory footprint, since the model requires access to its database of dictionaries at query time. 6oreover, performance on the unseen concept descriptions suggests that the RNN model is better than OneLook at generalising to novel, unseen definitions.While the two models place the correct word among their top candidates on this evaluation with approximately equal frequency (accuracy@10/100), the mean rank and rank variance of the correct word among the RNN candidates is much lower.The RNN is therefore more 'consistent' than OneLook in its ability to assign a reasonably high ranking to the correct word.In the next section we explore the differences in the model output more closely.

Qualitative Analysis
Example queries and top-five candidates from the models are presented in Table 6.They illustrate properties of the RNN output that should also be evident when querying the web demo.The first example demonstrates how the model generalises beyond its training data.Four of the top five responses could be classed as appropriate in that they refer to inhabitants of cold countries.However, there is no mention of cold or anything to do with climate in the dictionary definitions of Es-kimo, Scandinavian, Scandinavia etc. in the training data.The model has learned that coldness is a characteristic of Scandinavia, Siberia and relates to Eskimos via connections with other concepts that are described or defined as cold.In contrast, the candidates produced by the OneLook and W2V baseline models have nothing to do with coldness, suggesting that they are not capable of drawing such indirect (or higher-order) connections between entities in the training data.
The second example demonstrates how the RNN model returns candidates whose linguistic function is appropriate to the query.For a query referring explicitly to a means, method or process, the RNN model produces verbs in different forms or an appropriate deverbal noun.In contrast, OneLook returns words of all types (aerodynamics, draught) that are arbitrarily related to the words in the query.A similar effect is apparent in the third example.While the candidates produced by the OneLook model are the correct part of speech (Noun), and related to the query topic, they are not semantically appropriate.The RNN model is the only one that returns a list of plausible habits, the class of noun requested by the input.

Cross-Lingual Reverse Dictionaries
We now show how the RNN architecture can be easily modified to create a bilingual reverse dictionary -a system that returns candidate words in one language given a description or definition in another.A bilingual reverse dictionary could have clear applications for translators or transcribers.Indeed, the problem of attaching appropriate words to concepts may be more common when searching for words in a second language than in a monolingual context.
To create the bilingual variant, we simply replace the Word2Vec target embeddings with those from a bilingual embedding space.Bilingual embedding models use bilingual corpora to learn a space of representations of the words in two languages, such that words from either language that have similar meanings are close together (Hermann and Blunsom, 2013;Chandar et al., 2014;Gouws et al., 2014).For our experiment, we used English-French embeddings learned by the state-of-the-art BilBOWA model (Gouws et al., 2014) from the Wikipedia (monolingual) and Europarl (bilingual) corpora. 7e trained the RNN model to map from English definitions to English words in the bilingual space.At test time, after reading an English definition, we then simply return the nearest French word neighbours to that definition.
Because no benchmarks exist for quantitative evaluation of bilingual reverse dictionaries, we compare this approach qualitatively with two alternative methods for mapping definitions to words across languages.The first is analogous to the W2V Add model of the previous section: in the bilingual embedding space, we first compose the embeddings of the English words in the query definition with elementwise addition, and then return the French word whose embedding is nearest to this vector sum.The second uses the RNN monolingual reverse dictionary model to identify an English word from an English definition, and then translates that word using Google Translate.
Table 2 shows that the RNN model can be effectively modified to create a cross-lingual reverse dictionary.It is perhaps unsurprising that the W2V Add model candidates are generally the lowest quality given the performance of the method monolingual setting.In comparing the two RNNbased methods, the fully bilingual RNN appears to have two advantages over the RNN + Google approach.First, it does not require online access to a bilingual word-word mapping as defined e.g. by Google Translate.Second, it less prone to errors caused by word sense ambiguity.For example, in response to the query an emotion you feel after being rejected, the bilingual embedding RNN returns emotions or adjectives describing mental states.In contrast, the monolingal+Google model incorrectly maps the plausible English response regret to the verbal infinitive regretter.The model makes the same error when responding to a description of a fly, returning the verb voler (to fly).

Discussion
We have shown that simply training the RNN-toword-embedding architecture on six dictionaries yields a reverse dictionary that performs comparably to the leading commercial system, and with certain key advantages.First, it consistently returns syntactically and semantically plausible re-   sponses as part of a more coherent and homogenous set of candidates.Second, it requires many times less memory, which is a significant advantage given that language applications and tools generally benefit from portability (e.g.deployable on mobile devices).We also showed how the architecture can be easily extended to produce bilingual versions of the same model.Of course, in the analyses performed thus far, we only test the RNN approach on tasks that it was trained to accomplish (mapping definitions or descriptions to words).In the next section, we test the general applicability of the approach by exploring whether the representations learned by the model can be effectively transferred to a novel task.

General Knowledge (crossword) Question Answering
The automatic answering of questions posed in natural language is a central problem of Artificial Intelligence.Although web search and IR techniques provide a means to find sites or documents related to language queries, at present, internet users requiring a specific fact must still sift through pages to locate the desired information.Systems that attempt to overcome this, via fully open-domain or general knowledge questionanswering (open QA), generally require large teams of researchers, modular design and powerful infrastructure, exemplified by IBM's Watson (Ferrucci et al., 2010).For this reason, much academic research focuses on settings in which the scope of the task is reduced.This has been achieved by restricting questions to a specific topic or domain (Mollá and Vicedo, 2007), allowing systems access to pre-specified passages of text from which the answer can be inferred (Iyyer et al., 2014;Weston et al., 2015), or centering both questions and answers on a particular knowledge base (Berant and Liang, 2014;Bordes et al., 2014).
In what follows, we show RNNs trained on dictionary data may ultimately be a useful component of open QA system.Given the absence of a knowledge base or web-scale information in our architecture, we narrow the scope of the challenged facing our models by focusing on general knowledge crossword questions.General knowledge (noncryptic, or quick) crosswords appear in national newspapers in many countries.Crossword question answering is more tractable that general open QA for two reasons.First, models know the length of the correct answer (in letters), reducing the search space.Second, some crossword questions mirror definitions, in that they refer to fundamental properties of concepts (a twelve-sided shape) or request a category member (a city in Egypt). 8he architecture of the model we apply to crossword questions is identical to that used to create the reverse dictionary.However, since many general-knowledge crossword questions refer to named entities, people and places, we experiment by supplementing the dictionary definitions used by the previous model with content from Wikipedia.For every word in model's target embedding space that is also the title of an article in Wikipedia, we treat the sentences in the first paragraph of the article as if they were (independent) definitions of that word.When a word in Wikipedia also occurs in one (or more) of the dictionaries used previously, we simply add these pseudo-definitions to the training set of definitions for the word, noting the experiments with Word-Net in the previous section, which showed that using more definitions for each word generally improves performance.

Evaluation
General Knowledge crossword questions come in different styles and forms.We used the Eddie James crossword website to compile a bank of sentence-like general-knowledge questions. 9Eddie James is one of the UK's leading crossword compilers, working for several national newspapers.Our long question set consists of the first 150 questions (starting from puzzle #1) from his general-knowledge crosswords, excluding clues of fewer than four words and those whose answer was not a single word (e.g.kingjames).
To evaluate models on a different type of clue, we also compiled a set of shorter questions based on the Guardian Quick Crossword.Guardian questions still require general factual or linguistic knowledge, but are generally shorter and somewhat more cryptic than the longer Eddie James clues.We again formed a list of 150 questions, beginning on 1 January 2015 and excluding any questions with multiple-word answers.For clear contrast, we excluded those few questions of length greater than four words.Of these 150 clues, a subset of 30 were single-word clues.All evaluation datasets are available online with the paper.

Benchmarks and Comparisons
We evaluate RNN models trained with and without Wikipedia integrated into the training data.As before, candidates are extracted from the model by inputting definitions and returning words corresponding to the closest embeddings in the target space, but in this case we only consider candidate words whose length matches the length specified in the clue.We compare with the baseline of elementwise addition of Word2Vec vectors in the embedding space (we discard the ineffective W2V mult baseline), again restricting candidates to words of the pre-specified length.
We also compare to two bespoke online crossword-solving engines.The first, OneAcross (http://www.oneacross.com/) is the candidate generation module of the award-winning Proverb crossword system (Littman et al., 2002).Proverb, which was produced by academic researchers, has featured in national media such as New Scientist, and beaten expert humans in crossword solving tournaments.Our other comparison is with Crossword Maestro (http://www.crosswordmaestro.com/), a commercial crossword solving system that handles both cryptic and non-cryptic crossword clues (we focus only on the non-cryptic setting), and has also been featured in national media. 10We are unable to compare against a third well-known automatic crossword solver, Dr Fill (Ginsberg, 2011), because code for Dr Fill's candidate-generation module is not readily available.As with the RNN and baseline models, when evaluating existing systems we discard candidates whose length does not match the length specified in the clue.
Certain principles connect the design of the existing commercial systems and differentiate them from our approach.Unlike the RNN model, they each require query-time access to large databases containing common crossword clues, dictionary definitions, the frequency with which words typically appear as crossword solutions and other hand-engineered and task-specific components (Littman et al., 2002;Ginsberg, 2011).
Table 7: Responses from different models to example crossword clues.In each case the model output is filtered to exclude any candidates that are not of the same length as the correct answer.

Results
The performance of models on the various question types is presented in Table 6.On the long questions, the RNN models place the correct answer in the top ten candidates for over half of the questions, and in the top 100 candidates almost 80% of the time, clearly outperforming the baseline and commercial systems.Their responses are also more consistent (in terms of rank variance) than the W2V baseline.When evaluating the two commercial systems, One Across and Crossword Maestro, we have access to web interfaces that return up to approximately 100 candidates for each query, so can only reliably record membership of the top ten (accuracy@10).On this metric, One Across beats Crossword Maestro on the long questions, but the RNN model outperforms both commercial systems.
Interestingly, as the questions get shorter, the advantage of the RNN model diminishes.Both the Word2Vec baseline and the commercial systems answer the short questions more accurately than the RNN model, and generally produce more consistent sets of candidate responses.One obvi-ous reason for this effect is the clear difference in form and style between these shorter clues are the full definitions or encyclopedia sentences in the RNN training data.As the length of the clue decreases, finding the answer often reduces to generating synonyms (culpability -guilt), or category members (tall animal -giraffe).Word2Vec representations are known to encode these sorts of relationships (even after elementwise addition) (Mikolov et al., 2013), and seem particularly powerful in this case as the nearest neighbour search is constrained by a specified word length.The commercial systems also retrieve good candidates for such clues among their databases of entities, relationships and common crossword answers.
To produce an 'optimal' neural crossword solver, which stores all knowledge in weights and embeddings and is good at both long and short questions, the RNN model and Word2Vec baseline can be easily combined.We simply let the length of the clue (in words) determine how to generate candidates.This architecture requires no more memory than the RNN model, since that already stores all Word2Vec embeddings.The RNN-W2V row in Table 6 shows the performance of a model in which the RNN is used for questions of length > 3 words and the W2V add model is used otherwise.This composite model outperforms the commercial systems on questions of any length.

Qualitative Analysis
A better understanding of how the different models arrive at their answers can be gained from considering specific examples, as presented in Table 7.The first three examples show that, despite the apparently superficial nature of its training data (definitions and introductory sentences) the RNN model can answer questions that require factual knowledge about people and places.Another notable characteristic of the RNN model is the consistent semantic appropriateness of the candidate set.In the first case, the top five candidates are all mountains, valleys or places in the Alps; in the second, they are all biblical names, and in the third, four of the five are currencies.None of the alternative approaches exhibits this 'smoothness' or consistency in candidate generation.Despite its simplicity, the W2V add method is at times surprisingly effective, as shown by the fact it returns Joshua in its top candidates for the third query.
The final example in Table 7 highlights a limitation of the RNN approach in its current form.In this specific case, although there is an embedding in the target space for Schoenberg, there are no corresponding definitions or articles in the training data.The RNN model is not able to infer the connection between Schoenberg and the (comparatively infrequent) notion of atonality.It seems likely that a model trained only on definitions or introductory passages would always struggle to learn secondary properties of concepts, such as the name of Obama's second daughter.The existing systems, which may store these characteristics or relations explicitly in their databases, seem to be better at these sorts of questions.
More generally, it is an open question whether the world knowledge required for open QA could be encoded and retained as weights in a (larger) dynamic network, or whether it will be necessary to combine the RNN with an external memory that is less frequently (or never) updated.This latter approach has begun to achieve impressive results on certain QA and entailment tasks (Bordes et al., 2014;Graves et al., 2014;Weston et al., 2015).

Conclusion
Dictionaries exist in many of the world's languages.We have shown how these lexical resources can be a valuable resource for training the latest neural language models to interpret and represent the meaning of phrases and sentences.While humans use the phrasal definitions in dictionaries to better understand the meaning of words, machines can use the words to better understand the phrases.We presented an recurrent neural network architecture with a long-short-term memory to explicitly exploit this idea.
On the reverse dictionary task that mirrors its training setting, the RNN performs comparably to the best known commercial applications despite having access to many fewer definitions.Moreover, it generates smoother sets of candidates, uses less memory at query time and, perhaps most significantly, requires no linguistic pre-processing or task-specific engineering.We also showed how the description-to-word objective can be used to train models useful for other tasks.The architecture trained additionally on an encyclopedia performs well as a crossword question answerer, outperforming commercial systems on questions containing more than four words.While our QA experiments focused on a particular question type, the results suggest that a similar neural-languagemodel approach may ultimately lead to improved output from more general QA and dialog systems and information retrieval engines in general.
We make all code, training data, evaluation sets and both of our linguistic tools publicly available online for future research.In particular, we propose the reverse dictionary task as a comparatively general-purpose and objective way of evaluating how well models compose lexical meaning into phrase or sentence representations (whether or not they involve training on definitions directly).
In the next stage of this research, we will explore ways to enhance the RNN model, especially in the question-answering context.The model is currently not trained on any question-like language, and would conceivably improve on exposure to such linguistic forms.Compared to stateof-the-art word representation learning models, it actually sees very few words during training, and may also benefit from learning from both dictionaries and unstructured text.Finally, we intend to explore ways to endow the model with richer world knowledge.This may require the integra-tion of an external memory module, similar to the promising approaches proposed in several recent papers (Graves et al., 2014;Weston et al., 2015).

Table 2 :
Style difference between dictionary definitions and concept descriptions in the evaluation.

Table 3 :
The top-five candidates for example queries from different reverse dictionary models.

Table 4 :
Responses from cross-lingual reverse dictionary models to selected queries.Underlined responses are 'correct' or potentially useful.

Table 5 :
Examples of the different question types in the crossword question evaluation dataset.